191014K02 | Day 3 Lecture 1
191014K02: Day 3 Lecture 1
Work in progress.
Closest String
For two strings s_1, s_2 of the same length \ell, we define the Hamming Distance between S_1 & S_2 to be:
d\left(S_1, S_2\right):=\left|\left\{i: S_1[i] \neq S_2[i]\right\}\right|
so, for example, d(horse, force) =2.
For a string c and set S of stings of length \ell:
d(S, c):=\max _{t \in S} d(t,c)
GOAL: (1+\varepsilon) Approximation Algorithm
An Integer Linear Program for Closest String
Have an indicator variable for every position p and every letter \alpha \in \Sigma, introduce a binary variable x_{p,\alpha} with the following semantics:
\begin{equation*} x_{p,\alpha} = \begin{cases} 1 & \text{if } c[p] = \alpha\\ 0 & \text{otherwise.} \end{cases} \end{equation*}
The constraints:
- At every position we have one letter:
\forall p \in [\ell]: \sum_{\alpha \in \Sigma} x_{p,\alpha} = 1
We control the distance:
\forall t \in S: \quad \sum_{p=1}^\ell \left({\color{indianred}1-x_{p, t[p]}} \right) \leqslant d
…and ask the ILP to minimize d.
Now, as usual, relax the ILP and solve the associated LP.
Rounding. Think of the OPTLP variable values as “voting” for characters at each position. For example, if \ell = 5, \Sigma = \{A,C,G,T\}, and the OPTLP values turn out to be:
A | C | G | T | |
---|---|---|---|---|
1 | 0.9 | 0 | 0.05 | 0.05 |
2 | 0.1 | 0.2 | 0.4 | 0.3 |
3 | 0.5 | 0.1 | 0.1 | 0.3 |
4 | 0.1 | 0.7 | 0.1 | 0.1 |
5 | 0.2 | 0.2 | 0 | 0.6 |
Then you might be tempted to “round” the solution to AGACT
because for each position p \in [5], the letters A
, G
, A
, C
, and T
dominate the vote for that position. However: it turns out that this nautral rounding strategy can be arbitrarily bad!
Instead of picking the top choice, what we do instead is the following: for every position p \in [\ell], treat the x_{p,\alpha}’s as a probability distribution2 over \Sigma. Now the randomized rounding step involves sampling from this distribution to obtain the solution:
Set c[p] = \alpha with probability x_{p,\alpha}.
Define the indicator random variables q_{p,\alpha} as indicating for us when c is different from \alpha at position p, as a stepping stone to capturing distance eventually:
\begin{equation*} q_{p,\alpha} = \begin{cases} 1 & \text{if } {\color{indianred}c[p] \neq \alpha},\\ 0 & \text{otherwise}. \end{cases} \end{equation*}
Then we have the natural notion of a random variable to capture the distance between the string output by our randomized algorithm and any fixed string t \in S:
d[c,t] := \sum_{p \in \ell} q_{p,t[p]}
It’s time for our first cool claim.
Fix a t \in S. What is E[d(c,t)]?
It turns out: at most d, and therefore, at most OPTLP
, and in turn, at most OPTILP
= OPT
Note that d(c,t) is a sum of independent 0/1 random variables whose expectation is upper bounded by OPT
. So our useful Chernoff variation applies here,
and we get the following:
\operatorname{Pr}[d(c,t)-\text{OPT} > \varepsilon \cdot \text{OPT}] \leqslant e^{-\varepsilon^2 \text{OPT} / 3}.
Now, applying union bound over all n choices of t \in S, we get:
\operatorname{Pr}[{\color{darkseagreen}d(c,S)} > (1+\varepsilon) \cdot \text{OPT}] \leqslant \frac{{\color{darkseagreen}n}}{e^{\varepsilon^2 \text{OPT} / 3}}.
So, if, for example:
\frac{n}{e^{\varepsilon^2 \text{OPT} / 3}} \leqslant \frac{1}{2},
then it’s a win!
We don’t have a win when OPT
is really really small, in particular, if:
\text{OPT} \leqslant \frac{3\ln(2n)}{\varepsilon^2}
We handle this case with a local search algorithm.
Coming Soon.