read more
Algorithms/stuff we need to know:
Lecture 01:
Information = -log_2(probability).
Lecture 02:
How PCR works:
– Separate DNA strands, which happens naturally at high temps
– Add small single-strand DNA sequence (primer) and DNA Polymerase
– Let the primers hybridize (sometimes called anneal)
– Let the “heat activated” Polymerase extend the sequence
– Produces two copies, just by cycling the temperature
– Repeat.
(9)
How Microarrays work? (25)
Lecture 03:
Tower of Hanoi? Could give us a problem? (10).
Selection sort: find smallest element and put it in the first place, scan through rest and find the second smallest and put it in the second position, etc. Obviously n^2 time. (32)
Big-O notation. Big-Omega. (42)
Lecture 04:
Restriction enzyme: will cut at certain base pair sequences.
Double digest problem: Start by taking the power-set of AB (the set in which both are cut), and then only take those elements whose sum is a length in A or a length in B. You make two dictionaries, dA and dB, from these results, where the keys are the lengths achieved in A or B and the value is a set of all of the ways you could have made that length from the elements in AB. Then you create the set of all disjoint unions for dA and also dB, call these duA and duB. This is done by just trying to find all the ways you can make the lengths of A by only using unique values of the dictionary dA. So if dA = {20:{20}, 50:{10,40}{20,30}{50}, etc} then one duA would be {20:(20), 50:(10,40), etc}, but you can’t use the {20,30} for the 50 because you already used that 20. Do this to create a set of duAs and duBs, and the last step is make vertices with all the values of AB and then edges for this graph are the values of both the duA dictionary and the duB dictionary. (8).
Partial digestion: A lot easier. This is like every possible set of lengths, all it becomes is placing the numbers you’re given on a number line in such a way that they all can fit. (24).
Lecture 05:
Motif finding: with point mutations. Could be done by finding the consensus, then doing a brute force solution (22), but that will take way too long. Better solution is the Median String Problem (24). A better solution is using median string finding: we’re trying to find a motif, which is a string of length l, called an l-mer, that is the closest match to sites on a bunch of different DNA strands. So what you do, which is efficient, is to take all 4^L possibilities, and run them along the DNA strands and for each figure out the minimum Hamming distance for the entire DNA strand. I think you sum the minimums over all the strands, and then if any L-mer does better than the previous minimum, you replace it and remember that L-mer (33). Can do some various improvements, mainly by not trying all 4^L possibilities of L-mers, because after you test some you’ll know that others that match certain patterns will be worse (40).
Branch and Bound introduced: Essentially, when you have a sort of tree based algorithm (that is, there is some structure to your guesses at solutions), you can cut down on the exponential run-time of that algorithm by only exploring branches when they meet the criteria you need for a solution (that is, you’ll exclude that branch if the best possible solution you could have achieved from that branch is no greater than your current best solution). So essentially you’re bounding your test-set as you branch (48,51).
Lecture 06:
Pancake flipping: essentially given a stack of pancakes that aren’t sorted by size, how can you do minimum flips of subsections of this stack to get them ordered by size (4).
Bring to top method, self-explanatory (6).
Reversals: application of pancake flipping to biology. Sometimes a section of DNA is inverted (16). There is the sorting by reversals method, which starts from the front and assumes the numbers are in order and then looks for inversions, and switches them to be in order, each time scanning farthest left to farthest right (23).
Greedy algorithm reversals (reversal sort): keep your prefix the same (the first #s that are sorted) and then just swap inversions that are past that. Takes n-1 steps. (25). But this isn’t optimal (look at next lecture).
Lecture 07:
The approximation ratio of an algorithm on an input: is the number of steps that algorithm took on that input over the optimal number of steps for that input (3).
The approximation ratio (performance guarantee): of an algorithm, is the max approximation ratio on the space of inputs (4).
Adjacencies: a pair of neighboring elements that are sequential (that is, pi_i+1 = pi_i +1) (6).
Breakpoints: occur between neighboring non-adjacent pairs. So 1 | 9 | 34 | 78 | 2 | 65 (7). You can prepend 0 and append n+1, to set context for the reversing. Breakpoints are the bottlenecks for sorting by reversals once they are removed, the permutation is sorted. Each “useful” reversal eliminates at least 1 and at most 2 breakpoints. (9)
Strip: an interval between two consecutive breakpoints in a permutation
– Decreasing strip: strip of elements in decreasing order (e.g. 6 5 and 3 2 ).
– Increasing strip: strip of elements in increasing order (e.g. 7 8)
– A single-element strip can be declared either increasing or decreasing. We will choose to declare them as decreasing with exception of extension strips (with 0 and n+1) (11).
The proposed algorithm:
Choose the decreasing strip with the smallest
element k in π (it’ll always be the rightmost)
• Find k – 1 in the permutation
(it’ll always be flanked by a breakpoint)
• Reverse the segment between k and k-1
Repeat until there is no decreasing strip (16-19).
If no decreasing strips exist, then flip a random increasing strip to create one. (17).
The approximation ratio for this is 4. (27).
Lecture 08:
Dynamic programming: caching answers to subproblems to increase runtime/reduce number of operations needed (4).
Optimal substructure of dynamic programming solution: if the longest path from (0,0) to (n,m) passes through some vertex (i,j), then the path from (0,0) to (i,j) must be the longest. Otherwise, you could increase your path by changing it. (15).
Essence of dynamic programming problem without the caching:
MT(n,m)
if n = 0 and m = 0
return 0
if n = 0
return MT(0,m-1) + len(edge) from (0,m-1) to (0,m)
if m = 0
return MT(n-1, 0) + len(edge) from (n-1,0) to (n,0)
x <- MT(n-1,m) + len(edge) from (n- 1,m) to (n,m)
y <- MT(n,m-1) + len(edge) from (n,m-1) to (n,m)
return max{x,y}
(16).
Your running time for the code above is n x m, for an n by m grid.
Directed graph version of problem (25).
Lecture 09:
Measure similarity between pair of DNA sequences. (1).
Hamming distance: just point differences between sequences. So for v: ATA and w: TAT, the hamming difference is 3 (maximum). It neglects insertions and deletions, though. (5).
Edit distance: can allow for insertions/deletions to account for the fact that two sequences could be identical. (7).
Longest common subsequence: special case of edit distance where you can’t substitute (11).
Good alignments: Using as many diagonal segments (matches) as possible. The end of a good alignment from (j...k) begins with a good alignment from (i..j). (16).
Longest common subsequence code: (28). Backtracking code (figuring out what that subsequence is based on the arrows) (29).
Longest common subsequence problem: does not allow for mismatches (which is useful for when we are allowing point mutations). (30).
New scoring scheme: add +1 for matches, -some for mismatches, -something smaller for insertions/deletions. (31).
Global alignment problem: tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. (44).
Local Alignment Problem: tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph. Local alignment allows for the possibility that you’re actually adjusting the area across which the two sequences match. Look at the slide for a diagram (44). Reason for global alignment is that two species may be similar over some regions but dissimilar over other regions.
Local Alignment Output: Alignment of substrings of v and w whose alignment score is maximum among all possible alignment of all possible substrings (47).
Lecture 10:
Consecutive gaps are more acceptable than separated point ones (14).
Rules for gaps when consecutive insertions/deletions: (16).
3-Layer Manhattan grid/gap penalties, which is alignment between 3 different sequences to see their distances: (21-24).
Pair-wise alignment: just refers to, when given a set of three that you’re trying to align, any combination of two out of the set of three. (33).
Because the run-time of 3-layer and k-layer is so high, it makes more sense to compare pairs of twos individually and come up with an algorithm for combining them (33). The process for doing this is explained in (33-40).
Entropy: -sigma p_x log_2 (p_x) where p_x is something. Don’t know. Look at (52).
You can also go backwards, from multiple alignment to trying to figure out pairwise alignment, which is like projections of the cube (57).
Lecture 11:
Gene Prediction: figuring out how the DNA is being read (like where are we starting because depending on your offset, you could code a different protein). So there are 6 different ways you can read (3 for each half of the DNA, because a codon is a pair of three) (1).
Exons: the part of the DNA that actually codes for stuff (4).
Exon Chaining Problem: given a set of apparent exons, find a maximum set of non-overlapping exons. Input: a set of weighted intervals (based on something we don’t care about). Output: A maximum chain of intervals from this set (19).
Greedy algorithm for Exon chaining takes the highest worth, then fits in the rest. Comes up with a solution, but isn’t necessarily the best (20).
Spliced alignment: good luck (23).
Lecture 12:
Divide and conquer algorithms: Divide problem into sub-problems then Conquer by solving sub-problems recursively. If the sub-problems are small enough, solve them in brute force fashion then Combine the solutions of sub-problems into a solution of the original problem (tricky part). (2).
An example of the divide and conquer is mergesort (4) Mergesort run time is O(nlogn) (10).
To do the backtracking for alignment, you need O(nm) memory because you need to remember all the arrows (11). But to actually compute the table, you only needed the column before you need two column of memory at a time O(n) (12).
Divide and Conquer sequence alignment: essentially trying to do sequence alignment using divide and conquer, not really sure how it works (14-beyond).
Four-Russian speedup: purely a way to make the divide and conquer solution to the sequence alignment faster. It just does this by dividing the subproblems into a certain size groups that’s a balance between likelihood that you’ve already computed it and size of subproblem, that maximizes speed (28). Four Russian runtime is O(n^2 /log(nlog(n)), which is around linear he says. (42) I think original runtime was O(n^2).
Summary:
We take advantage of the fact that for each block of t = log(n), we can pre-compute all possible.
scores and store them in a lookup table of size n(3/2).
We used the Four Russian speedup to go from a quadratic running time for LCS to subquadratic running time: O(n2/log(n log n) )