Cheat Sheet for COMP 555 Midterm

March 3, 2015

My cheat sheet for COMP 555, Bioalgorithms, which was really a joke of a class in the end. Pretty disappointing. If I recall correctly, the professor was pretty sexist at times too.

How to do problem 4 on the problem set:

Given: Set A cut by one enzyme, set B cut by a different enzyme, set AB cut by both.

Start by taking the power-set of AB (the set in which both are cut), and then only take those elements whose sum is a length in A or a length in B. You make two dictionaries, dA and dB, from these results, where the keys are the lengths achieved in A or B and the value is a set of all of the ways you could have made that length from the elements in AB. Then you create the set of all disjoint unions for dA and also dB, call these duA and duB. This is done by just trying to find all the ways you can make the lengths of A by only using unique values of the dictionary dA. So if dA = {20:{20}, 50:{10,40}{20,30}{50}, etc} then one duA would be {20:(20), 50:(10,40), etc}, but you can’t use the {20,30} for the 50 because you already used that 20. Do this to create a set of duAs and duBs, and the last step is make vertices with all the values of AB and then edges for this graph are the values of both the duA dictionary and the duB dictionary.

Information: -log_2(probability)
6 ways out of 36 to
roll a 7. Thus, 7 conveys -log(6/36) = 2.58 bits.

Can determine how close genes are on a chromosome based on their combined frequency in a population. So you can anticipate a frequency based on the assumption that they're independent, and depending on how frequently they show up combined in the population, you deduce how far they are on a chromosome.

Most amino acids have redundant encodings.

Cutting by restriction enzymes makes reverse complement palindromes, they leave “sticky ends”, which are like overhangs.

PCR (Polymerase Chain Reaction):
You essentially just use this to copy a single strand of DNA a bunch so that you can do tests. You split it, then have some things in a pool of nucleotides that attach onto it and duplicate it, and then split it and repeat the process.

DNA sequencing: 3 steps. Create sub-fragments that differ in length by a singe nucleotide. Label each fragment with one of 4 different tags. Sort the fragments according to size.

Gel Electrophoresis: the process of sorting the

Motif finding using median finding:

We’re trying to find a motif, which is a string of length l, called an l-mer, that is the closest match to sites on a bunch of different DNA strands. So what you do, which is efficient, is to take all 4^L possibilities, and run them along the DNA strands and for each figure out the minimum Hamming distance for the entire DNA strand. I think you sum the minimums over all the strands, and then if any L-mer does better than the previous minimum, you replace it and remember that L-mer.

You have to do this for all 6 possible frames. There are 6 frames for each

You can do various optimizations by essentially doing variable matches for your L-mer, so like match ACT***, and then if this general one doesn’t do better than the previous minimum, then you know none of the specific ones will do better.

Hamming distance: just the number of differences between two DNA strands.

Notes from practice:

Answer: The number of base pairs that differ between individuals is 3x10^9x0.001 = 3x10^6. Each base pair can take 1 out of 4 combinations. Therefore, the total number of different human genome can be as many as 4^300000000. The number of bits that are needed to encode these variations is log_2(4^300000) = 6 million bits.

At what position should the l-mer appear to maximally improve he performance of the dynamic programming algorithm?

If the l-mer is near the middle of each string, then the tableu size can be cut by more than half. In the tableau, it’s like choosing a position, and then do the tableue for the remaining area.

Suppose you are given two sequences of lengths n and m where n> m. What is the lowest score that could be expected from a global alignment assuming a match score of +1; a mismatch score of 0; and a gap penalty of -1?

If the two strings have no sequences (eg. AAAAAA, TTTT) then the best socer possible is -(m-n) gaps for a low score of n-m.

On he test, he asks how many times a loop runs. If this loop is doing 2^n, n!, or n^3 iterations. All are nested for loops.

For the coin change problem, will need to know that essentially think about how

Make sure understand break point reversals.

Splice alignment don’t need to worry about.
Don’t need to worry about x-on chaining.

Understand global and local alignments.

Four-russian speedup:

I can solve dynamic programming problems by doing