Study Guide for COMP 555 Final

April 30, 2015
A study guide I made for studying COMP 555 at UNC by reviewing all the lectures.

Lecture 13:

Eulerian Cycle Problem: Find a cycle that visits every edge exactly once. (3). Can be done in n time. More on it in SBH below. 

Hamiltonian Cycle Problem: visit every vertex exactly once. But this is NP-complete. Takes exponential time (9). 

Shortest Superstring Problem: given set of strings, find the smallest string that contains all of them. Can map this to Traveling salesman problem, which is NP-complete. (23) Can make this into a graph (25). 

SBH (Sequencing by Hybridization) Problem: Given a set of all l-mers from an unknown string, reconstruct the string (many may exist) (37). Could be done with hamiltonian, but that takes forever. Better done with Eulerian path (41). Graph is “balanced” if indegree of ever vertex in equal to outdegree. A graph is Eulerian if and only if its vertices are balanced (modification: can contain as many as two vertices whose in and out degree differ by one). (43). Doing the eulerian cycle (45). 

Lecture 14:

Not much to see here, folks. 


Lecture 15: 

Peptide Sequencing Problem: figuring out which peptide existed by comparing spectra (33). Don’t currently understand.

Deficiency of Shared Peaks Count: intuitive measure of spectral similarity. (54). 

Shifts are covered (Spectral alignment problem) (62). 

Dynamic Programming algorithm for Spectral alignment: Algorithm for spectral alignment. Running time is O(n^4 * k) (73). There’s a corresponding fast one that does O(n^2 *k) (75). 

Lecture 16: 

l-mer repeats: find short l-mers (usually 10 to 13) in a string, which are easy to find. The extend l-mer into longer, maximal repeats. (4)

Basic hashing: generate integer key based on a record. Index by the key. (8)

Pattern matching problem: find all occurrences of some string in a text. Better than brute force, you can do the algo on (17), which has runtime O(nm) where n is string length and m is length of what you’re trying to match (18). 

Keyword trees: start with a root, and sort of branch out with every letter in the word if it’s not already there. Quadratic to create keyword trees. (40). 

Suffix tree: concatenation of keyword tree if what you’re doing is adding in all suffixes of a string, and the index at the bottom of the leaf node is the index at which that terminal ends (39). Can create suffix tree in order N time (41). Start with an empty node, then add in the first letter and create a node for it, then the second letter and add it to that existing node and create it’s own node for it, etc (44). Searching 
	What letter occurs most frequently? 
		Look at the top nodes for each letter, and count the number of leaf nodes 		under them. 
	What is the shortest pattern that occurs once in the string? 
		The shortest path (by number of letters) to a leaf node. (52)


Lecture 17: 

Searching for a string of length m in a text of length n. Then indexing strings with trees: keyword tree is O(m) search independent of number of keywords, suffix tree is O(n) construction, O(m) search. Suffix arrays are a practical alternative, O(log(n)) search time. (2). 

Suffix tree will have m leaves for a string of length m. (6). Given a suffix tree, how long to determine if a pattern of length n occurs in the sequence? O(n). (6). Suffix tree storage is O(n^2), but can get it down to O(n) + O(n). (7). 

Suffix tree depth first traversal corresponds to alphabetizing all suffixes, if you alp (8). We can represent this tree as an array, based on its in order traversal . Make sure you keep the branches in alphabetical order. (10). 

Given a suffix array, how long to determine if a pattern of length n occurs in the sequence? O(n*log(m)). (12)

BWT: start with a string “tarheel$”, then rotate it across its length so “arheel$t”, etc. Then sort this resulting list alphabetically, and return the last letter of each. The way this works actually allows you to compress the data, because for some reason when you do this, you end up with a long string of “A”’s for instance which you can just represent as a single A50 or whatever. So it’s useful for compression. (17). 

BWT is faster than suffix array: BWT O(n) search, suffix array O(nlogn) search. (21). 



Lecture 19: 

Skipped reading. If anything in depth about BWTs, go here. FM indexes, etc. 

Lecture 20:

Clique: is a graph where every vertex is connected via an edge to every other vertex. (2). 

Clique graph: is a graph where each connected component is a clique. (2). 

Corrupted cliques problem: Essentially, you just remove/make clusters out of an existing graph by changing edges. Input is a graph G, output is the smallest number of edge additions and or removals that transforms G into a clique graph. (4). 

Distance graph: between two species, for instance, is just creating cliques/clusters based on distances. You want to connect things which have a distance smaller than some chosen threshold. (5). 

CAST algorithm: algorithm for determining clusters, essentially. (8). 

Distance matrix: distances between species, based on their genes. We don’t know the evolution of the species and their relations but we want to. We want to form a tree out of this matrix. (25)

Constructing a distance tree based on distance matrix: This requires solving (n choose 2) equations, but we have enough variables (2n-3) to solve them. An unrooted tree with n leaves has 2n-3 edges. (29). 

Essentially, take the matrix, reduce each entry by 2*delta, until you get a degenerative triple (have an edge that’s zero to a node, essentially), remove that node, continue, etc, then rebuild back up until you get a loop. (40). 

Seems like these have to be additive matrices. (30). 

How to construct the tree based on the matrix. Very complicated though (32). 

AdditivePhylogeny algorithm tells you if matrix is additive (43). 

Lecture 21:

What to do if the distance matrix is not additive. We find the tree that works approximately (2). 

Hierarchical clustering, essentially (5-7). Exact algorithm (7). 

Parsimony scoring: (15). 

Fitch algorithm: (35). 

Lecture 22: 

SNPs (sequence diversity patterns): (3-30). 

Lecture 23:

Hidden Markov models. 

Fair bet casino problem: input sequence of coin tosses made by some combination of two possible coins, and output a sequence of states of the coins that indicates the most likely coins used to generate that sequence. (7). 

Hidden Markov formal definitions: (17-18)

Decoding problem (what we’re using HMMs for): Goal: find the optimal path of state transitions given a set of observations. Input sequence of observations generated by an HMM, then output a path that maximizes the probability that the path happened over all possible paths for that observation (23). 

Viterbi solution to HMM: essentially create the graph, as you are used to (25). 

Forward-backward problem: don’t really understand why it’s necessary. Seems like it does the HMM’s thing, but it tells you the most likely state at a given flip rather than the whole thing (39). 

Don’t necessarily know the parameters for the HMM, so have to use Expectation maximization. (47). 

Profile representations: represent the frequency of a given base at a position using a matrix, for multiple DNA sequences (49). 

Lecture 24: 

Select(L,k) finds the kth smallest element in L. (3). 

Want to find medians or first quartiles in O(n). (3). 

Select() does essentially quick sort, but with less work: pick an element from the list, split into two smaller lists larger than your choice and smaller, then, remember you’re trying to find the kth value, if k>len of the lower one, then run select again on that, otherwise do it on the top half, and if it’s neither, then you chose the kth element (4)

Runtime of select() depends on selection of m (the first value choice), want to choose m to be around the middle then you get O(n). Worst case it’s O(n^2). Best to just choose randomly (12). 

Randomized select code (13). 

Randomized Select: worst case is O(n^2), expected is O(n). 

Motif finding problem: given a list of t sequences each of length n, find the best pattern of length l that appears in each of the t sequences (16). 

Scoring strings with a profile: Now, given a profile matrix (just frequency table for your t sequences at each base pair), you can figure out the probability that a given l-mer was created by the profile by comparing it to the consensus string (18). Super easy to do (21). 

Most probable l-mer is just going to be doing this for ever possible l-mer, across the entire sequence (24). 

You can also say “find the t most probable l-mers”. And because you don’t want to rule things out in case they just don’t match with the consensus, you can also say “let’s not do 0 probabilities, but just really small”. (28-29). 

Greedy profile motif search: (32). 

Gibbs sampling: essentially doing more probability stuff to make what was done before even faster/smarter (36).