Notes from Tamara Berg Lectures Artificial Intelligence

March 31, 2015

I wanted to dip my toe into Machine Learning and Artificial Intelligence stuff, but I didn't want to actually invest the time to do the boring schoolwork associated with it, so I decided to just download all of the slides from our Machine Learning class and study the shit out of them for a couple weeks. I later did research with this professor, Tamara Berg. She was great, but I did feel bad beacuse I didn't really dedicate much time to our research together because I was busy trying to get Coursicle off the ground before we graduated.

As with anything complex that I learn, I spent a lot of time trying to write down descriptions for the different methods/concepts in the simplest way possible. If I don't have an intuition about something, I don't really feel like I understand it.

What excites me about AI is how remarkably simple it is at its core. It’s just so intuitively feels like this is how intelligence evolved naturally via selection.

Email to Tamara: “Over the past few weeks I’ve read through all of your Artificial Intelligence lectures and done some of the corresponding readings as well” Be sure to mention that I’m not looking for anything paid, I just want to learn more about Machine Learning. The text that went along with the lectures always was understandable to me. My particular interest is with Natural Language Processing, and how we could teach a computer to talk or listen intelligently. I haven’t had a chance to dig through the papers, but it seems like the auto-captioning for images is in my area of interest because I’m interested in the combination of machine learning and natural language processing. Just the idea that we can have a computer do something that we didn’t explicitly program it to do, that’s exciting.

Notes: Roman worked for her last summer, said he wasn’t doing too great in her class but she was still fine with him working with her. Asked about paid/unpaid, I don’t care if it’s unpaid. He talked to her in person, so I’d probably do the same.

Notes: while Google has demonstrated success in generating high accuracy recognition networks, it’s important to note that these networks could have been subjected to millions of test cases. Obviously, humans are able to perform with significantly less training, and so there is plenty more advancement to be done on the object recognition front. http://techcrunch.com/2014/11/18/new-google-research-project-can-auto-caption-complex-images/

A smart chatbot (essentially passing the Turing test) requires:
Natural language processing (actually parse the sentence)
Knowledge representation (find a way to store knowledge in some way)
Automated reasoning (figure out answers to questions based on the knowledge)
Machine learning (general pattern finding, learning from experience)

An agent:
Anything that can be thought of as perceiving its environment, and then acting on that environment. Essentially, the agent is trying to maximize a performance measure.

Performance measure (utility function):
Everything we design should have some sense of performance measure (probably, this is ideally a numerical range), which it is trying to maximize. Abstractly, the agent will calculate, for every possible action, ExpectedUtility(action) = sum over all outcomes of Utility(outcome) * P(outcome | action). So the sum over all outcomes of the utility of that outcome times the probability of the outcome given that action. Remember, these utility functions are designed to reveal preferences for the agent, that is, a high relative utility (likely normalized) should indicate that this action will lead to an outcome that is very preferential for the agent.

Some environment types:
Fully observable vs partially observable: can the agent see the entire environment or just part of it.
Deterministic vs stochastic: next state strictly determined by current state + action, or is there some chance to the evolution?

Episodic vs sequential: are the decisions essentially independent, that is, you take them case by case in episodes or is a decision going to change the environment for subsequent decisions too.

Static vs dynamic: is the environment changing while the agent is deciding, or is everything essentially on pause.

Discrete vs continuous: is everything in state of the environment/are the number of actions finite, or are we dealing with continuous quantities.

Single-agent vs multi-agent: are there multiple agents in the environment.

Known vs unknown: is the agent know during its birth the outcomes for all actions or does it have to learn?

Search:
Essentially trying to go from start state to some end state, with some cost function guiding behavior. You want to minimize moves, for instance, you have some actions you can take and your environment is in some kind of state as well. You can calculator a few layers of “given this, what state will I be in and given that state, what would I want to do to minimize cost” so you do dynamic programming but just down to a few layers. Assumes: single agent, deterministic, fully observable, discrete environment.

Constraint satisfaction problems (CSPs):
Essentially, you’re doing better on the search algorithm (which goes through states by trying actions, possibly guided by some utility function) by eliminating states based on constraints, so you know that some states aren’t possible and you eliminate those. How do we do this? Okay, so you have some set of constraints, and all you need to do is choose some start move (can be random) and then given that start move, see what other moves you can make that don’t break any constraints for your second move, and continue on and any time you are exploring a branch that ends up dying before having satisfied everything, you just backtrack up the tree. (Look at “csp.pptx” lecture) So imagine the classic map coloring problem with 6 states, choose one to be blue, then move on down a layer and choose the next to be yellow, etc and if you realize that every branch you hit you ended up not getting the solution then you’d need to go backup and change one of your guesses essentially, every time obeying the constraints. One common way of choosing what to do at each stage is to choose the most constrained variable (the employment of this sort of tactic is called using a heuristic). You can also decide to choose the variable whose choice induces the most constrain to the others (apparently the first heuristic is best but can use this for ties). For more efficiency, you can do early checking of failure (called forward constraint propagation) by keeping track of all possible assignments for the currently unassigned variables, and if these possible assignments violate constraints, terminate the branch as it is (essentially, rather than having the program reach the point where it realizes that there is no available value to assign to a variable). Sort of think of a heuristic as “if I were doing this manually, how would I probably do it to make things go fastest”. Another way to do consistency checking is called arc consistency, which takes every pair of variables and for every assignment of value in their current domain makes sure that the other will still have a valid assignment. You can run arc consistency to reduce the search space early, so just run arc consistency and then you’ll essentially know that certain assignments just won’t be possible. Local search is also an option, it will essentially just start with conditions that actually do violate constraints, and then try to make changes that greedily minimize the constraint violations, can get stuck in local minima.

Natural Language Processing (basic):
Can classify words into types as to identify meanings or attitudes about things. Can classify things as simply as “plus, minus, equals, and not” words, so like positive, negative, negators, etc.

Games:
Branching factors for moves can be huge (~35 different moves every turn), which means an ungodly amount of possible game scenarios, which means it’s not really possible to compute them all. For instances when you can calculate all of the branching, simply choose a perspective of the player and then for all possible scenarios give a winning scenario a 1, losing -1, and tie 0, and then go up the tree and setting the parent node’s utility value to be the sum of its children, and then percolate up the entire tree. Then at each turn, one player will be choosing the move whose children are max utility and the other will choose min utility. If you can’t make it to the end states because there are way too many, you can try to evaluate the probability of winning giving each possible states you could put yourself in and then choose moves based on that (this probability could be based on a database which contains probabilities of winning, or it could be based on some evaluation function of the current state which gives some indication of how likely it is to win, such as in chess where having pawns are worth some amount, having a queen is worth more, etc). It’s really all about seeing as far ahead as possible.

Probability world views:
Frequentist:
Things are truly random, and so the best we can do to model the thing is to look at how it occurred in the past and then take averages.

Bayesian:
Things aren’t truly random (?), but we’re just unsure about the variables involved so we have uncertainties about the variables that contribute to events. This results in using past data to come up with predictions, and then update that model in light of new data.

Monte Carlo Simulation:
To compute the utility of actions, you can just simulate taking that action and then simulate how the game could play out by actually choosing outcomes based on their random probability.

Maximum Expected Utility (MEU):
Could be considered the foundation of AI, or really if you think about it, intelligence in general. That an intelligent being will attempt to maximize its utility in future states. It does this by looking at all the possible actions it can take, estimate the utility at each of those actions (taking into account probabilities of those states if the model is not fully deterministic) and then make the decision that will most likely lead to maximum utility. These utilities are often normalized against the worst and best possible outcomes.

Rational Preferences:
The preferences that a rational being has must obey some axioms, “axioms of rationality”. These are axioms by which all rational agents must behave. A rational agent must obey these axioms when considering any set of outcomes. Some examples of these axioms are orderability (any two outcomes must be greater than, less than, or equal/indifferent), transitivity, etc. It’s been proven that given some set of preferences that are rational (obey the axioms), there exists a utility function that will work with this preferences.

Markov Decision Processes (MDP):
Defined as a set of states, s, with some initial state s_0. Each state has a set of actions available A(s), and a transition model (inherently stochastic) which gives the probability P(s’ | s,a), that is the probability of getting to state s’ given you are in state s and take action a. Lastly, there is a reward function R(s) which is the reward for being in state s. We want to find a policy pi(s), which gives us the action that an agent should take in a given state. This policy is exactly what we’re trying to find. Notice here that our policy will only be taking into account the current state, not things that happened in the past. In Markov, R(s) is known so we know the rewards a priori.
So how do you find this policy? Well, the idea is to choose the policy which has maximum utility over all possible state sequences, but there could be infinite state sequences, so you do this: say you have some possible sequence of states: and of course you want to maximize reward R(s0) + R(s1) + R(s2) + …, but that sum doesn’t converge as is because it’s infinite, so you can choose a 0 < gamma < 1 for all of your state sequences and then have the importance of each subsequent state decrease by factor gamma, so you then have the sum 0 to infinity of gamma*R(s_t), so that’s R(s0) + gamma*R(s1) + gamma^2*R(s2) + …, which converges. Because it’s geometric, it’s <= R_max / 1- gamma. Could read more about value iteration and policy iteration if interested with the specifics of how to compute this.

The Bellman equation:
Equation used to calculate the utility of a state (in a Markov Decision Processes) for comparison purposes to figure out what action an agent should take. Essentially, it calculates all of the utilities of the states that it could enter by doing: the reward for that state R(s) plus the dampening factor gamma times essentially the best utility it can get if it chooses that state (it obviously must do this recursively, but probably only to a certain level). This is denoted as U(s) = R(s) + gamma * max_(a \in A(s)) sum_s’ (P(s’ | s, a)U(s’)). Where A(s) are all the valid actions at state s.

Reinforcement Learning:
Essentially a Markov Decision Process but now we don’t actually know the transition model or the reward function. That means you essentially have to try out actions and come up with a way of learning. Still looking for a policy, that is, the best action to take in any given state. The workflow goes like this: take some action, observe outcome (successor state and reward), update internal representation of environment and policy, and if you reach a terminal state just restart (each pass is called a trail).
Passive learning:
You are given a policy and you need to evaluate how good it is (essentially by experimentally determining the state utilities). You still don’t know the transitions or rewards. Two types: model-based and model-free.
In model based, you’re essentially trying to experimentally figure out the transition probabilities and rewards, and then from that do an explicit calculation to see if the ideal policy (I guess by bellman equation?) based on these probabilities and rewards agrees with the policy given (imagine the bellman equation, and trying to explicitly determine the transition probabilities and rewards, then you should be able to produce an ideal policy from this model which you compare to the given policy).

In model-free: you don’t try to learn the probabilities, an example is TD-learning in which you just observe the transitions and rewards and then try to sort of back-fit it to the bellman equation (imagine the bellman equation, and now imagine pretty much ignoring the probabilities, and just doing the sum over some empirically determined average U(s’), the equation exactly is (apply update to U(s) every time state s->s’) U(s) <- U(s) + alpha(R(s) + gamma*U(s’) - U(s)), where 0Y->Z, and we can conclude from this that X and Z are not independent, but they are independent if Y is given, then we have common cause X->Y and X->Z, and we know that Y and Z are not independent again unless X is given. Finally we have common effect, which is X->Y and Z->Y, and we can see X and Y are independent unless we are given that Y occurred.
⁃ Learning: so two tasks, one is given the network structure (so you know the probability dependencies) and given a dataset of data (which you can use to infer probabilities), you’re supposed to find the probabilities (the conditional probability table for each variable (that is, each node)). The second task is say you only have the dataset, you’re supposed to figure out the net structure and the probabilities (a lot harder).

Probabilistic Inference:
Essentially, if you know the full joint distribution (that is, all the probabilities of each variable given another, and you’re tasked with some “query” variable, which is a variable whose probability you want to figure out, given some “evidence”, that is, some variable in your table is known to be some value, e.g. E=e. Then, say you have have P(X, E, Y), then, you just have to do some marginal sums to figure out P(X | E=e) which is P(X,e)/P(e) proportional to Sigma_y P(X, e, y). This involves a lot of sums, though, and that’s going to be a problem if you have a really really large full joint distribution.

Approximate Inference (Monte Carlo Algorithms):
Use this method when the exact probabilistic inference calculation would just be too much to deal with, that is, when your full joint distribution is huge. In this way, you just take a lot of samples and say that they’ll converge to the actual probability. This is so simple because you don’t have to really worry about dependencies as you did with the Bayesian network unless your “query” (that is, what you’re trying to find out) asks for a dependency (like P(X=true | Y = false)), you just take a bunch of samples and tally them. Everything’s straight forward, because you just store the data as samples and then calculate the probabilities that you’re being queried in the exact way you would if you were to do it by hand with that sample data.

Conditional Probability Table (CPT). Just a table that has probabilities given the outcome of certain dependent variables.

Expectation-Maximization algorithm: for the case where you want to learn the parameters of a probabilistic model (that is, I believe, the degree of influence that one variable has on another occurring). So, you’re trying to learn the probability of Rain for instance given Cloudy or not Cloudy. And say you have a training set of outcomes (actual outcomes, like training 1: cloudy was false and rainy was true, then training 2: cloudy was false and rainy was false, etc) which you would normally use to estimate these parameters, but say you don’t have the complete training set. What you do is set the parameters randomly to start (the conditional probability tables), fill in the missing data based on these parameters, figure out what the missing variables should be in your training set given these parameters, and then go back and re-estimate parameters based on the new training set. Continue until convergence.

Naive Bayes Representation: Spam is the class, and the words are the features. We want P(spam | message), and in fact we can estimate this using P(message | spam) * P(spam) and check to make sure this is larger than the same for P(not spam | message). Well, we can do this by looking closer at P(message | spam), essentially breaking up the message into a bag of words with no order and saying that the words are independent given the message is spam/not spam (this is obviously silly) and then we can say P(message | spam) = P(word1, word2, wordn | spam) which is equal to P(w1 | spam) * P(w2 | spam) … thus, we just need to estimate the likelihood that an individual word will appear given spam, and then multiply this by the probability of spam. In practice, multiplying all these small probabilities is a problem because of floating point underflow, so instead do some log stuff. Slide 51 in bayesnets3 has great notes for clustering.

Laplacian smoothing: say you never saw a word in your training set, then its probability of spam is zero which throws off the above calculation. To fix this, we can pretend we have seen every vocabulary word one more time than we did (prevents the zero). P(word | spam) = (number of word occurrences in spam +1)/(total number of words in spam messages + total number of unique words).

Markov Chain (first-order): let’s now consider a probability model but where time is involved, so we have possible states to be in and we have some transition probabilities, that is say there’s a .9 probability it will be rainy tomorrow if it was rainy today, and .9 probability it will be sunny tomorrow if it was sunny today (the fact that there is only dependence on the day before indicates it’s first-order). So, you start in some given state (say it’s sunny), then you move along a chain of possible outcomes with the corresponding probabilities. Note that the probabilities at each of the times can be calculated solely based on the probabilities in the state before (what makes it first-order). Eventually, apparently, these probabilities converge to something fixed, so for instance our example converges to .5 that it is sunny on a given day and .5 rainy, regardless of if we start with rainy or sunny in the start state. This is the problem: uncertainty accumulates, so after a certain amount of time we just can’t make predictions (you can’t predict weather very far in the future because of this uncertainty).

Hidden Markov Model (HMM): Imagine a Markov chain but you don’t observe the states, but some derivative of the states that are probabilistically defined. So, say you only know if it’s raining based on whether you see this guy has an umbrella (and sometimes he doesn’t even have an umbrella when it’s raining). So essentially, the state of the world is somewhat hidden from you. This is applicable to many, many things. Such as voice recognition (the evidence being the Fourier decomposed wave at many time slices, and the states being the phonemes, which are essentially the syllables that we think they’re saying based on those frequencies, and the transition model is how they phonemes are able to be strung together to make words)

Filtering: update the agent’s belief state about the environment (say that there is some uncertainty in what the agent knows about the environment). This is done via evidence collected from each of the time intervals.

To find the most likely sequence to actually occur, we can use the Viterbi algorithm: http://en.wikipedia.org/wiki/Viterbi_algorithm.

K-means: This is known as a “flat” clustering strategy, and it’s not deterministic (obviously). Another downside is that you have to know how many clusters you’re searching for. Say you have some clusters on a plane (vector form), then choose k random points (the number of groups you intend to find) and then claim that those points are the center of a cluster (they obviously likely won’t be), then assign all of your data points to the nearest of your k points, then set the new cluster center to be the average of all the data points assigned to each cluster. Repeat this until you have no data points changing which cluster they belong to. By taking the average, this also actually reduces the least squared distance from all points in a cluster. We can obviously see that initial random point choices have a big effect and could get stuck in something sub-optimal, so you may want to run the algorithm a bunch of times and go with the centers that are reproduced the most.

Agglomerative clustering: this is known as an hierarchical strategy, which means it’s deterministic which is good. start with each data point in its own cluster and then at each iteration merge the two of the closest clusters. Note that with this, you need to figure out how you’re going to define closeness of clusters which have multiple data points. You’ll get different behavior for the different choices. Ideas are closest pair, farthest pair, average of pairs, or distance between centers. Note that in hierarchical, you don’t need to know how many clusters you’re trying to find.

Divisive clustering: this is known as an hierarchical strategy, which means it’s deterministic which is good. Start with everything in one cluster, then recursively split clusters (not sure by what metric). Note this is top-down rather than bottom up (in Agglomerative).

Unsupervised Machine Learning for object recognition: first, you extract features from an image (exact process not described), and then you do the same for a bunch of images and you plot these features in some way on a graph, and then find the clusters in this graph (usually using k-means) and this center represents that feature for all these images. Then, when you’re looking at an image, you can try to represent it as the occurrences of these features in the image. You really decide what your feature recognizers are going to be.

Naive Bayes Classifier: say you have a function, f, which is trying to return a y (a class), for an input data, x. Then you could figure out y for an f(x) using Bayes by doing argmax_y P(y | x), that is which y produces the largest output for this. Or, and this should be proportional to the above, argmax_y P(y) * P(x | y) = argmax_y P(y) * P(x_1 | y) * P(x_2 | y) … where x_n is an attribute of the x data point. So we’re essentially saying how likely is the class y times the probability of the feature x_i occurring in y, but for all attributes in x.

Machine learning concepts: You have training data (what you train your classifiers on), held out set (what you can use to tune things), and then your test data (what you actually test your classifier on to see how well it does).

Nearest neighbor classification: given some vector set with pre-classified data, choose the closest data point to the new data point and give the new data point the existing data point’s class, or the majority of the k (you choose k) nearest data points.

Very impressive image completion in classification1, slides 70 and beyond.

Decision trees: say you want to decide whether or not to do something, well you can look at all of the various environmental factors that should lead into making that decision, and construct a tree using training data that allows you to decide on any input of factors. For instance, say you’re waiting at a restaurant and you want to decide if you should leave due to the wait, then your first node in your tree could be “price?” and then the three options are $, $$, $$$, and if it’s $$$ then you should just decide to go no matter what, and if it’s $$ or $ then you ask “are they family friendly?” and if no then you leave, and if yes then you ask another question, etc. When constructing a decision tree, you essentially want to minimize the size of the tree to help make things generalize to new cases (that is, essentially, not overfit), so you choose the attribute that splits the results into the purest subtrees, so you want to choose attributes to be at the top that are going to be as close to an obvious split as possible. This can be represented as information gain = entropy(parent) - average_entropy(children), where entropy is a measure of how near uniformly distributed the balance of results (in our restaurant example: waits or leaves) is. Essentially, this measure tells us if this choice of attribute actually split the data well (you want high entropy in parent going to low entropy in both children, because low entropy means that many results are just one or the other and not mixed). In creating the tree, we calculate the information gain for all attributes, choose the max, then continue recursively. You can do some things to improve its accuracy, such as bagging (this may be random forests): in which instead of making one decision tree from your entire sample data set, you take random samples of the entire data set and create n decision trees, and then pass the new data through each (each model gets a vote for what it thinks you should do) and take the majority vote. Boosting: essentially, take a single “weak” (not great) decision trees or just any classifier and give it some weight, then you also seem to have weights on the sample data, and you run them through your master classifier, and increase the weights of the samples that you got wrong, each time you do this, you add another weight*new classifier to handle non-linear classification problems.

Perceptron: more information in “Learning machine learning” document. Should review classification3 slide 8 and more for more info on weight update rule, should review this.

Cool demo of image classification: http://demo.caffe.berkeleyvision.org/

Linear classification: covered in the “Learning machine learning” document, but should be noted that in linear classification there may be infinitely many lines that do the job of classifying, but we want to choose the one with the largest margin (as this does the best for generalizability). Called a “large margin linear classifier”. Can use the package “liblinear” for doing linear classification; fast to train.

The Kernel Trick: if you are dealing with some data points that aren’t linearly separable, then you can try mapping them to a higher dimensional space in which they could be linearly separable.

AI-Complete/Strong AI: means that achieving this means making an AI that is as smart as a human.

Convolution: setting a pixel to be the average of its neighboring values, which can help reduce the noise of an image.

Edges: when breaking down a photo, what you can do is look at the intensity over a region of the image, plot that graph of intensity (essentially, how much white light in a greyscale) and then if you take the first derivative of the intensity, you see the edges (in actual height) which indicates where there’s a depress and an incline.

Image Retrieval: one way to do this is to create a 3D histograph of all of the colors present in the image (so, go through every pixel and figure out how many times each color (combination of 3 sub-pixels) is present in the image.

Segmentation: grouping features that belong together (top down is pixels belong together because they come from the same object and bottom up is pixels belong together because they look similar).

Sliding Window Approaches: essentially you just slide a window over the entire image and detect each region for faces. Face detection in cameras is done this way.

Communication for speaker (NLP): Needs intention (decide what info should be transmitted), generation (make that information into words), synthesis (output that string in the desired form (text or speech).

Communication for listener: Perception (actually map the words or text to a string of words), analysis (determine the information of the string. This involves Syntactic interpretation (parsing) which is getting the right parse tree, semantic interpretation (extract the meaning of the string in its logical form), and pragmatic interpretation (consider the context that could alter the meaning), and Incorporation (decide whether or not to believe the content of the string and add it to the Knowledge Base.

Syntax: the form (like the order of the words and the parts of speech etc). A sentence ending in n prepositional phrases has over 2^n syntactic interpretations.

Semantics: the meaning of the text. Problem is, there’s a lot of ambiguity in text, so the same statement can mean very different things e.g. “you have a green light”.

The reason language is so complex is because ambiguity acts as compression, essentially.

Morphological analysis: the process of breaking up words into their morphemes (smallest linguistic unit that has semantic meaning): “carry + ed” “depend + end”

Word Sense Disambiguation (WSD): the processes of getting the proper sense of each ambiguous word.

Textual Entailment: determine whether one sentence implies another. Example “Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year.” —?—> Yahoo bought Overture.

Anaphora Resolution/Co-reference: Determine which phrases in a document refer to the same thing. So “John put the carrot on the plate and ate it” (was “it” referring to the plate or the carrot?).

Corpus: a collection of text, often annotated in some way. Idea is to train the Machine Learning on the manually annotated Corpora, and then apply its linguistic knowledge to an NLP system which annotates text.