Just before I started auditing the Machine Learning class at UNC, I started reading some pretty good explanations of machine learning and found it rather intuitive at its base. These are my notes from those readings.

read more
Learning Machine Learning:

Good intro: http://www.toptal.com/machine-learning/machine-learning-theory-an-introductory-primer

Another good intro, with focus on computer vision: http://engineering.flipboard.com/2015/05/scaling-convnets/

Deep learning: just using a lot of layers (and may be non-linear). 

A great quote from an article about how Deep Learning is applied to language: A neural network can “learn” words by spooling through text and calculating how each word it encounters could have been predicted from the words before or after it. By doing this, the software learns to represent every word as a vector that indicates its relationship to other words—a process that uncannily captures concepts in language. The difference between the vectors for “king” and “queen” is the same as for “husband” and “wife,” for example. The vectors for “paper” and “cardboard” are close together, and those for “large” and “big” are even closer. Source: http://www.technologyreview.com/featuredstory/540001/teaching-machines-to-understand-us/

Really any program that can improve its performance (output compared to some ideal measure) after being exposed to some sort of experience. 

Supervised Machine Learning: when the program learns with training data (is given some examples of what to output with certain input). 
		Regression: where we’re trying to setup a continuous relationship. 
		Classification: when we’re trying to say yes or no to some input. 

Unsupervised Machine Learning: when the program is not given any training data, and much figure out everything from just the raw data. 

Some of the below will only directly apply to supervised learning:

At the core of the algorithm, of course, is a function. Call it h(x). x is almost surely a multi-dimensional vector, which means that the output will likely depend on a large range of variables. For a single dimensional case, say h(x) = c_0 + c_1(x). All we’re trying to do is tweak c_0 and c_1 to get ideal output. 

Use case for understanding: predicting market value of house. x_1 could be square footage, x_2 could be number of bedrooms, etc. 

The tweaking process: this depends on another function (“cost function” also known as “loss function”). Which cost function is used is a choice. This function, call it J(c), takes all of the parameters and calculates the average degree of error between a predicted value (h(x)) and the ideal (actual) value, y, for all values in the training data. An example, we could use least squares difference. Remember, we are taking the minimum average. 

With this clear definition of a cost function, we understand now that all we must do is minimize J(c). We can do this with calculus (start at some initial , taking the derivative of the cost function, use this derivative to figure out which modifications are of  would make J(c) smaller than its current value (this may involve, for instance, c_1 getting bigger, c_2 getting smaller, c_3 getting bigger, etc) and then make a step size in those directions, then for this new  repeat the process: find derivative at this point, figure out how to make modifications, etc. and continue until you have found a minimum). 

For classification, we want something that makes a prediction between 0 and 1. The prediction’s distance from .5 indicates its degree of certainty, where 0 represents a negative and 1 represents a positive. 


“Deep learning” (really just using neural networks in Machine learning): 


In the case of binary (supervised) classifiers, we can call the deciding trendline used to determine whether to output a 0 or 1 a “perceptron”. Say we have inputs  and weights , but also a constant bias, b, so our determination/classification function is f(x) =  \dot  + b. When trying to determine whether to call an input a 0 or 1, we simply output based on whether f(x) is above 0, for instance (the setting of this cutoff doesn’t really matter as you trained against it). 

Here’s the problem: this approach only allows linear distinction (I mean, f(x) only has one output, so this obviously must be the case). What that means is that if your data set is say graphed on a 2D plane, and colors denote 1’s and 0’s, then you can just draw a single line on this plane as a differentiator. But if you were to apply multiple perceptrons, (having multiple lines), then you could essentially slice up the data in many different ways (e.g. above trend line 1, but below 2, so that’s a 1 vs below trend line 1 and 2, which is a 0). 

Feedforward neural network:

Now, paint this picture differently. Imagine layers of neurons (circles), connecting to each other with arrows. The first layer is the input layer, the second is hidden, and the last is output. The input layer actually just holds the values of the input (so for a 4 dimensional input , you’d have 4 input neurons). The neurons actually have numerical values, and are used as inputs for the next layer. Now where are the functions in this? Each of the arrows have weights, and so essentially by propagating the inputs forward by those weights, we are computing f(x) =  \dot x, for each middle neuron (we should be able to add a +b here, too). 

Imagine using the above setup for determining the function XOR (is in the article above). With two transfer neurons, one would learn how to draw a horizontal line in the middle, and the other a vertical in the middle, and then the activation function would learn to say 1 when above one line and to the right of the other, etc. 

Still, we can only draw straight lines with this approach unless we introduce some non-linearity in transfer and activation functions. 

	During training, you can calculate error (typically using mean squared error 1/2(ideal - actual)^2) and attempt to minimize the average error over all inputs, to try to find the ideal set of weights. But this can be a lot of weights, so you’d really have a (# of weights) dimensional graph that you’re trying to find the global minimum of. 

	There’s a proved theorem, that a single finite layer of transfer neurons can be trained to approximate most any continuous function. 

One known problem is overfitting: likely the central problem in machine learning, the network may fit the training data too closely to the point where it performs poorly on real world data, likely because they don’t have irrelevant subtleties that were extant in the training data and thus the network thought were important. 

Feature detection (auto encoders): 

	For instance in an greyscale image, say we were to take every pixel as an input and output the same size image, but in the middle we had significantly fewer than every pixel neurons. This sort of setup forces (apparently) the transfer neurons to learn only the important features of the input. By doing this sort of feature-based compression, we’re actually helping solve the problem with overfitting, because we’re not matching every tiny feature in the sample dataset but only the big ones that are shared amongst most/all. In terms of human thought, you can sort of imagine this as your mental archetype of an object, but notice that you could vary that to many single features and have it still be that object. 

Contrastive Divergence: 

	This is a sort of weird, iterative back/forward propagation technique in which the input is passed in, there is propagation to the transfer layer, and this output is then back-propogated to the input layer and then re-input to the transfer layer, and then there is some function for weight updating given the difference between the original transfer output and the secondary transfer output. In terms of human thought, this is sort of like seeing something and trying to classify it, then trying to visualize the object based on its classification and trying to classify that? I can see how this sort of highlights features.