Misc Definitions, and Intro to Linear Classifiers (Perceptron)
Definitions
Machine Learning
- think "learning a machine", not "learning by a machine"
Types of Learning
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Examples
Supervised:
map bitmap of handwritten digit to digit class {0,1,...,9,not-a-digit}.
map credit card transaction authorization requests to Pr(fraudulent)
map patch of lung X-Ray to Pr(cancerous)
Unsupervised:
figure out the different letters in some unknown alphabet
figure out the 3D structure of the world from seeing lots of movies
figure out the structure of English from hearing many spoken sentences
Reinforcement Learning:
figure out which moves are good (ie lead to winning) in Chess
Success Stories of Machine Learning
PAPnet - multi-billion dollar success - saved tens of thousands of
women's lives - maps image of plated-out cell from stained Pap smear
slide to rating of abnormality, top-ranked abnormal cells in slide
presented to human operator
Handwritten digit recognition for postal zip-code sorting
Handwriting recognition for PDAs
Speech recognition - used to be hand-crafted rules, now dominated by
machine learning techniques, especially HMMs (hidden markov models).
Allows replacement of phone menu trees, hands-free phone dialing,
word-based searching of corpus of audio data, etc.
Fraud detection for credit cards
Example Learning Theorem!
Simple example of theory in machine learning: we will define a very
simple learning situation and then analyze it. Typically we are
concerned with performance on new (unseen) inputs. Let us assume
that inputs x are drawn iid from some unknown distribution p(x) and
that future inputs will be drawn from the same distribution.
One sided error: assume there is some threshold x_c and that all
x=x_c should be classified
"not okay". Imagine that incorrectly classifying an x that is really
okay as not okay (false negative) is costly but not very costly,
while incorrectly classifying an x that is actually not okay as okay
is a major disaster. (Think of classifying fruits as to whether they
are poisonous. Acceptable to throw away non-poisonous fruit, but
disastrous to eat poisonous fruit.) Consider the following "online"
learning system.
At each time step we get a fruit, with a 1D measurement x(t). We
need to decide whether x(t) is okay. After making our decision, we
are told the correct classification. We would like to come up with a
simple algorithm that (a) never makes a false positive error, (b)
minimizes the number of false negative errors, (c) has a provable
bound on the number of false negative errors it will make.
Consider the following algorithm. We maintain a threshold
\hat{x}_c(t), initialized at \hat{x}_c(1) = -\infty.
We use \hat{x}_c(t) to classify x(t). If we make a mistake (which
had better be a false negative) we raise \hat{x}_c(t) = x(t).
Theorem: the expected number of false negatives after the T-th actual
(true) negative is log(T).
Proof: let T(i) be the time we see the i-th positive exemplar.
If we get x(T(t)) wrong, it can only be because we have never seen
such a large positive exemplar. The chance of that one being the
largest we have ever seen is 1/t. So the odds of getting the t-th
positive examplar wrong are 1/t, and summing that up, the total
number wrong up to the t-th is 1/1 + 1/2 + 1/3 + ... + 1/t = log(t).
(Meaning the natural logarithm.)
We start with supervised learning.
Let us define some terminology.
learning a machine that looks like this:
y = h(x; w)
where x is the input (eg sensory input), w are the free parameters of
the machine we need to set properly to make it match the training
data, and y is the output of the machine given input x and parameter
setting w.
We need a name for the actual output of the machine given input x_p
(which we call y_p) and the desired output of the machine given input
x_p, which we call d_p.
Input Space X, x \in X
Output Space Y, y \in Y.
distinguish between space of possible machine outputs and space of
desired (target) outputs: these may differ. So y \in Y but d \in D.
Examplars
examples of inputs of corresponding output. We say
x_p \mapsto d_p
is an "examplar", or more colloquially a "training pattern."
Weight Space
can be continuous, discrete, or both in potentially complicated ways.
w \in W.
Hypothesis Space H. Parametrized by w, perhaps over-parametrized.
h_w \in H \subset (X \rightarrow Y)
H is that subset of (X \rightarrow Y) that can actually be
constructed by setting w appropriately. So
H = { the mappings x \mapsto h(x; w) such that w \in W }
We use standard variable names for members of these spaces.
Sometimes this standardization conflicts with other sorts of
standardization, in which case we punt in the interest of clarity.
Linear Classifier (aka Perceptron)
input space is R^n
output space is {-1,+1}, corresponding to no/yes dichotomy
(why {-1,+1} instead of {0,1} ? The symmetry will prove useful in
simplifying some formulas later.)
weight space is R^(n+1)
hypothesis space is linear decision surface
if w.x > theta then y = +1 else y = -1
division between +1 region and -1 region is w.x = theta.
(the boundary, in input space, between different outputs is called
the separation surface.)
note that changing theta keeps the direction of the separation plane
the same, just translating it.
And note that if we set theta=0 then the separating plane goes
through the origin, and if perpendicular to w if w is regarded as a
vector in input space. This means that even for nonzero theta, the
surface is perpendicular to the vector w.
Note that we can switch the "sense" of the separation surface, ie
make the +1 region into the -1 region and vice-versa, by changing the
sign of w, ie the sign of all entries of w. (Exercise: does theta
also need to be adjusted?)
Note that the mapping from (w,theta) to a "hypothesis" is
many-to-one, as for instance in 2D
(w1=2, w2=3, theta=4)
and
(w1=4, w2=6, theta=8)
give the same mapping from x to y. In fact, scaling w and
theta by some alpha>0 will leave the hypothesis unchanged.
Perceptron References
- Wikipedia Perceptron
entry
- Perceptron
Learning Rule chapter from the textbook Neural Network
Design by Martin T. Hagan, Howard B. Demuth, and Mark
H. Beale, ISBN 0-9717321-0-8