The Next Generation of Neural Networks


>>It’s always fun introducing people who
need no introduction. But for those of you who don’t know Geoff and his work, he pretty
much created–he helped create the field of machine learning as it now exists and was
on the cutting edge back when it was the bleeding edge of statistical machine learning and neural
nets when they first made their resurgence for the first time in our lifetime, and has
been a constant force pushing it–pushing the analysis in the field away from just sort
of the touchy-feely, let’s tweak something until it thinks and towards getting–building
systems that we can understand and that actually do useful things that make our lives better.
So you–if you read the talk announcement, you’ve seen all of his many accomplishments
and members of various royal societies, etcetera, so I won’t list those. I think instead of
taking up more of his time, I’m just going to hand the microphone over to Geoff.
>>HINTON: Thank you. I’ve got–I got it. So the main aim of neural network research
is to make computers recognize patterns better by emulating the way the brain does it. We
know the brain learns to extract many layers of features from the sensory data. We don’t
know how it does it. So it’s a sort of joint enterprise of science and engineering. The
first generation of neural networks–I can give you a two minute history of neural networks.
The first generation with things like Perceptrons, where you had hand coded features, they didn’t
adapt so you might put an image–the pixels of an image here, has some hand coded features,
and you’d learn the weights to decision units and if you wanted funding, you’d make decision
units like that. These were fundamentally limited in what they could do as these points
out in 1969, and so people stopped doing them. Then sometime later, people figured out how
to change the weights of the feature detectors as well as the weights of the decision units.
So what you would do is take an image share, you’d go forwards through a feed-forward neural
network, you will compare the answer the network gave with the correct answer, you take some
measure of that discrepancy and you send it backwards through the net and as you go backwards
through the net, you compute the derivatives for all of the connections strings here both
those once and those once and those once of the discrepancy between the correct answer
and what you got, and you change all these weights to get closer to the correct answer.
That’s backpropagation, and it’s just the chain rule. It works for non-linear units
so potentially, these can learn very powerful things and it was a huge disappointment. I
can say that now because I got something better. Basically, we thought when we got this that
we cannot learn anything and we’ll get lots and lots of features, object recognition,
speech recognition, it’ll be easy. There’s some problems, it worked for some things,
[INDISTINCT] can make it work for more or less anything. But in the hands of other people,
it has its limitations and something else came along so there was a temporary digression
called kernel methods where what you do is you do Perceptrons in a cleverer way. You
take each training example and you turn the training example into feature. Basically the
feature is how similar are you to this training example. And then, you have a clever optimization
algorithm that decides to throw away some of those features and also decides how to
weight the ones it keeps. But when you’re finished, you just got these fixed features
produced according to a fixed recipe that didn’t learn and some weights on these features
to make your decision. So it’s just a Perceptron. There’s a lot of clever math to how you optimize
it, but it’s just a Perceptron. And what happened was people forgot all of Minsky and Papert’s
criticisms about Perceptrons not being able to do much. Also it worked better than backpropagation
in quite a few things which was deeply embarrassing, but it says a lot more about how bad backpropagation
was and about how good support in fact the machines are. So if you ask what’s wrong with
backpropagation, it requires labeled data and some of you here may know it’s easy to
get data than labels. If you have a–there’s a model of the brain, you [INDISTINCT] about
that many parameters and you [INDISTINCT] for about that many seconds. Actually, twice
as many which is important to some of us. There’s not enough information in labels to
constrain that many parameters. You need ten to the five bits or bytes per second. There’s
only one place you’re going to get that and that’s the sensory input. So the brain must
be building a model of the sensory input, not of these labels. The labels don’t have
enough information. Also the learning time didn’t scale well. You couldn’t learn lots
of layers. The whole point of backpropagation was to learn lots of layers and if you gave
it like ten layers to learn, it would just take forever. And then there’s some neural
things I won’t talk about. So if you want to overcome these limitations, we want to
keep the efficiency of a gradient method for updating the parameters but instead of trying
to learn the probability of a label given an image, where you need the labels, we’re
just going to try and learn the probability of an image. That is, we’re going to try and
build a generative model that if you run it will produce stuff that looks like the sensory
data. Another is we’re going to try and learn to do computer graphics, and once we can do
that, then computer vision is just going to be inferring how the computer graphics produce
this image. So what kind of a model could the brain be using for that? The building
blocks I’m going to use are a bit like neurons. They’re intended to be a bit like neurons.
They’re these binaries stochastic neurons. They get some input, they’re given–I put
this either a one or a zero, so it’s easy to communicate and it’s probabilistic. So
this is the probability of giving a one as a function of the total input you get which
is your external input plus what you get for other neurons times the weights on the connections.
And we’re going to hook those up into a little module that I call a restricted Boltzmann
Machine. This is the module here, it has a layer of pixels and a layer of feature detectors.
So it looks like he’s never going to learn lots and lots of layers of feature detectors.
It looks like we thrown out the baby with the bath water and we’re now just restricted
to learning one layer of features but we’ll fixed that later. We’re going to have a very
restricting connectivity, hence the name, where this is going to be a bipartite graph.
The visible units for now don’t connect to each other and the hidden units don’t connect
to each other. The advantage of that is if I tell you the state of the pixels, these
become independent and so you can update them independently and in parallel. So given some
pixels and given that you know the weights on the connections, you can update all these
units in parallel, and so you’ve got your feature activations very simply, there’s no
lateral interactions there. These networks are governed by an energy function and the
energy function determines the probability of the network adopting particular states
just like in a physical system. These stochastic units will kind of rattle around and they’ll
tend to enter low energy states and avoid high energy states. The weights determine
the energies linearly. The probabilities are an exponential function of the images so the
probabilities, the log probabilities are a linear function of the weights, and that makes
learning easy. There’s a very simple algorithm that Terry Sejnowski and me invented in–back
in 1982. In a general network, you can run it but it’s very, very slow. In this restricted
Boltzmann Machine, it’s much more efficient. And I’m just going to show you what the Maximum
Likelihood Learning Algorithm looks like. That is, suppose you said take one of your
parameter on your connection, how do I change that parameter so that when I run this machine
in generative mode, in computer graphics mode, it’s more likely to generate stuff like the
stuff I’ve observed? And so here’s what you should do, you should take a data vector,
an image, and you should put it here on the visible units and then you should let the
visible units via their current weights activate the feature detectors. So you provide input
to each feature detector and you now make a stochastic decision about what the feature
detector should turn on. Lots of positive input, it almost certainly turns on, lots
of negative input it almost certainly turns off. Then, given the binary state of the feature
detectors, we now reconstruct the pixels from the feature detectors and we just keep going
like that. And if we run this chain for a long time, this is called a Markov chain,
and this process is called alternating Gibbs sampling, If we go back [INDISTINCT] for a
long time, we’ll get fantasies from the model. This is the kind of stuff the model would
like to produce. These are the things that the model shows you when it’s in its low energy
states given its current parameters. So that’s the sort of stuff it believes in, this is
the data and obviously you want to say to it, believe in the data, not your own fantasies.
And so we’d like to change the parameters the way it’s on the connections, so as to
make this more likely and that less likely. And the way to do that is to say, measure
how often a pixel i and a feature detector j on together when I’m showing you the data
vector v. And then measure how often they’re on together when the model is just fantasizing
and raise the weights by how often they’re on together when it’s seeing data and lower
the weights by how often they’re on together when it’s fantasizing. And what that will
do is it’ll make it happy with the data, low energy, and less happy with its fantasies.
And so it will–its fantasies will gradually move towards the data. If its fantasies are
just like the data, then these correlations, the probability of pixel i and feature detector
j being on together in the fantasies will be just the same as in the data, and so it’ll
stop learning. So it’s a very simple local learning rule that a neuron could implement
because it just involves learning the activity of a neuron and the other neuron it connects
to. And that will do Maximum Likelihood Learning, but it’s slow. You have to settle for like
a hundred steps. So, I think about how to make this algorithm go a hundred thousand
times faster. The way you do it is instead of running for a hundred steps, you just run
for one step. So now you go up, you come down and you go up again. And you take this difference
in statistics and that’s quite efficient to do. It took me 17 years to figure this out
and in that time computers got a thousand times faster. So, the change in the weight
now is the difference–is a learning rate times the difference between statistics measured
with data and statistics measured with reconstructions of the data. That’s not doing Maximum Likelihood
Learning but it works well anyway. So I’m going to show you a little example, we are
going to take a little image where we’re going to have handwritten digits, this is just a
toy sample. We’re going to put random weights on the connections then we’re going to activate
the binary feature detectors given the input they’re getting from the pixels, then we’re
going to reconstruct the image and initially we can get a lousy reconstruction, this will
be very different from the data because they’re random weights. And then we’re going to activate
the feature detectors again and we’re going to increment the connections on the data and
we’re going to decrement the connections on the reconstructions and that is neither going
to learn nice weight for us as I’ll show you, nice connection strengths that will make this
be a very good model of [INDISTINCT]. It’s important to run the algorithm where you take
the data and on the data you increment connection strengths and on your–this is really a sort
of screwed up version of the data that’s being infected by the prejudices of the model. So
the model kind of interprets the data in terms of its features then it reconstructs something,
it would rather see than the data. Now you could try running a learning algorithm where
you take the data, you interpret it, you imagine the data is what you would like to see and
then you learn on that. That’s the algorithm George Bush runs and it doesn’t work very
well. So, after you’ve been doing some learning on this for not very long, I’m now showing
you 25,000 connection strengths. Each of these is one of the features, take this slide. That’s
a feature and the intensity here shows you the strength of the connection to the pixels.
So this feature really wants to have these pixels off and it really wants to have these
pixels on and it doesn’t care much about the other ones, mid-gray means zero. And you can
see the features are fairly local and these features are now very good at reconstructing
twos. It was trained on twos. So if I show you–show it some twos it never saw before,
and get it to reconstruct them, you can see it reconstructs them pretty well. The funny
pixels here which aren’t quite right is because I’m using Vista. So you can see the reconstruction
is very like the data and the–it’s not quite identical but it’s a very good reconstruction
for a wide variety of twos and these are ones it didn’t see during training, okay. Now what
I’m going to do–that’s no that surprising, if you just copied the pixels and copy them
back, you’d get the same thing, right? So that would work very well. But now I’m going
to show it something it didn’t train on. And what you have to imagine is that Iraq is made
of threes but George Bush thinks it’s made of twos, okay? So here’s the real data and
this is what George Bush sees. That’s actually inconsistent with my previous joke because
[INDISTINCT] this learning algorithm. Sorry about that. Okay, so you see that it perverts
the data into what it would like to believe which is like what it’s trained on. Okay,
that was just a toy example. Now what we’re going to do is train the letter features like
that in the way I just showed you. Forget these features that are good at reconstructing
the data, at least for the kind of data it’s trained on. And then we’re going to take the
activations of those features and we’re going to make those data and train another layer,
okay. And then we’re going to keep doing that and for reasons that are slightly complicated
and I will partially explain, this works extremely well. You get more and more abstract features
as you go up and once you’ve gone up through about three layers, you got very nice abstract
features that are very good then for doing things like classification. But all these
features were learned without ever knowing the labels. It can be proved that every time
we add another layer, we get a better model of the training data or to be more precise,
we improve a lower band on how good a model we got of the training data. So here’s a quick
explanation on what’s going on. When we learn the weights in this little restrictive Boltzmann
Machine, those weights define the probability of given a vector here, we’re constructing
a particular vector there. So that’s the probability of a visible vector given a hidden vector.
They also define this whole Markov chain, if you went backwards and forwards many times.
And so if you went backwards and forwards many times, and then looked to see what you
got here, you’ll get some probability distribution of the hidden vectors and the weights defining
that. And so you can think of the weights as defining both a mapping from these vectors
of activity over the hidden units to the pixels, to images, that’s this term and the same weights
define a prior over these tons of hidden activities. When you learn the next level of Boltzmann
Machine up, you’re going to say, “Let’s keep this, keep this mapping, and let’s learn a
better model of the posterior that we’ve got here when we use this mapping,” and you keep
replacing the posterior–implicit posterior defined by these weights by a better one which
is the p of v given h defined by the next Boltzmann Machine. And so what you’re really
doing is dividing this task into two tasks. One is, find me a distribution that’s a little
bit simpler than the data distribution. Don’t go the whole way to try and find a full model,
just find me something a bit simpler than the data distribution. This is going to be
easy [INDISTINCT] Boltzmann Machine to model, that’s very nonparametric. And then find me
a parametric mapping from that slightly simpler distribution to the data distribution. So
I call this creeping parameterization. What you’re really doing is–it’s like taking the
shell off an onion, you got this distribution you want to model. Let’s take off one shell
which is this and get a very similar distribution that’s a bit easier to model and some parameters
that tell us how to turn this one to this one and then that’s going to solve the problem
of modeling this distribution. So that’s what’s going on when you learn these multiple layers.
After you’ve learned say three layers, you have a model that’s a bit surprising. This
is the last restrictive Boltzmann Machine we learned. So here we have this sort of model
that says, “To generate from the model, go backwards and forwards.” But because we just
kept the p of v given h from the previous models, this is a directed model where you
sort of get chunk, chunk to generate. So the right way to generate from this combined model
when you’ve learned three layers of features, is to take the top two layers and go backwards
and forwards for a long time. It’s fortunate you don’t actually need to generate from it,
I’m just telling you how you would if you did. We want this for perception so really,
you just need to do perceptual imprint which is chunk, chunk, chunk, it’s very fast. But
to generate, you’d have to get backwards and forwards for a long time and then once you’ve
decided on a pattern here, you go–just go chunk, chunk, that’s very directed and easy.
So I’m now going to learn a particular model of some handwritten digits but all the digit
classes now. So we’re going to put slightly bigger images of handwritten digits from a
very standard data set where we know how well other methods do. In fact it’s a data set
on which support back the machine’s beat backpropagation which was bad news backpropagation but we’re
going to reverse that in a minute. We’re going to learn 500 features now instead of 50. Once
we’ve learned those, we’re going to take the data, map it through these weights which are
just these weights in the opposite direction, and get some feature vectors. We’re going
to treat those as data and learn this guy, then we’re going to take these feature vectors,
we’re going to tack on ten labeled units. So now we needed the labels but I’ll get rid
of that later. And so we’ve got a 510 dimensional vector here and we’re going to learn a joint
density model of the labels and the features. We’re not trying to get from the features
to the labels, we’re trying to say why do these two things go together? So we’re learning
a joint model of both, not a discriminative model. When we’ve completed this learning,
what we’re going to end up with is, the top level here is a Boltzmann Machine and so it
has an energy function, and you can think of that as a landscape. When the weights are
all small here or close to zero, then the energy landscape is very flat. All the different
configurations here are more or less equally good. As it learns, it’s going to carve ravines
in this energy landscape. If you think of it as a 510 dimensional energy landscape,
these ravines are going to have the property that in the floor of the ravine, there’s about
ten degrees of freedom and those are the ways in which a digit can [INDISTINCT] and still
be a good instance of that digit, like a two with a bigger loop or a longer tail. Up the
sides of the ravine, there’s like 490 directions and those are the ways in which, if you vary
the image, it wouldn’t be such a good two anymore. But the nice thing is, it’s going
to learn long narrow ravine so that one two can be very different from another two and
yet connected by this ravine, the rings captured the manifold, so it could wander from one
to another in a way that it won’t wander from a two to a three even though the three might
be more similar in pixels to the two. Okay. I want to show you this generative model actually
generating. Before I do that, I want to own up, we did a little bit of fine tuning which
actually took longer than the original learning, where you–after you’ve done that greedy layer
by layer learning, you do a bit of fine tuning where you put in images, you do a forward
pass, bottom up with binary states and when you do this forward pass, you adjust the connections
slightly so that what you get in one layer would be better at reconstructing what caused
you in the layer below. Then you do a few iterations at the top level Boltzmann Machine,
you go backwards and forwards a few times to get the learning signal there. And then
you do a down pass. And during the down pass, you adjust the connections going upwards so
they’re better at reconstructing what caused the activity in that layer. So during the
down pass, you know what caused activity because you caused it and you’re trying to recover
those causes. That fine tuning helps but it will work without it. So now I’m going to
attempt to show you a movie. That’s not very nice. Okay, there’s that network. Here where
we’re going to put images. Here’s 500 features, 500 features, 2,000 features and the ten labels.
First of all, we’re going to do some perception. So I’m going to give it an image and tell
it to run forwards. Oops, sorry? I didn’t mean that. I meant that. And you’ll see, these
are stochastic, they keep changing, but it’s very sure that it’s a four. See, those are
the identities of these neurons. It knows that’s a four and it has no doubt about it,
even though its feature detectors are fluctuating a bit. If I give it a five, hopefully it’ll
think it’s a five. Yeah, it doesn’t have any doubt. So now let’s be mean to it because
that’s a lot more fun. I’m going to give it that. So, it says, so, four, six, eight, four,
eight, eight, eight, eight, eight, eight, four. It can’t make up its mind whether it’s
a four or an eight, and that’s pretty reasonable in those circumstances. It will actually,
for that one, say eight a bit more often than anything else. So, we’ve classed it as getting
that right but it’s very unsure whether it’s an eight or a four. And just occasionally,
it thinks it can be other things like a two, but it basically thinks four or eight. I can
make it run faster so you can–okay. It’s basically four or eight, an occasional six.
I could give it something like this and it thinks basically one or seven and occasionally
a four. Because I programmed this myself, I want to point out that it’s very reasonable
for this–this is my baby, and it’s very reasonable for it to think that I might be four, because,
look, you could see the four in there, okay. Okay. Now, that was just doing perception
but the very same model does generation. So, what I can do is I can fix the top level unit
and all I’ve done is I’ve fixed the state of one neuron. There’s a million connections
there because that’s 2,500. I just fixed this one neuron but when I fix that state, then
the weights, the 2,000 weights coming out of there to these neurons here, what they’ll
do is they’ll lower the energy of the ravine for twos and they’ll raise the energy of the
ravine for all of the other guys. So, now we’ve got this landscape in which you got
all these ravines but the two ravine has been lowered. And if you put it at a random point,
it will eventually stumble into the two ravine and then it will stay there and wander around.
So, let’s see if we can do that. So, what’s really going on here is I’m just going backwards
and forwards up here. Ignore that for now. I’m going backwards and forwards here and
letting it gradually settle until it’s into a state that this network’s happy with. So,
that’s his brain state and that doesn’t mean much to you. If you look at that, you don’t
really know what it means. So, what we’re going to do is, as it’s settling, we’re going
to play out the generative model here. We’re going to do computer graphics to see what
that would have generated. And so, what you got here is that’s what’s going on in its
brain and this is what’s going on in its mind. So, you can see what this is thinking and
I’m serious about that. That is–I know it sounds crazy, when I say to you I’m seeing
a pink elephant, what I mean is, I’ve got a brain state such that, if there were a pink
elephant out there, this will be perception. That’s how mental states work. They’re funny
because they’re hypothetical, not because they’re made of spooky stuff. So, I use this
language where the terms refer to things in the world because I was saying, “What would
have to be in the world for this brain state to be perception?” Now, if I got a generative
model, I can take the–take the brain state and say, “Well, what would have to be in the
world for that to be perception?” Well, that. So, that’s what it’s thinking, that’s its
mental state right there. So, you got brain states and mental states and most psychologists
won’t show you both. Let’s go a bit faster. And it still hasn’t settled into the two ravine.
And now it’s about in the two ravine. And now it’s just wandering around in that two
ravine and this is what it’s thinking. It knows about all sorts of different twos and
it’s very good that it does because that means it can recognize weird twos. Let’s give it
another one. It hasn’t got into the eight ravine properly yet. It will jump [INDISTINCT]
the ravines, he’s not really there. But by now, it will be in the eight ravine and it
will show you all the sorts of different eights it believes in, if you run it long enough.
If you run it for an hour now, it would probably just stay in the eight ravine showing you
all sorts of different eights, okay. Let’s do one more because I liked it so much. Again,
it’s not really in the five ravine properly yet. No, that was a six. By now it’s in the
five ravine and it will show you all sorts of weird fives, ones without tops, some occasional
sixes. And it ends up with a pretty weird one but that’s definitely a five and it’s
very good that it knows that that’s definitely a five because it lies to recognize things
like that. Okay. That’s it for the demo. I have to get rid of that. Okay. So, here’s
some examples of things it can recognize. These are all the ones it got right and you
can see it. It recognizes a wide variety of twos. It recognizes that this is a one despite
that and it recognizes that this is a seven because of that. If you try writing a program
by hand, it will do that. You’ll find it’s kind of tricky if you’d never thought of these
examples in advance. If you compare it with support vector machines, now what we’re doing
here is we’re taking a pure machine learning task. We’re not giving it any prior knowledge
about pixels being next to other pixels. We’re not giving it extra transformations of the
data. So, this is without–it’s a pure machine learning task without any extra help. If you
get extra help, you could make all the methods a lot better. But a support vector machine
done by DeCoste and Scholkopf were very good, it got 1.4%. The best you can do with standard
backpropagations is about 1.6%. This gets 1.25% and significance here is about a difference
of 0.1. So, this is significantly better than that. [INDISTINCT] maybe gets 3.3%. Now, I
fine-tune that to be good at generations so I could show you it generating using this
sort of up down algorithm but we can also use backpropagation for fine-tuning. And now
that I’ve got this way of finding features from the sensory data, I can say things like
nobody in their right mind would ever suggest that you would use a local search technique
like backpropagation to search some huge non-linear space by starting with small random weights.
It will get stuck in local [INDISTINCT]. And that is indeed true. What we’re going to do
is we’re going to search this huge non-linear space of possible features by finding features
in the sensory data and then finding features in the combinations of features we find in
the sensory data and keep doing that. And we’ll design our features like that. So, we
didn’t need labels, we just need sensory data. Once we designed all our features we can then
use backpropagation too slightly fine-tune them to make the category boundaries be in
the right place. So, a pure version of that would be to say let’s learn the same net but
without any labels. Okay? So, we do all the pre-training like this. After we pre-trained
now, what we’re going to do is we’re going to attach ten label units to the top and we’re
going to use backpropagation to fine-tune these and the fine-tuning is hardly going
to change the weights at all but is going to make the discrimination performance a lot
better. So, this is going to be discriminative fine-tuning and [INDISTINCT] 1.15% errors
and all the code for doing the pre-training and the fine-tuning is on my webpage, if you
want to try it. Now, given that we now know how to get features from data, we can now
train things we never used to be able to train with backpropagation. If you take a net like
this where we’re going to put in the digit, and we’re going to try and get out the same
digit but we’re going to put like eight layers of non-linearities in between, if you start
with small random weights and you backpropagate, you get small, small times small, and by the
time you get back here, you get small to the power eight and you don’t get any gradient.
If you wrote in big random weights, you’ll get a gradient but you’ll have decided in
advance where you’re going to be in the search space. What we’re going to do is learn this
Boltzmann Machine here. After we’ve learned that, we’re going to map the data to get activity
patterns and then this Boltzmann Machine. Then we’re going to learn this Boltzmann Machine
but with linear hidden units. And then what we’re going to do is put the transposed weights
here because this is good at reconstructing that. So, this should be good and so on. And
we’re going to use that as a starting point and then we do backpropagation from there
and it will slightly change all of these weights and it will make this work really well. And
so now what it’s done is this communicated this 28 by 28 image via this bottleneck of
30 units but using a highly non-linear transformation to compress it. If you make everything linear
here, you leave out all these layers and make everything linear, this is PCA, Principal
Components, which is a standard way to compress things. If you put in all these non-linear
layers, it’s much better than PCA. So, this is all done without labels, now. You just
give it the digits, you don’t tell it which is which. These are examples of the real digits,
just one example of each class. These are the reconstructions from those 30 activities
in the hidden layer and you can see they’re actually better than the data. This is a dangerous
line of thought. PCA does this and you can see it’s kind of hopeless compared to this
method. At least that’s what you’re meant to see. Now, we can apply this to document
vectors. I don’t find documents as interesting as digits but I know some people are interested
in them. You could take a document vector and you could take the counts of the 2000
most common words and there’s a big database like this of 800,000 documents. And so we
took 400,000–sorry. Yeah, I know. I see people smiling. [INDISTINCT] 100,000, I’m an academic,
okay. We then train up a neural net like this, where these are now [INDISTINCT] units. For
those of you who know machine learning, we can use any units in the exponential family,
where the log probability is linear in the parameters. So, we train up this to get some
features, we train up this to get some features, and then we train up this until you get just
two linear features. That seems a little excessive and obviously when we reconstruct, we’re not
going to get quite the right counts. But you’ll get a–you’ll get counts that are much closer
to the right counts in the base rates. So, we’ve done here, you have a high count for
Iraq and Cheney and torture, up here, you’ll get high counts for similar things. So, we
can turn a document into a point in the two dimensional space. And of course once we got
a point in two dimensional space, we can plot it in 2D. And for this database, someone had
gone through by hand, more or less by hand, and labeled all the documents. We didn’t use
the labels, okay. But now when we plot the point in 2D, we can color the point by the
class of the document. So, if you do the standard technique which is Latent Semantic Analysis
which is just a version of PCA, and you layout these documents in 2D, that’s what you get.
And you can see the green ones are in a slightly different place from these blue once but it’s
a bit of a mess. If you use our method, it does a little bit better. You get that. And
so now, if you look at these documents–these are business documents, right? If you look
at these documents here, you can see there’s lots of different kinds of documents about
accounts and earnings. Presumably, there’s an Enron cluster in here somewhere and it
would be very nice to know which are the companies that are in this Enron cluster. Okay. But
there’s something more interesting you can do. That’s just for visualization. But now
I’m going to show you how to solve the following problem. Suppose I’d give you a document.
So, this isn’t like what I call Google Search where you use a few key words and you find
what you want. This is–I give you a document and I ask you to find similar documents to
the one I gave you. Okay? Documents with similar semantic content. So, I’m using a document
as a query. What we’re going to do is we’re going to take our big database of documents,
a whole million of them, and we’re going to train up this network and it’s going to convert
these documents into 30 numbers. I’m going to use logistic units here, that is numbers
that range between 1 and 0 and we’re going to train it as Boltzmann Machines. Then we’re
going to back propagate and we’ll get intermediate values here that convey lots of information.
And then we’re going to start adding noise here and we’re going to add lots and lots
of noise. Now, if I add lots and lots of noise to something that has an output between 0
and 1, there’s only one way it can transmit a lot of information. It’s got to make the
total input that comes from below be either very big and positive, in which case it’ll
give one, or very big and negative, in which case it’ll give a zero. And in both those
cases, it will resist the noise. If it uses any intermediate value, the outcome will be
determined by the noise. So, it won’t transmit information, so it won’t be very good at getting
the right answers.>>So the noise is something like Gaussian,
it’s not binary flipping.>>HINTON: It’s Gaussian noise. And we gradually
increase the standard deviation and it’s noise in the input to the unit. And we gradually
increase this, and we use a funny kind of noise that I don’t want to get into, that
makes it easier to use conjugate gradient descent. And what will happen is, these will
turn into binary units. So, we now have a way of converting the word can’t vector a
document into a 30 bit binary vector. And now we can do what I call supermarket search.
So, suppose you want to find things that are like a can of sardines. What you do is you
go to your local supermarket and you say to the cashier, “Where do you keep the sardines?”
And you go to where the sardines are and then you just look around and there’s all the things
similar to sardines because the supermarket arrange things sensibly. Now, it doesn’t quite
work because you don’t find the anchovies, as I discovered when I came to North America,
I couldn’t find the anchovies. They weren’t anywhere near the sardines and the tuna. That’s
because they’re near the pizza toppings. But that’s just because it’s a three dimensional
supermarket. If there was a 30 dimensional supermarket, they could be close to the pizza
toppings and close to the sardines. So, what we’re going to do is we’re going to take a
document and using our learned network, we’re going to hash it to this 30-bit code. But
this is a hash code that was learned. It’s not some random little thing. It was learned
with lots of machine learning. So, it has the property that similar documents mapped
to similar codes. So, now we can use hashing for doing approximate matches. Everybody knows
hashing is nice and fast and everybody usually can’t do approximate matches. But with machine
learning, you can have both. So, we take our document, we hash it to a code and in this
memory space, at each point in the memory space, we put a pointer to the document that
has that code and your [INDISTINCT] so if two documents have the same code, you can
figure out what to do. So now, with the query document, we just go there and now we just
look around like in the supermarket. And the nearby similar documents will have nearby
codes. And so, all you need to do to find a similar document is flip a bit and do a
memory access. Okay. That’s two machine instructions. So, if you were to have a database, let’s
say 10 Billion documents, and I give you one and say, “Give me a 100,000 documents similar
to this one,” from my other search technique I’m going to use, it can only cope with a
100,000. You’re going to have to do a 100,000 times, you’re going to have to flip a bit
and do a memory access. So, that’s only 200,000 machine instructions. I need two machine instructions
per document. It’s completely independent of the size of your database. Okay. Because
you’ve laid things out like in a supermarket, you’ve got a document supermarket now [INDISTINCT]
so, if you compare it with–well, we’ve actually only tried it because we’re academic, on 20-bit
codes and a million documents and it works just fine, but nothing could possibly go wrong
when you scale it up. It’s actually quite accurate. That is, if you compare it with
a sort of gold standard method, it’s about the same accuracy and when you now take your
shortlist that you find in this very fast way and you give those guys in the shortlist
to the gold standard method, it works better than the gold standard method alone. It’s
much better than locality sensitive hashing but if in terms of speed, we use the code
that’s on the web for that and it’s about 50 times faster. And in terms of accuracy,
locality sensitive hashing will always be less good than this because it’s just a hack
for doing this. And locality sensitive hashing works on the count vector. If you work on
the count vector, you will never understand the similarity between the document that says,
“Gonzales quits,” than the documents that says “Volfovich resigns.” They’re very similar
but not in the word count vector. But if you’ve compress it down to some semantic features,
they’re very similar documents. So, the summary is that I showed you how to use this simple
little Boltzmann Machine with the bipartite connections to learn a layer of features.
Then I showed you that if you take those features, you can learn more features. And as you go
up this hierarchy, you get more and more complicated features that are going to be better and better
for doing classification. This produces good generative models. So they’re good at reconstructing
data, or producing data like the data you saw. If you fine-tune with this [INDISTINCT]
algorithm which has this funny name, if you want good discriminative models, what you
do is then you fine-tune with backpropagation. But the good news is you don’t need labels
for all of your training data. You can learn all these features on a very big data sets
then with just a few million labels or even a few hundred labels, you can backpropagate
to fine-tune it for discrimination. And that will work much better than for example using
any machine learning method that just uses the label data. It’s a huge way. You can use
the unlabeled data very effectively. And I’ve shown you that it can also be used for explicit
dimensional [INDISTINCT] where you get [INDISTINCT] bottleneck and that you can do search for
similar things very fast. And of course we’d like to apply it to images, but for images
you have a problem which is in documents, a word is very indicative of what the document
is about. In an image, what’s indicative of what the image is about is a recognized object
and so what we are trying to do now is make it recognize objects so that [INDISTINCT]
then we can get the objects in the image and then apply the semantic hashing technique.
But we haven’t done that yet. I see I’ve manage to talk very fast so I can show you a little
bit about how we’re going to do the image recognition. Suppose you want to do generative
model which would allow you, a graphics model, to take a type of an object and produce an
image of that object. So, I say square and I say what it’s pose is, its position orientation.
Then we might have a top-down model of–from this and this, predicts where the parts might
be. And if it’s a kind of sloppy model, it’ll say this [INDISTINCT] to be round about there,
and this [INDISTINCT] to be round about there. And if we pick randomly from these distributions,
we’ll get a square where the edges don’t meet up. Now, one way we can solve that is to generate
very accurately here. We could say, I’m going to generate each piece just right. But that
requires high bandwidth and lots of work. We’re going to generate sloppily. We’re going
to generate a redundant set of pieces and then we’re going to know how the pieces fit
together. We’re going to know a corner must be co-linear with an edge and the edges here
must be co-linear with corners. And now, by lateral interactions here, using something
called a Marker Finder Field, we can get it to settle into that. And so now, [INDISTINCT]
process is at each level, the level above says where the major pieces should be, roughly,
and a level that knows about how these pieces go together, like how eyes and noses and mouths
go together, says, “Okay, the nose should be exactly above the middle of the mouth and
the eyes should be at exactly the same height.” The level above doesn’t need to specify that,
that’s known locally. So, how are we going to learn that? Well, we’re going to introduce
lateral interactions during the visible units. That’s fine. The real crucial thing in these
nets is you don’t have lateral interactions during the hidden units. So, we can learn
that and the way we learn that is we put an image in here, we activate the features then
with the features fixed providing constant top-down input, we run this lateral interactions
to let this network settle down and we replace the binary variables by real value variables.
So, we’re doing something called mean-field. We let this settle down with something it
is happier with, a reconstruction. It doesn’t need to get all the way to equilibrium, it
just needs to get a bit better than this. And then, we apply a normal learning algorithm
to these correlations and these correlations, like this. But we can also learn the lateral
interactions by saying, “Take the correlations in the data minus the correlations in the
reconstructions,” and that’ll learn all these lateral interactions. So now what we’re going
to do is, we’re going to learn a network with 400 input units for 20 by 20 patch of an image.
This is just preliminary work. When we learn the first network, these aren’t connected.
Then when we use these feature activities to learn the second level Boltzmann Machine,
we connect these together and we learn these and these. Then when we learned the top Boltzmann
Machine, we connect these together and we learn these weights and these weights. When
we’re finished, we can generate from the model. And so as a control, what we’re going to do
is, we’re going to learn this model on patches and natural images which have notoriously
[INDISTINCT] things to model because anything could happen in a patch and natural image.
So, it’s a very hard thing to build identity model of. We’re going to learn it without
lateral connections and we get a model that’s very like many other models. When you generate
from it, what you get is clouds. So, here’s natural image patches and they have the property
that there’s not much going on and then there’s a sudden [INDISTINCT] of structure like here.
So, if you apply a linear filter to these things, the linear filter will usually produce
a zero and occasionally produce a huge output. If you apply a linear filter to these things,
it will produce some kind of Gaussian distribution. These have exactly the same [INDISTINCT] of
spectrum as these. What they don’t have is this sort of heavy tailed distribution where
there’s not much happening and then a lot happening, and long range structure. So, now
what happens if we put in the lateral interactions and do the learning again? If you put the
lateral interactions in, they can say things like if you have a piece [INDISTINCT] and
you’d like a piece of that somewhere around here, put it here where it lines up. So, that
will make much longer range interactions. And so now when we generate from the model
with lateral interactions, we get that and you can see that these are much more like
real image patches. They pass many of the statistical tests for being real image patches.
They’ve got this kind of much longer range structure. They’ve got sort of co-linear things
and things at right angles and all sorts of nice structure in them, which we didn’t have
before. And so we’re getting–this is probably the best model there is of natural image patches.
If you ask anybody else who models them, “Show me samples from your generative model.” They
say, “Oh, well, we tried that and it looked terrible. So we never published those.” This
is, I think the first model, it generates nice samples from the model. [INDISTINCT]
has the models maybe comparable. What we’d like to do now is make more layers and we’d
also like to have attention. So, as you go up, you focus on parts of the image. And what
I want to do is get something–you’re given an image, you go up, it’s focusing on parts
and it gives you a figure at the top. It gives you what you see, which is you look at an
image and you see a face. And then you look again, you see the eye. Then you look again,
you see a group of four people. And those are the things that come out and those are
going to be like the words that need to go and turn image retrieval system. You going
to have–this is going to run for long time learning and then it’s going to run for quite
a long time on each image, but that’s all [INDISTINCT] okay. I’m done.
>>So, it looks like we’ve got time for questions. If you have questions can you–if you have
questions, can you please hit the mic in the middle so that the folks at their offices
can hear.>>Okay. Hi. So, you were saying that this
method doesn’t require labels. I was just wondering if it would actually help if you
have labels for at least some of your training data?
>>HINTON: Oh, yes. Labels help. The main thing is to show that you can do lot without
them and therefore you can have much more leverage from a few labels. Yeah.
>>Okay. Thanks.>>HINTON: So, for example in the semantic
hashing idea, you could, as you’re learning those 30 dimensional codes, you could say
if two things are from the same class and the codes are far apart, introduce a small
force pulling them together. And we’ve got a paper on that in [INDISTINCT] last year.
And that will improve the sort of clustering of things of the same class. But the point
is you can do it without knowing the classes as well.
>>Hi. Now, so, people have built all the encoders for a long time before and they use
regular sigmoid units and use backprop to train them.
>>HINTON: But they never work very well.>>Correct. Would–if we actually have multiple
layers of these, over these sigmoid units and use–and train them the same function
as you’re doing, one layer at a time, would it work as well, as RBMs or not?
>>HINTON: Okay. That’s a very good question. So, it’s a bit confusing. This deep thing
with multiple layers trained with RBMs are called mutli-layer auto encoder. But you could
also have a very small auto encoder with one hidden layer that’s non-linear and train that
up. And the RBM is just like that. So, you could train these little auto encoders and
stack them together and then train the whole thing with backprop. That’s what the question
was. And that will work much better than the old way, training auto encoders, but not quite
as well as this. So, Yoshua Bengio has a paper. Where he compared doing auto encoders with
doing restricted Boltzmann Machines, and the restricted Boltzmann Machines worked better
specially for things like [INDISTINCT] backgrounds.>>I’ve got a–I’ve got a question which–if
I could ask…>>HINTON: Okay.
>>…because I’m holding a microphone. So, this morning we were talking about the–about
news with–where the problem with news is that everything changes from day to day. Do
you have any intuition–this is one of those unfair, “What do you think would happen,”
do you have any intuition on how hard it would be to adapt a deep network like this once
your input distribution changes or as it continues to change?
>>HINTON: Okay. So one good thing about this learning is everything scales linearly with
[INDISTINCT] training data. There’s no quadratic optimization anywhere that’s going to screw
you for big databases. The other thing is, because it’s basically stochastic online learning,
if your distribution changes slightly, you can track that very easily. You don’t have
to start again. So, if it’s the case that the news tomorrow has quite lot in common
with the news over the last few months and few years, and you just need to change your
model a bit rather than start again, Then this very good for–going to be good for tracking
and it’s not going to be as much work as learning it all in the first place. And in fact, once
you got all of these layers of features, basically changing the interactions in high level features
will get you lots of mileage without much work.
>>Sir I have another question about the a–so, about the supermarket search. You were saying
you just flip a bit in your hash code. So, what I’m wondering is, you know, one thing
that I’m not sure about is like if you flip one of these bit you might not necessarily
get something there?>>HINTON: That’s fine.
>>I mean, how do you know that you’re going to find something there? And then also, maybe,
is there some way of finding better bits to flip and like how do you decided which ones?
>>HINTON: So, of course. If you make the number of addresses be about the same as the
number of documents, the average answer is one.
>>Right.>>HINTON: Okay. And you’ve–if there’s nothing
there, you can flip for more bits.>>Sure.
>>HINTON: So, yes. You’ll get some misses but that’s just sort of constant
>>Right.>>HINTON: We can look at, actually, how evenly
spread over addresses it is and typically, most of the addresses won’t be used and a
typical address would be used like three or four times. So, it’s not as uniform as we’d
like but that could all be improved. And we’ve only done this once. We’ve just trained this
network once on one data set and that’s all the research we’ve done so far, really. If
we could get a tiny bit of money from someone, we could make this whole thing work much better.
>>So, one thing that is special about digits is that they evolve in a way that they make
them discriminative.>>HINTON: Yes.
>>So, you would hope you–it’s not that surprising that it then unsupervised way can attract
features that are discriminative. I was wondering what happens with [INDISTINCT] the other applications
where–so clearly, when you do unsupervised, you might throw away some very indicative
features right there.>>HINTON: Yes. So, basically, there’s two
kinds of learning, there’s discriminative learning where you take your input and your
whole aim in life is to predict the label. And then there’s generative learning where
you take your input and your whole aim in life is to understand what’s going on in this
input. You want to build a model that explains why you got these inputs and not other inputs.
Now, if you do that generative approach, you need a big computer and you’re going to explain
all sort of stuff that’s completely irrelevant to the task you’re interested in. So, you’re
going to waste lots of computation. On the other hand, you’re not going to need as much
training data because each image is going to contain lots of stuff and you can start
building your features without yet using information in the labels. So, if you’ve got a very small
computer, what you should do is discriminative learning so you don’t waste any effort. If
you got a big computer, do generative learning, you’ll waste lots of the cycles but you’ll
make better use of the limited [INDISTINCT] label data. That’s my claim.
>>Hi Geoff. I have a question. What happened to regularization? What kind of regularization
is implicit in all of your stages?>>HINTON: Okay. So, we’re using a little
bit of weight decay and the way we set the weight decay was just–we fiddled about it
for a bit to see what worked on the–on a validation set, the usual method. And if you
don’t use any weight decay, it works. If you use weight decay, it works a bit better. And
it’s not crucial how much you use. So, we are using some weight decays here but it’s
not a big deal. And like I say, all of the code is in [INDISTINCT] on my web page. There’s
a pointer on my web page. So, you can go and look at all those things and all the little
fudges we use.>>Right. But the Boltzmann Machine is fundamentally
sort of entropic regularization and then your little pieces of tuning with weight decay
are from the other family. So, you’re blending both [INDISTINCT]
>>HINTON: No. The Boltzmann Machine, it’s true. There’s a lot of regularization comes
on from the fact that the hidden units are binary stochastic. So, they can’t transmit
much information.>>Yes.
>>HINTON: That does lots of regularization for you, compared with the normal auto encoder.
But in addition, we say don’t make the weights too big. And one reason for that is not just
regularization, it’s–it makes the Markov chain mix faster if you don’t make the weights
too big.>>Thanks.
>>Hi. So, in your example of digits, you actually tell them–tell the algorithm that
they are ten classes.>>HINTON: Yes.
>>So, I wonder, well, what is the impact if we do not give this number correct? So,
yeah.>>HINTON: Okay. So, what you can do is you
can take this auto encoder that goes down to 30 real numbers and not tell it how many
classes there are, just give it the images, get these 30 real numbers. Then you can take
those 30 real numbers and apply dimensionality reduction technique that Sam Roweis and I
have developed, and the latest version of that, you can lay them out in 2D and you will
get 11 classes. And it did that without ever knowing any labels. You’ll get just these
11 clusters which is close to 10. It often thinks that the continental sevens are a separate
clusters.>>So you are saying this is [INDISTINCT]
you have try and that’s what happened or?>>HINTON: I might even have it in this talk
somewhere. I might not, though. It’s on my–it’s–oh, there you go. That’s pure unsupervised on
the digits. Now in this case, these are twos and these are twos. In 30D, it’s got the clusters.
When you force it down to 2D, it wants to keep the twos next to each other but it will
also wants these–these are the spiky twos and these are the sevens, and it wants those
close. And these are the loopy twos and these are the threes, and it wants those close.
But it also wants the threes close to the eights. And so in 2D, there just isn’t enough
space to make ten clusters. But look, it made 11 there and if I don’t cheat and do this
in black and white, you can still see there’s sort of roughly 11 clusters. So, this was
pure unsupervised and it found that structure in the data. So, when psychologists tell you,
you impose categories on this data, they aren’t really there in the world, it’s rubbish. I
mean, they’re really there.>>So the magic number is 30. Is it–if I
choose other number, it will be fine with it?
>>HINTON: If you choose a smaller number, you might not preserve enough information
to be able to keep the classes. And if you choose a bigger number, then PCA will do it
better. So your comparison with PCA won’t be as good.
>>Thank you.>>How does the performance of the digit classification
vary according to the number of layers you are using?
>>HINTON: Okay. Obviously, using the number of layers I showed you is one of the best
numbers to use. If you use less layers, it works a bit worse. If you use more layers,
it works about the same. I’ve now got a–I’ve got a very good Dutch student who has the
[INDISTINCT] he doesn’t believe a word I say, and we will know–he’s using like 40 cluster
machines and he’s going to get the answer to this. But so far, I’m right that using
less layers isn’t as good and he hasn’t got to more layers yet. He’s actually made with
the same number of layers, he can make it work better and we’ll see if he makes it work
better with more layers.>>Just [INDISTINCT] guess a related question.
So, it’s clear how to evaluate this models say if you have some labeled data and [INDISTINCT]
you can try to see if you predict it similarly. But if you try generative, this Boltzmann
Machines with like, especially [INDISTINCT] interactions in the same levels and so on,
if I gave you another set, can you say how good generatively it is and is it easy?
>>HINTON: Okay.>>How do you evaluate…
>>HINTON: Yeah.>>…that kind of part of it?
>>HINTON: So, the problem with these Boltzmann Machines, this is a partition function, and
what you’d love to do is take your data set, hold out some examples, train your generative
model on the training set and then say what is the log probability of these held out examples?
>>Exactly.>>HINTON: And that would be the sort of gold
standard. And that’s very hard to do. You know the log probability up to a constant
but you don’t know the constant. So, people in my group and I are working very hard on
a method for interpolating between Boltzmann Machines that allows you to use a Boltzmann
Machine with zero weights which is a pretty dumb model and then gradually change the weights
towards the Boltzmann Machine that you eventually learned and you can get the ratio of the partition
functions of all these Boltzmann Machines so in the end, you can get the partition function.
You can get a pretty good estimate. This is called–it’s a version of a [INDISTINCT] important
something called bridging. And we think we’re going to be able to get pretty accurate estimates
of the partition function now by running for like, you know, a 100 hours.
>>Yes. Yes.>>HINTON: You do this after you’ve learned
just to show how good you are. But the other thing you can do is you can generate from
the model and you can see that the stuff it generates looks good and you can then take
the stuff you generated from the model and you can apply statistical test to that and
statistical test to the real data and statistical test to the other guy’s data, the other guy’s
generative data. And if you choose the right statistical test, you can make the other guy’s
data look terrible.>>Okay. Okay. I think we’re out of time now.
I’d like to thank Geoff again and…