How to Do Mathematics Easily – Intro to Deep Learning #4


Hey are you okay? Siraj, show them the math behind deep learning. Totally. hello world it’s SIraj and let’s learn about the math needed to do deep learning. Math is in everything not just every field of engineering and science. It’s between every note in a piece of music and hidden in the textures of a painting. Deep learning is no different. Math helps us define rules for our neural network so we can learn from our data. if you wanted to, you could use deep learning without ever knowing anything about math. There are a bunch of readily available APIs for tasks like computer vision and language translation, but if you want to use a library like TensorFlow to make a custom model to solve a problem knowing what math terms mean when you see them pop up is helpful and if you want to advance the field through research, don’t even trip! You definitely need to know the math. Deep learning mainly pulls from three branches of math: linear algebra, statistics and calculus. if you don’t know any of the topics, i recommend a cheat sheet of the important concepts and I’ve linked to one for each in the description so let’s go over the four-step process of building a deep learning pipeline and talk about how math is used at each step once we’ve got a dataset that we want to use we want you process it we can clean the data of any empty values, remove features that are not necessary but these steps don’t require math. A step that does, though, i s called normalization. This is an optional step that can help our model reach convergence, which is that point when our prediction gives us the lowest error possible, faster, since all the values operate on the same scale. This idea comes from statistics. You have a 17.4 percent chance of making a straight. there are several strategies to normalize data although a popular one is called min max scaling. If we have some given data we can use the following equation to normalize it. We take each value in the list and subtract the minimum value from it, then divide that result by the maximum value minus the min value. we then have a new list of data within the range of 0 to 1 and we do this for every feature we have so they’re all in the same scale after normalizing our data we have to ensure that it’s in a format that our neural network will accept. This is where linear algebra comes in. There four terms in linear algebra that show up consistently. Scalars, vectors, matrices and tensors. A scalar is just a single number. A vector is a one-dimensional array of numbers. A matrix is a two-dimensional array of numbers. And a tensor is an N dimensional array of numbers. So a matrix, scalar, vector and spectre, wait not spectre, can all be represented as a tensor. Want to convert data, whatever form it’s in, be that images, words, videos, into tensors, where n is the number of features are data has and defines the dimensionality of our tensor. Let’s use a three-layer feed-forward neural network capable of predicting a binary output given an input as our base example to illustrate some more math concepts going forward When do we use math and deep learning?
When we normalize during processing. Learn a model parameters by searching.
And random weights be initializing. Tensors flow… From input to out
Then measure the error to measure the doubt It gives us what’s real and what’s expected.
Back propogate to get cost corrected. We’ll import our only dependency, Numpy, then initialize our input data and output data as matrices. Once our data is in the right format, we want to build our deep neural network. Deep nets have what are called hyperparameters. These are the high level tuning knobs of the network that we define and they help decide things like how fast our model runs, how many neurons per layer, how many hidden layers. Basically the more complex your neural network gets, the more hyperparameters you’ll have. You can tune these manually using knowledge you have about the problem you’re solving to guess probable values and observe the result. Based on the result, you can tweak them accordingly and repeat that process iteratively. But another strategy you could use is random search. You can identify ranges for each, then you can create a search algorithm that picks values from those ranges at random from a uniform distribution of possibilities which means all possible values have the same probability of being chosen. This process repeats until it finds the optimal hyperparameters. Yay for statistics! We only have number of epochs as our hyperparameter, since we have a very simple neural network We’ll use probability to decide our weight values, too. One common method is randomly initializing samples of each weight from a normal distribution with a low deviation, meaning values are pretty close together. We’ll use it to create a weight matrix with a dimension of three by four, since that’s the size of our input. So every node in the input layer is connected to every node in the next layer. The weight values will be in the range from -1 to 1. Since we have three layers, we’ll initialize two weight matrices. The next set of weights has a dimension four by one which is the size of our output. As data propagates forward in a neural network each layer applies its own respective operation to it. transforming it in some way,
until it eventually outputs a prediction This is all linear algebra. It’s all tensor math. We’ll initialize a for loop to train our network 60,000 iterations Then we’ll want to initialize our layers. The first layer, our input, gets input data. The next layer computes the dot product of the first layer and the first weight matrix. When we multiply two matrices together, like in the case of applying weight values to input data, we call that the dot product. Then it applies a non-linearity to the result which we decided it’s going to be a sigmoid. It takes a real value number and squashes it into a range between 0 and 1. So that’s the operation that occurs in layer 1, and the same occurs in the next layer. We’ll take that value from layer 1 and propagate it forward to layer 2, computing the dot product of it and the next weight matrix, then squashing it into output probabilities with our non-linearity. Since we only have three layers, this output value is our prediction. The way we improve this prediction, the way our network learns, is by optimizing our network over time. So how do we optimize it? Enter calculus. The first prediction our model makes will be inaccurate. To improve it, we first need to quantify
exactly how wrong our prediction is. We’ll do this by measuring the error, or cost. The error specifies how far off the predicted output is from the expected output. Once we have the error value we want to minimize it because the smaller the error the better our prediction. Training a neural network means minimizing the error over time. We don’t want to change our input data but we can change our weights to help minimize this error. If we just brute forced all the possible weights to see what gave us the most accurate prediction, it would take a very long time to compute. Instead, we want some sense of direction for how we can update our weights such that in the next round of training our output is more accurate. To get this direction we’ll want to calculate the gradient of our error with respect to our weight values. We can calculate this by using what’s called the derivative in calculus. When we set deriv to true for our nonlin function, it’ll calculate the derivative of a sigmoid. That means the slope of a sigmoid at a given point, which is the prediction values we give it from l2. We want to minimize our error as much as possible, and we can intuitively think of this process as dropping a ball into a bowl where the smallest error value is at the bottom of the bowl. Once we drop the ball in, we’ll calculate the gradient at each of those positions, and if the gradient is negative, we’ll move the ball to the right. If it’s positive, we’ll move the ball to the left. And we’re using the gradient to update our weights accordingly each time. We’ll keep repeating the process until eventually the gradient is zero, which will give us the smallest error value. This process is called gradient descent, because we are descending our gradient
to approach zero and using it to update our weight values iteratively. I understand everything now. Still understand everything. So to do this programmatically, we’ll multiply the derivative we calculated for our prediction by the error. This gives us our error waited derivative which we’ll call l2_delta This is a matrix of values, one for each predicted output, and gives us a direction. We’ll later use this direction to update this layer’s associated weight values. This process of calculating the error at a given layer and using it to help calculate the error weighted gradient so that we can update our weights in the right direction will be done recursively for every layer starting from the last back to the first. We are propagating our error backwards after we’ve computed our prediction by propagating forward. This is called back propagation. So we’ll multiply the l2_delta values
by the transpose of its associated weight matrix to get the previous layer’s error, then use that error to do the same operation as before, to get direction values to update the associated layers’ weights. so error is minimized. Lastly, we’ll update the weight matrices
for each associated layer by multiplying them by their respective deltas. When we run our code we can see that the error values decreased over time, and our prediction eventually became very accurate. So, to break it down. Deep learning borrows from three branches of math, linear algebra, statistics and calculus. A neural net performs a series of operations on an input tensor to compute a prediction and we can optimize a prediction by using gradient descent to back propagate our errors recursively, updating our weight values for every layer during training. The coding challenge winner from the last video is Jovian Lin. Jovian tried out a bunch of different models to predict sentiment from a dataset of video game reviews.
Wizard of the week! And the runner-up is Vishal Batchu. He tested out several different recurrent nets and eloquently recorded his experiment in his ReadMe. The coding challenge for this video is to train a deep neural net to predict the magnitude of an earthquake and use a strategy to learn the optimal hyperparameters. Details are in the ReadMe. Post your GitHub link in the comments, and I’ll announce the winner next video. Please subscribe if you wanna see more videos like this. Check out this related video, and for now I got to get my math turned up to a million. So thanks for watching!