Building makemore Part 2: MLP

hi everyone today we are continuing our implementation of makemore now in the last lecture we implemented the bigram language model and we implemented it both using counts and also using a super simple neural network that had a single linear layer now this is the jupyter notebook that we built out last lecture and we saw that the way we approached this is that we looked at only the single previous character and we predicted the distribution for the character that would go next in the sequence and we did that by taking counts and normalizing them into probabilities so that each row here sums to one now this is all well and good if you only have one character of previous context and this works and it's approachable the problem with this model of course is that the predictions from this model are not very good because you only take one character of context so the model didn't produce very name like sounding things now the problem with this approach though is that if we are to take more context into account when predicting the next character in a sequence things quickly blow up and this table the size of this table grows and in fact it grows exponentially with the length of the context because if we only take a single character at a time that's 27 possibilities of context but if we take two characters in the past and try to predict the third one suddenly the number of rows in this matrix you can look at it that way is 27 times 27 so there's 729 possibilities for what could have come in the context if we take three characters as the context suddenly we have 20 000 possibilities of context and so there's just way too many rows of this matrix it's way too few counts for each possibility and the whole thing just kind of explodes and doesn't work very well so that's why today we're going to move on to this bullet point here and we're going to implement a multi-layer perceptron model to predict the next uh character in a sequence and this modeling approach that we're going to adopt follows this paper benguetal 2003 so i have the paper pulled up here now this isn't the very first paper that proposed the use of multiglio perceptrons or neural networks to predict the next character or token in a sequence but it's definitely one that is uh was very influential around that time it is very often cited to stand in for this idea and i think it's a very nice write-up and so this is the paper that we're going to first look at and then implement now this paper has 19 pages so we don't have time to go into the full detail of this paper but i invite you to read it it's very readable interesting and has a lot of interesting ideas in it as well in the introduction they describe the exact same problem i just described and then to address it they propose the following model now keep in mind that we are building a character level language model so we're working on the level of characters in this paper they have a vocabulary of 17 000 possible words and they instead build a word level language model but we're going to still stick with the characters but we'll take the same modeling approach now what they do is basically they propose to take every one of these words seventeen thousand words and they're going to associate to each word a say thirty dimensional feature vector so every word is now embedded into a thirty dimensional space you can think of it that way so we have 17 000 points or vectors in a 30 dimensional space and that's um you might imagine that's very crowded that's a lot of points for a very small space now in the beginning these words are initialized completely randomly so they're spread out at random but then we're going to tune these embeddings of these words using back propagation so during the course of training of this neural network these points or vectors are going to basically move around in this space and you might imagine that for example words that have very similar meanings or that are indeed synonyms of each other might end up in a very similar part of the space and conversely words that mean very different things would go somewhere else in a space now their modeling approach otherwise is identical to ours they are using a multi-layer neural network to predict the next word given the previous words and to train the neural network they are maximizing the log likelihood of the training data just like we did so the modeling approach itself is identical now here they have a concrete example of this intuition why does it work basically suppose that for example you are trying to predict a dog was running in a blank now suppose that the exact phrase a dog was running in a has never occurred in a training data and here you are at sort of test time later when the model is deployed somewhere and it's trying to make a sentence and it's saying a dog was running in a blank and because it's never encountered this exact phrase in the training set you're out of distribution as we say like you don't have fundamentally any reason to suspect what might come next but this approach actually allows you to get around that because maybe you didn't see the exact phrase a dog was running in a something but maybe you've seen similar phrases maybe you've seen the phrase the dog was running in a blank and maybe your network has learned that a and the are like frequently are interchangeable with each other and so maybe it took the embedding for a and the embedding for the and it actually put them like nearby each other in the space and so you can transfer knowledge through that embedding and you can generalize in that way similarly the network could know that cats and dogs are animals and they co-occur in lots of very similar contexts and so even though you haven't seen this exact phrase or if you haven't seen exactly walking or running you can through the embedding space transfer knowledge and you can generalize to novel scenarios so let's now scroll down to the diagram of the neural network they have a nice diagram here and in this example we are taking three previous words and we are trying to predict the fourth word in a sequence now these three previous words as i mentioned uh we have a vocabulary of 17 000 um possible words so every one of these basically basically are the index of the incoming word and because there are 17 000 words this is an integer between 0 and 16999 now there's also a lookup table that they call c this lookup table is a matrix that is 17 000 by say 30. and basically what we're doing here is we're treating this as a lookup table and so every index is plucking out a row of this embedding matrix so that each index is converted to the 30 dimensional vector that corresponds to the embedding vector for that word so here we have the input layer of 30 neurons for three words making up 90 neurons in total and here they're saying that this matrix c is shared across all the words so we're always indexing into the same matrix c over and over um for each one of these words next up is the hidden layer of this neural network the size of this hidden neural layer of this neural net is a hoppy parameter so we use the word hyperparameter when it's kind of like a design choice up to the designer of the neural net and this can be as large as you'd like or as small as you'd like so for example the size could be a hundred and we are going to go over multiple choices of the size of this hidden layer and we're going to evaluate how well they work so say there were 100 neurons here all of them would be fully connected to the 90 words or 90 um numbers that make up these three words so this is a fully connected layer then there's a 10 inch long linearity and then there's this output layer and because there are 17 000 possible words that could come next this layer has 17 000 neurons and all of them are fully connected to all of these neurons in the hidden layer so there's a lot of parameters here because there's a lot of words so most computation is here this is the expensive layer now there are 17 000 logits here so on top of there we have the softmax layer which we've seen in our previous video as well so every one of these logits is exponentiated and then everything is normalized to sum to 1 so that we have a nice probability distribution for the next word in the sequence now of course during training we actually have the label we have the identity of the next word in a sequence that word or its index is used to pluck out the probability of that word and then we are maximizing the probability of that word with respect to the parameters of this neural net so the parameters are the weights and biases of this output layer the weights and biases of the hidden layer and the embedding lookup table c and all of that is optimized using back propagation and these uh dashed arrows ignore those uh that represents a variation of a neural nut that we are not going to explore in this video so that's the setup and now let's implement it okay so i started a brand new notebook for this lecture we are importing pytorch and we are importing matplotlib so we can create figures then i am reading all the names into a list of words like i did before and i'm showing the first eight right here keep in mind that we have a 32 000 in total these are just the first eight and then here i'm building out the vocabulary of characters and all the mappings from the characters as strings to integers and vice versa now the first thing we want to do is we want to compile the data set for the neural network and i had to rewrite this code um i'll show you in a second what it looks like so this is the code that i created for the dataset creation so let me first run it and then i'll briefly explain how this works so first we're going to define something called block size and this is basically the context length of how many characters do we take to predict the next one so here in this example we're taking three characters to predict the fourth one so we have a block size of three that's the size of the block that supports the prediction then here i'm building out the x and y the x are the input to the neural net and the y are the labels for each example inside x then i'm airing over the first five words i'm doing first five just for efficiency while we are developing all the code but then later we're going to come here and erase this so that we use the entire training set so here i'm printing the word emma and here i'm basically showing the examples that we can generate the five examples that we can generate out of the single um sort of word emma so when we are given the context of just uh dot dot the first character in a sequence is e in this context the label is m when the context is this the label is m and so forth and so the way i build this out is first i start with a padded context of just zero tokens then i iterate over all the characters i get the character in the sequence and i basically build out the array y of this current character and the array x which stores the current running context and then here see i print everything and here i um crop the context and enter the new character in a sequence so this is kind of like a rolling window of context now we can change the block size here to for example four and in that case we'll be predicting the fifth character given the previous four or it can be five and then it would look like this or it can be say ten and then it would look something like this we're taking ten characters to predict the eleventh one and we're always padding with dots so let me bring this back to three just so that we have what we have here in the paper and finally the data set right now looks as follows from these five words we have created a data set of 32 examples and each input of the neural net is three integers and we have a label that is also an integer y so x looks like this these are the individual examples and then y are the labels so given this let's now write a neural network that takes these axes and predicts the y's first let's build the embedding lookup table c so we have 27 possible characters and we're going to embed them in a lower dimensional space in the paper they have 17 000 words and they bet them in uh spaces as small dimensional as 30. so they cram 17 000 words into 30 dimensional space in our case we have only 27 possible characters so let's grab them in something as small as to start with for example a two-dimensional space so this lookup table will be random numbers and we'll have 27 rows and we'll have two columns right so each 20 each one of 27 characters will have a two-dimensional embedding so that's our matrix c of embeddings in the beginning initialized randomly now before we embed all of the integers inside the input x using this lookup table c let me actually just try to embed a single individual integer like say five so we get a sense of how this works now one way this works of course is we can just take the c and we can index into row five and that gives us a vector the fifth row of c and um this is one way to do it the other way that i presented in the previous lecture is actually seemingly different but actually identical so in the previous lecture what we did is we took these integers and we used the one hot encoding to first encode them so f.1 hot we want to encode integer 5 and we want to tell it that the number of classes is 27 so that's the 26 dimensional vector of all zeros except the fifth bit is turned on now this actually doesn't work the reason is that this input actually must be a doorstop tensor and i'm making some of these errors intentionally just so you get to see some errors and how to fix them so this must be a tester not an int fairly straightforward to fix we get a one hot vector the fifth dimension is one and the shape of this is 27. and now notice that just as i briefly alluded to in the previous video if we take this one hot vector and we multiply it by c then what would you expect well number one first you'd expect an error because expected scalar type long but found float so a little bit confusing but the problem here is that one hot the data type of it is long it's a 64-bit integer but this is a float tensor and so pytorch doesn't know how to multiply an int with a float and that's why we had to explicitly cast this to a float so that we can multiply now the output actually here is identical and that it's identical because of the way the matrix multiplication here works we have the one hot um vector multiplying columns of c and because of all the zeros they actually end up masking out everything in c except for the fifth row which is plucked out and so we actually arrive at the same result and that tells you that here we can interpret this first piece here this embedding of the integer we can either think of it as the integer indexing into a lookup table c but equivalently we can also think of this little piece here as a first layer of this bigger neural net this layer here has neurons that have no non-linearity there's no 10h they're just linear neurons and their weight matrix is c and then we are encoding integers into one hot and feeding those into a neural net and this first layer basically embeds them so those are two equivalent ways of doing the same thing we're just going to index because it's much much faster and we're going to discard this interpretation of one hot inputs into neural nets and we're just going to index integers and create and use embedding tables now embedding a single integer like 5 is easy enough we can simply ask pytorch to retrieve the fifth row of c or the row index five of c but how do we simultaneously embed all of these 32 by three integers stored in array x luckily pytorch indexing is fairly flexible and quite powerful so it doesn't just work to ask for a single element five like this you can actually index using lists so for example we can get the rows five six and seven and this will just work like this we can index with a list it doesn't just have to be a list it can also be a actually a tensor of integers and we can index with that so this is a integer tensor 567 and this will just work as well in fact we can also for example repeat row 7 and retrieve it multiple times and that same index will just get embedded multiple times here so here we are indexing with a one-dimensional tensor of integers but it turns out that you can also index with multi-dimensional tensors of integers here we have a two-dimensional in tensor of integers so we can simply just do c at x and this just works and the shape of this is 32 by 3 which is the original shape and now for every one of those 32 by 3 integers we've retrieved the embedding vector here so basically we have that as an example the 13th or example index 13 the second dimension is the integer 1 as an example and so here if we do c of x which gives us that array and then we index into 13 by two of that array then we we get the embedding here and you can verify that c at one which is the integer at that location is indeed equal to this you see they're equal so basically long story short pytorch indexing is awesome and to embed simultaneously all of the integers in x we can simply do c of x and that is our embedding and that just works now let's construct this layer here the hidden layer so we have that w1 as i'll call it are these weights which we will initialize randomly now the number of inputs to this layer is going to be three times two right because we have two dimensional embeddings and we have three of them so the number of inputs is 6 and the number of neurons in this layer is a variable up to us let's use 100 neurons as an example and then biases will be also initialized randomly as an example and let's and we just need 100 of them now the problem with this is we can't simply normally we would take the input in this case that's embedding and we'd like to multiply it with these weights and then we would like to add the bias this is roughly what we want to do but the problem here is that these embeddings are stacked up in the dimensions of this input tensor so this will not work this matrix multiplication because this is a shape 32 by 3 by 2 and i can't multiply that by 6 by 100 so somehow we need to concatenate these inputs here together so that we can do something along these lines which currently does not work so how do we transform this 32 by 3 by 2 into a 32 by 6 so that we can actually perform this multiplication over here i'd like to show you that there are usually many ways of implementing what you'd like to do in torch and some of them will be faster better shorter etc and that's because torch is a very large library and it's got lots and lots of functions so if you just go to the documentation and click on torch you'll see that my slider here is very tiny and that's because there are so many functions that you can call on these tensors to transform them create them multiply them add them perform all kinds of different operations on them and so this is kind of like the space of possibility if you will now one of the things that you can do is if we can control here ctrl f for concatenate and we see that there's a function torque.cat short for concatenate and this concatenates the given sequence of tensors in a given dimension and these sensors must have the same shape etc so we can use the concatenate operation to in a naive way concatenate these three embeddings for each input so in this case we have m of amp of the shape and really what we want to do is we want to retrieve these three parts and concatenate them so we want to grab all the examples we want to grab first the zeroth index and then all of this so this plucks out the 32 by 2 embeddings of just the first word here and so basically we want this guy we want the first dimension and we want the second dimension and these are the three pieces individually and then we want to treat this as a sequence and we want to torch that cat on that sequence so this is the list tor.cat takes a sequence of tensors and then we have to tell it along which dimension to concatenate so in this case all these are 32 by 2 and we want to concatenate not across dimension 0 by the cross dimension one so passing in one gives us a result the shape of this is 32 by 6 exactly as we'd like so that basically took 32 and squashed these by concatenating them into 32 by 6. now this is kind of ugly because this code would not generalize if we want to later change the block size right now we have three inputs three words but what if we had five then here we would have to change the code because i'm indexing directly well torch comes to rescue again because that turns out to be a function called unbind and it removes a tensor dimension so it removes the tensor dimension returns a tuple of all slices along a given dimension without it so this is exactly what we need and basically when we call torch dot unbind torch dot unbind of m and pass in dimension 1 index 1 this gives us a list of a list of tensors exactly equivalent to this so running this gives us a line 3 and it's exactly this list and so we can call torch.cat on it and along the first dimension and this works and this shape is the same but now this is uh it doesn't matter if we have block size 3 or 5 or 10 this will just work so this is one way to do it but it turns out that in this case there's actually a significantly better and more efficient way and this gives me an opportunity to hint at some of the internals of torch.tensor so let's create an array here of elements from 0 to 17 and the shape of this is just 18. it's a single picture of 18 numbers it turns out that we can very quickly re-represent this as different sized and dimensional tensors we do this by calling a view and we can say that actually this is not a single vector of 18 this is a two by nine tensor or alternatively this is a nine by two tensor or this is actually a three by three by two tensor as long as the total number of elements here multiply to be the same this will just work and in pytorch this operation calling that view is extremely efficient and the reason for that is that in each tensor there's something called the underlying storage and the storage is just the numbers always as a one-dimensional vector and this is how this tensor is represented in the computer memory it's always a one-dimensional vector but when we call that view we are manipulating some of attributes of that tensor that dictate how this one-dimensional sequence is interpreted to be an n-dimensional tensor and so what's happening here is that no memory is being changed copied moved or created when we call that view the storage is identical but when you call that view some of the internal attributes of the view of the sensor are being manipulated and changed in particular that's something there's something called a storage offset strides and shapes and those are manipulated so that this one-dimensional sequence of bytes is seen as different and dimensional arrays there's a blog post here from eric called pi torch internals where he goes into some of this with respect to tensor and how the view of the tensor is represented and this is really just like a logical construct of representing the physical memory and so this is a pretty good um blog post that you can go into i might also create an entire video on the internals of torch tensor and how this works for here we just note that this is an extremely efficient operation and if i delete this and come back to our end we see that the shape of our end is 32 by three by two but we can simply ask for pytorch to view this instead as a 32 by six and the way this gets flattened into a 32 by six array just happens that these two get stacked up in a single row and so that's basically the concatenation operation that we're after and you can verify that this actually gives the exact same result as what we had before so this is an element y equals and you can see that all the elements of these two tensors are the same and so we get the exact same result so long story short we can actually just come here and if we just view this as a 32x6 instead then this multiplication will work and give us the hidden states that we're after so if this is h then h shape is now the 100 dimensional activations for every one of our 32 examples and this gives the desired result let me do two things here number one let's not use 32 we can for example do something like m.shape at 0 so that we don't hard code these numbers and this would work for any size of this amp or alternatively we can also do negative one when we do negative one pi torch will infer what this should be because the number of elements must be the same and we're saying that this is 6 by church will derive that this must be 32 or whatever else it is if m is of different size the other thing is here um one more thing i'd like to point out is here when we do the concatenation this actually is much less efficient because um this concatenation would create a whole new tensor with a whole new storage so new memory is being created because there's no way to concatenate tensors just by manipulating the view attributes so this is inefficient and creates all kinds of new memory uh so let me delete this now we don't need this and here to calculate h we want to also dot 10h of this to get our oops to get our h so these are now numbers between negative one and one because of the 10h and we have that the shape is 32 by 100 and that is basically this hidden layer of activations here for every one of our 32 examples now there's one more thing i've lost over that we have to be very careful with and that this and that's this plus here in particular we want to make sure that the broadcasting will do what we like the shape of this is 32 by 100 and the ones shape is 100. so we see that the addition here will broadcast these two and in particular we have 32 by 100 broadcasting to 100. so broadcasting will align on the right create a fake dimension here so this will become a 1 by 100 row vector and then it will copy vertically for every one of these rows of 32 and do an element wise addition so in this case the correct thing will be happening because the same bias vector will be added to all the rows of this matrix so that is correct that's what we'd like and it's always good practice you just make sure so that you don't shoot yourself in the foot and finally let's create the final layer here so let's create w2 and v2 the input now is 100 and the output number of neurons will be for us 27 because we have 27 possible characters that come next so the biases will be 27 as well so therefore the logits which are the outputs of this neural net are going to be um h multiplied by w2 plus b2 logistic shape is 32 by 27 and the logits look good now exactly as we saw in the previous video we want to take these logits and we want to first exponentiate them to get our fake counts and then we want to normalize them into a probability so prob is counts divide and now counts dot sum along the first dimension and keep them as true exactly as in the previous video and so prob that shape now is 32 by 27 and you'll see that every row of prob sums to one so it's normalized so that gives us the probabilities now of course we have the actual letter that comes next and that comes from this array y which we which we created during the dataset creation so why is this last piece here which is the identity of the next character in the sequence that we'd like to now predict so what we'd like to do now is just as in the previous video we'd like to index into the rows of prob and in each row we'd like to pluck out the probability assigned to the correct character as given here so first we have torch.range of 32 which is kind of like a iterator over numbers from 0 to 31 and then we can index into prob in the following way prop in torch.range of 32 which iterates the roads and in each row we'd like to grab this column as given by y so this gives the current probabilities as assigned by this neural network with this setting of its weights to the correct character in the sequence and you can see here that this looks okay for some of these characters like this is basically 0.2 but it doesn't look very good at all for many other characters like this is 0.0701 probability and so the network thinks that some of these are extremely unlikely but of course we haven't trained the neural network yet so this will improve and ideally all of these numbers here of course are one because then we are correctly predicting the next character now just as in the previous video we want to take these probabilities we want to look at the lock probability and then we want to look at the average probability and the negative of it to create the negative log likelihood loss so the loss here is 17 and this is the loss that we'd like to minimize to get the network to predict the correct character in the sequence okay so i rewrote everything here and made it a bit more respectable so here's our data set here's all the parameters that we defined i'm now using a generator to make it reproducible i clustered all the parameters into a single list of parameters so that for example it's easy to count them and see that in total we currently have about 3400 parameters and this is the forward pass as we developed it and we arrive at a single number here the loss that is currently expressing how well this neural network works with the current setting of parameters now i would like to make it even more respectable so in particular see these lines here where we take the logits and we calculate the loss we're not actually reinventing the wheel here this is just um classification and many people use classification and that's why there is a functional.cross entropy function in pytorch to calculate this much more efficiently so we can just simply call f.cross entropy and we can pass in the logits and we can pass in the array of targets y and this calculates the exact same loss so in fact we can simply put this here and erase these three lines and we're going to get the exact same result now there are actually many good reasons to prefer f.cross entropy over rolling your own implementation like this i did this for educational reasons but you'd never use this in practice why is that number one when you use f.cross entropy by torch will not actually create all these intermediate tensors because these are all new tensors in memory and all this is fairly inefficient to run like this instead pytorch will cluster up all these operations and very often create have fused kernels that very efficiently evaluate these expressions that are sort of like clustered mathematical operations number two the backward pass can be made much more efficient and not just because it's a fused kernel but also analytically and mathematically it's much it's often a very much simpler backward pass to implement we actually sell this with micrograd you see here when we implemented 10h the forward pass of this operation to calculate the 10h was actually a fairly complicated mathematical expression but because it's a clustered mathematical expression when we did the backward pass we didn't individually backward through the x and the two times and the minus one in division etc we just said it's one minus t squared and that's a much simpler mathematical expression and we were able to do this because we're able to reuse calculations and because we are able to mathematically and analytically derive the derivative and often that expression simplifies mathematically and so there's much less to implement so not only can can it be made more efficient because it runs in a fused kernel but also because the expressions can take a much simpler form mathematically so that's number one number two under the hood f that cross entropy can also be significantly more um numerically well behaved let me show you an example of how this works suppose we have a logits of negative 2 3 negative 3 0 and 5 and then we are taking the exponent of it and normalizing it to sum to 1. so when logits take on this values everything is well and good and we get a nice probability distribution now consider what happens when some of these logits take on more extreme values and that can happen during optimization of the neural network suppose that some of these numbers grow very negative like say negative 100 then actually everything will come out fine we still get the probabilities that um you know are well behaved and they sum to one and everything is great but because of the way the x works if you have very positive logits let's say positive 100 in here you actually start to run into trouble and we get not a number here and the reason for that is that these counts have an if here so if you pass in a very negative number to x you just get a very negative sorry not negative but very small number very very near zero and that's fine but if you pass in a very positive number suddenly we run out of range in our floating point number that represents these counts so basically we're taking e and we're raising it to the power of 100 and that gives us if because we run out of dynamic range on this floating point number that is count and so we cannot pass very large logits through this expression now let me reset these numbers to something reasonable the way pi torch solved this is that you see how we have a well-behaved result here it turns out that because of the normalization here you can actually offset logits by any arbitrary constant value that you want so if i add 1 here you actually get the exact same result or if i add 2 or if i subtract three any offset will produce the exact same probabilities so because negative numbers are okay but positive numbers can actually overflow this x what patrick does is it internally calculates the maximum value that occurs in the logits and it subtracts it so in this case it would subtract five and so therefore the greatest number in logits will become zero and all the other numbers will become some negative numbers and then the result of this is always well behaved so even if we have 100 here previously not good but because pytorch will subtract 100 this will work and so there's many good reasons to call cross-entropy number one the forward pass can be much more efficient the backward pass can be much more efficient and also things can be much more numerically well behaved okay so let's now set up the training of this neural net we have the forward pass uh we don't need these is that we have the losses equal to the f.cross entropy that's the forward pass then we need the backward pass first we want to set the gradients to be zero so for p in parameters we want to make sure that p dot grad is none which is the same as setting it to zero in pi torch and then lost that backward to populate those gradients once we have the gradients we can do the parameter update so for p in parameters we want to take all the data and we want to nudge it learning rate times p dot grad and then we want to repeat this a few times and let's print the loss here as well now this won't suffice and it will create an error because we also have to go for pn parameters and we have to make sure that p dot requires grad is set to true in pi torch and this should just work okay so we started off with loss of 17 and we're decreasing it let's run longer and you see how the loss decreases a lot here so if we just run for a thousand times we get a very very low loss and that means that we're making very good predictions now the reason that this is so straightforward right now is because we're only um overfitting 32 examples so we only have 32 examples uh of the first five words and therefore it's very easy to make this neural net fit only these two 32 examples because we have 3 400 parameters and only 32 examples so we're doing what's called overfitting a single batch of the data and getting a very low loss and good predictions um but that's just because we have so many parameters for so few examples so it's easy to uh make this be very low now we're not able to achieve exactly zero and the reason for that is we can for example look at logits which are being predicted and we can look at the max along the first dimension and in pi torch max reports both the actual values that take on the maximum number but also the indices of piece and you'll see that the indices are very close to the labels but in some cases they differ for example in this very first example the predicted index is 19 but the label is five and we're not able to make loss be zero and fundamentally that's because here the very first or the zeroth index is the example where dot dot dot is supposed to predict e but you see how dot dot dot is also supposed to predict an o and dot dot is also supposed to predict an i and then s as well and so basically e o a or s are all possible outcomes in a training set for the exact same input so we're not able to completely over fit and um and make the loss be exactly zero so but we're getting very close in the cases where there's a unique input for a unique output in those cases we do what's called overfit and we basically get the exact same and the exact correct result so now all we have to do is we just need to make sure that we read in the full data set and optimize the neural net okay so let's swing back up where we created the dataset and we see that here we only use the first five words so let me now erase this and let me erase the print statements otherwise we'd be printing way too much and so when we processed the full data set of all the words we now had 228 000 examples instead of just 32. so let's now scroll back down to this is much larger reinitialize the weights the same number of parameters they all require gradients and then let's push this print out lost.item to be here and let's just see how the optimization goes if we run this okay so we started with a fairly high loss and then as we're optimizing the loss is coming down but you'll notice that it takes quite a bit of time for every single iteration so let's actually address that because we're doing way too much work forwarding and backwarding 220 000 examples in practice what people usually do is they perform forward and backward pass and update on many batches of the data so what we will want to do is we want to randomly select some portion of the data set and that's a mini batch and then only forward backward and update on that little mini batch and then we iterate on those many batches so in pytorch we can for example use storage.randint we can generate numbers between 0 and 5 and make 32 of them i believe the size has to be a tuple in my torch so we can have a tuple 32 of numbers between zero and five but actually we want x dot shape of zero here and so this creates uh integers that index into our data set and there's 32 of them so if our mini batch size is 32 then we can come here and we can first do a mini batch construct so in the integers that we want to optimize in this single iteration are in the ix and then we want to index into x with ix to only grab those rows so we're only getting 32 rows of x and therefore embeddings will again be 32 by three by two not two hundred thousand by three by two and then this ix has to be used not just to index into x but also to index into y and now this should be many batches and this should be much much faster so okay so it's instant almost so this way we can run many many examples nearly instantly and decrease the loss much much faster now because we're only dealing with mini batches the quality of our gradient is lower so the direction is not as reliable it's not the actual gradient direction but the gradient direction is good enough even when it's estimating on only 32 examples that it is useful and so it's much better to have an approximate gradient and just make more steps than it is to evaluate the exact gradient and take fewer steps so that's why in practice uh this works quite well so let's now continue the optimization let me take out this lost item from here and uh place it over here at the end okay so we're hovering around 2.5 or so however this is only the loss for that mini batch so let's actually evaluate the loss here for all of x and for all of y just so we have a full sense of exactly how all the model is doing right now so right now we're at about 2.7 on the entire training set so let's run the optimization for a while okay right 2.6 2.57 2.53 okay so one issue of course is we don't know if we're stepping too slow or too fast so this point one i just guessed it so one question is how do you determine this learning rate and how do we gain confidence that we're stepping in the right sort of speed so i'll show you one way to determine a reasonable learning rate it works as follows let's reset our parameters to the initial settings and now let's print in every step but let's only do 10 steps or so or maybe maybe 100 steps we want to find like a very reasonable set search range if you will so for example if this is like very low then we see that the loss is barely decreasing so that's not that's like too low basically so let's try this one okay so we're decreasing the loss but like not very quickly so that's a pretty good low range now let's reset it again and now let's try to find the place at which the loss kind of explodes uh so maybe at negative one okay we see that we're minimizing the loss but you see how uh it's kind of unstable it goes up and down quite a bit um so negative one is probably like a fast learning rate let's try negative 10. okay so this isn't optimizing this is not working very well so negative 10 is way too big negative one was already kind of big um so therefore negative one was like somewhat reasonable if i reset so i'm thinking that the right learning rate is somewhere between uh negative zero point zero zero one and um negative one so the way we can do this here is we can use uh torch shot lens space and we want to basically do something like this between zero and one but um those number of steps is one more parameter that's required let's do a thousand steps this creates 1000 numbers between 0.01 and 1 but it doesn't really make sense to step between these linearly so instead let me create learning rate exponent and instead of 0.001 this will be a negative 3 and this will be a zero and then the actual lrs that we want to search over are going to be 10 to the power of lre so now what we're doing is we're stepping linearly between the exponents of these learning rates this is 0.001 and this is 1 because 10 to the power of 0 is 1. and therefore we are spaced exponentially in this interval so these are the candidate learning rates that we want to sort of like search over roughly so now what we're going to do is here we are going to run the optimization for 1000 steps and instead of using a fixed number we are going to use learning rate indexing into here lrs of i and make this i so basically let me reset this to be again starting from random creating these learning rates between negative zero points between 0.001 and um one but exponentially stopped and here what we're doing is we're iterating a thousand times we're going to use the learning rate um that's in the beginning very very low in the beginning is going to be 0.001 but by the end it's going to be 1. and then we're going to step with that learning rate and now what we want to do is we want to keep track of the uh learning rates that we used and we want to look at the losses that resulted and so here let me track stats so lri.append lr and um lost side that append loss that item okay so again reset everything and then run and so basically we started with a very low learning rate and we went all the way up to a learning rate of negative one and now what we can do is we can plt that plot and we can plot the two so we can plot the learning rates on the x-axis and the losses we saw on the y-axis and often you're going to find that your plot looks something like this where in the beginning you had very low learning rates so basically anything barely anything happened then we got to like a nice spot here and then as we increase the learning rate enough we basically started to be kind of unstable here so a good learning rate turns out to be somewhere around here um and because we have lri here um we actually may want to um do not lr not the learning rate but the exponent so that would be the lre at i is maybe what we want to log so let me reset this and redo that calculation but now on the x axis we have the exponent of the learning rate and so we can see the exponent of the learning rate that is good to use it would be sort of like roughly in the valley here because here the learning rates are just way too low and then here where we expect relatively good learning rates somewhere here and then here things are starting to explode so somewhere around negative one x the exponent of the learning rate is a pretty good setting and 10 to the negative one is 0.1 so 0.1 is actually 0.1 was actually a fairly good learning rate around here and that's what we had in the initial setting but that's roughly how you would determine it and so here now we can take out the tracking of these and we can just simply set lr to be 10 to the negative one or basically otherwise 0.1 as it was before and now we have some confidence that this is actually a fairly good learning rate and so now we can do is we can crank up the iterations we can reset our optimization and we can run for a pretty long time using this learning rate oops and we don't want to print that's way too much printing so let me again reset and run ten thousand stops okay so we're 0.2 2.48 roughly let's run another 10 000 steps 2.46 and now let's do one learning rate decay what this means is we're going to take our learning rate and we're going to 10x lower it and so we're at the late stages of training potentially and we may want to go a bit slower let's do one more actually at 0.1 just to see if we're making a dent here okay we're still making dent and by the way the bi-gram loss that we achieved last video was 2.45 so we've already surpassed the bi-gram model and once i get a sense that this is actually kind of starting to plateau off people like to do as i mentioned this learning rate decay so let's try to decay the loss the learning rate i mean and we achieve it about 2.3 now obviously this is janky and not exactly how you would train it in production but this is roughly what you're going through you first find a decent learning rate using the approach that i showed you then you start with that learning rate and you train for a while and then at the end people like to do a learning rate decay where you decay the learning rate by say a factor of 10 and you do a few more steps and then you get a trained network roughly speaking so we've achieved 2.3 and dramatically improved on the bi-gram language model using this simple neural net as described here using these 3 400 parameters now there's something we have to be careful with i said that we have a better model because we are achieving a lower loss 2.3 much lower than 2.45 with the bi-gram model previously now that's not exactly true and the reason that's not true is that this is actually fairly small model but these models can get larger and larger if you keep adding neurons and parameters so you can imagine that we don't potentially have a thousand parameters we could have 10 000 or 100 000 or millions of parameters and as the capacity of the neural network grows it becomes more and more capable of overfitting your training set what that means is that the loss on the training set on the data that you're training on will become very very low as low as zero but all that the model is doing is memorizing your training set verbatim so if you take that model and it looks like it's working really well but you try to sample from it you will basically only get examples exactly as they are in the training set you won't get any new data in addition to that if you try to evaluate the loss on some withheld names or other words you will actually see that the loss on those can be very high and so basically it's not a good model so the standard in the field is to split up your data set into three splits as we call them we have the training split the dev split or the validation split and the test split so training split test or um sorry dev or validation split and test split and typically this would be say eighty percent of your data set this could be ten percent and this ten percent roughly so you have these three splits of the data now these eighty percent of your trainings of the data set the training set is used to optimize the parameters of the model just like we're doing here using gradient descent these 10 percent of the examples the dev or validation split they're used for development over all the hyper parameters of your model so hyper parameters are for example the size of this hidden layer the size of the embedding so this is a hundred or a two for us but we could try different things the strength of the regularization which we aren't using yet so far so there's lots of different hybrid parameters and settings that go into defining your neural net and you can try many different variations of them and see whichever one works best on your validation split so this is used to train the parameters this is used to train the hyperprimers and test split is used to evaluate basically the performance of the model at the end so we're only evaluating the loss on the test plate very very sparingly and very few times because every single time you evaluate your test loss and you learn something from it you are basically starting to also train on the test split so you are only allowed to test the loss on a test set very very few times otherwise you risk overfitting to it as well as you experiment on your model so let's also split up our training data into train dev and test and then we are going to train on train and only evaluate on tests very very sparingly okay so here we go here is where we took all the words and put them into x and y tensors so instead let me create a new cell here and let me just copy paste some code here because i don't think it's that complex but we're going to try to save a little bit of time i'm converting this to be a function now and this function takes some list of words and builds the arrays x and y for those words only and then here i am shuffling up all the words so these are the input words that we get we are randomly shuffling them all up and then um we're going to set n1 to be the number of examples that there's 80 of the words and n2 to be 90 of the way of the words so basically if len of words is 32 000 n1 is well sorry i should probably run this n1 is 25 000 and n2 is 28 000. and so here we see that i'm calling build data set to build the training set x and y by indexing into up to and one so we're going to have only 25 000 training words and then we're going to have roughly n2 minus n1 3 3 000 validation examples or dev examples and we're going to have when of words basically minus and two or 3 204 examples here for the test set so now we have x's and y's for all those three splits oh yeah i'm printing their size here inside the function as well but here we don't have words but these are already the individual examples made from those words so let's now scroll down here and the data set now for training is more like this and then when we reset the network when we're training we're only going to be training using x train x train and y train so that's the only thing we're training on let's see where we are on the single batch let's now train maybe a few more steps training neural networks can take a while usually you don't do it inline you launch a bunch of jobs and you wait for them to finish um can take in multiple days and so on luckily this is a very small network okay so the loss is pretty good oh we accidentally used a learning rate that is way too low so let me actually come back we use the decay learning rate of 0.01 so this will train much faster and then here when we evaluate let's use the dep set here xdev and ydev to evaluate the loss okay and let's now decay the learning rate and only do say 10 000 examples and let's evaluate the dev loss ones here okay so we're getting about 2.3 on dev and so the neural network when it was training did not see these dev examples it hasn't optimized on them and yet when we evaluate the loss on these dev we actually get a pretty decent loss and so we can also look at what the loss is on all of training set oops and so we see that the training and the dev loss are about equal so we're not over fitting um this model is not powerful enough to just be purely memorizing the data and so far we are what's called underfitting because the training loss and the dev or test losses are roughly equal so what that typically means is that our network is very tiny very small and we expect to make performance improvements by scaling up the size of this neural net so let's do that now so let's come over here and let's increase the size of the neural net the easiest way to do this is we can come here to the hidden layer which currently has 100 neurons and let's just bump this up so let's do 300 neurons and then this is also 300 biases and here we have 300 inputs into the final layer so let's initialize our neural net we now have ten thousand ex ten thousand parameters instead of three thousand parameters and then we're not using this and then here what i'd like to do is i'd like to actually uh keep track of uh tap um okay let's just do this let's keep stats again and here when we're keeping track of the loss let's just also keep track of the steps and let's just have i here and let's train on thirty thousand or rather say okay let's try thirty thousand and we are at point one and we should be able to run this and optimize the neural net and then here basically i want to plt.plot the steps against the loss so these are the x's and y's and this is the loss function and how it's being optimized now you see that there's quite a bit of thickness to this and that's because we are optimizing over these mini batches and the mini batches create a little bit of noise in this uh where are we in the def set we are at 2.5 so we still haven't optimized this neural net very well and that's probably because we made it bigger it might take longer for this neural net to converge um and so let's continue training um yeah let's just continue training one possibility is that the batch size is so low that uh we just have way too much noise in the training and we may want to increase the batch size so that we have a bit more um correct gradient and we're not thrashing too much and we can actually like optimize more properly okay this will now become meaningless because we've reinitialized these so yeah this looks not pleasing right now but there probably is like a tiny improvement but it's so hard to tell let's go again 2.52 let's try to decrease the learning rate by factor two okay we're at 2.32 let's continue training we basically expect to see a lower loss than what we had before because now we have a much much bigger model and we were under fitting so we'd expect that increasing the size of the model should help the neural net 2.32 okay so that's not happening too well now one other concern is that even though we've made the 10h layer here or the hidden layer much much bigger it could be that the bottleneck of the network right now are these embeddings that are two dimensional it can be that we're just cramming way too many characters into just two dimensions and the neural net is not able to really use that space effectively and that that is sort of like the bottleneck to our network's performance okay 2.23 so just by decreasing the learning rate i was able to make quite a bit of progress let's run this one more time and then evaluate the training and the dev loss now one more thing after training that i'd like to do is i'd like to visualize the um embedding vectors for these characters before we scale up the embedding size from two because we'd like to make uh this bottleneck potentially go away but once i make this greater than two we won't be able to visualize them so here okay we're at 2.23 and 2.24 so um we're not improving much more and maybe the bottleneck now is the character embedding size which is two so here i have a bunch of code that will create a figure and then we're going to visualize the embeddings that were trained by the neural net on these characters because right now the embedding has just two so we can visualize all the characters with the x and the y coordinates as the two embedding locations for each of these characters and so here are the x coordinates and the y coordinates which are the columns of c and then for each one i also include the text of the little character so here what we see is actually kind of interesting the network has basically learned to separate out the characters and cluster them a little bit uh so for example you see how the vowels a e i o u are clustered up here so that's telling us that is that the neural net treats these is very similar right because when they feed into the neural net the embedding uh for all these characters is very similar and so the neural net thinks that they're very similar and kind of like interchangeable if that makes sense um then the the points that are like really far away are for example q q is kind of treated as an exception and q has a very special embedding vector so to speak similarly dot which is a special character is all the way out here and a lot of the other letters are sort of like clustered up here and so it's kind of interesting that there's a little bit of structure here after the training and it's not definitely not random and these embeddings make sense so we're now going to scale up the embedding size and won't be able to visualize it directly but we expect that because we're under fitting and we made this layer much bigger and did not sufficiently improve the loss we're thinking that the um constraint to better performance right now could be these embedding pictures so let's make them bigger okay so let's scroll up here and now we don't have two dimensional embeddings we are going to have say 10 dimensional embeddings for each word then this layer will receive 3 times 10 so 30 inputs will go into the hidden layer let's also make the hidden layer a bit smaller so instead of 300 let's just do 200 neurons in that hidden layer so now the total number of elements will be slightly bigger at 11 000 and then here we have to be a bit careful because um okay the learning rate we set to 0.1 here we are hardcoded in six and obviously if you're working in production you don't wanna be hard-coding magic numbers but instead of six this should now be thirty um and let's run for fifty thousand iterations and let me split out the initialization here outside so that when we run this cell multiple times it's not going to wipe out our loss in addition to that here let's instead of logging lost.item let's actually log the let's do log 10 i believe that's a function of the loss and i'll show you why in a second let's optimize this basically i'd like to plot the log loss instead of the loss because when you plot the loss many times it can have this hockey stick appearance and log squashes it in uh so it just kind of like looks nicer so the x-axis is step i and the y-axis will be the loss i and then here this is 30. ideally we wouldn't be hard-coding these okay so let's look at the loss okay it's again very thick because the mini batch size is very small but the total loss over the training set is 2.3 and the the tests or the def set is 2.38 as well so so far so good uh let's try to now decrease the learning rate by a factor of 10 and train for another 50 000 iterations we'd hope that we would be able to beat uh 2.32 but again we're just kind of like doing this very haphazardly so i don't actually have confidence that our learning rate is set very well that our learning rate decay which we just do at random is set very well and um so the optimization here is kind of suspect to be honest and this is not how you would do it typically in production in production you would create parameters or hyper parameters out of all these settings and then you would run lots of experiments and see whichever ones are working well for you okay so we have 2.17 now and 2.2 okay so you see how the training and the validation performance are starting to slightly slowly depart so maybe we're getting the sense that the neural net is getting good enough or that number of parameters is large enough that we are slowly starting to overfit let's maybe run one more iteration of this and see where we get but yeah basically you would be running lots of experiments and then you are slowly scrutinizing whichever ones give you the best depth performance and then once you find all the hyper parameters that make your dev performance good you take that model and you evaluate the test set performance a single time and that's the number that you report in your paper or wherever else you want to talk about and brag about your model so let's then rerun the plot and rerun the train and death and because we're getting lower loss now it is the case that the embedding size of these was holding us back very likely okay so 2.162.19 is what we're roughly getting so there's many ways to go from many ways to go from here we can continue tuning the optimization we can continue for example playing with the sizes of the neural net or we can increase the number of uh words or characters in our case that we are taking as an input so instead of just three characters we could be taking more characters as an input and that could further improve the loss okay so i changed the code slightly so we have here 200 000 steps of the optimization and in the first 100 000 we're using a learning rate of 0.1 and then in the next 100 000 we're using a learning rate of 0.01 this is the loss that i achieve and these are the performance on the training and validation loss and in particular the best validation loss i've been able to obtain in the last 30 minutes or so is 2.17 so now i invite you to beat this number and you have quite a few knobs available to you to i think surpass this number so number one you can of course change the number of neurons in the hidden layer of this model you can change the dimensionality of the embedding lookup table you can change the number of characters that are feeding in as an input as the context into this model and then of course you can change the details of the optimization how long are we running what is the learning rate how does it change over time how does it decay you can change the batch size and you may be able to actually achieve a much better convergence speed in terms of how many seconds or minutes it takes to train the model and get your result in terms of really good loss and then of course i actually invite you to read this paper it is 19 pages but at this point you should actually be able to read a good chunk of this paper and understand pretty good chunks of it and this paper also has quite a few ideas for improvements that you can play with so all of those are not available to you and you should be able to beat this number i'm leaving that as an exercise to the reader and that's it for now and i'll see you next time before we wrap up i also wanted to show how you would sample from the model so we're going to generate 20 samples at first we begin with all dots so that's the context and then until we generate the zeroth character again we're going to embed the current context using the embedding table c now usually uh here the first dimension was the size of the training set but here we're only working with a single example that we're generating so this is just the mission one just for simplicity and so this embedding then gets projected into the end state you get the logits now we calculate the probabilities for that you can use f.softmax of logits and that just basically exponentiates the logits and makes them sum to one and similar to cross entropy it is careful that there's no overflows once we have the probabilities we sample from them using torture multinomial to get our next index and then we shift the context window to append the index and record it and then we can just decode all the integers to strings and print them out and so these are some example samples and you can see that the model now works much better so the words here are much more word like or name like so we have things like ham joes you know it's starting to sound a little bit more name-like so we're definitely making progress but we can still improve on this model quite a lot okay sorry there's some bonus content i wanted to mention that i want to make these notebooks more accessible and so i don't want you to have to like install jupyter notebooks and torch and everything else so i will be sharing a link to a google colab and google collab will look like a notebook in your browser and you can just go to the url and you'll be able to execute all of the code that you saw in the google collab and so this is me executing the code in this lecture and i shortened it a little bit but basically you're able to train the exact same network and then plot and sample from the model and everything is ready for you to like tinker with the numbers right there in your browser no installation necessary so i just wanted to point that out and the link to this will be in the video description

Building makemore Part 2: MLP

Scores

Summary

Key Points

Key Takeaways

Primary Category

Secondary Categories

Topics

Entities

people

organizations

products

technologies

domain_specific

Sentiment

Content Type

Difficulty

Tone