With all the interest in AI programs and specifically chatGPT, I was quite curious about how it and other programs like it work.
I started my research with a visit to Wikapedia, followed by several days of general searching the topic. Much of what I found was either very simplistic, or highly technical, requireing a deeper knowledge base than what I had. I needed something on the order of a primer that would allow me to build and test someting without a gradulate level course.
My real journey started whin I found Andrew Karptathy’s nanoGPT open source code files at https://github.com/karpathy/nanoGPT.
It contains the python code and database, allowing you to feel the magic of actually building, training and to running a character-level Generative Pretrained Transformet (GPT).
Opening the link will bring you to the directory structure followed by the README.md file containing the program descripton and instructions on how to build, train, fine tune and run the code. It explains that nanoGPT is a simple, character based Generative Pretrained Transformer (GPT). In that model, you are shown how to train the Langluage Model (LM) on a database containing all the dialog of Shakespears plays.
This is an open source program that contains the working code for a small demonstration program. The site has the code and database to build an AI Language Model (LM). That said, it was quite enlightning to train and run it, but, I don’t think it will change the world, but it will open your eyes to how a GPT works, and why Graphic Processing Unit (GPU) are essential. It took three days of processing on a PC without a GPU to train the simplest mode,
Once I had experamented with nanoGPT I wanted to understand how the code worked. Reviewing the source files did not help. I needed a good understanding of how they worked.
At the end of the readme file there was a link to a series of his lectures titled Neural Networkes: Zero to Hero, designed to give users the basic understanding of the tools I needed. I consider thesee to be an outstanding series of videos that take you from start through building a basic Generative Pretrained Transformer (GPT) written in Python.
Neural Network: Zero To Hero. https://karpathy.ai/zero-to-hero.html
This is the most step-by-step spelled-out explanation of backpropagation and training of neural networks. It only assumes basic knowledge of Python and a vague recollection of calculus from high school.
We implement a bigram character-level language model, which we will further complexify in followup videos into a modern Transformer language model, like GPT. In this video, the focus is on (1) introducing torch.Tensor and its subtleties and use in efficiently evaluating neural networks and (2) the overall framework of language modeling that includes model training, sampling, and the evaluation of a loss (e.g. the negative log likelihood for classification).
We implement a multilayer perceptron (MLP) character-level language model. In this video we also introduce many basics of machine learning (e.g. model training, learning rate tuning, hyperparameters, evaluation, train/dev/test splits, under/overfitting, etc.).
We dive into some of the internals of MLPs with multiple layers and scrutinize the statistics of the forward pass activations, backward pass gradients, and some of the pitfalls when they are improperly scaled. We also look at the typical diagnostic tools and visualizations you’d want to use to understand the health of your deep network. We learn why training deep neural nets can be fragile and introduce the first modern innovation that made doing so much easier: Batch Normalization. Residual connections and the Adam optimizer remain notable todos for later video.
We take the 2-layer MLP (with BatchNorm) from the previous video and backpropagate through it manually without using PyTorch autograd’s loss.backward(): through the cross entropy loss, 2nd linear layer, tanh, batchnorm, 1st linear layer, and the embedding table. Along the way, we get a strong intuitive understanding about how gradients flow backwards through the compute graph and on the level of efficient Tensors, not just individual scalars like in micrograd. This helps build competence and intuition around how neural nets are optimized and sets you up to more confidently innovate on and debug modern neural networks.
We take the 2-layer MLP from previous video and make it deeper with a tree-like structure, arriving at a convolutional neural network architecture similar to the WaveNet (2016) from DeepMind. In the WaveNet paper, the same hierarchical architecture is implemented more efficiently using causal dilated convolutions (not yet covered). Along the way we get a better sense of torch.nn and what it is and how it works under the hood, and what a typical deep learning development process looks like (a lot of reading of documentation, keeping track of multidimensional tensor shapes, moving between jupyter notebooks and repository code, …).
We build a Generatively Pretrained Transformer (GPT), following the paper “Attention is All You Need” and OpenAI’s GPT-2 / GPT-3. We talk about connections to ChatGPT, which has taken the world by storm. We watch GitHub Copilot, itself a GPT, help us write a GPT (meta :D!) . I recommend people watch the earlier makemore videos to get comfortable with the autoregressive language modeling framework and basics of tensors and PyTorch nn, which we take for granted in this video.
Neural Networks: Zero-To-Hero lecture notes and exercises