MLG 034 Large Language Models 1

May 07, 2025

Click to Play Episode

Explains language models (LLMs) advancements. Scaling laws - the relationships among model size, data size, and compute - and how emergent abilities such as in-context learning, multi-step reasoning, and instruction following arise once certain scaling thresholds are crossed. The evolution of the transformer architecture with Mixture of Experts (MoE), describes the three-phase training process culminating in Reinforcement Learning from Human Feedback (RLHF) for model alignment, and explores advanced reasoning techniques such as chain-of-thought prompting which significantly improve complex task performance.

Resources

Resources best viewed here

DeepLearning.AI Short Courses

AI Engineering: Building Applications with Foundation Models - Chip Huyen

Stanford CS336 Language Modeling from Scratch

Build a Large Language Model (From Scratch) - Sebastian Raschka

Hands-On Large Language Models: Language Understanding and Generation 1st Edition

Hugging Face NLP / LLM Course

Hugging Face Smol Course (Post-Training)

Nathan Lambert - RLHF Book

DeepLearning.AI: Reinforcement Learning from Human Feedback

Mastering LLMs Conference (Hamel Husain & Dan Becker)

Coursera Generative AI with Large Language Models

Andrej Karpathy - Neural Networks: Zero to Hero

Maxime Labonne's LLM Course (GitHub)

DeepLearning.AI: ChatGPT Prompt Engineering for Developers

Anthropic Prompt Engineering Interactive Tutorial

LangChain: RAG From Scratch

Microsoft Generative AI for Beginners (GitHub)

Show Notes

Learn Faster with a Walking DeskWalk While You Learn

Sitting for hours drains energy and focus. A walking desk boosts alertness, helping you retain complex ML topics more effectively.Boost focus and energy to learn faster and retain more.Discover the benefitsDiscover the benefits

Build the future of multi-agent software with AGNTCY.

Transformer Foundations and Scaling Laws

Transformers: Introduced by the 2017 "Attention is All You Need" paper, transformers allow for parallel training and inference of sequences using self-attention, in contrast to the sequential nature of RNNs.
Scaling Laws:
- Empirical research revealed that LLM performance improves predictably as model size (parameters), data size (training tokens), and compute are increased together, with diminishing returns if only one variable is scaled disproportionately.
- The "Chinchilla scaling law" (DeepMind, 2022) established the optimal model/data/compute ratio for efficient model performance: earlier large models like GPT-3 were undertrained relative to their size, whereas right-sized models with more training data (e.g., Chinchilla, LLaMA series) proved more compute and inference efficient.

Emergent Abilities in LLMs

Emergence: When trained beyond a certain scale, LLMs display abilities not present in smaller models, including:
- In-Context Learning (ICL): Performing new tasks based solely on prompt examples at inference time.
- Instruction Following: Executing natural language tasks not seen during training.
- Multi-Step Reasoning & Chain of Thought (CoT): Solving arithmetic, logic, or symbolic reasoning by generating intermediate reasoning steps.
Discontinuity & Debate: These abilities appear abruptly in larger models, though recent research suggests that this could result from non-linearities in evaluation metrics rather than innate model properties.

Architectural Evolutions: Mixture of Experts (MoE)

MoE Layers: Modern LLMs often replace standard feed-forward layers with MoE structures.
- Composed of many independent "expert" networks specializing in different subdomains or latent structures.
- A gating network routes tokens to the most relevant experts per input, activating only a subset of parameters - this is called "sparse activation."
- Enables much larger overall models without proportional increases in compute per inference, but requires the entire model in memory and introduces new challenges like load balancing and communication overhead.
Specialization & Efficiency: Experts learn different data/knowledge types, boosting model specialization and throughput, though care is needed to avoid overfitting and underutilization of specialists.

The Three-Phase Training Process

1. Unsupervised Pre-Training: Next-token prediction on massive datasets - builds a foundation model capturing general language patterns.
2. Supervised Fine Tuning (SFT): Training on labeled prompt-response pairs to teach the model how to perform specific tasks (e.g., question answering, summarization, code generation). Overfitting and "catastrophic forgetting" are risks if not carefully managed.
3. Reinforcement Learning from Human Feedback (RLHF):
- Collects human preference data by generating multiple responses to prompts and then having annotators rank them.
- Builds a reward model (often PPO) based on these rankings, then updates the LLM to maximize alignment with human preferences (helpfulness, harmlessness, truthfulness).
- Introduces complexity and risk of reward hacking (specification gaming), where the model may exploit the reward system in unanticipated ways.

Advanced Reasoning Techniques

Prompt Engineering: The art/science of crafting prompts that elicit better model responses, shown to dramatically affect model output quality.
Chain of Thought (CoT) Prompting: Guides models to elaborate step-by-step reasoning before arriving at final answers - demonstrably improves results on complex tasks.
- Variants include zero-shot CoT ("let's think step by step"), few-shot CoT with worked examples, self-consistency (voting among multiple reasoning chains), and Tree of Thought (explores multiple reasoning branches in parallel).
Automated Reasoning Optimization: Frontier models selectively apply these advanced reasoning techniques, balancing compute costs with gains in accuracy and transparency.

Optimization for Training and Inference

Tradeoffs: The optimal balance between model size, data, and compute is determined not only for pretraining but also for inference efficiency, as lifetime inference costs may exceed initial training costs.
Current Trends: Efficient scaling, model specialization (MoE), careful fine-tuning, RLHF alignment, and automated reasoning techniques define state-of-the-art LLM development.

Never Run Out of ML ContentGenerate Your Own Episodes

Want to go deeper on a topic this podcast didn't cover? Generate your own episodes - AI agents, transformers, diffusion models, whatever you're curious about. They appear right in your podcast app.Turn any ML topic into a podcast episode in your app.Start Generating →Start Generating →

Transcript

Welcome back to Machine Learning Guide, episode 34, large Language Models or LLMs. Now I have three episodes starting at episode 18 . On foundational material for natural language processing in general.

This includes things like bag of words, TF IDF, tokenization. I have two episodes on deep natural language processing that introduces the concept of recurrent neural networks embeddings. We'll talk about embeddings a little bit in this episode. And then the last MLG episode 33 was transformers and really LLMs.

If you want to understand the true core heart of LLMs, you gotta understand transformers. so that episode Transformers is the real heart and engine of LLMs. So listen to that episode. If you wanna understand LLMs properly. This episode's going to add a little bit of what we may have missed

That goes into modern day large language models. What are some of the new interesting techniques that they're experimenting with and going into the future? So this is sort of to just fill in the knowledge gaps on things that we haven't covered. and wrapping up the LLMs package.

2017 white paper attention is all you need introduced,the transformer architecture, transformers being a sequence Of attention layers, especially self attention layers followed by feed forward layers, and various other components in that stack.

Again, listen to the last episode on Transformers if you wanna understand it. And the reason for the Transformers architecture,is that unlike RNs,

which have to predict next tokens and be trained sequentially due to the nature of the architecture, the Transformers architecture takes that attention mechanism, which was introduced in RNs. And applies it to a more traditional deep learning architecture that can train these sequences in parallel and run them in parallel.

it improved the throughput of Training and running these models at scale. And then once they started doing that, they started iterating on this architecture rapidly,with significant changes over time.

. So in this episode I'm going to

talk about scaling laws. Which are some interesting findings and research that dictate the amount of scaling you want to impose on the architecture, on the data set, and on the amount of training emergent abilities, like a model's ability to provide answers for questions it's never been asked before through generalization, which is quite a magical concept and I think was somewhat surprising for researchers.

Architectural evolutions to the Transformers architecture. Things that have changed over time, like the introduction of mixture of experts or MOE, which allows for greater scale and efficiency. I will, talk about training, tuning, and alignment. So supervised, fine tuning,reinforcement learning from human feedback and I'll talk about reasoning and interaction such as chain of thought prompting and in context learning, which is part of this broader trend towards scaling compute at inference time.

We will talk some about rag. I've talked about rag a lot in the past, but we will get more into it here. LLM agents, multimodal LLMs, whose purposes go beyond just next token prediction and evaluation benchmarks.

So first up is scaling laws.

So early on in the research, they just figured that bigger LLMs are better

increasing the model size, the number of parameters, the data size. or number of training tokens and,the computational budget or how long you train these models just leads predictably to better performance. And that was true. And this is measured as,lower cross entropy loss on the holdout test set.

But, they found that there was a very specific sort of interplay between these three variables. the model size, or number of parameters, The data size or number of training tokens and the computational budget or amount of training epochs.

So there was a 2020 paper by Kaplan et al and they found that,test loss decreases as a power law function of model size, data size, and training compute.

and that these trends were observed to hold more than seven orders of magnitude.

And the shape of the architecture didn't seem to matter as much as the scale of these three variables.

So the ratio of network depth and width. Had a lot less impact on the performance of the model than just throwing more parameters, data and training time. So this is why

early on The models just got bigger and bigger, and that was just the only truism you could really rely on. So GPT two was 1.5 billion parameters. GPT-3 was 175 billion. And GPT-4, I think it's an estimate is 1.5 trillion parameters.

the trick was to scale these three variables together, and they found that performance improves predictably when the number of parameters and the number of tokens, the training data size.

we're scaled together, but increasing the number of parameters significantly while keeping the dataset sized fixed leads to diminishing returns due to overfit.

the study found that the performance penalty depends predictably on the ratio. N to the point 74. DN is the number of parameters, and D is the tokens. Implying that the dataset size needs to grow subline with model size to maintain optimal performance.

So when they found that ratio, they realized actually maybe there's a little bit of fiddling so that the three variables don't just scale linearly together, , ;you know, just larger models, larger data sets, and more training.

And that there's this perfect ratio that they could dial in to optimize training the most performant models. and so DeepMind came up with this chinchilla scaling law. I don't know why. Chinchilla,and it focused on finding the optimal allocation of,fixed training, compute budget.

so they found the perfect ratio of these three things, and it implied that,GPT-3 and larger earlier models were undertrained relative to their size. So they should have been trained longer and on more data.

And they proved this with the chinchilla 70 billion parameter model, compared to the GPT-3 175 billion parameter model. So less than half the size and it outperformed it.

And so they found that. The most compute efficient way to train a model is training very large models on slightly less data compared to what may seem necessary, and stopping before the model fully converges on the training data. now all this stuff about training efficiency is important.

It doesn't seem important, like you can just train a model all the way till convergence on as much data as you can possibly muster for as large of a model as you can possibly muster and so forth, until it converges, with your tensor board graphs and whatnot.

the reason this stuff matters is that with these large language models, training is very expensive, like billions of dollars expensive in terms of money, and so is inference once you launch these models. But right now, everything we've talked about specifically is the optimal ratio of model size, data size, and epoch to train to milk out the best cost per training They'll train these models for very long periods of time and on very large GPU clusters or TPU clusters, and it will cost them a lot of money. So this stuff counts.

But, at inference time when you're running these models in the cloud chat, GPT or Gemini on their website or on the mobile app, there's a cost to inference and There is a different ratio in terms of the model size and maybe over training it on more data. And so if you want to optimize for inference costs, you might want a smaller model,trained significantly longer or on more data to try to approximate the same performance.

And so these companies will run some calculations as to how much they think their model's gonna be used by consumers or companies relative to how much it might cost them to train the models. And they'll come up with the perfect ratio of training, budget versus inference budget.

so that cumulative cost of inference over a model's lifetime could significantly outweigh the initial training costs.

So the cheapest training regime is large model on slightly less data or slightly fewer epochs than what would seem normal, stopping before convergence a chinchilla optimal model might be the opposite. So it would be a smaller model trained over more data

for a longer period of time.

And then we first saw this really with the,llama series, Meta's Llama Models, L-L-A-M-A.

They were trained intentionally on far more tokens per parameter thanthan chinchilla recommended.

So for LAMA two, it was 2 trillion tokens. LAMA three was 15 trillion tokens. And this improved inference costs

So one analysis showed that, training a 13 billion parameter model. Expecting 2 trillion inference tokens

could save 17% of total flops F-L-O-P-S by using the smaller model.

And we'll talk about later. This isn't the only way they're working with optimizations. They're also working with inference, time compute or test time compute, Which is dedicating more computational resources during the text generation process itself, allowing the model to perform multiple reasoning steps, sample, different potential answers, critique its own output.

and doing these inference time steps, it can achieve better results for complex tasks, which increases the latency or per query cost, but improves the quality of the output.

And, potentially allows them to reduce some on the training side of things.

Next is emergent abilities, and this is talked about a lot. You'll see this in textbooks or YouTube series about large language models. I think this is the magical aspect of LLMs when you do the first pre-training step, we'll talk about training later and you're just training it to predict next tokens.

And then you do the fine tuning step where you're fine tuning it to specific tasks, like answering questions or writing code. What you have then is a model fine tuned to a specific series of tasks and it's actually incredible that it can infer or reason outside of the specific examples it's seen. So yes, it understands the English language. Yes, it understands. you've given it 10 examples of answering questions in the English language, but then if you ask it a question for an answer, it's never seen before and it can infer fill in the gaps.

The real answer to that question, it's actually quite an amazing phenomenon And, it amazes even researchers I think. And so they call it an emergent ability, but they have a handful of specific emergent abilities they use as example, emergent abilities. they're not quite as present in smaller scale models. but they appear sharply or abruptly in the training of larger scale models.

So as you're training the larger LLMs, the capacity to perform these emergent abilities during inference. arises all at once at some point in the training cycle

and so they list a handful of these commonly known emergent abilities. one is in context learning or ICL. It's the ability to perform new tasks based only on examples provided in the prompt at inference time.

So that might be one shot or a few shot prompting. you give it an example scenario of A leads to B, and then you give it C, therefore, and it can infer D. In context learning instruction following the ability to execute natural language instructions for tasks.

It's not seen during training.

multi-step reasoning. So they've learned through time, through tinkering with various prompts. And this whole world of prompt engineering, that multi-step reasoning significantly improves the performance on tasks requiring sequential reasoning.

like arithmetic, common sense, symbolic manipulation.

so as a prompt engineer, you nudge the model to take multiple steps in its output.

Or you have some structure in place that sort of does that on behalf of the model and it will improve the performance, which is cool. It's almost like an actual human student showing their work improves the performance or the accuracy of the result. we tell students to show their work for a reason.

Chain of thought prompting.

We'll talk about chain of thought later. COT. it's very similar to multi-step reasoning, So these emergent abilities seem to, arise after a certain threshold of model scale,number of parameters and training, data and training.

And the assumption is that larger models just. Learn more complex patterns, world knowledge, reasoning structures, and thought patterns, sophistication that these smaller models can't do.

This concept of, uh, emergent abilities is actually contested in research.

A 2023 Schaffer et al paper. Suggests that it might be a mirage created by the choice of evaluation metrics rather than some fundamental property of model scaling.

I won't get into this too much, because I think it's a little bit splitting hairs. but they talk about the metrics which are used to evaluate the performance of these models being non-linear or discontinuous. the idea of these emergent properties appearing all at once late in the training phase and for large models is a consequence of the metrics being binary. you either passed the test or you didn't, and so it discounts the sort of stepwise learning it takes to learn these more complex patterns until bam, it can answer the questions more confidently.

Next up is changes to the actual Transformers architecture.

So Transformers is a series of self attention modules followed by feed forward networks

And some nuances along the way, Normalization and positional encoding

residual connections and stuff, but. The, mixture of Expert is not a nuanced edition. It is a large edition what it is an expert is a sub network. Designed to handling a specific topic. Let's say, maybe one expert is good at math and another expert Is good at philosophy and another expert is good at answering questions. Now, these experts, the number of experts and what each expert represents is a black box.

So you don't know what they're actually focusing on, and it's probably not topics. It's probably something. More complex, latent space that a human wouldn't possibly understand, but they determine the number of experts in advance

and somewhere along the network path During training time or inference, the right expert is chosen for the given task at hand, and the other experts are deactivated or they're called dormant. And so only one expert is activated at a given time. And this improves,computational performance efficiency You only have to use one sub layer at any given time. So they call this conditional computation. it's a design philosophy, so instead of activating the entire network at every input, a mixture of experts model only activates specific experts deemed most relevant for that input. And this dramatically increases the total number of parameters in a model without a corresponding linear increase in computational cost or flops required to process each input token.

. So I'm looking at the Deep Seeq R one model. and the total number of parameters is 671 billion, but the activated number of parameters is only 37 billion. What that means is when you run the model at any given time, only 37 billion parameters are live, and so it's a lot faster.

So these mixture of expert layers, they usually replace the standard feed forward network layers, within some or all of the transformer blocks. And it has, two primary components, the expert. These are, which are independent neural networks, and they're typically just feed forward networks.

and the number of experts, per block, varies from model model. It's eight in mixture and but there can be hundreds or thousands of experts. And the total number of parameters across all the experts, of course, adds up to the number of parameters of the network. And then the second component is called the router or gating network.

And it's a very small neural network. that's usually, uh, linear layer followed by a soft max function. And it just determines.

how the tokens are processed by the experts or which experts to activate. And so it takes an input token, like the output token from the previous attention layer,and it outputs a signal that controls expert activation.

and one of the big efficiency gains that mixture of experts implements is through this thing called sparse activation.

So the router network calculates a score or probability for each,available expert,

indicating that it's suitable for processing this particular token.

And a selection mechanism is used to determine Which experts to use. It's usually like the top one or two. So you might use two experts in any given token.

and then the output of those two experts are combined through something like a weighted sum, based on the router's score To produce the final output of the MOE layer,

so this is called sparse activation.

so with MOE Transformers, you can have. Really large models, which as we saw with the scaling laws, large models matter. And the largeness of these models is maintained because even though only one expert is a at a given time through training and the activation of part of a particular expert,still the overall intelligence is roughly maintained throughout the.

network architecture at large. It's just that any one pass is more efficiently computed.

So training can be faster and more computationally efficient. Inference can be faster, especially for lower batch sizes.

And then experts can specialize in particular things. We don't really know what these are, but they may be something like different types of data, linguistic patterns, knowledge domains, or computational functions.

And the analogy here is if you have a task and you go to, a menza, ultra intelligent, high IQ person. Versus taking that same task to a company, a team that has, a manager who will hand it off to any number of the specialists under him who have studied one particular topic, their whole lives.

Okay, so the downsides of mixture of experts models. One is that even though during training an inference, only one expert route is active, during a single token pass. The whole model still has to be in VAM or Ram. It still has to be loaded in your GPU. So if you tried tinkering with local models per the last few episodes when I talked about, vibe coating, if you tried downloading O lama and deep seeq R one for example.

And maybe you looked at that active number of parameters and you were disappointed that it wouldn't load into VRA or ram. and that is because the whole model does have to be loaded into ram, efficiency is not at the RAM level, per se. it's at the compute level.

another downside is training becomes a little bit more difficult for these things. there's this concept of load balancing. And that's making sure that all the experts. Receive the same amount of training data in computational load. so it's making sure that no one expert is relied upon too heavily.

But this is, if you recall from a long ago episode on neural networks, maybe akin to L one, L two, normalization or dropout. It's like a tech, you know, turtles all the way down with these things. So now we're dealing with modules and layers as opposed to. individual layer regularization techniques.

now we have to build in these complex load balancing mechanisms that makes sure all the experts are participating equally in the training process.

And,some techniques they have are auxiliary loss functions to encourage, more balanced routing and,hyper parameter optimization on the number of experts.

another issue is communication overhead. So because these modules and these layers, maybe you want to train them separately and in parallel, they may reside on different GPUs or TPUs. So you have to have systems for routing the tokens to the correct experts.

And this introduces communication overhead.

So you have to build efficient systems designed for communication protocols and mitigating bottlenecks and such. and also MOE architectures,make overfitting a little bit more likely. simply because you're adding more parameters to the network, increasing the size, and so you have to be more mindful of regularization.

I believe most, if not all, modern LLMs use mixture of experts layers now instead of standard feedforward layers, there will be some combination of both forward and MOE layers.

Okay, next up is training, tuning, and alignment. there's three steps in the training of these models. The first step is the pre-training phase, unsupervised pre-training. And this is what you think of when you think of training a model.

Transformer model. it's learning next token prediction. So you give it the internet, right? there's all sorts of data sets out there.

There's a common crawl and C four, you give it the internet and you train it for a very long time. this is the most expensive training step. This is generally what they do when they're training a model for months and months.

And, most of what it's doing is learning to predict the next token in a sequence. The cat sat on the blank, Matt, and, adjusts its score accordingly. it's unsupervised because the correct answer is the next token, but there's not some labeled data set of labels or ground truth.

Values that correspond to sentences or tokens. It's just predict the next token. And there's some sophistication involved in this. I covered that in the Transformers episode, in the R N's episode. but that's the first step. Now you have a pre-trained model and they call this a foundation model.

A foundation model.

it's just models pre-trained on mountains of raw data.

and by the way, don't confuse that with frontier model. Frontier model means like the newest and greatest of foundation models. So current frontier model would be Gemini 2.5 Pro and GPT, 4.5, and so forth.

Once you're done with that, you can't use it for anything except for auto completion, like you're finishing a text message or an email. Sure. But that's not the real value of LLMs the value is executing a task, performing a specific task, answering a question, writing code, and so forth. And so you fine tune these models after that.

use something called supervised fine tuning. SFT supervised fine tuning. And that is the process of adapting a pre-trained LLM,by continuing its training on smaller curated data set of labeled examples, specifically designed for a target task or behavior. So it's a supervised learning process.

somebody has to generate a data set of very high quality, prompt response pairs. So questions and answers. For example, Who is the 40th president and has to say, Ronald Reagan.

a whole giant data set of task, specific prompt response pairs.

So there's question answering. There's summarization. the dataset would be, articles or chapters, and then a human written summary.

And then, during training, the model processes, the input tokens, generates the response and calculates the difference or loss, between its output and the correct label. and then the error signal is used through gradient descent, update the model's, weights, just typical neural network stuff.

Question answering summarization, code generation Medical q and a, and while it's training on these specific data sets, it's also learning the domain and the jargon and the nuances and the deeper knowledge of this particular vertical. so if it's trained on a code generation.

Data set. maybe you have the inputs being a comment saying, implement the following function, and it outputs a JavaScript function. It will learn JavaScript in the process, anything it didn't already know during the unsupervised pre-training step.

And if you were here during the early hugging face days, you'll recall that we had to download a task specific model for any one of these tasks, summarization, question answering And modern frontier models will basically have all of these data sets all in one.

It will train on across the entire, after GPT or Claude or Gemini are done with the. Pre-training phase, it will then fine tune on every task specific supervised, fine tuning data set out there so that it could become a general purpose model. But if you need a model specifically geared towards one task, like code generation might be one of the most popular tasks, specific, model tasks out there, you might want to limit it to just handling that particular, data set.

Coming up with these data sets is obviously no easy task these days. there are lots of data sets to work with that have been curated and developed over the years and decade. it is very easy to overfit models on the fine tuning dataset, especially for small models, and so it can lead to poor generalization on unseen data.

And then there's this thing called catastrophic forgetting, where a model loses some of the general knowledge that has acquired during the pre-training step. As it specializes on the new task. So it becomes a little too focused on the new task at hand, and it forgets broad world domain knowledge.

and so managing this stuff all comes down to hyper parameter optimization. So I have a whole episode on hyperop, and that includes things like the learning rate and the number of epochs.

overfitting and catastrophic forgetting aren't new issues, but they do pop up as you fine tuneAs your dataset sizes increase, as your model sizes increase, as you saw with a mixture of experts, okay.

Reinforcement learning from human feedback. RLHF. and this is typically the third training step, the third phase of the training cycle.

First one's pre-training, second is fine tuning, and then we do reinforcement learning from human feedback,

Okay, so the goal of RLHF the goal, is alignment. Alignment.

And you'll see things like the alignment problem in LLMs. people talk about the alignment problem of ai. I always thought that alignment had specifically to do with, you know, racism and sexism, like curbing bias in LLMs,or preventing evil killer robots.

what it is is more general. It's actually just

making sure that models align with human preferences. it's training the model on certain qualities like helpfulness, honesty, harmlessness. And these are nuanced attributes that are hard to capture with static labeled data. It's really hard to capture whether something was helpful, in a prebuilt data set about code generation or question answering. Or whether it was accurate. you may have a whole data set on questions and answers, but maybe some of those answers are wrong, or maybe they weren't thorough enough, maybe they just didn't align with human preference.

Maybe they should have been, tighter answers, simpler. just the facts, please.

R lh, F's entire goal is alignment, and alignment is ensuring that a model's outputs are consistent with human values and intentions. trying to make the model a better conversational partner, a more reliable source of information. and also, yes, less likely to generate harmful biased untruthful content.

And so RLHF involves three stages. The first is to start with the pre-trained and supervised, fine tuned model. Next, we collect human preference data.

So this pre-trained fine tune model that we've already built up is now going to generate different responses for the same input prompt. Maybe you'll use a different temperature or.

a kneeling or sampling, some technique to generate two different responses. And a human labeler will rank the responses,based on various criteria. Was it helpful? Was it unharmful? Was it truthful?

understandable and so forth

and, uh, relative to each other. So it's a scoring system. So it's comparative feedback

Then. we train a reward model. A real reinforcement learning model. so go back to the RL episode. and a common one is PPO, proximal policy optimization. I believe I mentioned that model in that episode.

It was the sort of new cool RL model, that was happening when I last was looking at RL in like 2018 or 2020.

So this PPO model, our reinforcement learning models trained on the human preference data, and it's trying to predict a scaler reward score for any given Prompt response pair. So it's learning to mimic the human preference judgments,captured in the rankings. It's learning to act like the human, it's learning to play the preferences that the human played out as if it was a video game.

And this guy is now a video game playing AI agent. And of course, a high score indicates the response is more likely to be preferred by the human.

Then, you do another fine tuning phase of the original LLM, the pre-trained and SFT fine tuned model where this RL model, the PPO model is actively watching the outputs that the model generates and it's scoring them based on these predicted scores

and it's like a little teacher telling the model, good, bad, good, bad. And if I'm not mistaken, I could be wrong about this. I think that the RL model actually can reach into the architecture of the LLM and manually adjust the weights based on the predicted score. So I think it's doing a little bit of surgery on the model.

It's kinda like this little droid that's zapping the neurons inside of the model as it generates outputs. I could be wrong about that. And instead it's simply providing some sort of a feedback that the model can use standard supervised learning to back propagate error,during the training process.

And I believe these RL models can adjust the weights of the LLMs while they're live. and I just did a search on that and that is true. that's correct. So I'm not sure to what extent this is actually used in the wild, but my gut tells me, I remember,OpenAI saying that you're gonna be impressed with how much chat GPT improves, over the next weeks or months after they launched a live one.

I think it was 4.0

And my assumption here was that it was generating, updates to the model live with every time somebody did a thumbs up or a thumbs down, when it generates two responses, split screen and chat gpt.com and says, which one do you like? And, also these data sets. I don't know if you've ever seen the company.

Data annotation tech. I see a lot of ads for them. my ex actually did work with them, and that's what they do. they generate

scores for these various. Values like helpfulness and bias and stuff for textual content. And then I think that they're like contracted out by some of these major LLM players to provide that back for RLHF training data.

so the big benefit of RLHF is that. It captures something that text data sets don't capture, and that is , value. Nuanced,

subtle preferences responses that would be really hard to capture in text. And as a consequence, the introduction of RLHF has bolstered the performance or the value of outputs significantly, and in a way that before the introduction of RLHF was impossible to capture.

And evidently RLHF can make really big changes very fast that are very valuable changes to the model. they call, they call this parameter efficient,

so it doesn't matter if your neural network is large or small. these data sets and trained PPO RL models can really improve the performance of those LLMs.

okay. So what are the downsides of RAHF? It's complex. It's resource intensive. You have to collect high quality data. It has to be consistent data. it's sometimes hard when somebody's staring at the screen all day and doing hundreds of these tasks to, think really critically about how helpful or unbiased some response was at the 11th hour.

It introduces a lot more sophistication to the training process of a model.

this makes it difficult to implement and tune correctly, and there's a big concern of reward hacking or you've ever heard the term specification, gaming? Specification gaming. Way back in some of the philosophical episodes I talked about, the

paperclip maximization theory. It's this sort of cheeky, idea where if you tell a robot you, if it's a misaligned AI and its entire premise is to maximize the output and sales of Paperclip for Paperclip Co, I. Then what's to prevent the AI from learning how to suck in the entire universe and absorb all the raw materials, and create the big crunch and destroy the universe and so forth.

that was thought of as a silly philosophical idea for a while there.

But, as RRL has been used more extensively in research and academia and in industry, they've actually found this to be a real concern. RRL tends to find shortcuts to maximize the goal that was given it. And so you have to make sure that goal is really. Clever and specific and nuanced and captures a lot of blah, blah, blah,to prevent it from reward hacking.

And this problem is called specification gaming. and so the LLM will find ways to exploit the reward model to get high scores. one thing they found is during the early days of this, while they were exploring it, it was generating really verbose, unhelpful answers because the human reward labelers thought that veracity.

Probably indicated truthfulness and helpfulness. You know, I'm not gonna read this whole thing, but it looks like it knows what it's talking about. and so LLMs learned to skip the truth.

By doing a filibuster and,humans were accidentally labeling that as good.

So there's whole techniques and, hyper parameter optimization and overseeing the data sets and so forth to make sure that this reward hacking doesn't happen.

Now let's talk about reasoning and in context learning. So there are specific tasks where an LLM benefits greatly, by enabling complex multi-step reasoning. these are particularly for logical deduction, mathematical calculation, common sense understanding applied over multiple steps, code.

And, because so many of these components listed really present themselves in any type of task where you're asking a question, the typical use case of chat, GBT and Gemini,

More and more they are adding this this reasoning phase two LLMs, and it's not architectural. There's no modification to the neural network. It is simply a modification of the way that the model is prompted.

Now, this began in the early days with prompt engineering. For a while there, prompt engineering was a really valued skill. In fact, there were jobs out there for prompt engineering, and I still think everybody should learn prompt engineering. I may or may not add a segment for some, you top tips in prompt engineering in this episode.

We'll see if we have time. but it's the difference between a good response and a bad response, partially because these models were trained on certain data, and that data maybe was a little bit more specific than the average human query, and partially because a sequence of thinking steps. Prior to giving an answer in humans, helps with formulating a correct response and either the training data bears that out, that showing a sequence of steps leading to a response is what the model. Is trained to work with for generating the right response and or per the emergent abilities mentioned previously.

Telling the model to show its work step by step as it's generating a response similar to a human, without anthropomorphizing these models too much, but they are trained, after. based on our language. Surely that training incorporates some metaphysical aspect of language use.

So showing steps step by step as it's generating responses leads to better responses. It can correct itself, it can crosscheck itself, it can look at tokens and steps previously that almost that glimpse back acts as. A language level, self attention mechanism, almost like extrapolating the attention mechanism within the neural network outward to the language level.

Having that token visible in its context in the generation of the next token acts as a sort of token level attention mechanism. And so early prompt engineers really knew the tricks of the trade. and one of the big tricks they applied was

they would simply append the phrase, let's think step by step. or think through this step by step. So they would type out their prompt and at the very end they'd just enter and then say, let's think step by step, control, enter, and the results would be significantly better. And by the way, prompt engineering there, there was this guy on Reddit recently that did geo guessing.

He put a flag in some rocks and it was in Nepal and said, try to guess the location of this flag toGPT oh three. And it guessed it within, a mile radius with no indicators whatsoever. And everybody on Reddit tried to do it with their own. pictures of ball and their lawn and stuff.

Nobody could replicate it. And then somebody said, guys look at the prompt. And they went to the original prompt and they copied and pasted it and it was like 10 paragraphs long. And it was worded in such a strange way that nobody would've thought to word it and had they pasted it. And then sure enough, they all got great performance for the model, geo guessing their own photos.

And that just goes to show that prompt engineering is alive and well, and it's critical. Especially if you're using this for coding, vibe coding, like the prior episodes, there is a button. If you use ru code, which you hear me pedal all the time, there's a button before you click submit star symbol, the classic AI icon that will rewrite your prompt using Well-known prompt engineering techniques and that will improve the quality of the code output more than if you had just clicked submit. So prompt engineering, very important. Learn it. If you can look up one of the guides. Each model has their own, but generally, 80% of the prompt engineering guidelines out there, apply across all the different models and it makes a very big difference.

And so, um, chain of thought prompting was that these frontier model designers chat, GT Gemini realized how valuable, very specific prompt engineering techniques were. And made sure that those were integrated with your prompt. So you would say, who was the 40th president? and it would append, think step by step, under the hood so you don't see it.

and that's not all what chain of thought prompting is. So let's get into chain of thought. but that was the precursor to this revolution. chain of thought is,a technique that guides the LLM to explicitly articulate its reasoning process.

By generating intermediate steps before providing the final answer, it encourages the model to think out loud, mimicking how a human might solve a complex problem.

there are other techniques of chain of thought you can provide a few examples demonstrating the desired task. So this is called few shot prompting or few shot chain of thought prompting. where each example explicitly shows the step-by-step reasoning process leading to the answer.

Or you can do zero shot chain of thought. which was the one I said, let's think step by step, adding that to the end of the query and it will trigger the model to generate a reasoning chain

and,some more complex chain of thought. I don't know exactly how they work, but you'll see with oh four mini high. it will iterate over these chain of thought prompts. So it will go multiple steps, one after another until it's confident it can give you a final output. I believe that the way that works is if, at the end of a chain of thought, it implies that it, needs to dig in deeper than it will.

Trigger the system to run it again, until somehow it determines that it has the right answer.

This chain of thought system dramatically improves LLM performance on a wide range of tasks, arithmetic, common sense, symbolic reasoning, obviously code. in Vibe Coding, you'll see chain of thought used for architect mode for Orchestrator. I think they use it for orchestrator mode. anything where it needs to come up with a game plan, a step-by-step strategy.

So by externalizing the reasoning process, it leads to more accurate results compared to directly asking for the final answer.

And this is also valuable for generating a degree of transparency to the model's process. that can help in debugging interpretability monitoring for safety and alignment purposes.

Okay. There's a few, variations of chain of flat. And for more advanced techniques that have been developed, one is called self consistency. this involves generating multiple independent reasoning chains or chains of thought for the same problem using different temperature sampling, for example.

And then selecting the final answer based on a majority vote among the conclusions reached by the different chains.

this improves accuracy by mitigating the impact of the occasional faulty reasoning path. So it's like an ensemble machine learning model. Just for the chains of thought.

And then there's Tree of Thought or TOT. I've never personally seen this or I don't think I've seen it.

Maybe it's happening under the hood for some of these frontier models. but it allows the model to explore multiple reasoning pathways simultaneously. So it creates a tree structure, and at each step the model can generate. Several possible next thoughts or intermediate steps, evaluate them and decide which paths to pursue further using search algorithms like breadth first or beam search.

so it's a far cry from standard prompt engineering. It is quite architectural. It's an architectural shift. but it allows exploration and backtracking and it can handle more complex problems where a single, linear chain of thought might fail.

Obviously, chain of thought is more computationally expensive, than just standard inference. And so I think these frontier models are coming up with ways to be very. Intelligent about how much thinking time, how many steps in a chain of thought, whether to use chain of thought at all automated on your behalf with these models in advanced before triggering these things.

So it's not, a simple question that warrants a simple answer, doesn't kick off one of these chains.

Okay, I'm only halfway through LLM, so I'm actually gonna split this into another episode. so we're leaving off, we're still inside of inference time training, test time, compute reasoning, that umbrella of territory leaving off with chain of thought. And then we'll pick up with in context learning like, like few shot prompting or one shot prompting in the next episode on LLMs part two, I'll see you there.

MLG 034 Large Language Models 1

Resources

Show Notes

@media (min-width:0px){.css-6k8fz8{display:none;}}@media (min-width:1200px){.css-6k8fz8{display:block;}}Learn Faster with a Walking Desk@media (min-width:0px){.css-1rb0nos{display:block;}}@media (min-width:1200px){.css-1rb0nos{display:none;}}Walk While You Learn

Transformer Foundations and Scaling Laws

Emergent Abilities in LLMs

Architectural Evolutions: Mixture of Experts (MoE)

The Three-Phase Training Process

Advanced Reasoning Techniques

Optimization for Training and Inference

Never Run Out of ML ContentGenerate Your Own Episodes

Transcript

Learn Faster with a Walking DeskWalk While You Learn