Post

LLM Foundations Part 1: A Guide to Modern Language Models

LLM Foundations Part 1: A Guide to Modern Language Models

Introduction

Welcome to a new series exploring the foundations of the technology that has captured the world’s imagination: Large Language Models (LLMs). We’ve previously covered the building blocks of classical machine learning and even the core concepts of neural networks. Now, we dive into the massive, sophisticated models like GPT, LLaMA, and BERT that power modern AI.

This first post will demystify the models themselves. We’ll look at the architecture that makes them possible, explore the different “families” of LLMs, and understand how they are trained. The complex code you’ve seen, like the LlamaModel class, is a real-world implementation of the very concepts we’ll discuss, from handling vocabularies and positional embeddings to advanced techniques like Mixture-of-Experts.

Our journey will cover:

  1. A high-level overview of what an LLM is.
  2. A recap of the Transformer architecture that underpins them all.
  3. A look at the different model families: Encoder-Only (BERT), Decoder-Only (GPT, LLaMA), and Encoder-Decoder (T5).
  4. The training lifecycle: Pre-training, Fine-tuning, and Instruction Tuning.

1. What Are LLMs?

At its core, a Large Language Model is a massive neural network, often with hundreds of billions of parameters, trained on an enormous corpus of text and code. Its fundamental objective during this initial training is surprisingly simple: predict the next word (or “token”) in a sequence.

For example, given the input “The cat sat on the…”, the model is trained to predict the word “mat”. By doing this billions of times across trillions of words, the model develops a deep, nuanced statistical understanding of language. This simple goal leads to incredible emergent abilities, allowing LLMs to perform tasks they were never explicitly trained on, such as translation, summarization, question answering, and even reasoning.

2. The Core Architecture: A Transformer Recap

As we discussed in a previous post, the Transformer architecture is the engine inside every modern LLM. Its key innovation is the Self-Attention Mechanism, which allows a model to weigh the importance of different words in the input sequence when processing any given word.

For example, in the sentence “The robot picked up the ball because it was heavy,” self-attention helps the model understand that “it” refers to the “ball,” not the “robot.”

The Transformer is composed of two main parts:

  • The Encoder: Reads and understands the entire input sequence.
  • The Decoder: Generates the output sequence, one token at a time.

The LlamaModel code you’ve seen gives us a peek into the real-world complexity. When it processes q_proj and k_proj weights, it is setting up the Query and Key matrices, which are fundamental components used to calculate attention scores.

3. The LLM Family Tree: GPT, BERT, and T5

Not all LLMs use the full Transformer. They can be categorized into three main families based on which parts of the architecture they use.

a. Decoder-Only Models (e.g., GPT, LLaMA, PaLM)

This is the most popular architecture for generative AI assistants. These models use only the Decoder stack of the Transformer.

  • How they work: They are auto-regressive, meaning they predict the next token based on all the previous tokens. They are inherently designed for text generation.
  • Strengths: Excellent at creative writing, chatbots, summarization, and any task that requires generating new text.
  • The LlamaModel Code: The code you have is a perfect example of a decoder-only model. It includes advanced features seen in modern LLaMA models:
    • Vocabulary & Tokenizer: The set_vocab method shows the complexity of handling different tokenizers (SentencePiece, GPT-2 BPE) and special tokens. Models don’t see words; they see numerical tokens.
    • Rotary Positional Embeddings (RoPE): The generate_extra_tensors method calculates RoPE factors. This is an advanced technique for encoding word order, which is more dynamic than the original Transformer’s fixed positional encodings.
    • Mixture-of-Experts (MoE): The logic for block_sparse_moe points to an MoE architecture (like in Mixtral). In an MoE model, there are multiple “expert” sub-networks, and a gating mechanism routes each token to the most relevant experts. This allows models to have a huge number of parameters but be computationally efficient during inference, as only a fraction of the experts are used for any given token.

b. Encoder-Only Models (e.g., BERT, RoBERTa)

These models use only the Encoder stack of the Transformer.

  • How they work: They are designed to build a deep, bidirectional understanding of a piece of text. During training, some words in a sentence are masked (hidden), and the model’s job is to predict those masked words using context from both the left and the right.
  • Strengths: Excellent for analytical tasks that require a deep understanding of the entire text, such as sentiment analysis, text classification, and named entity recognition. They are not good at generating text.

c. Encoder-Decoder Models (e.g., T5, BART)

These models use the full, original Transformer architecture.

  • How they work: They frame every NLP problem as a “text-to-text” task. For example, to classify a sentence, you would prompt the model with "classify sentiment: This movie is wonderful.", and it would be trained to output the text "positive".
  • Strengths: Very versatile and can perform a wide range of tasks, from translation and summarization to classification, by changing the input prefix.

4. The Training Lifecycle of an LLM

Creating a powerful LLM involves several stages of training.

a. Pre-training

This is the initial, massively expensive phase. The model is trained on a colossal, diverse dataset (e.g., a large portion of the public internet) for its primary objective (e.g., next-token prediction for a decoder-only model). This is where the model learns grammar, facts about the world, reasoning abilities, and its general understanding of language. This phase can take months and cost millions of dollars in cloud computing.

b. Fine-tuning

Once a model is pre-trained, it can be adapted for specific tasks or domains. Fine-tuning involves taking the pre-trained model and continuing to train it on a much smaller, high-quality, domain-specific dataset. For example, you could fine-tune a general model on a dataset of medical research papers to create a medical expert model. This is far cheaper than pre-training from scratch.

c. Instruction Tuning and Alignment

To make a base model useful as an AI assistant, it needs to learn how to follow human instructions. Instruction tuning is a form of fine-tuning where the model is trained on a dataset of (instruction, response) pairs.

Furthermore, to ensure the model is helpful, harmless, and honest, it undergoes an alignment process. The most common method is Reinforcement Learning from Human Feedback (RLHF). In RLHF, human reviewers rank different model responses to the same prompt. A separate “reward model” is then trained to predict which responses humans would prefer. Finally, the LLM is fine-tuned using reinforcement learning to maximize the score it gets from this reward model, effectively aligning its behavior with human preferences.

Conclusion

Modern Large Language Models are the culmination of decades of research, built upon the elegant and powerful Transformer architecture. They come in different flavors—decoder-only for generation, encoder-only for analysis, and full encoder-decoder for text-to-text tasks. Understanding their training lifecycle, from the brute-force pre-training to the nuanced alignment of instruction tuning, is key to appreciating how these incredible tools are created and how they continue to evolve.

In the next part of this series, we’ll explore how to effectively communicate with these models through the art of Prompt Engineering.

Suggested Reading

  • “Attention Is All You Need” (Vaswani et al., 2017): The paper that started it all.
  • The Illustrated Transformer by Jay Alammar: The best visual explanation of the Transformer architecture on the internet.
  • Hugging Face Blog: A fantastic resource for deep dives into specific models and concepts.
This post is licensed under CC BY 4.0 by the author.