Simple Transformer Model Tutorial

9 minute read

A Comprehensive Tutorial on Building a Simple Transformer Model with Hugging Face and PyTorch (GPU Support)

Introduction

Welcome to this in-depth tutorial on building a simple Transformer model using Python, Hugging Face, and PyTorch with GPU support. Transformers, introduced in the seminal paper “Attention is All You Need” by Vaswani et al. (2017), have revolutionized natural language processing (NLP) and are the backbone of modern large language models (LLMs) like BERT, GPT, and Llama. This tutorial aims to demystify the Transformer architecture for beginners, walking you through its components, implementation, and training with a simple example. We will use the Hugging Face transformers library and PyTorch, leveraging GPU acceleration for efficient computation.

This tutorial is designed to be verbose, providing detailed explanations of each component, step-by-step code, and visualizations (described for download). It prepares you for a follow-up episode where we will extend this knowledge to build and fine-tune a full-fledged LLM. By the end, you’ll have a working Transformer model, understand its architecture, and be ready to explore advanced topics in the next episode.

Prerequisites

Before diving in, ensure you have the following:

  • Python Knowledge: Familiarity with Python programming, including functions, classes, and basic data structures.
  • PyTorch Basics: Understanding of PyTorch tensors, neural network modules, and training loops.
  • Hugging Face: Basic familiarity with the Hugging Face transformers library (we’ll guide you through its usage).
  • Hardware: A system with a CUDA-compatible GPU for accelerated training. If you don’t have a GPU, the code will fall back to CPU, but GPU is recommended for performance.
  • Environment Setup:
    • Python 3.8+
    • Install required libraries:
      pip install torch transformers datasets numpy matplotlib
      
    • A CUDA-enabled GPU with the appropriate PyTorch version installed (e.g., torch>=2.0.0+cu121 for CUDA 12.1). Check your setup with:
      import torch
      print(torch.cuda.is_available())  # Should print True if GPU is available
      

What is a Transformer?

The Transformer is a deep learning model architecture that relies entirely on the attention mechanism, eliminating the need for recurrent neural networks (RNNs) like LSTMs. It excels in sequence-to-sequence tasks, such as machine translation, text generation, and more. The key innovation is the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence when processing each word, capturing long-range dependencies efficiently.

Key Components of a Transformer

A Transformer consists of an Encoder and a Decoder, each composed of multiple layers (or blocks). Here’s a breakdown of the main components:

  1. Input Embedding: Converts input tokens (e.g., words or subwords) into dense vectors.
  2. Positional Encoding: Adds information about the position of each token in the sequence, as Transformers lack inherent sequential order.
  3. Multi-Head Self-Attention: Computes attention scores to focus on relevant parts of the input sequence.
  4. Feed-Forward Neural Network: Applies a position-wise fully connected layer to each token.
  5. Layer Normalization and Residual Connections: Stabilizes training and improves gradient flow.
  6. Decoder: Generates output sequences, using masked self-attention to prevent attending to future tokens.
  7. Output Layer: Maps the decoder’s output to a probability distribution over the vocabulary.

Architecture Diagram

To visualize the Transformer, refer to the architecture diagram (described below for download). The diagram shows the encoder and decoder stacks, with arrows indicating data flow through self-attention, feed-forward layers, and residual connections.

Figure 1: Transformer Architecture
Description: A diagram illustrating the Transformer model with an encoder (left) and decoder (right). The encoder processes the input sequence, while the decoder generates the output. Key components include input embeddings, positional encodings, multi-head attention, feed-forward networks, and layer normalization. Arrows show residual connections and the flow from encoder to decoder via cross-attention.
File: transformer_architecture.png (Download separately)

Step-by-Step Implementation

We’ll implement a simple Transformer model for a sequence-to-sequence task: translating English numbers (e.g., “one”) to their numerical form (e.g., “1”). This toy example simplifies tokenization and focuses on the model architecture. We’ll use Hugging Face for tokenization and PyTorch for the model and training.

Step 1: Environment Setup

First, let’s set up the environment and check for GPU availability.

import torch
import torch.nn as nn
import torch.optim as optim
from transformers import AutoTokenizer
import numpy as np
import matplotlib.pyplot as plt
from datasets import Dataset
import uuid

# Check for GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

This code imports necessary libraries and ensures the model runs on GPU if available, falling back to CPU otherwise.

Step 2: Prepare a Toy Dataset

For simplicity, we’ll create a dataset mapping English number words to their numerical equivalents (e.g., “one” → “1”). We’ll use Hugging Face’s Dataset class for data handling.

# Create a toy dataset
data = {
    "input": ["one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"],
    "output": ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"]
}
dataset = Dataset.from_dict(data)

# Initialize tokenizer (using a pre-trained tokenizer for simplicity)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize the dataset
def tokenize_function(examples):
    input_enc = tokenizer(examples["input"], padding="max_length", truncation=True, max_length=10)
    output_enc = tokenizer(examples["output"], padding="max_length", truncation=True, max_length=10)
    return {
        "input_ids": input_enc["input_ids"],
        "attention_mask": input_enc["attention_mask"],
        "labels": output_enc["input_ids"]
    }

tokenized_dataset = dataset.map(tokenize_function, batched=True)

Here, we:

  • Create a small dataset with 10 examples.
  • Use the BERT tokenizer for simplicity (though a custom tokenizer could be used for this toy task).
  • Tokenize inputs and outputs, padding sequences to a fixed length (10 tokens) and generating attention masks.

Step 3: Define the Transformer Model

We’ll implement a simple Transformer model using PyTorch. This model includes an encoder and decoder, with multi-head self-attention and feed-forward layers.

class SimpleTransformer(nn.Module):
    def __init__(self, vocab_size, d_model=64, nhead=4, num_layers=2, dim_feedforward=128, max_seq_length=10):
        super(SimpleTransformer, self).__init__()
        self.d_model = d_model
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoder = PositionalEncoding(d_model, max_seq_length)
        
        # Encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=dim_feedforward,
            batch_first=True
        )
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        
        # Decoder
        decoder_layer = nn.TransformerDecoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=dim_feedforward,
            batch_first=True
        )
        self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers)
        
        # Output layer
        self.fc_out = nn.Linear(d_model, vocab_size)
        
    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        # Embed and add positional encoding
        src = self.embedding(src) * torch.sqrt(torch.tensor(self.d_model, dtype=torch.float32))
        src = self.pos_encoder(src)
        tgt = self.embedding(tgt) * torch.sqrt(torch.tensor(self.d_model, dtype=torch.float32))
        tgt = self.pos_encoder(tgt)
        
        # Encoder forward
        memory = self.encoder(src, mask=src_mask)
        
        # Decoder forward
        output = self.decoder(tgt, memory, tgt_mask=tgt_mask)
        
        # Output projection
        output = self.fc_out(output)
        return output

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        return x + self.pe[:, :x.size(1), :]

# Initialize model
vocab_size = tokenizer.vocab_size
model = SimpleTransformer(vocab_size=vocab_size).to(device)

Explanation:

  • Embedding Layer: Maps token IDs to dense vectors of size d_model (64 in this case).
  • Positional Encoding: Adds sine and cosine functions to encode token positions, as described in Vaswani et al. (2017).
  • TransformerEncoder/Decoder: Uses PyTorch’s built-in TransformerEncoderLayer and TransformerDecoderLayer for multi-head attention and feed-forward layers.
  • Output Layer: Projects the decoder output to the vocabulary size for token prediction.
  • The model is moved to the GPU (or CPU) using .to(device).

Step 4: Training the Model

We’ll train the model using a cross-entropy loss and the Adam optimizer. We’ll also generate a training loss plot (described for download).

# Training setup
criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)
optimizer = optim.Adam(model.parameters(), lr=0.001)
num_epochs = 50
batch_size = 2

# Convert dataset to PyTorch tensors
def format_batch(batch):
    return {
        "input_ids": torch.tensor(batch["input_ids"], dtype=torch.long).to(device),
        "attention_mask": torch.tensor(batch["attention_mask"], dtype=torch.long).to(device),
        "labels": torch.tensor(batch["labels"], dtype=torch.long).to(device)
    }

# Training loop
losses = []
model.train()
for epoch in range(num_epochs):
    epoch_loss = 0
    for i in range(0, len(tokenized_dataset), batch_size):
        batch = tokenized_dataset[i:i+batch_size]
        batch = format_batch(batch)
        
        # Forward pass
        outputs = model(batch["input_ids"], batch["labels"])
        loss = criterion(outputs.view(-1, vocab_size), batch["labels"].view(-1))
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
    
    avg_loss = epoch_loss / (len(tokenized_dataset) // batch_size)
    losses.append(avg_loss)
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

# Plot training loss
plt.figure(figsize=(10, 6))
plt.plot(range(1, num_epochs+1), losses, marker='o')
plt.title("Training Loss Over Epochs")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.grid(True)
plt.savefig("training_loss.png")

Figure 2: Training Loss Plot
Description: A line plot showing the training loss over 50 epochs. The x-axis represents the epoch number, and the y-axis shows the average loss per epoch. The plot includes markers at each epoch and a grid for readability.
File: training_loss.png (Download separately)

Step 5: Inference

Let’s test the model by translating a number word to its numerical form.

def translate_number(model, tokenizer, text, max_length=10):
    model.eval()
    with torch.no_grad():
        # Tokenize input
        inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
        input_ids = inputs["input_ids"].to(device)
        
        # Generate output
        output_ids = torch.zeros((1, max_length), dtype=torch.long).to(device)
        for i in range(max_length):
            outputs = model(input_ids, output_ids)
            next_token = torch.argmax(outputs[:, i, :], dim=-1)
            output_ids[:, i] = next_token
            if next_token.item() == tokenizer.eos_token_id:
                break
        
        # Decode output
        output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
        return output_text

# Test the model
test_input = "five"
result = translate_number(model, tokenizer, test_input)
print(f"Input: {test_input}, Output: {result}")

This code defines a function to generate translations and tests it with the input “five”. The model should output “5” if trained successfully.

Understanding the Transformer Architecture

Let’s dive deeper into the Transformer’s components, as they form the foundation for LLMs.

Multi-Head Self-Attention

The self-attention mechanism computes a weighted sum of input embeddings, where weights are determined by the similarity between tokens. Multi-head attention splits the input into multiple subspaces, allowing the model to focus on different aspects of the sequence simultaneously.

Formula: For a single attention head: [ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V ] Where:

  • ( Q ): Query matrix
  • ( K ): Key matrix
  • ( V ): Value matrix
  • ( d_k ): Dimension of keys/queries

Multi-head attention concatenates the outputs of multiple attention heads: [ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, …, \text{head}_h)W^O ]

Positional Encoding

Since Transformers process tokens in parallel, they need positional encodings to capture word order. We used sine and cosine functions, as proposed by Vaswani et al. (2017): [ PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{\text{model}}}) ] [ PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{\text{model}}}) ]

Feed-Forward Networks and Residual Connections

Each Transformer layer includes a position-wise feed-forward network (FFN): [ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 ] Residual connections add the input to the output of each sub-layer, followed by layer normalization: [ \text{LayerNorm}(x + \text{Sublayer}(x)) ]

Preparing for the Next Episode: LLMs

This tutorial provides a foundation for understanding Transformers, which are the building blocks of LLMs. In the next episode, we’ll:

  • Scale up the model with more layers and parameters.
  • Use a larger dataset (e.g., from Hugging Face’s datasets library).
  • Fine-tune a pre-trained LLM like GPT-2 or Llama for specific tasks.
  • Explore advanced techniques like LoRA (Low-Rank Adaptation) for efficient fine-tuning.

Tips for Scaling to LLMs

  1. Use Pre-trained Models: Hugging Face provides pre-trained models like distilbert or gpt2 that can be fine-tuned for specific tasks.
  2. Efficient Training: Use mixed precision training (torch.cuda.amp) to reduce memory usage on GPUs.
  3. Data Quality: LLMs require large, diverse datasets. Consider datasets like wikitext or bookcorpus.
  4. Hyperparameter Tuning: Experiment with learning rates, batch sizes, and model sizes.

Visualizations

To aid understanding, download the following figures:

  1. Transformer Architecture (transformer_architecture.png): Shows the encoder-decoder structure with labeled components.
  2. Training Loss Plot (training_loss.png): Displays the loss curve over epochs, indicating training progress.

Conclusion

In this tutorial, we built a simple Transformer model using PyTorch and Hugging Face, trained it on a toy dataset, and visualized the training process. We explored the Transformer’s architecture, including self-attention, positional encoding, and feed-forward layers, with GPU support for efficient computation. This foundation prepares you for the next episode, where we’ll dive into LLMs, scaling up the model and tackling real-world NLP tasks.

Feel free to experiment with the code, adjust hyperparameters, or try different datasets. Upload this tutorial to GitHub to share with others, and stay tuned for the LLM episode!

Citations and References

  1. Vaswani, A., Shazeer, N., Parmar, N., Uszoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. Advances in Neural Information Processing Systems, 30. Link
  2. Hugging Face Documentation: https://huggingface.co/docs/transformers
  3. PyTorch Documentation: https://pytorch.org/docs/stable/index.html