Efficient Tuning with LoRA

Low-rank adaptation (LoRA) is an advanced machine learning technique designed to enhance the performance of pretrained models when applied to specific, often smaller, datasets. Rather than adjusting all the parameters of a large model, LoRA focuses on a subset of parameters that can be represented in a low-rank format. This methodology involves modifying only a limited number of parameters, which makes it more computationally efficient.

The significance of LoRA lies in its ability to facilitate effective finetuning of large-scale models on task-specific data while markedly reducing the time and computational resources typically required for this process. By strategically limiting the changes to a small, manageable set of parameters, researchers and practitioners can achieve high levels of performance on new tasks without the need for extensive retraining. This approach not only saves time but also allows smaller teams with limited resources to leverage powerful pretrained models effectively, making it an important tool in the field of machine learning.

Imagine we are working with a large weight matrix, denoted as $W$, for a specific layer in a neural network. During the process of backpropagation, we aim to optimize the network’s performance by adjusting these weights based on the gradients computed with respect to a loss function. This adjustment is captured in a matrix called $\Delta{W}$, which quantifies the necessary updates needed to minimize the loss and improve the model's accuracy during training.

In the context of regular training and fine-tuning, the updated weights can be expressed with the following equation:

$W_{\text{updated}} = W + \Delta W$

This formula clearly illustrates that the new weights, $(W_{\text{updated}} $ , are attained by adding the changes defined by $\Delta W$to the original weights $W $ .

To enhance the efficiency of weight updates, the Low-Rank Adaptation (LoRA) method proposed by Hu et al. provides a compelling alternative. Instead of directly calculating the full $\Delta W$matrix, LoRA learns a more efficient approximation by expressing it as the product of two smaller matrices, $ A$ and $B$. This approximation can be stated as:

$\Delta W \approx AB $

Thus, within the LoRA framework, the weight update can be reformulated as:

$W_{\text{updated}} = W + AB$

This means that, instead of manipulating a potentially large $\Delta W,$ we only need to work with the smaller matrices $A$ and $B$, which significantly reduces computational overhead and memory usage.

To visually understand this process, consider a diagram that places the traditional full fine-tuning method alongside the LoRA approach, highlighting the distinct mechanisms of weight updates. This comparative illustration allows for a clearer grasp of the efficiency gains achieved with LoRA while still effectively adapting the network to new tasks or data.

Lets Download a SpamCollection file and split the data into training, validation and test set.

import urllib
from pathlib import Path
import pandas as pd


def download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path):
    if data_file_path.exists():
        print(f"{data_file_path} already exists. Skipping download and extraction.")
        return

    # Downloading the file
    with urllib.request.urlopen(url) as response:
        with open(zip_path, "wb") as out_file:
            out_file.write(response.read())

    # Unzipping the file
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(extracted_path)

    # Add .tsv file extension
    original_file_path = Path(extracted_path) / "SMSSpamCollection"
    os.rename(original_file_path, data_file_path)
    print(f"File downloaded and saved as {data_file_path}")
def create_balanced_dataset(df):

    # Count the instances of "spam"
    num_spam = df[df["Label"] == "spam"].shape[0]

    # Randomly sample "ham' instances to match the number of 'spam' instances
    ham_subset = df[df["Label"] == "ham"].sample(num_spam, random_state=123)

    # Combine ham "subset" with "spam"
    balanced_df = pd.concat([ham_subset, df[df["Label"] == "spam"]])

    return balanced_df
def random_split(df, train_frac, validation_frac):
    # Shuffle the entire DataFrame
    df = df.sample(frac=1, random_state=123).reset_index(drop=True)

    # Calculate split indices
    train_end = int(len(df) * train_frac)
    validation_end = train_end + int(len(df) * validation_frac)

    # Split the DataFrame
    train_df = df[:train_end]
    validation_df = df[train_end:validation_end]
    test_df = df[validation_end:]

    return train_df, validation_df, test_df


url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
zip_path = "sms_spam_collection.zip"
extracted_path = "sms_spam_collection"
data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv"

try:
    download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path)
except (urllib.error.HTTPError, urllib.error.URLError, TimeoutError) as e:
    print(f"Primary URL failed: {e}. Trying backup URL...")
    url = "https://f001.backblazeb2.com/file/LLMs-from-scratch/sms%2Bspam%2Bcollection.zip"
    download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path)

df = pd.read_csv(data_file_path, sep="\t", header=None, names=["Label", "Text"])
balanced_df = create_balanced_dataset(df)
balanced_df["Label"] = balanced_df["Label"].map({"ham": 0, "spam": 1})

train_df, validation_df, test_df = random_split(balanced_df, 0.7, 0.1)
train_df.to_csv("train.csv", index=None)
validation_df.to_csv("validation.csv", index=None)
test_df.to_csv("test.csv", index=None)

Create Pytorch dataset

class SpamDataset(Dataset):
    def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256):
        self.data = pd.read_csv(csv_file)

        # Pre-tokenize texts
        self.encoded_texts = [
            tokenizer.encode(text) for text in self.data["Text"]
        ]

        if max_length is None:
            self.max_length = self._longest_encoded_length()
        else:
            self.max_length = max_length
            # Truncate sequences if they are longer than max_length
            self.encoded_texts = [
                encoded_text[:self.max_length]
                for encoded_text in self.encoded_texts
            ]

        # Pad sequences to the longest sequence
        self.encoded_texts = [
            encoded_text + [pad_token_id] * (self.max_length - len(encoded_text))
            for encoded_text in self.encoded_texts
        ]

    def __getitem__(self, index):
        encoded = self.encoded_texts[index]
        label = self.data.iloc[index]["Label"]
        return torch.tensor(encoded, dtype=torch.long), torch.tensor(label, dtype=torch.long)

    def __len__(self):
        return len(self.data)

    def _longest_encoded_length(self):
        max_length = 0
        for encoded_text in self.encoded_texts:
            encoded_length = len(encoded_text)
            if encoded_length > max_length:
                max_length = encoded_length
        return max_length

tokenizer = tiktoken.get_encoding("gpt2")
train_dataset = SpamDataset("train.csv", max_length=None, tokenizer=tokenizer)
val_dataset = SpamDataset("validation.csv", max_length=train_dataset.max_length, tokenizer=tokenizer)
test_dataset = SpamDataset("test.csv", max_length=train_dataset.max_length, tokenizer=tokenizer)

Now we create DataLoader for the datasets we created previously

num_workers = 0
batch_size = 8

torch.manual_seed(123)

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers,
    drop_last=True,
)

val_loader = DataLoader(
    dataset=val_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    drop_last=False,
)

test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    drop_last=False,
)

Lets iterate through the data loaders and check that the batches contain 8 training examples each, where each training example consists of 120 tokens

print("Train loader:")
for input_batch, target_batch in train_loader:
    pass

print("Input batch dimensions:", input_batch.shape)
print("Label batch dimensions", target_batch.shape)

Train loader:
Input batch dimensions: torch.Size([8, 120])
Label batch dimensions torch.Size([8])

Let's print the total number of batches in each dataset

print(f"{len(train_loader)} training batches")
print(f"{len(val_loader)} validation batches")
print(f"{len(test_loader)} test batches")
130 training batches
19 validation batches
38 test batches

We will download and load the pre-trained model, this whole exercise was done in the previous blog so we will leverage the same code and import it here as package.

from gpt_download import download_and_load_gpt2
from previous_chapters import GPTModel, load_weights_into_gpt
# Alternatively:
# from llms_from_scratch.ch04 import GPTModel
# from llms_from_scratch.ch05 import load_weights_into_gpt



CHOOSE_MODEL = "gpt2-small (124M)"
INPUT_PROMPT = "Every effort moves"

BASE_CONFIG = {
    "vocab_size": 50257,     # Vocabulary size
    "context_length": 1024,  # Context length
    "drop_rate": 0.0,        # Dropout rate
    "qkv_bias": True         # Query-key-value bias
}

model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

BASE_CONFIG.update(model_configs[CHOOSE_MODEL])

model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")
settings, params = download_and_load_gpt2(model_size=model_size, models_dir="gpt2")

model = GPTModel(BASE_CONFIG)
load_weights_into_gpt(model, params)
model.eval();

To ensure that the model was loaded correctly, let's double-check that it generates coherent text

from previous_chapters import (
    generate_text_simple,
    text_to_token_ids,
    token_ids_to_text
)


text_1 = "Every effort moves you"

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(text_1, tokenizer),
    max_new_tokens=15,
    context_size=BASE_CONFIG["context_length"]
)

print(token_ids_to_text(token_ids, tokenizer))

Every effort moves you forward.

The first step is to understand the importance of your work

We begin by preparing the model for classification fine-tuning. This process involves modifying the existing architecture to better suit our specific task. One key step is to replace the output layer, which is responsible for generating predictions. We will typically substitute it with a new layer that matches the number of classes in our target dataset. This adjustment ensures that the model can effectively learn to categorize inputs based on the unique features of our data, allowing for improved performance on the classification task at hand.

torch.manual_seed(123)

num_classes = 2
model.out_head = torch.nn.Linear(in_features=768, out_features=num_classes)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")
model.to(device);

Let’s take a closer look at calculating the initial classification accuracy of the non-finetuned model. We anticipate that this accuracy will be approximately 50%. This expectation indicates that the model is currently unable to reliably differentiate between spam and non-spam messages. As it stands, the model's performance suggests that it is making random guesses rather than utilizing any learned patterns or signals in the data to make informed predictions. This baseline accuracy will serve as a crucial point of reference for evaluating improvements once the model undergoes further training and fine-tuning.

@torch.no_grad()  # Disable gradient tracking for efficiency
def calc_accuracy_loader(data_loader, model, device, num_batches=None):
    model.eval()
    correct_predictions, num_examples = 0, 0

    if num_batches is None:
        num_batches = len(data_loader)
    else:
        num_batches = min(num_batches, len(data_loader))
    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            input_batch, target_batch = input_batch.to(device), target_batch.to(device)
            logits = model(input_batch)[:, -1, :]  # Logits of last output token
            predicted_labels = torch.argmax(logits, dim=-1)

            num_examples += predicted_labels.shape[0]
            correct_predictions += (predicted_labels == target_batch).sum().item()
        else:
            break
    return correct_predictions / num_examples



torch.manual_seed(123)
train_accuracy = calc_accuracy_loader(train_loader, model, device, num_batches=10)
val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=10)
test_accuracy = calc_accuracy_loader(test_loader, model, device, num_batches=10)

print(f"Training accuracy: {train_accuracy*100:.2f}%")
print(f"Validation accuracy: {val_accuracy*100:.2f}%")
print(f"Test accuracy: {test_accuracy*100:.2f}%")

Training accuracy: 46.25%
Validation accuracy: 45.00%
Test accuracy: 48.75%

To start, we initialize a LoRALayer, which is specifically designed to generate two important matrices: ( A ) and ( B ). These matrices play a crucial role in the layer's functioning. Additionally, the initialization process involves setting up the alpha scaling hyperparameter, which helps to control the impact of the low-rank updates, and the rank (( r )) hyperparameter, which determines the dimensionality of the matrices being factored.

Once this layer is properly configured, it is capable of accepting an input in the form of a tensor. The layer then processes this input to compute the corresponding output through a series of mathematical operations involving the matrices ( A ) and ( B ). This output effectively reflects the transformed representation of the initial input based on the learned parameters of the LoRALayer.

import math

class LoRALayer(torch.nn.Module):
    def __init__(self, in_dim, out_dim, rank, alpha):
        super().__init__()
        self.A = torch.nn.Parameter(torch.empty(in_dim, rank))
        torch.nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))  # similar to standard weight initialization
        self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
        self.alpha = alpha

    def forward(self, x):
        x = self.alpha * (x @ self.A @ self.B)
        return x

In the above code, the hyperparameter rank controls the inner dimensions of the matrices A and B. This parameter determines the number of additional parameters introduced by LoRA and plays a crucial role in balancing model adaptability and parameter efficiency.

The second hyperparameter, alpha, is a scaling factor applied to the output of the low-rank adaptation. It controls the degree to which the output of the adapted layer can influence the original output of the layer being modified. Essentially, this serves as a way to regulate the impact of the low-rank adaptation on the layer's output.

Thus far, the LoRALayer class we implemented allows us to transform the inputs x of the layer. However, in LoRA, we typically aim to replace existing Linear layers so that weight updates are applied to the pre-trained weights.

We are introducing a specialized layer called LinearWithLoRA that builds upon the previously developed LoRALayer. This new layer is designed to be seamlessly integrated into existing neural network architectures, specifically as a substitute for traditional Linear layers. Its applications are particularly relevant in modules like the self-attention mechanism and the feedforward sublayers within large language models (LLMs). By leveraging the benefits of LoRA, the LinearWithLoRA layer enhances the model's efficiency and adaptability, allowing for improved performance while maintaining the underlying structure of the network.

class LinearWithLoRA(torch.nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        self.linear = linear
        self.lora = LoRALayer(
            linear.in_features, linear.out_features, rank, alpha
        )

    def forward(self, x):
        return self.linear(x) + self.lora(x)

It's important to note that when we initialize the weight matrix ( B ) (referred to as self.B in the LoRALayer), we set all of its values to zero. This initialization plays a crucial role in the functioning of the Low-Rank Adaptation (LoRA) approach. Specifically, when we perform matrix multiplication between matrices ( A ) and ( B ), the result will be a matrix filled entirely with zeros. Because of this, the operation does not impact the original weights of the model; effectively, adding a zero matrix to the original weights leaves them unchanged.

To implement LoRA within the previously defined GPT model, we create a function named replace_linear_with_lora. This function systematically goes through the model and replaces each instance of the standard Linear layers with our modified LinearWithLoRA layers. This transition allows the model to leverage the benefits of the LoRA architecture while maintaining the integrity of its pre-existing structure.

def replace_linear_with_lora(model, rank, alpha):
    for name, module in model.named_children():
        if isinstance(module, torch.nn.Linear):
            # Replace the Linear layer with LinearWithLoRA
            setattr(model, name, LinearWithLoRA(module, rank, alpha))
        else:
            # Recursively apply the same function to child modules
            replace_linear_with_lora(module, rank, alpha)

Then freeze the original model parameters and use the replace_linear_with_lora function to replace the Linear layers with LinearWithLoRA layers in the LLM.

total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable parameters before: {total_params:,}")

for param in model.parameters():
    param.requires_grad = False

total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable parameters after: {total_params:,}")

Total trainable parameters before: 124,441,346
Total trainable parameters after: 0

replace_linear_with_lora(model, rank=16, alpha=16)

total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable LoRA parameters: {total_params:,}")

Total trainable LoRA parameters: 2,666,528

We can see, we reduced the number of trainable parameters by almost 50x when using LoRA

Lets finetune the model by reusing the training function from previous blog

import time
from previous_chapters import train_classifier_simple
# Alternatively:
# from llms_from_scratch.ch06 import train_classifier_simple


start_time = time.time()

torch.manual_seed(123)

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.1)

num_epochs = 5
train_losses, val_losses, train_accs, val_accs, examples_seen = train_classifier_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=50, eval_iter=5,
)

end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes.")

Ep 1 (Step 000000): Train loss 3.820, Val loss 3.462
Ep 1 (Step 000050): Train loss 0.396, Val loss 0.364
Ep 1 (Step 000100): Train loss 0.111, Val loss 0.229
Training accuracy: 97.50% | Validation accuracy: 95.00%
Ep 2 (Step 000150): Train loss 0.135, Val loss 0.073
Ep 2 (Step 000200): Train loss 0.008, Val loss 0.052
Ep 2 (Step 000250): Train loss 0.021, Val loss 0.179
Training accuracy: 97.50% | Validation accuracy: 97.50%
Ep 3 (Step 000300): Train loss 0.096, Val loss 0.080
Ep 3 (Step 000350): Train loss 0.010, Val loss 0.116
Training accuracy: 97.50% | Validation accuracy: 95.00%
Ep 4 (Step 000400): Train loss 0.003, Val loss 0.151
Ep 4 (Step 000450): Train loss 0.008, Val loss 0.077
Ep 4 (Step 000500): Train loss 0.001, Val loss 0.147
Training accuracy: 100.00% | Validation accuracy: 97.50%
Ep 5 (Step 000550): Train loss 0.007, Val loss 0.094
Ep 5 (Step 000600): Train loss 0.000, Val loss 0.056
Training accuracy: 100.00% | Validation accuracy: 97.50%
Training completed in 12.10 minutes.

from previous_chapters import plot_values

epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
examples_seen_tensor = torch.linspace(0, examples_seen, len(train_losses))

plot_values(epochs_tensor, examples_seen_tensor, train_losses, val_losses, label="loss")

Now we calculate the accuracies on the full dataset

train_accuracy = calc_accuracy_loader(train_loader, model, device)
val_accuracy = calc_accuracy_loader(val_loader, model, device)
test_accuracy = calc_accuracy_loader(test_loader, model, device)

print(f"Training accuracy: {train_accuracy*100:.2f}%")
print(f"Validation accuracy: {val_accuracy*100:.2f}%")
print(f"Test accuracy: {test_accuracy*100:.2f}%")

Training accuracy: 100.00%
Validation accuracy: 96.64%
Test accuracy: 97.33%

Parameter-efficient Fine tuning with LoRA

Lets Download a SpamCollection file and split the data into training, validation and test set.

Lets iterate through the data loaders and check that the batches contain 8 training examples each, where each training example consists of 120 tokens

Let's print the total number of batches in each dataset

To ensure that the model was loaded correctly, let's double-check that it generates coherent text

We can see, we reduced the number of trainable parameters by almost 50x when using LoRA

Now we calculate the accuracies on the full dataset

we can see based on the relatively high accuracy values above, the LoRA finetuning was successful

Comments

Deep Learning

Implementing Model for Instruction Fine Tuning using PyTorch

More from this blog

Building an Agentic CI/CD Pipeline with Amazon Bedrock, GitLab CI, and AWS CDK

Reliable JSON Responses from LLMs

Multi-Agent Loan Processing AgenticAI

Implementing LSTM RNN using Pytorch

Implementing a ResNet-34 CNN Using PyTorch

Command Palette

Lets Download a SpamCollection file and split the data into training, validation and test set.

Lets iterate through the data loaders and check that the batches contain 8 training examples each, where each training example consists of 120 tokens

Let's print the total number of batches in each dataset

To ensure that the model was loaded correctly, let's double-check that it generates coherent text

We can see, we reduced the number of trainable parameters by almost 50x when using LoRA

Now we calculate the accuracies on the full dataset

we can see based on the relatively high accuracy values above, the LoRA finetuning was successful

Comments

Deep Learning

Implementing Model for Instruction Fine Tuning using PyTorch

More from this blog