Advancedllamallmfine-tuninginferenceragpeftloralibpytorchtransformersnlpmachine-learningmetaai

Llama Cookbook - LLaMA Fine-tuning Framework

Official comprehensive guide for LLaMA model inference, fine-tuning, RAG, and end-to-end use cases. Includes examples for domain adaptation and LLM applications.

Step 1
Overview
Llama Cookbook (formerly llama-recipes) is Meta's official companion project to the Llama models. It provides comprehensive examples and recipes for getting started with inference, fine-tuning for domain adaptation, RAG (Retrieval Augmented Generation), and end-to-end use cases with the Llama model family.

Step 2

Technology Stack

Llama Cookbook is built with the following technologies:

Name: llama-cookbook (formerly llama-recipes)
License: MIT
Stars: ~18,337
Owner: meta-llama
Repo: https://github.com/meta-llama/llama-cookbook
Website: https://www.llama.com

Languages:
- Jupyter Notebook (Primary: 13M+ lines)
- Python (~800K lines)
- Java (~56K lines)
- JavaScript (~35K lines)
- Kotlin (~11K lines)

Core Dependencies:
- PyTorch >=2.2 - Deep learning framework
- Accelerate - Training acceleration
- Transformers >=4.45.1 - Hugging Face models
- Peft - Parameter-efficient fine-tuning
- LoraLib - Low-Rank Adaptation
- Datasets - Hugging Face datasets
- bitsandbytes - 8/4-bit quantization

Optional Dependencies:
- vllm: High-performance inference
- auditnlg: Sensitive topics safety checker
- langchain: LangChain integration
- tests: pytest-mock for testing

Key Features:
- Inference for Llama models
- Fine-tuning with Full Finetuning, LoRA, QLoRA
- RAG (Retrieval Augmented Generation)
- End-to-end use cases and applications
- Model distillation
- 3P integrations with providers
- Multi-GPU and distributed training
- Jupyter notebooks for interactive exploration

Step 3

Installation

Install Llama Cookbook using pip or from source.

# Install with pip (recommended)
pip install llama-cookbook

# Install with optional dependencies
pip install llama-cookbook[tests]           # For unit tests
pip install llama-cookbook[vllm]            # For vLLM inference
pip install llama-cookbook[auditnlg]        # For safety checker
pip install llama-cookbook[langchain]       # For LangChain examples

# Install multiple optional dependencies
pip install llama-cookbook[tests,vllm,auditnlg]

# Install from source (for development)
git clone https://github.com/meta-llama/llama-cookbook.git
cd llama-cookbook
pip install -U pip setuptools
pip install -e .

# For development with all dependencies
cd llama-cookbook
pip install -e .[tests,auditnlg,vllm,langchain]

Step 4

PyTorch Installation (CUDA-aware)

Install PyTorch with the correct CUDA version for your GPU.

# Check your CUDA version
nvidia-smi

# Install PyTorch with CUDA 11.8 (recommended)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install PyTorch with CUDA 12.1 (for H100 and newer GPUs)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Verify installation
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}')

Step 5

Getting Llama Models

Access Llama models from Hugging Face Hub.

# Log into Hugging Face
pip install huggingface_hub
huggingface-cli login

# Visit https://huggingface.co/meta-llama to access models
# Models with 'hf' suffix are already converted to HF format

# Example model names:
# - meta-llama/Llama-3.3-70B-Instruct-hf
# - meta-llama/Llama-3.2-1B-hf
# - meta-llama/Llama-3.2-1B-Instruct-hf
# - meta-llama/Llama-3.1-8B-Instruct-hf

# Note: Models with 'hf' suffix require NO conversion step

Step 6

Basic Inference

Run inference with a Llama model.

# Using transformers library
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct-hf",
    token="YOUR_HF_TOKEN"
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct-hf",
    torch_dtype="auto",
    device_map="auto",
    token="YOUR_HF_TOKEN"
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    temperature=0.7
)

result = pipe("What is machine learning?")
print(result[0]['generated_text'])

Step 7

Fine-tuning with LoRA

Perform parameter-efficient fine-tuning using LoRA.

# Finetuning with LoRA
cd src/llama_cookbook/finetuning/

# Using the provided scripts
python finetune_lora.py \
    --model_name meta-llama/Llama-3.2-1B-Instruct-hf \
    --dataset_name your-dataset \
    --max_seq_length 2048 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 3 \
    --learning_rate 2e-4 \
    --output_dir ./results

Step 8

Fine-tuning with QLoRA

Quantized LoRA for memory-efficient fine-tuning.

# Finetuning with QLoRA (4-bit quantization)
# QLoRA allows 7B+ models to fit on single consumer GPU (24GB)

python finetune_qlora.py \
    --model_name meta-llama/Llama-3.1-8B-Instruct-hf \
    --use_peft True \
    --load_in_4bit True \
    --bnb_4bit_quant_type nf4 \
    --lora_r 64 \
    --lora_alpha 16 \
    --dataset_name your-dataset \
    --max_seq_length 2048 \
    --output_dir ./qlora_results

⚠ Heads up: QLoRA requires GPU with at least 16GB VRAM for 7B models, 24GB+ for larger models.

Step 9

RAG (Retrieval Augmented Generation)

Build RAG systems with Llama models.

# RAG with Llama Cookbook
# Check the src/llama_cookbook/rag/ directory for examples

# Key components:
# 1. Document loading and chunking
# 2. Embedding models
# 3. Vector stores (FAISS, etc.)
# 4. Llama model for generation

# See RAG examples in:
# - getting-started/RAG/
# - src/llama_cookbook/rag/

Step 10

Using Jupyter Notebooks

Interactive exploration with Jupyter notebooks.

# Start Jupyter
cd getting-started/
jupyter notebook

# Key notebooks:
# - build_with_llama_api.ipynb   (Llama API integration)
# - build_with_llama_4.ipynb     (5M context, Llama 4)

# Prerequisites:
pip install jupyter jupyterlab notebook

Step 11

Configuration Options

Key configuration parameters.

Fine-tuning Parameters:

Model:
--model_name      : Path to model
--output_dir      : Output directory
--use_peft        : Enable parameter-efficient fine-tuning
--peft_method     : peft, lora, or qlora

LoRA:
--lora_r          : LoRA rank (default: 64)
--lora_alpha      : LoRA alpha (default: 16)
--lora_dropout    : LoRA dropout (default: 0.1)

Data:
--dataset_name    : HF dataset or local path
--max_seq_length  : Maximum sequence length

Training:
--per_device_train_batch_size : Batch size per GPU
--gradient_accumulation_steps : Accumulation steps
--num_train_epochs    : Number of epochs
--learning_rate       : Learning rate (e.g., 2e-4)
--lr_scheduler_type   : cosine, linear, constant

Optimization:
--fp16              : Use 16-bit precision
--bf16              : Use bfloat16 precision
--gradient_checkpointing : Enable checkpointing

Step 12

Multi-GPU Training

Distributed training across multiple GPUs.

# Using accelerate (recommended)
accelerate config

# Run with accelerate
accelerate launch finetune_script.py \
    --model_name meta-llama/Llama-3.1-8B-Instruct-hf \
    --use_peft True \
    --peft_method lora

# Check GPU availability
nvidia-smi
python -c "import torch; print(f'GPUs: {torch.cuda.device_count()}')"

Step 13

vLLM Inference

High-throughput inference using vLLM.

# Install vLLM
pip install llama-cookbook[vllm]

# Or install separately
pip install vllm

# Usage with vllm package:
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct-hf",
    tensor_parallel_size=4,
    max_model_len=8192
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

outputs = llm.generate(
    prompts=["What is machine learning?"],
    sampling_params=sampling_params
)

for output in outputs:
    print(output.outputs[0].text)

Step 14

Repository Structure

Understanding the organization:

llama-cookbook/
├── 3p-integrations/       # Provider-specific recipes
├── end-to-end-use-cases/  # Complete applications
│   ├── whatsapp_llama_4_bot/
│   ├── research_paper_analyzer/
│   └── ...
├── getting-started/       # Core tutorials
│   ├── Inference examples
│   ├── Finetuning examples
│   ├── RAG examples
│   ├── build_with_llama_api.ipynb
│   └── build_with_llama_4.ipynb
├── src/                   # llama-recipes library
│   ├── llama_cookbook/
│   └── docs/              # Fine-tuning FAQ
├── pyproject.toml
└── README.md

Step 15

End-to-End Use Cases

Complete application examples.

Available Use Cases:

1. WhatsApp Llama 4 Bot
   - Path: end-to-end-use-cases/whatsapp_llama_4_bot/
   - Integrate Llama API with WhatsApp

2. Research Paper Analyzer
   - Path: end-to-end-use-cases/research_paper_analyzer/
   - Analyze academic papers with Llama 4

3. Book Character Mind Map
   - Path: end-to-end-use-cases/book-character-mindmap/
   - Create character relationships from books

4. 5M Token Long Context
   - Path: getting-started/build_with_llama_4.ipynb
   - Handle extremely long documents (Llama 4)

Step 16

Key Features

Llama Cookbook capabilities:

1. Inference: Text generation with Llama models
2. Fine-tuning: Full, LoRA, QLoRA methods
3. RAG: Retrieval augmented generation
4. Distillation: Knowledge transfer
5. End-to-End: Complete applications
6. 3P Integrations: Provider recipes
7. Multi-GPU: Distributed training
8. Jupyter: Interactive notebooks
9. Quantization: 8-bit, 4-bit
10. vLLM: High-throughput inference
11. Safety: AuditNLG checker
12. LangChain: Integration
13. Long Context: 5M tokens (Llama 4)
14. Llama API: Official API
15. Hugging Face Native

Step 17

Use Cases

Ideal applications:

1. Domain Adaptation: Fine-tune for specific domains
2. Custom Models: Task-specific variants
3. RAG Systems: Knowledge-grounded apps
4. Chatbots: Conversational AI
5. Content Generation: Text workflows
6. Code Generation: Programming
7. Research: LLM experimentation
8. Production: Enterprise deployments
9. API Integration: Cloud solutions
10. Cost Reduction: Smaller models

Step 18

FAQ

Frequently asked questions.

Q: What happened to llama-recipes?
A: Renamed to llama-cookbook.

Q: Links broken/folders missing?
A: Repo was refactored. Use archive-main branch.

Q: Where to find model details?
A: https://www.llama.com

Q: How to access models?
A: Visit https://huggingface.co/meta-llama

Q: 'hf' suffix vs original models?
A: 'hf' = already converted to HF format.

Q: Minimum GPU requirement?
A: 7B with QLoRA: 16GB VRAM. 70B: Multi-GPU.

Step 19

Resources

Additional resources.

Main Resources:
- Repository: https://github.com/meta-llama/llama-cookbook
- Llama Models: https://www.llama.com
- Llama API: https://llama.developer.meta.com
- Hugging Face: https://huggingface.co/meta-llama
- Models: https://github.com/meta-llama/llama-models
- Synthetic Data Kit: https://github.com/meta-llama/synthetic-data-kit
- Llama Prompt Ops: https://github.com/meta-llama/llama-prompt-ops
- Contributing: See CONTRIBUTING.md