Llama Cookbook - LLaMA Fine-tuning Framework
Official comprehensive guide for LLaMA model inference, fine-tuning, RAG, and end-to-end use cases. Includes examples for domain adaptation and LLM applications.
- Step 1
Overview
Llama Cookbook (formerly llama-recipes) is Meta's official companion project to the Llama models. It provides comprehensive examples and recipes for getting started with inference, fine-tuning for domain adaptation, RAG (Retrieval Augmented Generation), and end-to-end use cases with the Llama model family.
- Step 2
Technology Stack
Llama Cookbook is built with the following technologies:
Name: llama-cookbook (formerly llama-recipes) License: MIT Stars: ~18,337 Owner: meta-llama Repo: https://github.com/meta-llama/llama-cookbook Website: https://www.llama.com Languages: - Jupyter Notebook (Primary: 13M+ lines) - Python (~800K lines) - Java (~56K lines) - JavaScript (~35K lines) - Kotlin (~11K lines) Core Dependencies: - PyTorch >=2.2 - Deep learning framework - Accelerate - Training acceleration - Transformers >=4.45.1 - Hugging Face models - Peft - Parameter-efficient fine-tuning - LoraLib - Low-Rank Adaptation - Datasets - Hugging Face datasets - bitsandbytes - 8/4-bit quantization Optional Dependencies: - vllm: High-performance inference - auditnlg: Sensitive topics safety checker - langchain: LangChain integration - tests: pytest-mock for testing Key Features: - Inference for Llama models - Fine-tuning with Full Finetuning, LoRA, QLoRA - RAG (Retrieval Augmented Generation) - End-to-end use cases and applications - Model distillation - 3P integrations with providers - Multi-GPU and distributed training - Jupyter notebooks for interactive exploration - Step 3
Installation
Install Llama Cookbook using pip or from source.
# Install with pip (recommended) pip install llama-cookbook # Install with optional dependencies pip install llama-cookbook[tests] # For unit tests pip install llama-cookbook[vllm] # For vLLM inference pip install llama-cookbook[auditnlg] # For safety checker pip install llama-cookbook[langchain] # For LangChain examples # Install multiple optional dependencies pip install llama-cookbook[tests,vllm,auditnlg] # Install from source (for development) git clone https://github.com/meta-llama/llama-cookbook.git cd llama-cookbook pip install -U pip setuptools pip install -e . # For development with all dependencies cd llama-cookbook pip install -e .[tests,auditnlg,vllm,langchain] - Step 4
PyTorch Installation (CUDA-aware)
Install PyTorch with the correct CUDA version for your GPU.
# Check your CUDA version nvidia-smi # Install PyTorch with CUDA 11.8 (recommended) pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # Install PyTorch with CUDA 12.1 (for H100 and newer GPUs) pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # Verify installation python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}') - Step 5
Getting Llama Models
Access Llama models from Hugging Face Hub.
# Log into Hugging Face pip install huggingface_hub huggingface-cli login # Visit https://huggingface.co/meta-llama to access models # Models with 'hf' suffix are already converted to HF format # Example model names: # - meta-llama/Llama-3.3-70B-Instruct-hf # - meta-llama/Llama-3.2-1B-hf # - meta-llama/Llama-3.2-1B-Instruct-hf # - meta-llama/Llama-3.1-8B-Instruct-hf # Note: Models with 'hf' suffix require NO conversion step - Step 6
Basic Inference
Run inference with a Llama model.
# Using transformers library from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline tokenizer = AutoTokenizer.from_pretrained( "meta-llama/Llama-3.2-1B-Instruct-hf", token="YOUR_HF_TOKEN" ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-1B-Instruct-hf", torch_dtype="auto", device_map="auto", token="YOUR_HF_TOKEN" ) pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=256, temperature=0.7 ) result = pipe("What is machine learning?") print(result[0]['generated_text']) - Step 7
Fine-tuning with LoRA
Perform parameter-efficient fine-tuning using LoRA.
# Finetuning with LoRA cd src/llama_cookbook/finetuning/ # Using the provided scripts python finetune_lora.py \ --model_name meta-llama/Llama-3.2-1B-Instruct-hf \ --dataset_name your-dataset \ --max_seq_length 2048 \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --num_train_epochs 3 \ --learning_rate 2e-4 \ --output_dir ./results - Step 8
Fine-tuning with QLoRA
Quantized LoRA for memory-efficient fine-tuning.
# Finetuning with QLoRA (4-bit quantization) # QLoRA allows 7B+ models to fit on single consumer GPU (24GB) python finetune_qlora.py \ --model_name meta-llama/Llama-3.1-8B-Instruct-hf \ --use_peft True \ --load_in_4bit True \ --bnb_4bit_quant_type nf4 \ --lora_r 64 \ --lora_alpha 16 \ --dataset_name your-dataset \ --max_seq_length 2048 \ --output_dir ./qlora_results⚠ Heads up: QLoRA requires GPU with at least 16GB VRAM for 7B models, 24GB+ for larger models. - Step 9
RAG (Retrieval Augmented Generation)
Build RAG systems with Llama models.
# RAG with Llama Cookbook # Check the src/llama_cookbook/rag/ directory for examples # Key components: # 1. Document loading and chunking # 2. Embedding models # 3. Vector stores (FAISS, etc.) # 4. Llama model for generation # See RAG examples in: # - getting-started/RAG/ # - src/llama_cookbook/rag/ - Step 10
Using Jupyter Notebooks
Interactive exploration with Jupyter notebooks.
# Start Jupyter cd getting-started/ jupyter notebook # Key notebooks: # - build_with_llama_api.ipynb (Llama API integration) # - build_with_llama_4.ipynb (5M context, Llama 4) # Prerequisites: pip install jupyter jupyterlab notebook - Step 11
Configuration Options
Key configuration parameters.
Fine-tuning Parameters: Model: --model_name : Path to model --output_dir : Output directory --use_peft : Enable parameter-efficient fine-tuning --peft_method : peft, lora, or qlora LoRA: --lora_r : LoRA rank (default: 64) --lora_alpha : LoRA alpha (default: 16) --lora_dropout : LoRA dropout (default: 0.1) Data: --dataset_name : HF dataset or local path --max_seq_length : Maximum sequence length Training: --per_device_train_batch_size : Batch size per GPU --gradient_accumulation_steps : Accumulation steps --num_train_epochs : Number of epochs --learning_rate : Learning rate (e.g., 2e-4) --lr_scheduler_type : cosine, linear, constant Optimization: --fp16 : Use 16-bit precision --bf16 : Use bfloat16 precision --gradient_checkpointing : Enable checkpointing - Step 12
Multi-GPU Training
Distributed training across multiple GPUs.
# Using accelerate (recommended) accelerate config # Run with accelerate accelerate launch finetune_script.py \ --model_name meta-llama/Llama-3.1-8B-Instruct-hf \ --use_peft True \ --peft_method lora # Check GPU availability nvidia-smi python -c "import torch; print(f'GPUs: {torch.cuda.device_count()}')" - Step 13
vLLM Inference
High-throughput inference using vLLM.
# Install vLLM pip install llama-cookbook[vllm] # Or install separately pip install vllm # Usage with vllm package: from vllm import LLM, SamplingParams llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct-hf", tensor_parallel_size=4, max_model_len=8192 ) sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=512 ) outputs = llm.generate( prompts=["What is machine learning?"], sampling_params=sampling_params ) for output in outputs: print(output.outputs[0].text) - Step 14
Repository Structure
Understanding the organization:
llama-cookbook/ ├── 3p-integrations/ # Provider-specific recipes ├── end-to-end-use-cases/ # Complete applications │ ├── whatsapp_llama_4_bot/ │ ├── research_paper_analyzer/ │ └── ... ├── getting-started/ # Core tutorials │ ├── Inference examples │ ├── Finetuning examples │ ├── RAG examples │ ├── build_with_llama_api.ipynb │ └── build_with_llama_4.ipynb ├── src/ # llama-recipes library │ ├── llama_cookbook/ │ └── docs/ # Fine-tuning FAQ ├── pyproject.toml └── README.md - Step 15
End-to-End Use Cases
Complete application examples.
Available Use Cases: 1. WhatsApp Llama 4 Bot - Path: end-to-end-use-cases/whatsapp_llama_4_bot/ - Integrate Llama API with WhatsApp 2. Research Paper Analyzer - Path: end-to-end-use-cases/research_paper_analyzer/ - Analyze academic papers with Llama 4 3. Book Character Mind Map - Path: end-to-end-use-cases/book-character-mindmap/ - Create character relationships from books 4. 5M Token Long Context - Path: getting-started/build_with_llama_4.ipynb - Handle extremely long documents (Llama 4) - Step 16
Key Features
Llama Cookbook capabilities:
1. Inference: Text generation with Llama models 2. Fine-tuning: Full, LoRA, QLoRA methods 3. RAG: Retrieval augmented generation 4. Distillation: Knowledge transfer 5. End-to-End: Complete applications 6. 3P Integrations: Provider recipes 7. Multi-GPU: Distributed training 8. Jupyter: Interactive notebooks 9. Quantization: 8-bit, 4-bit 10. vLLM: High-throughput inference 11. Safety: AuditNLG checker 12. LangChain: Integration 13. Long Context: 5M tokens (Llama 4) 14. Llama API: Official API 15. Hugging Face Native - Step 17
Use Cases
Ideal applications:
1. Domain Adaptation: Fine-tune for specific domains 2. Custom Models: Task-specific variants 3. RAG Systems: Knowledge-grounded apps 4. Chatbots: Conversational AI 5. Content Generation: Text workflows 6. Code Generation: Programming 7. Research: LLM experimentation 8. Production: Enterprise deployments 9. API Integration: Cloud solutions 10. Cost Reduction: Smaller models - Step 18
FAQ
Frequently asked questions.
Q: What happened to llama-recipes? A: Renamed to llama-cookbook. Q: Links broken/folders missing? A: Repo was refactored. Use archive-main branch. Q: Where to find model details? A: https://www.llama.com Q: How to access models? A: Visit https://huggingface.co/meta-llama Q: 'hf' suffix vs original models? A: 'hf' = already converted to HF format. Q: Minimum GPU requirement? A: 7B with QLoRA: 16GB VRAM. 70B: Multi-GPU. - Step 19
Resources
Additional resources.
Main Resources: - Repository: https://github.com/meta-llama/llama-cookbook - Llama Models: https://www.llama.com - Llama API: https://llama.developer.meta.com - Hugging Face: https://huggingface.co/meta-llama - Models: https://github.com/meta-llama/llama-models - Synthetic Data Kit: https://github.com/meta-llama/synthetic-data-kit - Llama Prompt Ops: https://github.com/meta-llama/llama-prompt-ops - Contributing: See CONTRIBUTING.md
Feature requests
Sign in to suggest features or vote on existing ones.
No feature requests yet.
Discussion
Sign in to join the discussion.
No comments yet.