Advancedinstructlabllmfine-tuningsynthetic-datallamaquantizationsdgtaxonomyredhataimachine-learninglab-method

InstructLab - Framework for Fine-tuning LLMs with Synthetic Data

Framework for fine-tuning LLMs with synthetic data using the LAB (Large-Scale Alignment for ChatBots) method. Now refactored into SDG Hub and Training Hub components.

Step 1
Overview
InstructLab (ilab) is a framework for fine-tuning Large Language Models (LLMs) using synthetic data via the LAB (Large-Scale Alignment for ChatBots) method. Note: The original repository was archived in April 2026 and has been refactored into separate component repositories.
⚠ Heads up: The original instructlab/instructlab repository is archived (April 23, 2026). Development continues in new component repositories.

Step 2

Project Evolution & Architecture

As of September 2, 2025, InstructLab was restructured into separate repositories for improved maintainability.

Project Evolution Summary

=== Original (Archived) ===
Repository: instructlab/instructlab
Stars: ~12,000
Status: Archived (April 23, 2026)
Last release: v0.26.1 (May 5, 2025)
URL: https://github.com/instructlab/instructlab

New Component Repositories:

1. SDG Hub (Synthetic Data Generation)
   - URL: https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub
   - Stars: ~142
   - Status: Actively maintained
   
2. Training Hub
   - URL: https://github.com/Red-Hat-AI-Innovation-Team/training_hub
   - Status: Actively maintained

3. Taxonomy (Still Active)
   - URL: https://github.com/instructlab/taxonomy
   - Status: Community contributions ongoing

Step 3

Technology Stack & Requirements

InstructLab core technology stack and system requirements:

Core Technologies:
- Python: 85.5% of codebase
- Shell: 9.3%
- Jupyter: 2.9%
- Dockerfile: 1.6%
- License: Apache-2.0

System Requirements:
- Python >= 3.10 (Python 3.10 removed in v0.26.1)
- RAM: 16GB+ recommended (32GB+ for production)
- Storage: 20GB+ for models and data
- GPU: Optional but recommended (NVIDIA CUDA)

Core Dependencies:
- instructlab-core: CLI functionality
- instructlab-sdg: Synthetic data generation
- instructlab-training: Training pipeline
- transformers: HuggingFace models
- torch: PyTorch
- llama_cpp_python: Local inference

Optional:
- vllm: High-performance inference
- bitsandbytes: Quantization (4bit/8bit)
- accelerate: Distributed training

Step 4

Installation Options

Multiple installation methods depending on your needs:

# Option 1: Archived Original Version
pip install instructlab==0.26.1  # Last stable release

# Option 2: From Source (Archived)
git clone https://github.com/instructlab/instructlab.git
cd instructlab
git checkout 0.26.1
pip install -e .[all]

# Option 3: New SDG Hub (Recommended for new projects)
git clone https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub.git
cd sdg_hub
pip install -e .

# Option 4: Training Hub
git clone https://github.com/Red-Hat-AI-Innovation-Team/training_hub.git
cd training_hub
pip install -e .

# Verify Installation
ilab --version  # Original package
python -c "import sdg_hub"  # New package

# Clone Taxonomy (Active Repository)
git clone https://github.com/instructlab/taxonomy.git

Step 5

Quick Start Workflow

Basic workflow demonstrating InstructLab core functionality:

# Step 1: Download a model (Granite-7b recommended)
ilab model download --model granite-7b

# Alternative: Use HuggingFace directly
huggingface-cli download ibm-granite/granite-7b-lab \
    --local-dir ./granite-7b

# Step 2: Start chat with base model
ilab chat --model ./granite-7b

# Step 3: Generate synthetic data from taxonomy
ilab data generate \
    --taxonomy-path ./taxonomy \
    --output-dir ./generated/ \
    --seed 42

# Step 4: Train model with synthetic data
ilab model train \
    --model ./granite-7b \
    --data-path ./generated/data.jsonl \
    --output-dir ./trained_model/ \
    --num-epochs 5 \
    --batch-size 8

# Step 5: Evaluate the trained model
ilab model eval \
    --model ./trained_model/model.pth \
    --dataset mmlu

# Step 6: Test new model
ilab chat --model ./trained_model/model.pth

Step 6

Taxonomy Structure & Usage

Taxonomy is the knowledge base that powers synthetic data generation:

# Clone and explore taxonomy
 git clone https://github.com/instructlab/taxonomy.git
cd taxonomy

# View structure
ls taxonomy/knowledge/  # Knowledge topics
ls taxonomy/skills/     # Task examples

# Validate taxonomy before use
ilab diff --taxonomy taxonomy/

# Generate data from specific topic
ilab data generate \
    --taxonomy-path taxonomy/knowledge/python/ \
    --output-dir ./py_data/

# Create custom taxonomy file
# taxonomy/knowledge/my-topic.yaml

cat > my_topic.yaml << 'EOF'
name: python_basics
description: "Python programming basics"
tasks:
  - type: knowledge
    questions:
      - text: "What is a Python list?"
        answered_question: |
          A mutable ordered collection using square brackets.
            
# Validate and generate
ilab diff --taxonomy my_topic.yaml
ilab data generate --taxonomy-path ./my_topic.yaml

# Preview generated data
cat ./generated/knowledge.jsonl | head -5

Step 7

Training Parameters Reference

Key training configuration options:

# Basic Training
ilab model train \
    --model ./granite-7b \
    --taxonomy-path ./taxonomy \
    --output-dir ./output

# Advanced Training
ilab model train \
    --model ./granite-7b \
    --data-path ./generated/data.jsonl \
    --output-dir ./trained/ \
    --num-epochs 10 \
    --batch-size 16 \
    --learning-rate 4.0e-5 \
    --device cuda \
    --quantization 4bit \
    --max-seq-length 2048 \
    --seed 42 \
    --gradient-checkpointing

# Parameter Reference:
--model           Path to base model (required)
--taxonomy-path   Path to taxonomy YAML files (required)
--data-path       Path to generated JSONL data
--output-dir      Output directory for trained model (required)
--num-epochs      Training epochs (default: 1)
--batch-size      Training batch size (default: 1)
--learning-rate   Learning rate (default: 4.0e-5)
--device          cuda, cpu, or auto (default: auto)
--quantization    4bit or 8bit quantization
--max-seq-length  Max sequence length (default: 128)
--seed            Random seed for reproducibility
--gradient-checkpointing Enable memory optimization

Step 8

Configuration & Environment

Configuration file management:

# Config file location
~/.config/instructlab/config.yaml

# View current configuration
ilab config show

# Example config
cat > ~/.config/instructlab/config.yaml << 'EOF'
generic:
  debug_level: INFO

model:
  chat:
    model_path: /path/to/granite-7b

taxonomy:
  path: /path/to/taxonomy

cli:
  sdg_backend: null  # or 'ray' for distributed

serve:
  model_path: /path/to/model
  device: auto
EOF

# Environment variables
export ILAB_MODEL_PATH=/path/to/model
export ILAB_TAXONOMY_PATH=/path/to/taxonomy
export ILAB_DEVICE=cuda

# Override with CLI flags
ilab chat --model ./custom-model --config ~/my-config.yaml

Step 9

Supported Models

Recommended models for InstructLab:

Recommended Models:

1. IBM Granite (Primary)
   - granite-7b-lab (Most tested)
   - granite-3.0-3b (Faster)
   
2. LLaMA Family
   - LLaMA-3.1-8B (Meta)
   - LLaMA-2-13b

3. Merlynite
   - merlinite-7b

Size Guidelines:
- 3B: Fast, ~6GB VRAM
- 7B: Recommended, ~12GB VRAM
- 13B+: Better quality, 16-24GB VRAM

Hardware Requirements (7B models):
- Minimum: 16GB RAM + 8GB GPU VRAM
- Recommended: 32GB RAM + 16GB GPU VRAM

Cloud Alternatives:
- Google Colab
- AWS SageMaker
- Lambda Labs
- RunPod

Step 10

Troubleshooting Guide

Common issues and solutions:

Common Issues:

1. CUDA/GPU Issues
   Check: python -c "import torch; print(torch.cuda.is_available())"
   Fix: Match PyTorch CUDA version

2. Out of Memory (OOM)
   Fix:
   - Enable 4bit/8bit quantization
   - Reduce batch_size
   - Enable gradient checkpointing
   - Reduce max_seq_length

3. Slow Generation
   Fix:
   - Use vLLM for faster inference
   - Reduce taxonomy files
   - Enable caching

4. Taxonomy Errors  
   Check: ilab diff --taxonomy my_file.yaml
   Fix: Validate YAML syntax, check required fields

5. Dependency Conflicts
   Fix:
   - Use virtual environment
   - Check Python version (>=3.10)
   - Update pip

6. Quantization Issues
   Fix:
   - Install bitsandbytes
   - Check GPU compatibility

Step 11

Resources & Community

Important resources for learning and support:

Repositories:
- Original (Archived): https://github.com/instructlab/instructlab
- SDG Hub: https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub
- Training Hub: https://github.com/Red-Hat-AI-Innovation-Team/training_hub
- Taxonomy: https://github.com/instructlab/taxonomy

Documentation:
- Website: https://instructlab.ai
- Docs: https://docs.instructlab.ai

Research:
- LAB Paper: https://arxiv.org/abs/2403.01081
- Governance: https://docs.instructlab.ai/community/GOVERNANCE/

Community:
- Discussions: https://github.com/instructlab/instructlab/discussions
- Security: https://github.com/instructlab/.github/blob/main/SECURITY.md

Overview