TechSetupGuides
Advancedinstructlabllmfine-tuningsynthetic-datallamaquantizationsdgtaxonomyredhataimachine-learninglab-method

InstructLab - Framework for Fine-tuning LLMs with Synthetic Data

Framework for fine-tuning LLMs with synthetic data using the LAB (Large-Scale Alignment for ChatBots) method. Now refactored into SDG Hub and Training Hub components.

  1. Step 1

    Overview

    InstructLab (ilab) is a framework for fine-tuning Large Language Models (LLMs) using synthetic data via the LAB (Large-Scale Alignment for ChatBots) method. Note: The original repository was archived in April 2026 and has been refactored into separate component repositories.

    ⚠ Heads up: The original instructlab/instructlab repository is archived (April 23, 2026). Development continues in new component repositories.
  2. Step 2

    Project Evolution & Architecture

    As of September 2, 2025, InstructLab was restructured into separate repositories for improved maintainability.

    Project Evolution Summary
    
    === Original (Archived) ===
    Repository: instructlab/instructlab
    Stars: ~12,000
    Status: Archived (April 23, 2026)
    Last release: v0.26.1 (May 5, 2025)
    URL: https://github.com/instructlab/instructlab
    
    New Component Repositories:
    
    1. SDG Hub (Synthetic Data Generation)
       - URL: https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub
       - Stars: ~142
       - Status: Actively maintained
       
    2. Training Hub
       - URL: https://github.com/Red-Hat-AI-Innovation-Team/training_hub
       - Status: Actively maintained
    
    3. Taxonomy (Still Active)
       - URL: https://github.com/instructlab/taxonomy
       - Status: Community contributions ongoing
  3. Step 3

    Technology Stack & Requirements

    InstructLab core technology stack and system requirements:

    Core Technologies:
    - Python: 85.5% of codebase
    - Shell: 9.3%
    - Jupyter: 2.9%
    - Dockerfile: 1.6%
    - License: Apache-2.0
    
    System Requirements:
    - Python >= 3.10 (Python 3.10 removed in v0.26.1)
    - RAM: 16GB+ recommended (32GB+ for production)
    - Storage: 20GB+ for models and data
    - GPU: Optional but recommended (NVIDIA CUDA)
    
    Core Dependencies:
    - instructlab-core: CLI functionality
    - instructlab-sdg: Synthetic data generation
    - instructlab-training: Training pipeline
    - transformers: HuggingFace models
    - torch: PyTorch
    - llama_cpp_python: Local inference
    
    Optional:
    - vllm: High-performance inference
    - bitsandbytes: Quantization (4bit/8bit)
    - accelerate: Distributed training
  4. Step 4

    Installation Options

    Multiple installation methods depending on your needs:

    # Option 1: Archived Original Version
    pip install instructlab==0.26.1  # Last stable release
    
    # Option 2: From Source (Archived)
    git clone https://github.com/instructlab/instructlab.git
    cd instructlab
    git checkout 0.26.1
    pip install -e .[all]
    
    # Option 3: New SDG Hub (Recommended for new projects)
    git clone https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub.git
    cd sdg_hub
    pip install -e .
    
    # Option 4: Training Hub
    git clone https://github.com/Red-Hat-AI-Innovation-Team/training_hub.git
    cd training_hub
    pip install -e .
    
    # Verify Installation
    ilab --version  # Original package
    python -c "import sdg_hub"  # New package
    
    # Clone Taxonomy (Active Repository)
    git clone https://github.com/instructlab/taxonomy.git
  5. Step 5

    Quick Start Workflow

    Basic workflow demonstrating InstructLab core functionality:

    # Step 1: Download a model (Granite-7b recommended)
    ilab model download --model granite-7b
    
    # Alternative: Use HuggingFace directly
    huggingface-cli download ibm-granite/granite-7b-lab \
        --local-dir ./granite-7b
    
    # Step 2: Start chat with base model
    ilab chat --model ./granite-7b
    
    # Step 3: Generate synthetic data from taxonomy
    ilab data generate \
        --taxonomy-path ./taxonomy \
        --output-dir ./generated/ \
        --seed 42
    
    # Step 4: Train model with synthetic data
    ilab model train \
        --model ./granite-7b \
        --data-path ./generated/data.jsonl \
        --output-dir ./trained_model/ \
        --num-epochs 5 \
        --batch-size 8
    
    # Step 5: Evaluate the trained model
    ilab model eval \
        --model ./trained_model/model.pth \
        --dataset mmlu
    
    # Step 6: Test new model
    ilab chat --model ./trained_model/model.pth
  6. Step 6

    Taxonomy Structure & Usage

    Taxonomy is the knowledge base that powers synthetic data generation:

    # Clone and explore taxonomy
     git clone https://github.com/instructlab/taxonomy.git
    cd taxonomy
    
    # View structure
    ls taxonomy/knowledge/  # Knowledge topics
    ls taxonomy/skills/     # Task examples
    
    # Validate taxonomy before use
    ilab diff --taxonomy taxonomy/
    
    # Generate data from specific topic
    ilab data generate \
        --taxonomy-path taxonomy/knowledge/python/ \
        --output-dir ./py_data/
    
    # Create custom taxonomy file
    # taxonomy/knowledge/my-topic.yaml
    
    cat > my_topic.yaml << 'EOF'
    name: python_basics
    description: "Python programming basics"
    tasks:
      - type: knowledge
        questions:
          - text: "What is a Python list?"
            answered_question: |
              A mutable ordered collection using square brackets.
                
    # Validate and generate
    ilab diff --taxonomy my_topic.yaml
    ilab data generate --taxonomy-path ./my_topic.yaml
    
    # Preview generated data
    cat ./generated/knowledge.jsonl | head -5
  7. Step 7

    Training Parameters Reference

    Key training configuration options:

    # Basic Training
    ilab model train \
        --model ./granite-7b \
        --taxonomy-path ./taxonomy \
        --output-dir ./output
    
    # Advanced Training
    ilab model train \
        --model ./granite-7b \
        --data-path ./generated/data.jsonl \
        --output-dir ./trained/ \
        --num-epochs 10 \
        --batch-size 16 \
        --learning-rate 4.0e-5 \
        --device cuda \
        --quantization 4bit \
        --max-seq-length 2048 \
        --seed 42 \
        --gradient-checkpointing
    
    # Parameter Reference:
    --model           Path to base model (required)
    --taxonomy-path   Path to taxonomy YAML files (required)
    --data-path       Path to generated JSONL data
    --output-dir      Output directory for trained model (required)
    --num-epochs      Training epochs (default: 1)
    --batch-size      Training batch size (default: 1)
    --learning-rate   Learning rate (default: 4.0e-5)
    --device          cuda, cpu, or auto (default: auto)
    --quantization    4bit or 8bit quantization
    --max-seq-length  Max sequence length (default: 128)
    --seed            Random seed for reproducibility
    --gradient-checkpointing Enable memory optimization
  8. Step 8

    Configuration & Environment

    Configuration file management:

    # Config file location
    ~/.config/instructlab/config.yaml
    
    # View current configuration
    ilab config show
    
    # Example config
    cat > ~/.config/instructlab/config.yaml << 'EOF'
    generic:
      debug_level: INFO
    
    model:
      chat:
        model_path: /path/to/granite-7b
    
    taxonomy:
      path: /path/to/taxonomy
    
    cli:
      sdg_backend: null  # or 'ray' for distributed
    
    serve:
      model_path: /path/to/model
      device: auto
    EOF
    
    # Environment variables
    export ILAB_MODEL_PATH=/path/to/model
    export ILAB_TAXONOMY_PATH=/path/to/taxonomy
    export ILAB_DEVICE=cuda
    
    # Override with CLI flags
    ilab chat --model ./custom-model --config ~/my-config.yaml
  9. Step 9

    Supported Models

    Recommended models for InstructLab:

    Recommended Models:
    
    1. IBM Granite (Primary)
       - granite-7b-lab (Most tested)
       - granite-3.0-3b (Faster)
       
    2. LLaMA Family
       - LLaMA-3.1-8B (Meta)
       - LLaMA-2-13b
    
    3. Merlynite
       - merlinite-7b
    
    Size Guidelines:
    - 3B: Fast, ~6GB VRAM
    - 7B: Recommended, ~12GB VRAM
    - 13B+: Better quality, 16-24GB VRAM
    
    Hardware Requirements (7B models):
    - Minimum: 16GB RAM + 8GB GPU VRAM
    - Recommended: 32GB RAM + 16GB GPU VRAM
    
    Cloud Alternatives:
    - Google Colab
    - AWS SageMaker
    - Lambda Labs
    - RunPod
  10. Step 10

    Troubleshooting Guide

    Common issues and solutions:

    Common Issues:
    
    1. CUDA/GPU Issues
       Check: python -c "import torch; print(torch.cuda.is_available())"
       Fix: Match PyTorch CUDA version
    
    2. Out of Memory (OOM)
       Fix:
       - Enable 4bit/8bit quantization
       - Reduce batch_size
       - Enable gradient checkpointing
       - Reduce max_seq_length
    
    3. Slow Generation
       Fix:
       - Use vLLM for faster inference
       - Reduce taxonomy files
       - Enable caching
    
    4. Taxonomy Errors  
       Check: ilab diff --taxonomy my_file.yaml
       Fix: Validate YAML syntax, check required fields
    
    5. Dependency Conflicts
       Fix:
       - Use virtual environment
       - Check Python version (>=3.10)
       - Update pip
    
    6. Quantization Issues
       Fix:
       - Install bitsandbytes
       - Check GPU compatibility
    
  11. Step 11

    Resources & Community

    Important resources for learning and support:

    Repositories:
    - Original (Archived): https://github.com/instructlab/instructlab
    - SDG Hub: https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub
    - Training Hub: https://github.com/Red-Hat-AI-Innovation-Team/training_hub
    - Taxonomy: https://github.com/instructlab/taxonomy
    
    Documentation:
    - Website: https://instructlab.ai
    - Docs: https://docs.instructlab.ai
    
    Research:
    - LAB Paper: https://arxiv.org/abs/2403.01081
    - Governance: https://docs.instructlab.ai/community/GOVERNANCE/
    
    Community:
    - Discussions: https://github.com/instructlab/instructlab/discussions
    - Security: https://github.com/instructlab/.github/blob/main/SECURITY.md

Feature requests

Sign in to suggest features or vote on existing ones.

No feature requests yet.

Discussion

0 people marked this as worked·Sign in to mark your own.

Sign in to join the discussion.

No comments yet.