TechSetupGuides
IntermediatevLLMPythonLLMInferenceGPUCUDAOpenAI APIPagedAttentionModel ServingPyTorchQuantizationDistributed Inference

vLLM: High-Throughput LLM Inference and Serving Setup

Complete setup guide for vLLM - a high-throughput and memory-efficient inference and serving engine for Large Language Models. Features PagedAttention for optimized memory management, continuous batching for maximum GPU utilization, and OpenAI-compatible API server. Supports 200+ model architectures with quantization, distributed inference, and multi-GPU deployment.

  1. Step 1

    System Prerequisites

    vLLM is optimized for high-performance LLM inference on GPU-accelerated systems. Before installation, verify your environment meets the requirements. vLLM works best on Linux with NVIDIA GPUs, though experimental support exists for AMD GPUs, TPUs, and other accelerators.

    # Check Python version (3.9-3.12 supported, 3.12 recommended)
    python --version
    
    # Check CUDA availability and version
    nvidia-smi
    
    # Check GPU compute capability (7.0+ recommended)
    nvidia-smi --query-gpu=compute_cap --format=csv
    
    # Verify system architecture
    uname -m  # x86_64 recommended
    ⚠ Heads up: vLLM requires an NVIDIA GPU with CUDA for optimal performance. CPU-only mode is experimental and significantly slower. For production workloads, use GPUs with at least 16GB VRAM.
  2. Step 2

    Installation via pip (Recommended)

    The simplest installation method is via pip, which automatically handles PyTorch and CUDA dependencies. vLLM ships with pre-compiled binaries for CUDA 12.1 by default, with CUDA 11.8 binaries also available.

    # Create a virtual environment (recommended)
    python -m venv vllm-env
    source vllm-env/bin/activate
    
    # Install vLLM with CUDA 12.1 (default)
    pip install vllm
    
    # Or install with CUDA 11.8
    export VLLM_VERSION=0.6.5
    export PYTHON_VERSION=312
    pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl
    
    # Verify installation
    python -c "import vllm; print(vllm.__version__)"
  3. Step 3

    Installation via Docker (Production Recommended)

    Docker provides the most reliable deployment method, especially for production environments. The official vLLM Docker images come pre-configured with all dependencies, CUDA drivers, and optimized kernels.

    # Pull the latest vLLM Docker image
    docker pull vllm/vllm-openai:latest
    
    # Run a simple test to verify GPU access
    docker run --gpus all \
      vllm/vllm-openai:latest \
      python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
    
    # Run interactive container for testing
    docker run --gpus all -it \
      -v ~/.cache/huggingface:/root/.cache/huggingface \
      vllm/vllm-openai:latest \
      /bin/bash
    ⚠ Heads up: Ensure NVIDIA Container Toolkit is installed on your host system. Install with: sudo apt-get install -y nvidia-container-toolkit && sudo systemctl restart docker
  4. Step 4

    Installation from Source (For Development)

    Building from source gives you access to the latest features and allows custom compilation flags. This method is recommended for contributors, researchers, or when you need bleeding-edge features.

    # Install build dependencies
    pip install --upgrade pip
    pip install wheel packaging ninja setuptools-scm>=8
    
    # Clone the repository
    git clone https://github.com/vllm-project/vllm.git
    cd vllm
    
    # Install in editable mode (for development)
    pip install -e .
    
    # Or build and install normally
    pip install .
    
    # Verify the installation
    vllm --version
  5. Step 5

    Basic Offline Inference

    The simplest way to use vLLM is through its Python API for offline batch inference. This mode is ideal for processing large datasets, benchmarking, or when you don't need a server.

    from vllm import LLM, SamplingParams
    
    # Initialize the model
    llm = LLM(model="meta-llama/Llama-3.2-3B-Instruct")
    
    # Define sampling parameters
    sampling_params = SamplingParams(
        temperature=0.7,
        top_p=0.9,
        max_tokens=256
    )
    
    # Generate outputs
    prompts = [
        "Explain quantum computing in simple terms:",
        "Write a Python function to reverse a string:",
    ]
    
    outputs = llm.generate(prompts, sampling_params)
    
    # Print results
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt}")
        print(f"Generated: {generated_text}\n")
  6. Step 6

    Launch OpenAI-Compatible API Server

    vLLM's most powerful feature is its OpenAI-compatible API server, allowing drop-in replacement for OpenAI services. This is the recommended deployment mode for production applications.

    # Basic server launch
    python -m vllm.entrypoints.openai.api_server \
      --model meta-llama/Llama-3.2-3B-Instruct \
      --host 0.0.0.0 \
      --port 8000
    
    # Production-optimized launch
    python -m vllm.entrypoints.openai.api_server \
      --model meta-llama/Llama-3.2-3B-Instruct \
      --host 0.0.0.0 \
      --port 8000 \
      --gpu-memory-utilization 0.9 \
      --max-model-len 4096 \
      --max-num-seqs 256 \
      --api-key sk-your-secret-key \
      --served-model-name llama-3.2-3b
    ⚠ Heads up: The --api-key parameter is critical for production. Never use the default 'token-abc123' in exposed deployments. Generate a secure key with: openssl rand -base64 32
  7. Step 7

    Docker-based API Server Deployment

    Running the API server in Docker provides isolation, reproducibility, and simplified deployment. This is the recommended production setup, especially when using orchestration platforms like Kubernetes.

    # Run vLLM API server with Docker
    docker run -d \
      --name vllm-server \
      --gpus all \
      -p 8000:8000 \
      -v ~/.cache/huggingface:/root/.cache/huggingface \
      -e HUGGING_FACE_HUB_TOKEN=<your-hf-token> \
      vllm/vllm-openai:latest \
      --model meta-llama/Llama-3.2-3B-Instruct \
      --host 0.0.0.0 \
      --port 8000 \
      --gpu-memory-utilization 0.9
    
    # Check server logs
    docker logs -f vllm-server
    
    # Test the server
    curl http://localhost:8000/v1/models
  8. Step 8

    Query the API Server

    Once the server is running, you can query it using curl, the OpenAI Python SDK, or any HTTP client. The API is fully compatible with OpenAI's specification, making migration seamless.

    # Test with curl - chat completions
    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer sk-your-secret-key" \
      -d '{
        "model": "meta-llama/Llama-3.2-3B-Instruct",
        "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Explain PagedAttention"}
        ],
        "temperature": 0.7,
        "max_tokens": 500
      }'
    
    # List available models
    curl http://localhost:8000/v1/models
  9. Step 9

    Use OpenAI Python SDK with vLLM

    The easiest way to integrate vLLM into existing applications is via the OpenAI Python SDK. Simply change the base_url to point to your vLLM server - no other code changes required.

    from openai import OpenAI
    
    # Configure client to use vLLM server
    client = OpenAI(
        api_key="sk-your-secret-key",
        base_url="http://localhost:8000/v1"
    )
    
    # Chat completions (streaming)
    stream = client.chat.completions.create(
        model="meta-llama/Llama-3.2-3B-Instruct",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Write a haiku about AI"}
        ],
        stream=True,
        temperature=0.7
    )
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="")
    
    print()
  10. Step 10

    Multi-GPU Deployment with Tensor Parallelism

    For larger models that don't fit in a single GPU, vLLM supports tensor parallelism to distribute the model across multiple GPUs. This enables serving models like Llama-70B or Mixtral-8x7B efficiently.

    # Launch with 4-way tensor parallelism
    python -m vllm.entrypoints.openai.api_server \
      --model meta-llama/Llama-3.1-70B-Instruct \
      --tensor-parallel-size 4 \
      --host 0.0.0.0 \
      --port 8000
    
    # Docker deployment with multi-GPU
    docker run -d \
      --gpus '"device=0,1,2,3"' \
      -p 8000:8000 \
      -v ~/.cache/huggingface:/root/.cache/huggingface \
      vllm/vllm-openai:latest \
      --model meta-llama/Llama-3.1-70B-Instruct \
      --tensor-parallel-size 4
    
    # Check GPU utilization
    watch -n 1 nvidia-smi
    ⚠ Heads up: Tensor parallelism requires GPUs to be on the same node with fast interconnects (NVLink recommended). For pipeline parallelism across nodes, use --pipeline-parallel-size instead.
  11. Step 11

    Model Quantization for Memory Efficiency

    vLLM supports various quantization methods to reduce memory footprint and increase throughput. Quantization can reduce model size by 50-75% with minimal quality loss, enabling larger batch sizes.

    # AWQ 4-bit quantization (pre-quantized model)
    python -m vllm.entrypoints.openai.api_server \
      --model TheBloke/Llama-2-70B-AWQ \
      --quantization awq \
      --dtype half
    
    # GPTQ 4-bit quantization
    python -m vllm.entrypoints.openai.api_server \
      --model TheBloke/Llama-2-70B-GPTQ \
      --quantization gptq
    
    # FP8 quantization (NVIDIA H100/Ada GPUs)
    python -m vllm.entrypoints.openai.api_server \
      --model neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
      --quantization fp8
    
    # Check model memory usage
    python -c "from vllm import LLM; llm = LLM(model='meta-llama/Llama-3.2-3B-Instruct'); print(f'Memory: {llm.llm_engine.model_executor.driver_worker.model_runner.model.get_memory_footprint() / 1e9:.2f} GB')"
  12. Step 12

    Advanced Configuration Options

    Fine-tune vLLM behavior with advanced parameters for memory management, scheduling, and performance optimization. These settings can dramatically impact throughput and latency based on your workload.

    # Full production configuration example
    python -m vllm.entrypoints.openai.api_server \
      --model meta-llama/Llama-3.2-3B-Instruct \
      --host 0.0.0.0 \
      --port 8000 \
      --gpu-memory-utilization 0.95 \
      --max-model-len 8192 \
      --max-num-seqs 512 \
      --max-num-batched-tokens 8192 \
      --enable-prefix-caching \
      --disable-log-requests \
      --trust-remote-code \
      --dtype bfloat16 \
      --kv-cache-dtype auto \
      --max-parallel-loading-workers 4
  13. Step 13

    Configure Sampling Parameters

    Control generation behavior with sampling parameters that affect output quality, diversity, and speed. These can be set per-request or as server defaults.

    from vllm import SamplingParams
    
    # Deterministic sampling (temperature=0)
    sampling_params = SamplingParams(
        temperature=0.0,
        top_p=1.0,
        max_tokens=1024
    )
    
    # Creative sampling
    sampling_params = SamplingParams(
        temperature=0.9,
        top_p=0.95,
        top_k=50,
        max_tokens=2048,
        presence_penalty=0.1,
        frequency_penalty=0.1
    )
    
    # Constrained generation with stop sequences
    sampling_params = SamplingParams(
        temperature=0.7,
        max_tokens=512,
        stop=["\n\n", "```", "END"],
        skip_special_tokens=True
    )
  14. Step 14

    Understanding PagedAttention

    PagedAttention is vLLM's core innovation - it manages KV cache memory in non-contiguous blocks (like virtual memory paging) instead of pre-allocating large contiguous chunks. This eliminates memory fragmentation and enables dynamic memory allocation, resulting in 4x higher throughput compared to traditional approaches. Combined with continuous batching, vLLM maximizes GPU utilization by constantly filling available compute with new requests as others complete.

    # Monitor PagedAttention block allocation
    # The server logs show block utilization in real-time
    python -m vllm.entrypoints.openai.api_server \
      --model meta-llama/Llama-3.2-3B-Instruct \
      --block-size 16 \
      --num-gpu-blocks-override 2048 \
      --log-level debug
    
    # Key metrics to watch:
    # - GPU blocks used / total
    # - Number of running sequences
    # - Batched token throughput
  15. Step 15

    Production Deployment Best Practices

    For production deployments, implement proper security, monitoring, and scaling strategies. Use a reverse proxy for TLS termination, implement request rate limiting, and monitor GPU utilization and error rates.

    # NGINX reverse proxy configuration
    server {
        listen 443 ssl http2;
        server_name api.yourdomain.com;
    
        ssl_certificate /path/to/cert.pem;
        ssl_certificate_key /path/to/key.pem;
    
        location / {
            proxy_pass http://localhost:8000;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            
            # Increase timeout for long generations
            proxy_read_timeout 300s;
            proxy_connect_timeout 75s;
        }
    }
  16. Step 16

    Monitoring and Observability

    vLLM exposes Prometheus metrics for monitoring server health, request latency, GPU utilization, and throughput. Integrate with your existing observability stack for production monitoring.

    # Enable Prometheus metrics endpoint
    python -m vllm.entrypoints.openai.api_server \
      --model meta-llama/Llama-3.2-3B-Instruct \
      --disable-log-requests \
      --uvicorn-log-level warning
    
    # Metrics available at http://localhost:8000/metrics
    curl http://localhost:8000/metrics
    
    # Key metrics:
    # - vllm:num_requests_running
    # - vllm:num_requests_waiting
    # - vllm:gpu_cache_usage_perc
    # - vllm:time_to_first_token_seconds
    # - vllm:time_per_output_token_seconds
  17. Step 17

    Benchmarking Performance

    Measure your vLLM deployment's throughput and latency characteristics using the built-in benchmark tool. This helps tune configuration parameters for your specific hardware and workload.

    # Install benchmark dependencies
    pip install vllm[benchmark]
    
    # Run throughput benchmark
    python -m vllm.entrypoints.openai.api_server \
      --model meta-llama/Llama-3.2-3B-Instruct &
    
    sleep 30  # Wait for server startup
    
    # Benchmark with various request rates
    curl -X POST http://localhost:8000/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "meta-llama/Llama-3.2-3B-Instruct",
        "prompt": "Once upon a time",
        "max_tokens": 100
      }' &
    
    # Use Apache Bench for load testing
    ab -n 1000 -c 10 -p request.json \
      -T application/json \
      http://localhost:8000/v1/completions
  18. Step 18

    Common Use Cases and Patterns

    vLLM excels at various LLM serving scenarios. Here are practical examples for common production patterns including batch processing, real-time chat, and function calling.

    # Use Case 1: Batch document summarization
    from vllm import LLM, SamplingParams
    
    llm = LLM(model="meta-llama/Llama-3.2-3B-Instruct")
    documents = [...]  # Load your documents
    prompts = [f"Summarize: {doc}" for doc in documents]
    outputs = llm.generate(prompts, SamplingParams(max_tokens=200))
    
    # Use Case 2: Real-time streaming chat
    from openai import OpenAI
    client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-key")
    for chunk in client.chat.completions.create(
        model="meta-llama/Llama-3.2-3B-Instruct",
        messages=[{"role": "user", "content": "Hello!"}],
        stream=True
    ):
        print(chunk.choices[0].delta.content or "", end="")
    
    # Use Case 3: Embeddings generation
    response = client.embeddings.create(
        model="BAAI/bge-large-en-v1.5",
        input="Your text to embed"
    )
    vector = response.data[0].embedding
  19. Step 19

    Troubleshooting Common Issues

    Solutions to frequently encountered problems when deploying vLLM. Most issues relate to GPU memory, CUDA compatibility, or model loading errors.

    # Issue: CUDA out of memory
    # Solution: Reduce gpu-memory-utilization or max-num-seqs
    python -m vllm.entrypoints.openai.api_server \
      --model <model> \
      --gpu-memory-utilization 0.8 \
      --max-num-seqs 128
    
    # Issue: Model download fails
    # Solution: Set Hugging Face token
    export HUGGING_FACE_HUB_TOKEN=<your-token>
    
    # Issue: Slow first request
    # Solution: This is expected - vLLM loads model on first request
    # Use --preload-models to load at startup
    
    # Issue: CUDA version mismatch
    # Solution: Check CUDA version and install matching vLLM build
    nvcc --version
    pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121
    
    # Enable debug logging
    python -m vllm.entrypoints.openai.api_server \
      --model <model> \
      --log-level debug
    ⚠ Heads up: If you encounter 'ImportError: cannot import name X from vllm', your installation may be corrupted. Uninstall with pip uninstall vllm -y and reinstall from scratch.
  20. Step 20

    Updating vLLM

    vLLM is actively developed with frequent releases adding new features, model support, and performance improvements. Stay updated to benefit from the latest optimizations and bug fixes.

    # Update via pip
    pip install --upgrade vllm
    
    # Update Docker image
    docker pull vllm/vllm-openai:latest
    
    # Update from source
    cd vllm
    git pull origin main
    pip install -e .
    
    # Check current version
    python -c "import vllm; print(vllm.__version__)"
    
    # View changelog
    curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | grep body

Feature requests

Sign in to suggest features or vote on existing ones.

No feature requests yet.

Discussion

0 people marked this as worked·Sign in to mark your own.

Sign in to join the discussion.

No comments yet.