DeepSeek Open Infrastructure Index: Production AI Infrastructure Toolkit
A comprehensive collection of production-tested AI infrastructure tools for AGI development, including GPU kernels, distributed systems, parallel file systems, and MoE training components.
- Step 1
What is Open Infrastructure Index?
DeepSeek's Open Infrastructure Index is a curated collection of production-tested AI infrastructure components that power their online services. Released with full transparency, these tools represent battle-tested solutions for efficient AGI development, emphasizing community-driven innovation. The project is licensed under CC0-1.0 (public domain), making all components freely available for commercial and research use.
The V3/R1 inference system demonstrates impressive performance with 73.7k/14.8k input/output tokens per second per H800 node, showcasing the practical efficiency of these infrastructure components.
- Step 2
Core components overview
The Open Infrastructure Index consists of seven major components, each solving specific challenges in AI infrastructure:
FlashMLA: Optimized decoding kernel for Hopper GPUs with BF16 support and paged KV cache functionality (block size 64).
DeepEP: The first open-source expert-parallel (EP) communication library for MoE model training and inference, supporting both intra-node (NVLink) and inter-node (RDMA) communication.
DeepGEMM: FP8 GEMM library supporting dense and MoE operations, achieving up to 1350+ FP8 TFLOPS on Hopper GPUs.
DualPipe: Bidirectional pipeline parallelism algorithm for computation-communication overlap in model training.
EPLB: Expert-parallel load balancer for mixture-of-experts models.
3FS (Fire-Flyer File System): Parallel file system achieving 6.6 TiB/s aggregate read throughput, supporting training data, checkpoints, and vector search.
Smallpond: Lightweight data processing framework built on top of 3FS and DuckDB for petabyte-scale dataset processing.
- Step 3
Technology stack
The Open Infrastructure Index leverages cutting-edge hardware and software technologies:
Hardware requirements:
- NVIDIA Hopper GPUs (H800, SM90) or newer (SM100)
- High-bandwidth NVLink for GPU-to-GPU communication
- RDMA-capable networks (200-400Gbps InfiniBand demonstrated)
- Thousands of SSDs for distributed storage (3FS)
- Multi-NUMA domain configurations
Software stack:
- Programming Languages: CUDA, C++20, Python 3.8+, Rust 1.75+
- Deep Learning: PyTorch 2.0+, NCCL 2.30.4+
- CUDA Toolkit: 12.3+ (12.9+ recommended for SM100)
- Build Systems: CMake, Python setuptools
- Libraries: CUTLASS 4.0+, {fmt}, NVSHMEM
- Database: DuckDB (for Smallpond), FoundationDB 7.1+ (for 3FS)
- Other: libfuse 3.16.1+, various compression and SSL libraries
Key innovations:
- JIT (Just-In-Time) compilation for CUDA kernels at runtime
- FP8 precision support for improved throughput
- Disaggregated storage architecture
- Computation-communication overlap techniques
Hardware: ├── NVIDIA Hopper GPUs (SM90/SM100) ├── NVLink interconnect ├── RDMA networks (InfiniBand) └── Distributed SSD storage Software: ├── CUDA 12.3+ / 12.9+ ├── PyTorch 2.0+ ├── C++20 / CUDA / Python 3.8+ / Rust 1.75+ ├── CMake build system ├── NCCL 2.30.4+ ├── CUTLASS 4.0+ └── DuckDB / FoundationDB - Step 4
FlashMLA: Efficient MLA decoding kernel
FlashMLA provides optimized attention kernels for Hopper GPUs, supporting Multi-Query Attention (MQA) and Multi-Head Attention (MHA) with sparse and dense operations.
Requirements:
- GPU: SM90 or SM100 architecture
- CUDA: 12.8+ (12.9+ for SM100)
- PyTorch: 2.0+
- Languages: C++ (49%), CUDA (39%), Python (12%)
# Clone and install FlashMLA git clone https://github.com/deepseek-ai/FlashMLA.git flash-mla cd flash-mla git submodule update --init --recursive pip install -v . # The installation compiles CUDA kernels for optimized attention - Step 5
DeepEP: Expert-parallel communication library
DeepEP is the first open-source expert-parallel communication library designed for efficient MoE model operations. All kernels compile at runtime via JIT, requiring no CUDA compilation during installation.
Requirements:
- GPU: Hopper (SM90) or architectures with SM90 PTX ISA support
- Python: 3.8+
- PyTorch: 2.10+
- CUDA: 12.3+
- NCCL: 2.30.4+
- NVSHMEM (for legacy method support)
- Languages: CUDA (49.8%), C++ (25.1%), Python (24.1%)
# Install NCCL via pip for automatic library discovery pip install "nvidia-nccl-cu13>=2.30.4" --no-deps # Clone and build DeepEP git clone https://github.com/deepseek-ai/DeepEP.git cd DeepEP python setup.py build python setup.py install # Test the installation python -c "import deepep; print('DeepEP installed successfully')" - Step 6
DeepGEMM: FP8 GEMM library
DeepGEMM delivers high-performance FP8 matrix operations for dense and MoE computations, achieving over 1350 FP8 TFLOPS on Hopper GPUs. Like DeepEP, it uses JIT compilation to avoid requiring CUDA compilation tools post-installation.
Requirements:
- GPU: SM90 or SM100 architecture
- CUDA: 12.3+ (12.9+ recommended)
- Python: 3.8+
- PyTorch: 2.1+
- C++20 compatible compiler
- CUTLASS: 4.0+
- {fmt} library
- Languages: CUDA (48.2%), C++ (35.2%), Python (16.2%)
# Clone with submodules git clone --recursive https://github.com/deepseek-ai/DeepGEMM.git cd DeepGEMM # Development setup ./develop.sh # Production installation ./install.sh # Verify installation python -c "import deepgemm; print('DeepGEMM ready')" - Step 7
DualPipe: Bidirectional pipeline parallelism
DualPipe implements bidirectional pipeline parallelism for efficient computation-communication overlap during model training. Designed for DeepSeek V3/R1 training workflows.
Requirements:
- PyTorch: 2.0+
- Languages: Python (100%)
Note: Real-world applications require implementing a custom
overlapped_forward_backwardmethod tailored to your specific module architecture.# Clone DualPipe git clone https://github.com/deepseek-ai/DualPipe.git cd DualPipe # Run example scripts python examples/example_dualpipe.py python examples/example_dualpipev.py # For production use, customize the overlapped_forward_backward method # in your training pipeline - Step 8
EPLB: Expert-parallel load balancer
EPLB computes optimal expert replication and placement plans for GPU load balancing in mixture-of-experts models.
Requirements:
- PyTorch (version not explicitly specified)
- Languages: Python (100%)
# Clone EPLB git clone https://github.com/deepseek-ai/eplb.git cd eplb # Usage example python << 'EOF' import torch import eplb # Compute expert rebalancing plan plan = eplb.rebalance_experts( # Your expert load metrics here ) EOF - Step 9
3FS: Fire-Flyer parallel file system
3FS is a high-performance parallel file system designed for AI workloads, achieving 6.6 TiB/s aggregate read throughput. It uses a disaggregated architecture combining thousands of SSDs across storage nodes with RDMA-capable networks.
Requirements:
- System libraries: cmake, libuv1-dev, liblz4-dev, liblzma-dev, libdouble-conversion-dev
- Debugging tools: libgflags-dev, libgoogle-glog-dev, libgtest-dev, libgmock-dev
- SSL and compression utilities
- FoundationDB: 7.1+
- libfuse: 3.16.1+
- Rust: 1.75.0+
- Compiler: clang-14 (primary), g++10 or g++11 (alternatives)
- OS: Ubuntu 20.04/22.04, openEuler, OpenCloudOS, TencentOS
- Languages: C++ (87%), Rust (4.4%), Python (2.1%)
# Install system dependencies (Ubuntu) sudo apt-get update sudo apt-get install -y cmake libuv1-dev liblz4-dev liblzma-dev \ libdouble-conversion-dev libgflags-dev libgoogle-glog-dev \ libgtest-dev libgmock-dev clang-14 # Clone 3FS git clone https://github.com/deepseek-ai/3FS.git cd 3FS git submodule update --init --recursive # Build with CMake mkdir build && cd build cmake .. -DCMAKE_C_COMPILER=clang-14 -DCMAKE_CXX_COMPILER=clang++-14 make -j$(nproc) # Docker alternative for pre-configured environments # docker pull <3fs-image> # Check repository for official images⚠ Heads up: 3FS requires significant infrastructure setup including FoundationDB cluster deployment, RDMA-capable networking, and distributed SSD storage. Refer to the official documentation for production deployment guidance. - Step 10
Smallpond: Data processing on 3FS
Smallpond is a lightweight data processing framework built on top of 3FS and DuckDB, designed for petabyte-scale datasets without long-running services. It leverages DuckDB's query engine for local processing while integrating with 3FS for distributed storage.
Requirements:
- Python: 3.8 to 3.12
- DuckDB: high-performance analytical database engine
- 3FS: distributed file system backend
- Languages: Python (100%)
# Install Smallpond via pip pip install smallpond # For development pip install smallpond[dev] # For building documentation pip install smallpond[docs] # Run unit tests pytest # Basic usage example python << 'EOF' import smallpond # Configure connection to 3FS # Process large-scale datasets using DuckDB queries # Leverage distributed storage for scalability EOF - Step 11
Integration and workflow
The Open Infrastructure Index components work together to create a complete AI training and inference pipeline:
Training workflow:
- Store training data in 3FS for high-throughput access
- Use Smallpond to preprocess and prepare datasets
- Leverage DualPipe for efficient pipeline parallelism
- Apply DeepEP for expert-parallel communication in MoE models
- Use EPLB to balance expert loads across GPUs
- Accelerate matrix operations with DeepGEMM FP8 kernels
Inference workflow:
- Load model checkpoints from 3FS
- Use FlashMLA for optimized attention operations
- Apply DeepGEMM for fast FP8 inference
- Leverage DeepEP for distributed MoE inference
Performance considerations:
- FP8 precision trades minimal accuracy for significant throughput gains
- JIT compilation allows kernel optimization at runtime
- Disaggregated storage separates compute and storage scaling
- RDMA networks minimize communication overhead
# Example: Combining components in a training loop import torch import deepgemm import deepep import eplb from flash_mla import attention # Configure expert-parallel communication deepep.init_ep_group() # Compute load balancing plan expert_plan = eplb.rebalance_experts( expert_loads=current_loads, num_gpus=world_size ) # Training step with FP8 operations with torch.autocast(device_type='cuda', dtype=torch.float8_e4m3fn): # Use DeepGEMM for matrix operations output = deepgemm.matmul(input, weight) # Use FlashMLA for attention attn_output = attention(query, key, value) # Expert-parallel all-to-all via DeepEP expert_output = deepep.all_to_all(expert_input) - Step 12
Hardware and deployment considerations
Successfully deploying the Open Infrastructure Index requires careful hardware planning:
GPU requirements:
- NVIDIA Hopper (H800) or newer GPUs are mandatory for most components
- SM90 architecture minimum; SM100 supported with CUDA 12.9+
- Multiple GPUs per node connected via NVLink for optimal performance
Network requirements:
- 200-400Gbps InfiniBand or equivalent RDMA-capable networking
- Low-latency interconnects for inter-node communication
- Dedicated storage network for 3FS (optional but recommended)
Storage requirements:
- Thousands of SSDs for 3FS deployment (production scale)
- High IOPS and bandwidth capabilities
- Multi-NUMA domain configurations for optimal access patterns
Software environment:
- Linux distribution: Ubuntu 20.04/22.04, openEuler, OpenCloudOS, or TencentOS
- CUDA driver matching toolkit version
- NCCL properly configured for multi-node communication
- FoundationDB cluster for 3FS metadata management
Cloud deployment: Most cloud providers offer Hopper GPU instances (H100, H200) that meet the hardware requirements. However, achieving the full performance demonstrated by DeepSeek requires bare-metal deployment with custom networking and storage configurations.
⚠ Heads up: The Open Infrastructure Index is designed for large-scale AI infrastructure. Small-scale deployments may not realize the full performance benefits and may be better served by simpler alternatives. - Step 13
Academic papers and documentation
DeepSeek has published academic papers detailing the design and performance of their infrastructure:
SC24 (Supercomputing 2024): Papers on distributed training systems and storage architectures
ISCA25 (International Symposium on Computer Architecture 2025): Papers on GPU kernel optimization and memory systems
These publications provide deep technical insights into the design decisions, performance characteristics, and optimization techniques used throughout the infrastructure stack.
Key publications: - SC24: Distributed systems and storage - ISCA25: GPU optimization and memory - Step 14
Resources and community
Main repository: https://github.com/deepseek-ai/open-infra-index
Individual component repositories:
- FlashMLA: https://github.com/deepseek-ai/FlashMLA
- DeepEP: https://github.com/deepseek-ai/DeepEP
- DeepGEMM: https://github.com/deepseek-ai/DeepGEMM
- DualPipe: https://github.com/deepseek-ai/DualPipe
- EPLB: https://github.com/deepseek-ai/eplb
- 3FS: https://github.com/deepseek-ai/3FS
- Smallpond: https://github.com/deepseek-ai/smallpond
License: CC0-1.0 (public domain dedication)
Language support: Primary documentation in English; Chinese language documentation available
Community: Each repository has its own issue tracker and contribution guidelines
Main: github.com/deepseek-ai/open-infra-index Components: ├── FlashMLA (GPU kernels) ├── DeepEP (expert-parallel) ├── DeepGEMM (FP8 GEMM) ├── DualPipe (pipeline parallelism) ├── EPLB (load balancing) ├── 3FS (file system) └── Smallpond (data processing) License: CC0-1.0 (public domain)
Feature requests
Sign in to suggest features or vote on existing ones.
No feature requests yet.
Discussion
Sign in to join the discussion.
No comments yet.