Advancedaiinfrastructuregpucudadistributed-systemsmoedeepseekpytorchhopperfp8parallel-computing

DeepSeek Open Infrastructure Index: Production AI Infrastructure Toolkit

A comprehensive collection of production-tested AI infrastructure tools for AGI development, including GPU kernels, distributed systems, parallel file systems, and MoE training components.

Step 1
What is Open Infrastructure Index?
DeepSeek's Open Infrastructure Index is a curated collection of production-tested AI infrastructure components that power their online services. Released with full transparency, these tools represent battle-tested solutions for efficient AGI development, emphasizing community-driven innovation. The project is licensed under CC0-1.0 (public domain), making all components freely available for commercial and research use.

The V3/R1 inference system demonstrates impressive performance with 73.7k/14.8k input/output tokens per second per H800 node, showcasing the practical efficiency of these infrastructure components.
Step 2
Core components overview
The Open Infrastructure Index consists of seven major components, each solving specific challenges in AI infrastructure:

FlashMLA: Optimized decoding kernel for Hopper GPUs with BF16 support and paged KV cache functionality (block size 64).

DeepEP: The first open-source expert-parallel (EP) communication library for MoE model training and inference, supporting both intra-node (NVLink) and inter-node (RDMA) communication.

DeepGEMM: FP8 GEMM library supporting dense and MoE operations, achieving up to 1350+ FP8 TFLOPS on Hopper GPUs.

DualPipe: Bidirectional pipeline parallelism algorithm for computation-communication overlap in model training.

EPLB: Expert-parallel load balancer for mixture-of-experts models.

3FS (Fire-Flyer File System): Parallel file system achieving 6.6 TiB/s aggregate read throughput, supporting training data, checkpoints, and vector search.

Smallpond: Lightweight data processing framework built on top of 3FS and DuckDB for petabyte-scale dataset processing.
Step 3
Technology stack
The Open Infrastructure Index leverages cutting-edge hardware and software technologies:

Hardware requirements:
- NVIDIA Hopper GPUs (H800, SM90) or newer (SM100)
- High-bandwidth NVLink for GPU-to-GPU communication
- RDMA-capable networks (200-400Gbps InfiniBand demonstrated)
- Thousands of SSDs for distributed storage (3FS)
- Multi-NUMA domain configurations
Software stack:
- Programming Languages: CUDA, C++20, Python 3.8+, Rust 1.75+
- Deep Learning: PyTorch 2.0+, NCCL 2.30.4+
- CUDA Toolkit: 12.3+ (12.9+ recommended for SM100)
- Build Systems: CMake, Python setuptools
- Libraries: CUTLASS 4.0+, {fmt}, NVSHMEM
- Database: DuckDB (for Smallpond), FoundationDB 7.1+ (for 3FS)
- Other: libfuse 3.16.1+, various compression and SSL libraries
Key innovations:
- JIT (Just-In-Time) compilation for CUDA kernels at runtime
- FP8 precision support for improved throughput
- Disaggregated storage architecture
- Computation-communication overlap techniques
```
Hardware:
├── NVIDIA Hopper GPUs (SM90/SM100)
├── NVLink interconnect
├── RDMA networks (InfiniBand)
└── Distributed SSD storage

Software:
├── CUDA 12.3+ / 12.9+
├── PyTorch 2.0+
├── C++20 / CUDA / Python 3.8+ / Rust 1.75+
├── CMake build system
├── NCCL 2.30.4+
├── CUTLASS 4.0+
└── DuckDB / FoundationDB
```
Step 4
FlashMLA: Efficient MLA decoding kernel
FlashMLA provides optimized attention kernels for Hopper GPUs, supporting Multi-Query Attention (MQA) and Multi-Head Attention (MHA) with sparse and dense operations.

Requirements:
- GPU: SM90 or SM100 architecture
- CUDA: 12.8+ (12.9+ for SM100)
- PyTorch: 2.0+
- Languages: C++ (49%), CUDA (39%), Python (12%)
```
# Clone and install FlashMLA
git clone https://github.com/deepseek-ai/FlashMLA.git flash-mla
cd flash-mla
git submodule update --init --recursive
pip install -v .

# The installation compiles CUDA kernels for optimized attention
```
Step 5
DeepEP: Expert-parallel communication library
DeepEP is the first open-source expert-parallel communication library designed for efficient MoE model operations. All kernels compile at runtime via JIT, requiring no CUDA compilation during installation.

Requirements:
- GPU: Hopper (SM90) or architectures with SM90 PTX ISA support
- Python: 3.8+
- PyTorch: 2.10+
- CUDA: 12.3+
- NCCL: 2.30.4+
- NVSHMEM (for legacy method support)
- Languages: CUDA (49.8%), C++ (25.1%), Python (24.1%)
```
# Install NCCL via pip for automatic library discovery
pip install "nvidia-nccl-cu13>=2.30.4" --no-deps

# Clone and build DeepEP
git clone https://github.com/deepseek-ai/DeepEP.git
cd DeepEP
python setup.py build
python setup.py install

# Test the installation
python -c "import deepep; print('DeepEP installed successfully')"
```
Step 6
DeepGEMM: FP8 GEMM library
DeepGEMM delivers high-performance FP8 matrix operations for dense and MoE computations, achieving over 1350 FP8 TFLOPS on Hopper GPUs. Like DeepEP, it uses JIT compilation to avoid requiring CUDA compilation tools post-installation.

Requirements:
- GPU: SM90 or SM100 architecture
- CUDA: 12.3+ (12.9+ recommended)
- Python: 3.8+
- PyTorch: 2.1+
- C++20 compatible compiler
- CUTLASS: 4.0+
- {fmt} library
- Languages: CUDA (48.2%), C++ (35.2%), Python (16.2%)
```
# Clone with submodules
git clone --recursive https://github.com/deepseek-ai/DeepGEMM.git
cd DeepGEMM

# Development setup
./develop.sh

# Production installation
./install.sh

# Verify installation
python -c "import deepgemm; print('DeepGEMM ready')"
```
Step 7
DualPipe: Bidirectional pipeline parallelism
DualPipe implements bidirectional pipeline parallelism for efficient computation-communication overlap during model training. Designed for DeepSeek V3/R1 training workflows.

Requirements:
- PyTorch: 2.0+
- Languages: Python (100%)
Note: Real-world applications require implementing a custom overlapped_forward_backward method tailored to your specific module architecture.
```
# Clone DualPipe
git clone https://github.com/deepseek-ai/DualPipe.git
cd DualPipe

# Run example scripts
python examples/example_dualpipe.py
python examples/example_dualpipev.py

# For production use, customize the overlapped_forward_backward method
# in your training pipeline
```

Step 8

EPLB: Expert-parallel load balancer

EPLB computes optimal expert replication and placement plans for GPU load balancing in mixture-of-experts models.

Requirements:

PyTorch (version not explicitly specified)
Languages: Python (100%)

# Clone EPLB
git clone https://github.com/deepseek-ai/eplb.git
cd eplb

# Usage example
python << 'EOF'
import torch
import eplb

# Compute expert rebalancing plan
plan = eplb.rebalance_experts(
    # Your expert load metrics here
)
EOF

Step 9
3FS: Fire-Flyer parallel file system
3FS is a high-performance parallel file system designed for AI workloads, achieving 6.6 TiB/s aggregate read throughput. It uses a disaggregated architecture combining thousands of SSDs across storage nodes with RDMA-capable networks.

Requirements:
- System libraries: cmake, libuv1-dev, liblz4-dev, liblzma-dev, libdouble-conversion-dev
- Debugging tools: libgflags-dev, libgoogle-glog-dev, libgtest-dev, libgmock-dev
- SSL and compression utilities
- FoundationDB: 7.1+
- libfuse: 3.16.1+
- Rust: 1.75.0+
- Compiler: clang-14 (primary), g++10 or g++11 (alternatives)
- OS: Ubuntu 20.04/22.04, openEuler, OpenCloudOS, TencentOS
- Languages: C++ (87%), Rust (4.4%), Python (2.1%)
```
# Install system dependencies (Ubuntu)
sudo apt-get update
sudo apt-get install -y cmake libuv1-dev liblz4-dev liblzma-dev \
  libdouble-conversion-dev libgflags-dev libgoogle-glog-dev \
  libgtest-dev libgmock-dev clang-14

# Clone 3FS
git clone https://github.com/deepseek-ai/3FS.git
cd 3FS
git submodule update --init --recursive

# Build with CMake
mkdir build && cd build
cmake .. -DCMAKE_C_COMPILER=clang-14 -DCMAKE_CXX_COMPILER=clang++-14
make -j$(nproc)

# Docker alternative for pre-configured environments
# docker pull <3fs-image> # Check repository for official images
```
⚠ Heads up: 3FS requires significant infrastructure setup including FoundationDB cluster deployment, RDMA-capable networking, and distributed SSD storage. Refer to the official documentation for production deployment guidance.
Step 10
Smallpond: Data processing on 3FS
Smallpond is a lightweight data processing framework built on top of 3FS and DuckDB, designed for petabyte-scale datasets without long-running services. It leverages DuckDB's query engine for local processing while integrating with 3FS for distributed storage.

Requirements:
- Python: 3.8 to 3.12
- DuckDB: high-performance analytical database engine
- 3FS: distributed file system backend
- Languages: Python (100%)
```
# Install Smallpond via pip
pip install smallpond

# For development
pip install smallpond[dev]

# For building documentation
pip install smallpond[docs]

# Run unit tests
pytest

# Basic usage example
python << 'EOF'
import smallpond

# Configure connection to 3FS
# Process large-scale datasets using DuckDB queries
# Leverage distributed storage for scalability
EOF
```
Step 11
Integration and workflow
The Open Infrastructure Index components work together to create a complete AI training and inference pipeline:

Training workflow:
1. Store training data in 3FS for high-throughput access
2. Use Smallpond to preprocess and prepare datasets
3. Leverage DualPipe for efficient pipeline parallelism
4. Apply DeepEP for expert-parallel communication in MoE models
5. Use EPLB to balance expert loads across GPUs
6. Accelerate matrix operations with DeepGEMM FP8 kernels
Inference workflow:
1. Load model checkpoints from 3FS
2. Use FlashMLA for optimized attention operations
3. Apply DeepGEMM for fast FP8 inference
4. Leverage DeepEP for distributed MoE inference
Performance considerations:
- FP8 precision trades minimal accuracy for significant throughput gains
- JIT compilation allows kernel optimization at runtime
- Disaggregated storage separates compute and storage scaling
- RDMA networks minimize communication overhead
```
# Example: Combining components in a training loop
import torch
import deepgemm
import deepep
import eplb
from flash_mla import attention

# Configure expert-parallel communication
deepep.init_ep_group()

# Compute load balancing plan
expert_plan = eplb.rebalance_experts(
    expert_loads=current_loads,
    num_gpus=world_size
)

# Training step with FP8 operations
with torch.autocast(device_type='cuda', dtype=torch.float8_e4m3fn):
    # Use DeepGEMM for matrix operations
    output = deepgemm.matmul(input, weight)
    
    # Use FlashMLA for attention
    attn_output = attention(query, key, value)
    
    # Expert-parallel all-to-all via DeepEP
    expert_output = deepep.all_to_all(expert_input)
```
Step 12
Hardware and deployment considerations
Successfully deploying the Open Infrastructure Index requires careful hardware planning:

GPU requirements:
- NVIDIA Hopper (H800) or newer GPUs are mandatory for most components
- SM90 architecture minimum; SM100 supported with CUDA 12.9+
- Multiple GPUs per node connected via NVLink for optimal performance
Network requirements:
- 200-400Gbps InfiniBand or equivalent RDMA-capable networking
- Low-latency interconnects for inter-node communication
- Dedicated storage network for 3FS (optional but recommended)
Storage requirements:
- Thousands of SSDs for 3FS deployment (production scale)
- High IOPS and bandwidth capabilities
- Multi-NUMA domain configurations for optimal access patterns
Software environment:
- Linux distribution: Ubuntu 20.04/22.04, openEuler, OpenCloudOS, or TencentOS
- CUDA driver matching toolkit version
- NCCL properly configured for multi-node communication
- FoundationDB cluster for 3FS metadata management
Cloud deployment: Most cloud providers offer Hopper GPU instances (H100, H200) that meet the hardware requirements. However, achieving the full performance demonstrated by DeepSeek requires bare-metal deployment with custom networking and storage configurations.
⚠ Heads up: The Open Infrastructure Index is designed for large-scale AI infrastructure. Small-scale deployments may not realize the full performance benefits and may be better served by simpler alternatives.
Step 13
Academic papers and documentation
DeepSeek has published academic papers detailing the design and performance of their infrastructure:

SC24 (Supercomputing 2024): Papers on distributed training systems and storage architectures

ISCA25 (International Symposium on Computer Architecture 2025): Papers on GPU kernel optimization and memory systems

These publications provide deep technical insights into the design decisions, performance characteristics, and optimization techniques used throughout the infrastructure stack.
```
Key publications:
- SC24: Distributed systems and storage
- ISCA25: GPU optimization and memory
```
Step 14
Resources and community
Main repository: https://github.com/deepseek-ai/open-infra-index

Individual component repositories:
- FlashMLA: https://github.com/deepseek-ai/FlashMLA
- DeepEP: https://github.com/deepseek-ai/DeepEP
- DeepGEMM: https://github.com/deepseek-ai/DeepGEMM
- DualPipe: https://github.com/deepseek-ai/DualPipe
- EPLB: https://github.com/deepseek-ai/eplb
- 3FS: https://github.com/deepseek-ai/3FS
- Smallpond: https://github.com/deepseek-ai/smallpond
License: CC0-1.0 (public domain dedication)

Language support: Primary documentation in English; Chinese language documentation available

Community: Each repository has its own issue tracker and contribution guidelines
```
Main: github.com/deepseek-ai/open-infra-index
Components:
├── FlashMLA (GPU kernels)
├── DeepEP (expert-parallel)
├── DeepGEMM (FP8 GEMM)
├── DualPipe (pipeline parallelism)
├── EPLB (load balancing)
├── 3FS (file system)
└── Smallpond (data processing)

License: CC0-1.0 (public domain)
```

DeepSeek Open Infrastructure Index: Production AI Infrastructure Toolkit

What is Open Infrastructure Index?

Core components overview

Technology stack

FlashMLA: Efficient MLA decoding kernel

DeepEP: Expert-parallel communication library

DeepGEMM: FP8 GEMM library

DualPipe: Bidirectional pipeline parallelism

EPLB: Expert-parallel load balancer

3FS: Fire-Flyer parallel file system

Smallpond: Data processing on 3FS

Integration and workflow

Hardware and deployment considerations

Academic papers and documentation

Resources and community

Feature requests

Discussion