TechSetupGuides
Advancedaivideo-processingcomputer-visionpythondeep-learningopencvpytorchpaddlepaddlecudagpu-acceleration

Video Subtitle Remover: AI-powered hardcoded subtitle removal

AI-based tool for removing hardcoded subtitles and text watermarks from videos and images using advanced inpainting models. Local processing without third-party APIs.

  1. Step 1

    What is Video Subtitle Remover?

    Video Subtitle Remover (VSR) is an AI-based tool that removes hardcoded subtitles and text-like watermarks from videos or images. Unlike soft subtitles (which can be toggled off), hardcoded subtitles are permanently burned into the video. VSR uses advanced deep learning models to detect subtitle regions and intelligently fill them by reconstructing the background content. The tool runs entirely locally without requiring third-party API calls, giving you full control over your content and privacy.

  2. Step 2

    Technology stack

    Video Subtitle Remover is built on a modern Python-based AI/ML stack with GPU acceleration support:

    Core Framework:

    • Python 3.12+ (with support for Python 3.13)
    • PyTorch 2.7.0 for deep learning
    • TorchVision 0.22.0 for vision models
    • PaddlePaddle 3.0.0 for OCR capabilities

    Computer Vision & Processing:

    • OpenCV (cv2) for image/video processing
    • FFmpeg for video encoding, frame extraction, and audio merging
    • PP-OCRv5 model for text detection and recognition

    AI Inpainting Models:

    • STTN (Spatio-Temporal Transformer Network) - Fast, uses temporal information from adjacent frames
    • LAMA (Large Mask Inpainting) - Single-frame neural fill, best for images and animated content
    • ProPainter - Hybrid approach combining TBE (Temporal Background Extraction) and LAMA refinement
    • OpenCV traditional inpainting as fallback

    GUI & Interface:

    • PySimpleGUI for the graphical user interface
    • Command-line interface for batch processing

    Hardware Acceleration:

    • CUDA support for NVIDIA GPUs (recommended)
    • DirectML for AMD/Intel GPU acceleration
    • CPU fallback mode (slower)
    • Apple Silicon (macOS) optimization

    Deployment:

    • Docker containers with CUDA 11.8, 12.6, 12.8, DirectML, and CPU variants
    • Conda/pip for dependency management
    • GitHub Actions for CI/CD
    Core Stack:
    ├── Python 3.12+
    ├── PyTorch 2.7.0 + TorchVision 0.22.0
    ├── PaddlePaddle 3.0.0
    ├── OpenCV (cv2)
    └── FFmpeg
    
    AI Models:
    ├── PP-OCRv5 (text detection)
    ├── STTN (temporal inpainting)
    ├── LAMA (single-frame inpainting)
    └── ProPainter (hybrid inpainting)
    
    Hardware Support:
    ├── CUDA (NVIDIA GPU) - Recommended
    ├── DirectML (AMD/Intel GPU)
    ├── CPU (no GPU)
    └── Apple Silicon (macOS)
  3. Step 3

    How it works: The subtitle removal pipeline

    VSR uses a multi-stage pipeline to remove subtitles:

    1. Text Detection (PP-OCRv5)

    • Video frames are extracted using FFmpeg
    • PP-OCRv5 detects text regions in each frame
    • Bounding boxes are generated around detected subtitles
    • Scene detection skips static frames for efficiency

    2. Mask Generation

    • Detected text regions are converted to binary masks
    • Masks define which pixels need to be reconstructed

    3. AI Inpainting

    • STTN mode: Leverages temporal information - subtitles are sparse in time, so the model reconstructs background from adjacent frames where subtitles are absent or different
    • LAMA mode: Single-frame neural fill using surrounding context
    • ProPainter mode: First uses TBE to reconstruct temporal background, then LAMA refines residual artifacts

    4. Video Reconstruction

    • Inpainted frames are re-encoded using FFmpeg
    • Original audio track is merged back
    • Output video maintains original resolution and quality
    Pipeline Flow:
    
    [Input Video]
        ↓
    [FFmpeg Frame Extraction]
        ↓
    [PP-OCRv5 Text Detection] → Bounding boxes
        ↓
    [Mask Generation] → Binary masks
        ↓
    [AI Inpainting Model]
      ├── STTN (temporal)
      ├── LAMA (single-frame)
      └── ProPainter (hybrid)
        ↓
    [FFmpeg Re-encoding]
        ↓
    [Audio Merge]
        ↓
    [Output Video (subtitle-free)]
  4. Step 4

    Prerequisites

    Before setting up Video Subtitle Remover, ensure you have the appropriate hardware and software:

    For Docker Deployment (Recommended):

    • Docker 20.10+ and Docker Compose
    • NVIDIA GPU with CUDA support (10/20/30/40 series)
    • NVIDIA Container Toolkit (nvidia-docker2)
    • At least 8GB GPU VRAM (16GB+ recommended for ProPainter)
    • 10GB+ free disk space for Docker images and models

    For Source Installation:

    • Python 3.12 or 3.13
    • NVIDIA GPU with CUDA 11.8 or 12.x support
    • CUDA Toolkit and cuDNN installed
    • Conda or venv for virtual environment
    • FFmpeg installed and available in PATH
    • At least 8GB GPU VRAM

    General Requirements:

    • 64-bit operating system (Linux, Windows, or macOS)
    • For CPU-only mode: Modern CPU with AVX2 support (very slow, not recommended)
    # Check Docker and NVIDIA support
    docker --version
    nvidia-smi  # Check GPU and CUDA version
    
    # For source installation:
    python --version  # Should be 3.12 or 3.13
    nvcc --version    # Check CUDA Toolkit
    ffmpeg -version   # Check FFmpeg
    
    # Recommended: Check GPU VRAM
    nvidia-smi --query-gpu=memory.total --format=csv,noheader
    ⚠ Heads up: ⚠️ **GPU Strongly Recommended**: While CPU mode is available, processing is 10-50x slower. A subtitle removal task that takes 5 minutes on an NVIDIA RTX 3090 may take hours on CPU. NVIDIA GPUs with CUDA are the primary supported hardware.
  5. Step 5

    Quick start with Docker (NVIDIA GPU)

    The fastest way to get started is using Docker with GPU support. First, ensure you have NVIDIA Container Toolkit installed, then pull and run the appropriate image based on your GPU generation.

    # For NVIDIA 10/20/30 Series GPUs (CUDA 11.8)
    docker pull eritpchy/video-subtitle-remover:1.4.0-cuda11.8
    
    docker run -it --gpus all \
      --name vsr \
      -v $(pwd)/videos:/workspace \
      eritpchy/video-subtitle-remover:1.4.0-cuda11.8 \
      python backend/main.py \
      -i /workspace/input.mp4 \
      -o /workspace/output.mp4
    
    # For NVIDIA 40 Series GPUs (CUDA 12.x)
    docker pull eritpchy/video-subtitle-remover:1.4.0-cuda12
    
    docker run -it --gpus all \
      --name vsr \
      -v $(pwd)/videos:/workspace \
      eritpchy/video-subtitle-remover:1.4.0-cuda12 \
      python backend/main.py \
      -i /workspace/input.mp4 \
      -o /workspace/output.mp4
    ⚠ Heads up: ⚠️ **NVIDIA Container Toolkit Required**: You must install nvidia-docker2 to enable GPU access in Docker containers. See [NVIDIA Container Toolkit Installation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
  6. Step 6

    Docker with DirectML (AMD/Intel GPU)

    For non-NVIDIA GPUs (AMD or Intel), use the DirectML variant which provides GPU acceleration on Windows through DirectX:

    # Pull DirectML image (Windows only)
    docker pull eritpchy/video-subtitle-remover:1.4.0-directml
    
    # Run with DirectML support
    docker run -it --name vsr \
      -v %cd%/videos:/workspace \
      eritpchy/video-subtitle-remover:1.4.0-directml \
      python backend/main.py \
      -i /workspace/input.mp4 \
      -o /workspace/output.mp4
  7. Step 7

    Installation from source (Linux/Windows)

    For development or customization, install from source. This gives you full control over the environment and allows you to use the GUI interface.

    Step 1: Set up virtual environment

    Step 2: Install PyTorch and PaddlePaddle

    Step 3: Clone repository and install dependencies

    Step 4: Download AI models

    # Step 1: Create conda environment
    conda create -n vsr python=3.12
    conda activate vsr
    
    # Step 2: Install PyTorch with CUDA support
    # For CUDA 11.8:
    pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu118
    
    # For CUDA 12.1:
    pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu121
    
    # Install PaddlePaddle with GPU support
    python -m pip install paddlepaddle-gpu==3.0.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
    
    # Step 3: Clone and install
    git clone https://github.com/YaoFANGUK/video-subtitle-remover.git
    cd video-subtitle-remover
    pip install -r requirements.txt
    
    # Step 4: Models will auto-download on first run
    # Or manually download to models/ directory from releases
  8. Step 8

    Installation from source (macOS Apple Silicon)

    For Apple Silicon Macs (M1/M2/M3), use the MPS (Metal Performance Shaders) backend for GPU acceleration:

    # Create virtual environment
    python3 -m venv vsr-env
    source vsr-env/bin/activate
    
    # Install dependencies for macOS
    pip install torch==2.7.0 torchvision==0.22.0
    pip install paddlepaddle==3.0.0
    
    # Clone repository
    git clone https://github.com/YaoFANGUK/video-subtitle-remover.git
    cd video-subtitle-remover
    
    # Install requirements
    pip install -r requirements.txt
    
    # Ensure FFmpeg is installed
    brew install ffmpeg
  9. Step 9

    Using the GUI interface

    Video Subtitle Remover provides a user-friendly GUI built with PySimpleGUI for interactive subtitle removal:

    Features:

    • Visual file selection
    • Preview of detected subtitle regions
    • Real-time progress monitoring
    • Adjustable detection parameters
    • Choice of inpainting models
    • Batch processing support
    # Launch the GUI (source installation)
    python gui.py
    
    # The GUI window will open with:
    # 1. Input file selection
    # 2. Output path specification
    # 3. Model selection (STTN/LAMA/ProPainter)
    # 4. Detection parameters:
    #    - Text detection threshold
    #    - Mask dilation size
    #    - Frame sampling rate
    # 5. Start processing button
  10. Step 10

    Command-line usage and parameters

    For batch processing and automation, use the command-line interface with various configuration options:

    # Basic usage
    python backend/main.py -i input.mp4 -o output.mp4
    
    # Specify inpainting model
    python backend/main.py -i input.mp4 -o output.mp4 --inpaint_mode sttn
    # Options: sttn-auto, sttn-det, lama, propainter, opencv
    
    # Adjust text detection threshold (0.0-1.0)
    python backend/main.py -i input.mp4 -o output.mp4 --det_threshold 0.5
    
    # Skip subtitle detection (use entire bottom region)
    python backend/main.py -i input.mp4 -o output.mp4 --inpaint_mode sttn-auto
    
    # Process image instead of video
    python backend/main.py -i input.jpg -o output.jpg --inpaint_mode lama
    
    # Specify GPU device
    python backend/main.py -i input.mp4 -o output.mp4 --device cuda:0
  11. Step 11

    Choosing the right inpainting model

    Different inpainting models excel in different scenarios:

    STTN (Spatio-Temporal Transformer Network)

    • Best for: Live-action videos with camera movement
    • Speed: Fast (5-15 minutes for a 2-hour 1080p video on RTX 3090)
    • Quality: Excellent for videos where background is visible in other frames
    • VRAM: 6-8GB
    • Modes:
      • sttn-det: Uses OCR to detect subtitles
      • sttn-auto: Skips detection, processes entire bottom region (faster)
    • How it works: Reconstructs background from adjacent frames where subtitles differ

    LAMA (Large Mask Inpainting)

    • Best for: Static images, animated videos, or when background never appears
    • Speed: Moderate (10-25 minutes for a 2-hour 1080p video)
    • Quality: Good single-frame inpainting using surrounding context
    • VRAM: 4-6GB
    • How it works: Neural fill based on neighboring pixels in the same frame

    ProPainter

    • Best for: Complex scenes with intense movement and no clean frames
    • Speed: Slowest (30-60 minutes for a 2-hour 1080p video)
    • Quality: Best overall quality with minimal artifacts
    • VRAM: 12-16GB (high memory usage)
    • How it works: TBE reconstructs temporal background, then LAMA refines

    OpenCV Traditional

    • Best for: Testing or fallback only
    • Speed: Very fast but poor quality
    • Quality: Low (visible artifacts and blur)
    # Use STTN for typical live-action content (recommended)
    python backend/main.py -i movie.mp4 -o clean.mp4 --inpaint_mode sttn-det
    
    # Use LAMA for animated content or images
    python backend/main.py -i anime.mp4 -o clean.mp4 --inpaint_mode lama
    
    # Use ProPainter for highest quality (requires 16GB+ VRAM)
    python backend/main.py -i action.mp4 -o clean.mp4 --inpaint_mode propainter
    
    # Use sttn-auto for fastest processing (skips OCR)
    python backend/main.py -i series.mp4 -o clean.mp4 --inpaint_mode sttn-auto
  12. Step 12

    Advanced configuration and tuning

    Fine-tune the subtitle detection and removal process for optimal results:

    Text Detection Parameters:

    • det_threshold: Confidence threshold for text detection (default: 0.3, range: 0.0-1.0)
      • Lower = more sensitive, may detect non-subtitle text
      • Higher = less sensitive, may miss faint subtitles

    Mask Parameters:

    • mask_dilation: Expand detected regions to ensure complete coverage (default: 2 pixels)

    Processing Parameters:

    • frame_rate: Frame sampling rate for detection (default: 1, process every frame)
      • Set to 2-5 to speed up at cost of potentially missing subtitle changes

    Memory Optimization:

    • For 8GB VRAM: Use STTN or LAMA, avoid ProPainter
    • For 12GB+ VRAM: All models available
    • Reduce video resolution if VRAM errors occur
    # Lower threshold for faint or small subtitles
    python backend/main.py -i input.mp4 -o output.mp4 --det_threshold 0.2
    
    # Higher threshold to ignore watermarks or logos
    python backend/main.py -i input.mp4 -o output.mp4 --det_threshold 0.6
    
    # Expand mask to cover subtitle shadows/glow
    python backend/main.py -i input.mp4 -o output.mp4 --mask_dilation 4
    
    # Speed up by sampling every 3rd frame
    python backend/main.py -i input.mp4 -o output.mp4 --frame_rate 3
    
    # Full configuration example
    python backend/main.py \
      -i input.mp4 \
      -o output.mp4 \
      --inpaint_mode sttn-det \
      --det_threshold 0.4 \
      --mask_dilation 3 \
      --device cuda:0
  13. Step 13

    Batch processing multiple videos

    Process multiple videos efficiently using shell scripts or Python automation:

    # Bash script for batch processing (Linux/macOS)
    #!/bin/bash
    for video in ./input_videos/*.mp4; do
      filename=$(basename "$video" .mp4)
      python backend/main.py \
        -i "$video" \
        -o "./output_videos/${filename}_clean.mp4" \
        --inpaint_mode sttn-det
    done
    
    # PowerShell script for batch processing (Windows)
    Get-ChildItem -Path .\input_videos\*.mp4 | ForEach-Object {
      $output = "output_videos\$($_.BaseName)_clean.mp4"
      python backend/main.py -i $_.FullName -o $output --inpaint_mode sttn-det
    }
    
    # Python script for advanced batch processing
    import os
    import subprocess
    
    input_dir = "./input_videos"
    output_dir = "./output_videos"
    os.makedirs(output_dir, exist_ok=True)
    
    for filename in os.listdir(input_dir):
        if filename.endswith(".mp4"):
            input_path = os.path.join(input_dir, filename)
            output_path = os.path.join(output_dir, f"{os.path.splitext(filename)[0]}_clean.mp4")
            subprocess.run([
                "python", "backend/main.py",
                "-i", input_path,
                "-o", output_path,
                "--inpaint_mode", "sttn-det"
            ])
  14. Step 14

    Troubleshooting common issues

    CUDA Out of Memory:

    • Reduce video resolution or use a model with lower VRAM requirements
    • Close other GPU applications
    • Use STTN instead of ProPainter

    Text Detection Failing:

    • Lower the det_threshold parameter
    • Ensure subtitles are clear and high-contrast
    • Try sttn-auto mode to skip detection

    Poor Inpainting Quality:

    • Use ProPainter for highest quality (requires more VRAM)
    • Increase mask_dilation to cover subtitle shadows
    • For animated content, switch from STTN to LAMA

    Slow Processing Speed:

    • Verify GPU is being used: check nvidia-smi during processing
    • Use sttn-auto mode to skip OCR detection
    • Increase frame_rate parameter (may miss subtitle changes)

    FFmpeg Errors:

    • Ensure FFmpeg is installed and in PATH
    • Check video codec compatibility
    • Try converting input to a standard format (H.264 MP4)

    Model Download Failures:

    • Models auto-download on first run (requires internet)
    • Manually download from GitHub Releases if needed
    • Check models/ directory for existing files
    # Check if GPU is being utilized
    watch -n 1 nvidia-smi
    
    # Verify CUDA is available in Python
    python -c "import torch; print(torch.cuda.is_available())"
    
    # Test FFmpeg installation
    ffmpeg -version
    
    # Check available VRAM
    nvidia-smi --query-gpu=memory.free --format=csv,noheader
    
    # Convert video to standard format if needed
    ffmpeg -i input.mkv -c:v libx264 -c:a aac input.mp4
    
    # Clear Python cache if imports fail
    find . -type d -name __pycache__ -exec rm -r {} +
    find . -type f -name '*.pyc' -delete
  15. Step 15

    Performance benchmarks

    Approximate processing times for a 2-hour 1080p video on different hardware:

    NVIDIA RTX 4090:

    • STTN-auto: 3-5 minutes
    • STTN-det: 8-12 minutes
    • LAMA: 12-18 minutes
    • ProPainter: 15-25 minutes

    NVIDIA RTX 3090:

    • STTN-auto: 5-8 minutes
    • STTN-det: 12-18 minutes
    • LAMA: 18-28 minutes
    • ProPainter: 25-40 minutes

    NVIDIA RTX 3060 (12GB):

    • STTN-auto: 10-15 minutes
    • STTN-det: 20-30 minutes
    • LAMA: 30-45 minutes
    • ProPainter: 50-80 minutes

    CPU Only (Ryzen 9 5950X):

    • STTN: 4-8 hours
    • LAMA: 5-10 hours
    • ProPainter: Not recommended (10+ hours)

    Factors affecting speed:

    • Video resolution (4K takes 4x longer than 1080p)
    • Subtitle density (more text = more processing)
    • Frame rate (60fps takes 2x longer than 30fps)
    • Inpainting mode and parameters
    Processing Time Comparison (2-hour 1080p video):
    
    RTX 4090:
      STTN-auto:   █░░░░░░░░░  3-5 min
      STTN-det:    ███░░░░░░░  8-12 min
      LAMA:        ████░░░░░░  12-18 min
      ProPainter:  █████░░░░░  15-25 min
    
    RTX 3090:
      STTN-auto:   ██░░░░░░░░  5-8 min
      STTN-det:    ████░░░░░░  12-18 min
      LAMA:        ██████░░░░  18-28 min
      ProPainter:  ████████░░  25-40 min
    
    RTX 3060 (12GB):
      STTN-auto:   ████░░░░░░  10-15 min
      STTN-det:    ███████░░░  20-30 min
      LAMA:        █████████░  30-45 min
      ProPainter:  ██████████  50-80 min
    
    CPU (Ryzen 9 5950X):
      STTN:        4-8 hours 😰
      LAMA:        5-10 hours 😱
      ProPainter:  10+ hours 💀
  16. Step 16

    Project structure and customization

    Understanding the codebase structure for customization and development:

    backend/ - Core processing logic

    • main.py: CLI entry point
    • subtitle_detector.py: PP-OCR text detection
    • inpainter.py: Inpainting model orchestration
    • video_processor.py: FFmpeg video handling

    models/ - Pre-trained model weights

    • PP-OCR models (text detection/recognition)
    • STTN/LAMA/ProPainter weights
    • Auto-downloaded on first run

    gui.py - PySimpleGUI interface

    requirements.txt - Python dependencies

    docker/ - Dockerfile variants for different CUDA versions

    video-subtitle-remover/
    ├── backend/
    │   ├── main.py                # CLI entry point
    │   ├── subtitle_detector.py   # PP-OCR detection
    │   ├── inpainter.py          # Model orchestration
    │   ├── video_processor.py     # FFmpeg integration
    │   └── utils.py              # Helper functions
    ├── models/                    # Model weights (auto-dl)
    │   ├── pp_ocr/
    │   ├── sttn/
    │   ├── lama/
    │   └── propainter/
    ├── gui.py                     # PySimpleGUI interface
    ├── requirements.txt           # Dependencies
    ├── docker/
    │   ├── Dockerfile.cuda11.8
    │   ├── Dockerfile.cuda12
    │   └── Dockerfile.directml
    └── README.md
  17. Step 17

    Resources and community

    Official Repository: https://github.com/YaoFANGUK/video-subtitle-remover

    Docker Hub: https://hub.docker.com/r/eritpchy/video-subtitle-remover

    GitHub Issues: https://github.com/YaoFANGUK/video-subtitle-remover/issues

    Releases & Changelog: https://github.com/YaoFANGUK/video-subtitle-remover/releases

    Related Projects:

    • video-subtitle-extractor: Extract hardcoded subtitles to SRT files

    Model References:

    • STTN Paper: "Learning Joint Spatial-Temporal Transformations for Video Inpainting"
    • LAMA Paper: "Resolution-robust Large Mask Inpainting with Fourier Convolutions"
    • ProPainter: "ProPainter: Improving Propagation and Transformer for Video Inpainting"
    • PP-OCRv5: PaddlePaddle's latest OCR model

    License: MIT License - Free for personal and commercial use

    Support: Community support via GitHub Issues

    Main Repo:    https://github.com/YaoFANGUK/video-subtitle-remover
    Docker:       https://hub.docker.com/r/eritpchy/video-subtitle-remover
    Issues:       https://github.com/YaoFANGUK/video-subtitle-remover/issues
    Releases:     https://github.com/YaoFANGUK/video-subtitle-remover/releases
    License:      MIT
    
    Related:
    - Subtitle Extractor: https://github.com/YaoFANGUK/video-subtitle-extractor
    
    Model Papers:
    - STTN: arXiv:2007.10247
    - LAMA: arXiv:2109.07161
    - ProPainter: arXiv:2309.03897
    - PP-OCR: https://github.com/PaddlePaddle/PaddleOCR

Feature requests

Sign in to suggest features or vote on existing ones.

No feature requests yet.

Discussion

0 people marked this as worked·Sign in to mark your own.

Sign in to join the discussion.

No comments yet.