Video Subtitle Remover: AI-powered hardcoded subtitle removal
AI-based tool for removing hardcoded subtitles and text watermarks from videos and images using advanced inpainting models. Local processing without third-party APIs.
- Step 1
What is Video Subtitle Remover?
Video Subtitle Remover (VSR) is an AI-based tool that removes hardcoded subtitles and text-like watermarks from videos or images. Unlike soft subtitles (which can be toggled off), hardcoded subtitles are permanently burned into the video. VSR uses advanced deep learning models to detect subtitle regions and intelligently fill them by reconstructing the background content. The tool runs entirely locally without requiring third-party API calls, giving you full control over your content and privacy.
- Step 2
Technology stack
Video Subtitle Remover is built on a modern Python-based AI/ML stack with GPU acceleration support:
Core Framework:
- Python 3.12+ (with support for Python 3.13)
- PyTorch 2.7.0 for deep learning
- TorchVision 0.22.0 for vision models
- PaddlePaddle 3.0.0 for OCR capabilities
Computer Vision & Processing:
- OpenCV (cv2) for image/video processing
- FFmpeg for video encoding, frame extraction, and audio merging
- PP-OCRv5 model for text detection and recognition
AI Inpainting Models:
- STTN (Spatio-Temporal Transformer Network) - Fast, uses temporal information from adjacent frames
- LAMA (Large Mask Inpainting) - Single-frame neural fill, best for images and animated content
- ProPainter - Hybrid approach combining TBE (Temporal Background Extraction) and LAMA refinement
- OpenCV traditional inpainting as fallback
GUI & Interface:
- PySimpleGUI for the graphical user interface
- Command-line interface for batch processing
Hardware Acceleration:
- CUDA support for NVIDIA GPUs (recommended)
- DirectML for AMD/Intel GPU acceleration
- CPU fallback mode (slower)
- Apple Silicon (macOS) optimization
Deployment:
- Docker containers with CUDA 11.8, 12.6, 12.8, DirectML, and CPU variants
- Conda/pip for dependency management
- GitHub Actions for CI/CD
Core Stack: ├── Python 3.12+ ├── PyTorch 2.7.0 + TorchVision 0.22.0 ├── PaddlePaddle 3.0.0 ├── OpenCV (cv2) └── FFmpeg AI Models: ├── PP-OCRv5 (text detection) ├── STTN (temporal inpainting) ├── LAMA (single-frame inpainting) └── ProPainter (hybrid inpainting) Hardware Support: ├── CUDA (NVIDIA GPU) - Recommended ├── DirectML (AMD/Intel GPU) ├── CPU (no GPU) └── Apple Silicon (macOS) - Step 3
How it works: The subtitle removal pipeline
VSR uses a multi-stage pipeline to remove subtitles:
1. Text Detection (PP-OCRv5)
- Video frames are extracted using FFmpeg
- PP-OCRv5 detects text regions in each frame
- Bounding boxes are generated around detected subtitles
- Scene detection skips static frames for efficiency
2. Mask Generation
- Detected text regions are converted to binary masks
- Masks define which pixels need to be reconstructed
3. AI Inpainting
- STTN mode: Leverages temporal information - subtitles are sparse in time, so the model reconstructs background from adjacent frames where subtitles are absent or different
- LAMA mode: Single-frame neural fill using surrounding context
- ProPainter mode: First uses TBE to reconstruct temporal background, then LAMA refines residual artifacts
4. Video Reconstruction
- Inpainted frames are re-encoded using FFmpeg
- Original audio track is merged back
- Output video maintains original resolution and quality
Pipeline Flow: [Input Video] ↓ [FFmpeg Frame Extraction] ↓ [PP-OCRv5 Text Detection] → Bounding boxes ↓ [Mask Generation] → Binary masks ↓ [AI Inpainting Model] ├── STTN (temporal) ├── LAMA (single-frame) └── ProPainter (hybrid) ↓ [FFmpeg Re-encoding] ↓ [Audio Merge] ↓ [Output Video (subtitle-free)] - Step 4
Prerequisites
Before setting up Video Subtitle Remover, ensure you have the appropriate hardware and software:
For Docker Deployment (Recommended):
- Docker 20.10+ and Docker Compose
- NVIDIA GPU with CUDA support (10/20/30/40 series)
- NVIDIA Container Toolkit (nvidia-docker2)
- At least 8GB GPU VRAM (16GB+ recommended for ProPainter)
- 10GB+ free disk space for Docker images and models
For Source Installation:
- Python 3.12 or 3.13
- NVIDIA GPU with CUDA 11.8 or 12.x support
- CUDA Toolkit and cuDNN installed
- Conda or venv for virtual environment
- FFmpeg installed and available in PATH
- At least 8GB GPU VRAM
General Requirements:
- 64-bit operating system (Linux, Windows, or macOS)
- For CPU-only mode: Modern CPU with AVX2 support (very slow, not recommended)
# Check Docker and NVIDIA support docker --version nvidia-smi # Check GPU and CUDA version # For source installation: python --version # Should be 3.12 or 3.13 nvcc --version # Check CUDA Toolkit ffmpeg -version # Check FFmpeg # Recommended: Check GPU VRAM nvidia-smi --query-gpu=memory.total --format=csv,noheader⚠ Heads up: ⚠️ **GPU Strongly Recommended**: While CPU mode is available, processing is 10-50x slower. A subtitle removal task that takes 5 minutes on an NVIDIA RTX 3090 may take hours on CPU. NVIDIA GPUs with CUDA are the primary supported hardware. - Step 5
Quick start with Docker (NVIDIA GPU)
The fastest way to get started is using Docker with GPU support. First, ensure you have NVIDIA Container Toolkit installed, then pull and run the appropriate image based on your GPU generation.
# For NVIDIA 10/20/30 Series GPUs (CUDA 11.8) docker pull eritpchy/video-subtitle-remover:1.4.0-cuda11.8 docker run -it --gpus all \ --name vsr \ -v $(pwd)/videos:/workspace \ eritpchy/video-subtitle-remover:1.4.0-cuda11.8 \ python backend/main.py \ -i /workspace/input.mp4 \ -o /workspace/output.mp4 # For NVIDIA 40 Series GPUs (CUDA 12.x) docker pull eritpchy/video-subtitle-remover:1.4.0-cuda12 docker run -it --gpus all \ --name vsr \ -v $(pwd)/videos:/workspace \ eritpchy/video-subtitle-remover:1.4.0-cuda12 \ python backend/main.py \ -i /workspace/input.mp4 \ -o /workspace/output.mp4⚠ Heads up: ⚠️ **NVIDIA Container Toolkit Required**: You must install nvidia-docker2 to enable GPU access in Docker containers. See [NVIDIA Container Toolkit Installation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). - Step 6
Docker with DirectML (AMD/Intel GPU)
For non-NVIDIA GPUs (AMD or Intel), use the DirectML variant which provides GPU acceleration on Windows through DirectX:
# Pull DirectML image (Windows only) docker pull eritpchy/video-subtitle-remover:1.4.0-directml # Run with DirectML support docker run -it --name vsr \ -v %cd%/videos:/workspace \ eritpchy/video-subtitle-remover:1.4.0-directml \ python backend/main.py \ -i /workspace/input.mp4 \ -o /workspace/output.mp4 - Step 7
Installation from source (Linux/Windows)
For development or customization, install from source. This gives you full control over the environment and allows you to use the GUI interface.
Step 1: Set up virtual environment
Step 2: Install PyTorch and PaddlePaddle
Step 3: Clone repository and install dependencies
Step 4: Download AI models
# Step 1: Create conda environment conda create -n vsr python=3.12 conda activate vsr # Step 2: Install PyTorch with CUDA support # For CUDA 11.8: pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu118 # For CUDA 12.1: pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu121 # Install PaddlePaddle with GPU support python -m pip install paddlepaddle-gpu==3.0.0 -i https://pypi.tuna.tsinghua.edu.cn/simple # Step 3: Clone and install git clone https://github.com/YaoFANGUK/video-subtitle-remover.git cd video-subtitle-remover pip install -r requirements.txt # Step 4: Models will auto-download on first run # Or manually download to models/ directory from releases - Step 8
Installation from source (macOS Apple Silicon)
For Apple Silicon Macs (M1/M2/M3), use the MPS (Metal Performance Shaders) backend for GPU acceleration:
# Create virtual environment python3 -m venv vsr-env source vsr-env/bin/activate # Install dependencies for macOS pip install torch==2.7.0 torchvision==0.22.0 pip install paddlepaddle==3.0.0 # Clone repository git clone https://github.com/YaoFANGUK/video-subtitle-remover.git cd video-subtitle-remover # Install requirements pip install -r requirements.txt # Ensure FFmpeg is installed brew install ffmpeg - Step 9
Using the GUI interface
Video Subtitle Remover provides a user-friendly GUI built with PySimpleGUI for interactive subtitle removal:
Features:
- Visual file selection
- Preview of detected subtitle regions
- Real-time progress monitoring
- Adjustable detection parameters
- Choice of inpainting models
- Batch processing support
# Launch the GUI (source installation) python gui.py # The GUI window will open with: # 1. Input file selection # 2. Output path specification # 3. Model selection (STTN/LAMA/ProPainter) # 4. Detection parameters: # - Text detection threshold # - Mask dilation size # - Frame sampling rate # 5. Start processing button - Step 10
Command-line usage and parameters
For batch processing and automation, use the command-line interface with various configuration options:
# Basic usage python backend/main.py -i input.mp4 -o output.mp4 # Specify inpainting model python backend/main.py -i input.mp4 -o output.mp4 --inpaint_mode sttn # Options: sttn-auto, sttn-det, lama, propainter, opencv # Adjust text detection threshold (0.0-1.0) python backend/main.py -i input.mp4 -o output.mp4 --det_threshold 0.5 # Skip subtitle detection (use entire bottom region) python backend/main.py -i input.mp4 -o output.mp4 --inpaint_mode sttn-auto # Process image instead of video python backend/main.py -i input.jpg -o output.jpg --inpaint_mode lama # Specify GPU device python backend/main.py -i input.mp4 -o output.mp4 --device cuda:0 - Step 11
Choosing the right inpainting model
Different inpainting models excel in different scenarios:
STTN (Spatio-Temporal Transformer Network)
- Best for: Live-action videos with camera movement
- Speed: Fast (5-15 minutes for a 2-hour 1080p video on RTX 3090)
- Quality: Excellent for videos where background is visible in other frames
- VRAM: 6-8GB
- Modes:
sttn-det: Uses OCR to detect subtitlessttn-auto: Skips detection, processes entire bottom region (faster)
- How it works: Reconstructs background from adjacent frames where subtitles differ
LAMA (Large Mask Inpainting)
- Best for: Static images, animated videos, or when background never appears
- Speed: Moderate (10-25 minutes for a 2-hour 1080p video)
- Quality: Good single-frame inpainting using surrounding context
- VRAM: 4-6GB
- How it works: Neural fill based on neighboring pixels in the same frame
ProPainter
- Best for: Complex scenes with intense movement and no clean frames
- Speed: Slowest (30-60 minutes for a 2-hour 1080p video)
- Quality: Best overall quality with minimal artifacts
- VRAM: 12-16GB (high memory usage)
- How it works: TBE reconstructs temporal background, then LAMA refines
OpenCV Traditional
- Best for: Testing or fallback only
- Speed: Very fast but poor quality
- Quality: Low (visible artifacts and blur)
# Use STTN for typical live-action content (recommended) python backend/main.py -i movie.mp4 -o clean.mp4 --inpaint_mode sttn-det # Use LAMA for animated content or images python backend/main.py -i anime.mp4 -o clean.mp4 --inpaint_mode lama # Use ProPainter for highest quality (requires 16GB+ VRAM) python backend/main.py -i action.mp4 -o clean.mp4 --inpaint_mode propainter # Use sttn-auto for fastest processing (skips OCR) python backend/main.py -i series.mp4 -o clean.mp4 --inpaint_mode sttn-auto - Step 12
Advanced configuration and tuning
Fine-tune the subtitle detection and removal process for optimal results:
Text Detection Parameters:
det_threshold: Confidence threshold for text detection (default: 0.3, range: 0.0-1.0)- Lower = more sensitive, may detect non-subtitle text
- Higher = less sensitive, may miss faint subtitles
Mask Parameters:
mask_dilation: Expand detected regions to ensure complete coverage (default: 2 pixels)
Processing Parameters:
frame_rate: Frame sampling rate for detection (default: 1, process every frame)- Set to 2-5 to speed up at cost of potentially missing subtitle changes
Memory Optimization:
- For 8GB VRAM: Use STTN or LAMA, avoid ProPainter
- For 12GB+ VRAM: All models available
- Reduce video resolution if VRAM errors occur
# Lower threshold for faint or small subtitles python backend/main.py -i input.mp4 -o output.mp4 --det_threshold 0.2 # Higher threshold to ignore watermarks or logos python backend/main.py -i input.mp4 -o output.mp4 --det_threshold 0.6 # Expand mask to cover subtitle shadows/glow python backend/main.py -i input.mp4 -o output.mp4 --mask_dilation 4 # Speed up by sampling every 3rd frame python backend/main.py -i input.mp4 -o output.mp4 --frame_rate 3 # Full configuration example python backend/main.py \ -i input.mp4 \ -o output.mp4 \ --inpaint_mode sttn-det \ --det_threshold 0.4 \ --mask_dilation 3 \ --device cuda:0 - Step 13
Batch processing multiple videos
Process multiple videos efficiently using shell scripts or Python automation:
# Bash script for batch processing (Linux/macOS) #!/bin/bash for video in ./input_videos/*.mp4; do filename=$(basename "$video" .mp4) python backend/main.py \ -i "$video" \ -o "./output_videos/${filename}_clean.mp4" \ --inpaint_mode sttn-det done # PowerShell script for batch processing (Windows) Get-ChildItem -Path .\input_videos\*.mp4 | ForEach-Object { $output = "output_videos\$($_.BaseName)_clean.mp4" python backend/main.py -i $_.FullName -o $output --inpaint_mode sttn-det } # Python script for advanced batch processing import os import subprocess input_dir = "./input_videos" output_dir = "./output_videos" os.makedirs(output_dir, exist_ok=True) for filename in os.listdir(input_dir): if filename.endswith(".mp4"): input_path = os.path.join(input_dir, filename) output_path = os.path.join(output_dir, f"{os.path.splitext(filename)[0]}_clean.mp4") subprocess.run([ "python", "backend/main.py", "-i", input_path, "-o", output_path, "--inpaint_mode", "sttn-det" ]) - Step 14
Troubleshooting common issues
CUDA Out of Memory:
- Reduce video resolution or use a model with lower VRAM requirements
- Close other GPU applications
- Use STTN instead of ProPainter
Text Detection Failing:
- Lower the
det_thresholdparameter - Ensure subtitles are clear and high-contrast
- Try
sttn-automode to skip detection
Poor Inpainting Quality:
- Use ProPainter for highest quality (requires more VRAM)
- Increase
mask_dilationto cover subtitle shadows - For animated content, switch from STTN to LAMA
Slow Processing Speed:
- Verify GPU is being used: check
nvidia-smiduring processing - Use
sttn-automode to skip OCR detection - Increase
frame_rateparameter (may miss subtitle changes)
FFmpeg Errors:
- Ensure FFmpeg is installed and in PATH
- Check video codec compatibility
- Try converting input to a standard format (H.264 MP4)
Model Download Failures:
- Models auto-download on first run (requires internet)
- Manually download from GitHub Releases if needed
- Check models/ directory for existing files
# Check if GPU is being utilized watch -n 1 nvidia-smi # Verify CUDA is available in Python python -c "import torch; print(torch.cuda.is_available())" # Test FFmpeg installation ffmpeg -version # Check available VRAM nvidia-smi --query-gpu=memory.free --format=csv,noheader # Convert video to standard format if needed ffmpeg -i input.mkv -c:v libx264 -c:a aac input.mp4 # Clear Python cache if imports fail find . -type d -name __pycache__ -exec rm -r {} + find . -type f -name '*.pyc' -delete - Step 15
Performance benchmarks
Approximate processing times for a 2-hour 1080p video on different hardware:
NVIDIA RTX 4090:
- STTN-auto: 3-5 minutes
- STTN-det: 8-12 minutes
- LAMA: 12-18 minutes
- ProPainter: 15-25 minutes
NVIDIA RTX 3090:
- STTN-auto: 5-8 minutes
- STTN-det: 12-18 minutes
- LAMA: 18-28 minutes
- ProPainter: 25-40 minutes
NVIDIA RTX 3060 (12GB):
- STTN-auto: 10-15 minutes
- STTN-det: 20-30 minutes
- LAMA: 30-45 minutes
- ProPainter: 50-80 minutes
CPU Only (Ryzen 9 5950X):
- STTN: 4-8 hours
- LAMA: 5-10 hours
- ProPainter: Not recommended (10+ hours)
Factors affecting speed:
- Video resolution (4K takes 4x longer than 1080p)
- Subtitle density (more text = more processing)
- Frame rate (60fps takes 2x longer than 30fps)
- Inpainting mode and parameters
Processing Time Comparison (2-hour 1080p video): RTX 4090: STTN-auto: █░░░░░░░░░ 3-5 min STTN-det: ███░░░░░░░ 8-12 min LAMA: ████░░░░░░ 12-18 min ProPainter: █████░░░░░ 15-25 min RTX 3090: STTN-auto: ██░░░░░░░░ 5-8 min STTN-det: ████░░░░░░ 12-18 min LAMA: ██████░░░░ 18-28 min ProPainter: ████████░░ 25-40 min RTX 3060 (12GB): STTN-auto: ████░░░░░░ 10-15 min STTN-det: ███████░░░ 20-30 min LAMA: █████████░ 30-45 min ProPainter: ██████████ 50-80 min CPU (Ryzen 9 5950X): STTN: 4-8 hours 😰 LAMA: 5-10 hours 😱 ProPainter: 10+ hours 💀 - Step 16
Project structure and customization
Understanding the codebase structure for customization and development:
backend/ - Core processing logic
main.py: CLI entry pointsubtitle_detector.py: PP-OCR text detectioninpainter.py: Inpainting model orchestrationvideo_processor.py: FFmpeg video handling
models/ - Pre-trained model weights
- PP-OCR models (text detection/recognition)
- STTN/LAMA/ProPainter weights
- Auto-downloaded on first run
gui.py - PySimpleGUI interface
requirements.txt - Python dependencies
docker/ - Dockerfile variants for different CUDA versions
video-subtitle-remover/ ├── backend/ │ ├── main.py # CLI entry point │ ├── subtitle_detector.py # PP-OCR detection │ ├── inpainter.py # Model orchestration │ ├── video_processor.py # FFmpeg integration │ └── utils.py # Helper functions ├── models/ # Model weights (auto-dl) │ ├── pp_ocr/ │ ├── sttn/ │ ├── lama/ │ └── propainter/ ├── gui.py # PySimpleGUI interface ├── requirements.txt # Dependencies ├── docker/ │ ├── Dockerfile.cuda11.8 │ ├── Dockerfile.cuda12 │ └── Dockerfile.directml └── README.md - Step 17
Resources and community
Official Repository: https://github.com/YaoFANGUK/video-subtitle-remover
Docker Hub: https://hub.docker.com/r/eritpchy/video-subtitle-remover
GitHub Issues: https://github.com/YaoFANGUK/video-subtitle-remover/issues
Releases & Changelog: https://github.com/YaoFANGUK/video-subtitle-remover/releases
Related Projects:
- video-subtitle-extractor: Extract hardcoded subtitles to SRT files
Model References:
- STTN Paper: "Learning Joint Spatial-Temporal Transformations for Video Inpainting"
- LAMA Paper: "Resolution-robust Large Mask Inpainting with Fourier Convolutions"
- ProPainter: "ProPainter: Improving Propagation and Transformer for Video Inpainting"
- PP-OCRv5: PaddlePaddle's latest OCR model
License: MIT License - Free for personal and commercial use
Support: Community support via GitHub Issues
Main Repo: https://github.com/YaoFANGUK/video-subtitle-remover Docker: https://hub.docker.com/r/eritpchy/video-subtitle-remover Issues: https://github.com/YaoFANGUK/video-subtitle-remover/issues Releases: https://github.com/YaoFANGUK/video-subtitle-remover/releases License: MIT Related: - Subtitle Extractor: https://github.com/YaoFANGUK/video-subtitle-extractor Model Papers: - STTN: arXiv:2007.10247 - LAMA: arXiv:2109.07161 - ProPainter: arXiv:2309.03897 - PP-OCR: https://github.com/PaddlePaddle/PaddleOCR
Feature requests
Sign in to suggest features or vote on existing ones.
No feature requests yet.
Discussion
Sign in to join the discussion.
No comments yet.