Advancedaivideo-processingcomputer-visionpythondeep-learningopencvpytorchpaddlepaddlecudagpu-acceleration

Video Subtitle Remover: AI-powered hardcoded subtitle removal

AI-based tool for removing hardcoded subtitles and text watermarks from videos and images using advanced inpainting models. Local processing without third-party APIs.

Step 1
What is Video Subtitle Remover?
Video Subtitle Remover (VSR) is an AI-based tool that removes hardcoded subtitles and text-like watermarks from videos or images. Unlike soft subtitles (which can be toggled off), hardcoded subtitles are permanently burned into the video. VSR uses advanced deep learning models to detect subtitle regions and intelligently fill them by reconstructing the background content. The tool runs entirely locally without requiring third-party API calls, giving you full control over your content and privacy.
Step 2
Technology stack
Video Subtitle Remover is built on a modern Python-based AI/ML stack with GPU acceleration support:

Core Framework:
- Python 3.12+ (with support for Python 3.13)
- PyTorch 2.7.0 for deep learning
- TorchVision 0.22.0 for vision models
- PaddlePaddle 3.0.0 for OCR capabilities
Computer Vision & Processing:
- OpenCV (cv2) for image/video processing
- FFmpeg for video encoding, frame extraction, and audio merging
- PP-OCRv5 model for text detection and recognition
AI Inpainting Models:
- STTN (Spatio-Temporal Transformer Network) - Fast, uses temporal information from adjacent frames
- LAMA (Large Mask Inpainting) - Single-frame neural fill, best for images and animated content
- ProPainter - Hybrid approach combining TBE (Temporal Background Extraction) and LAMA refinement
- OpenCV traditional inpainting as fallback
GUI & Interface:
- PySimpleGUI for the graphical user interface
- Command-line interface for batch processing
Hardware Acceleration:
- CUDA support for NVIDIA GPUs (recommended)
- DirectML for AMD/Intel GPU acceleration
- CPU fallback mode (slower)
- Apple Silicon (macOS) optimization
Deployment:
- Docker containers with CUDA 11.8, 12.6, 12.8, DirectML, and CPU variants
- Conda/pip for dependency management
- GitHub Actions for CI/CD
```
Core Stack:
├── Python 3.12+
├── PyTorch 2.7.0 + TorchVision 0.22.0
├── PaddlePaddle 3.0.0
├── OpenCV (cv2)
└── FFmpeg

AI Models:
├── PP-OCRv5 (text detection)
├── STTN (temporal inpainting)
├── LAMA (single-frame inpainting)
└── ProPainter (hybrid inpainting)

Hardware Support:
├── CUDA (NVIDIA GPU) - Recommended
├── DirectML (AMD/Intel GPU)
├── CPU (no GPU)
└── Apple Silicon (macOS)
```
Step 3
How it works: The subtitle removal pipeline
VSR uses a multi-stage pipeline to remove subtitles:

1. Text Detection (PP-OCRv5)
- Video frames are extracted using FFmpeg
- PP-OCRv5 detects text regions in each frame
- Bounding boxes are generated around detected subtitles
- Scene detection skips static frames for efficiency
2. Mask Generation
- Detected text regions are converted to binary masks
- Masks define which pixels need to be reconstructed
3. AI Inpainting
- STTN mode: Leverages temporal information - subtitles are sparse in time, so the model reconstructs background from adjacent frames where subtitles are absent or different
- LAMA mode: Single-frame neural fill using surrounding context
- ProPainter mode: First uses TBE to reconstruct temporal background, then LAMA refines residual artifacts
4. Video Reconstruction
- Inpainted frames are re-encoded using FFmpeg
- Original audio track is merged back
- Output video maintains original resolution and quality
```
Pipeline Flow:

[Input Video]
    ↓
[FFmpeg Frame Extraction]
    ↓
[PP-OCRv5 Text Detection] → Bounding boxes
    ↓
[Mask Generation] → Binary masks
    ↓
[AI Inpainting Model]
  ├── STTN (temporal)
  ├── LAMA (single-frame)
  └── ProPainter (hybrid)
    ↓
[FFmpeg Re-encoding]
    ↓
[Audio Merge]
    ↓
[Output Video (subtitle-free)]
```
Step 4
Prerequisites
Before setting up Video Subtitle Remover, ensure you have the appropriate hardware and software:

For Docker Deployment (Recommended):
- Docker 20.10+ and Docker Compose
- NVIDIA GPU with CUDA support (10/20/30/40 series)
- NVIDIA Container Toolkit (nvidia-docker2)
- At least 8GB GPU VRAM (16GB+ recommended for ProPainter)
- 10GB+ free disk space for Docker images and models
For Source Installation:
- Python 3.12 or 3.13
- NVIDIA GPU with CUDA 11.8 or 12.x support
- CUDA Toolkit and cuDNN installed
- Conda or venv for virtual environment
- FFmpeg installed and available in PATH
- At least 8GB GPU VRAM
General Requirements:
- 64-bit operating system (Linux, Windows, or macOS)
- For CPU-only mode: Modern CPU with AVX2 support (very slow, not recommended)
```
# Check Docker and NVIDIA support
docker --version
nvidia-smi  # Check GPU and CUDA version

# For source installation:
python --version  # Should be 3.12 or 3.13
nvcc --version    # Check CUDA Toolkit
ffmpeg -version   # Check FFmpeg

# Recommended: Check GPU VRAM
nvidia-smi --query-gpu=memory.total --format=csv,noheader
```
⚠ Heads up: ⚠️ **GPU Strongly Recommended**: While CPU mode is available, processing is 10-50x slower. A subtitle removal task that takes 5 minutes on an NVIDIA RTX 3090 may take hours on CPU. NVIDIA GPUs with CUDA are the primary supported hardware.

Step 5

Quick start with Docker (NVIDIA GPU)

The fastest way to get started is using Docker with GPU support. First, ensure you have NVIDIA Container Toolkit installed, then pull and run the appropriate image based on your GPU generation.

# For NVIDIA 10/20/30 Series GPUs (CUDA 11.8)
docker pull eritpchy/video-subtitle-remover:1.4.0-cuda11.8

docker run -it --gpus all \
  --name vsr \
  -v $(pwd)/videos:/workspace \
  eritpchy/video-subtitle-remover:1.4.0-cuda11.8 \
  python backend/main.py \
  -i /workspace/input.mp4 \
  -o /workspace/output.mp4

# For NVIDIA 40 Series GPUs (CUDA 12.x)
docker pull eritpchy/video-subtitle-remover:1.4.0-cuda12

docker run -it --gpus all \
  --name vsr \
  -v $(pwd)/videos:/workspace \
  eritpchy/video-subtitle-remover:1.4.0-cuda12 \
  python backend/main.py \
  -i /workspace/input.mp4 \
  -o /workspace/output.mp4

⚠ Heads up: ⚠️ **NVIDIA Container Toolkit Required**: You must install nvidia-docker2 to enable GPU access in Docker containers. See [NVIDIA Container Toolkit Installation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).

Step 6

Docker with DirectML (AMD/Intel GPU)

For non-NVIDIA GPUs (AMD or Intel), use the DirectML variant which provides GPU acceleration on Windows through DirectX:

# Pull DirectML image (Windows only)
docker pull eritpchy/video-subtitle-remover:1.4.0-directml

# Run with DirectML support
docker run -it --name vsr \
  -v %cd%/videos:/workspace \
  eritpchy/video-subtitle-remover:1.4.0-directml \
  python backend/main.py \
  -i /workspace/input.mp4 \
  -o /workspace/output.mp4

Step 7

Installation from source (Linux/Windows)

For development or customization, install from source. This gives you full control over the environment and allows you to use the GUI interface.

Step 1: Set up virtual environment

Step 2: Install PyTorch and PaddlePaddle

Step 3: Clone repository and install dependencies

Step 4: Download AI models

# Step 1: Create conda environment
conda create -n vsr python=3.12
conda activate vsr

# Step 2: Install PyTorch with CUDA support
# For CUDA 11.8:
pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu118

# For CUDA 12.1:
pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu121

# Install PaddlePaddle with GPU support
python -m pip install paddlepaddle-gpu==3.0.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

# Step 3: Clone and install
git clone https://github.com/YaoFANGUK/video-subtitle-remover.git
cd video-subtitle-remover
pip install -r requirements.txt

# Step 4: Models will auto-download on first run
# Or manually download to models/ directory from releases

Step 8

Installation from source (macOS Apple Silicon)

For Apple Silicon Macs (M1/M2/M3), use the MPS (Metal Performance Shaders) backend for GPU acceleration:

# Create virtual environment
python3 -m venv vsr-env
source vsr-env/bin/activate

# Install dependencies for macOS
pip install torch==2.7.0 torchvision==0.22.0
pip install paddlepaddle==3.0.0

# Clone repository
git clone https://github.com/YaoFANGUK/video-subtitle-remover.git
cd video-subtitle-remover

# Install requirements
pip install -r requirements.txt

# Ensure FFmpeg is installed
brew install ffmpeg

Step 9

Using the GUI interface

Video Subtitle Remover provides a user-friendly GUI built with PySimpleGUI for interactive subtitle removal:

Features:

Visual file selection
Preview of detected subtitle regions
Real-time progress monitoring
Adjustable detection parameters
Choice of inpainting models
Batch processing support

# Launch the GUI (source installation)
python gui.py

# The GUI window will open with:
# 1. Input file selection
# 2. Output path specification
# 3. Model selection (STTN/LAMA/ProPainter)
# 4. Detection parameters:
#    - Text detection threshold
#    - Mask dilation size
#    - Frame sampling rate
# 5. Start processing button

Step 10

Command-line usage and parameters

For batch processing and automation, use the command-line interface with various configuration options:

# Basic usage
python backend/main.py -i input.mp4 -o output.mp4

# Specify inpainting model
python backend/main.py -i input.mp4 -o output.mp4 --inpaint_mode sttn
# Options: sttn-auto, sttn-det, lama, propainter, opencv

# Adjust text detection threshold (0.0-1.0)
python backend/main.py -i input.mp4 -o output.mp4 --det_threshold 0.5

# Skip subtitle detection (use entire bottom region)
python backend/main.py -i input.mp4 -o output.mp4 --inpaint_mode sttn-auto

# Process image instead of video
python backend/main.py -i input.jpg -o output.jpg --inpaint_mode lama

# Specify GPU device
python backend/main.py -i input.mp4 -o output.mp4 --device cuda:0

Step 11
Choosing the right inpainting model
Different inpainting models excel in different scenarios:

STTN (Spatio-Temporal Transformer Network)
- Best for: Live-action videos with camera movement
- Speed: Fast (5-15 minutes for a 2-hour 1080p video on RTX 3090)
- Quality: Excellent for videos where background is visible in other frames
- VRAM: 6-8GB
- Modes:
  - sttn-det: Uses OCR to detect subtitles
  - sttn-auto: Skips detection, processes entire bottom region (faster)
- How it works: Reconstructs background from adjacent frames where subtitles differ
LAMA (Large Mask Inpainting)
- Best for: Static images, animated videos, or when background never appears
- Speed: Moderate (10-25 minutes for a 2-hour 1080p video)
- Quality: Good single-frame inpainting using surrounding context
- VRAM: 4-6GB
- How it works: Neural fill based on neighboring pixels in the same frame
ProPainter
- Best for: Complex scenes with intense movement and no clean frames
- Speed: Slowest (30-60 minutes for a 2-hour 1080p video)
- Quality: Best overall quality with minimal artifacts
- VRAM: 12-16GB (high memory usage)
- How it works: TBE reconstructs temporal background, then LAMA refines
OpenCV Traditional
- Best for: Testing or fallback only
- Speed: Very fast but poor quality
- Quality: Low (visible artifacts and blur)
```
# Use STTN for typical live-action content (recommended)
python backend/main.py -i movie.mp4 -o clean.mp4 --inpaint_mode sttn-det

# Use LAMA for animated content or images
python backend/main.py -i anime.mp4 -o clean.mp4 --inpaint_mode lama

# Use ProPainter for highest quality (requires 16GB+ VRAM)
python backend/main.py -i action.mp4 -o clean.mp4 --inpaint_mode propainter

# Use sttn-auto for fastest processing (skips OCR)
python backend/main.py -i series.mp4 -o clean.mp4 --inpaint_mode sttn-auto
```
Step 12
Advanced configuration and tuning
Fine-tune the subtitle detection and removal process for optimal results:

Text Detection Parameters:
- det_threshold: Confidence threshold for text detection (default: 0.3, range: 0.0-1.0)
  - Lower = more sensitive, may detect non-subtitle text
  - Higher = less sensitive, may miss faint subtitles
Mask Parameters:
- mask_dilation: Expand detected regions to ensure complete coverage (default: 2 pixels)
Processing Parameters:
- frame_rate: Frame sampling rate for detection (default: 1, process every frame)
  - Set to 2-5 to speed up at cost of potentially missing subtitle changes
Memory Optimization:
- For 8GB VRAM: Use STTN or LAMA, avoid ProPainter
- For 12GB+ VRAM: All models available
- Reduce video resolution if VRAM errors occur
```
# Lower threshold for faint or small subtitles
python backend/main.py -i input.mp4 -o output.mp4 --det_threshold 0.2

# Higher threshold to ignore watermarks or logos
python backend/main.py -i input.mp4 -o output.mp4 --det_threshold 0.6

# Expand mask to cover subtitle shadows/glow
python backend/main.py -i input.mp4 -o output.mp4 --mask_dilation 4

# Speed up by sampling every 3rd frame
python backend/main.py -i input.mp4 -o output.mp4 --frame_rate 3

# Full configuration example
python backend/main.py \
  -i input.mp4 \
  -o output.mp4 \
  --inpaint_mode sttn-det \
  --det_threshold 0.4 \
  --mask_dilation 3 \
  --device cuda:0
```

Step 13

Batch processing multiple videos

Process multiple videos efficiently using shell scripts or Python automation:

# Bash script for batch processing (Linux/macOS)
#!/bin/bash
for video in ./input_videos/*.mp4; do
  filename=$(basename "$video" .mp4)
  python backend/main.py \
    -i "$video" \
    -o "./output_videos/${filename}_clean.mp4" \
    --inpaint_mode sttn-det
done

# PowerShell script for batch processing (Windows)
Get-ChildItem -Path .\input_videos\*.mp4 | ForEach-Object {
  $output = "output_videos\$($_.BaseName)_clean.mp4"
  python backend/main.py -i $_.FullName -o $output --inpaint_mode sttn-det
}

# Python script for advanced batch processing
import os
import subprocess

input_dir = "./input_videos"
output_dir = "./output_videos"
os.makedirs(output_dir, exist_ok=True)

for filename in os.listdir(input_dir):
    if filename.endswith(".mp4"):
        input_path = os.path.join(input_dir, filename)
        output_path = os.path.join(output_dir, f"{os.path.splitext(filename)[0]}_clean.mp4")
        subprocess.run([
            "python", "backend/main.py",
            "-i", input_path,
            "-o", output_path,
            "--inpaint_mode", "sttn-det"
        ])

Step 14
Troubleshooting common issues
CUDA Out of Memory:
- Reduce video resolution or use a model with lower VRAM requirements
- Close other GPU applications
- Use STTN instead of ProPainter
Text Detection Failing:
- Lower the det_threshold parameter
- Ensure subtitles are clear and high-contrast
- Try sttn-auto mode to skip detection
Poor Inpainting Quality:
- Use ProPainter for highest quality (requires more VRAM)
- Increase mask_dilation to cover subtitle shadows
- For animated content, switch from STTN to LAMA
Slow Processing Speed:
- Verify GPU is being used: check nvidia-smi during processing
- Use sttn-auto mode to skip OCR detection
- Increase frame_rate parameter (may miss subtitle changes)
FFmpeg Errors:
- Ensure FFmpeg is installed and in PATH
- Check video codec compatibility
- Try converting input to a standard format (H.264 MP4)
Model Download Failures:
- Models auto-download on first run (requires internet)
- Manually download from GitHub Releases if needed
- Check models/ directory for existing files
```
# Check if GPU is being utilized
watch -n 1 nvidia-smi

# Verify CUDA is available in Python
python -c "import torch; print(torch.cuda.is_available())"

# Test FFmpeg installation
ffmpeg -version

# Check available VRAM
nvidia-smi --query-gpu=memory.free --format=csv,noheader

# Convert video to standard format if needed
ffmpeg -i input.mkv -c:v libx264 -c:a aac input.mp4

# Clear Python cache if imports fail
find . -type d -name __pycache__ -exec rm -r {} +
find . -type f -name '*.pyc' -delete
```

Step 15

Performance benchmarks

Approximate processing times for a 2-hour 1080p video on different hardware:

NVIDIA RTX 4090:

STTN-auto: 3-5 minutes
STTN-det: 8-12 minutes
LAMA: 12-18 minutes
ProPainter: 15-25 minutes

NVIDIA RTX 3090:

STTN-auto: 5-8 minutes
STTN-det: 12-18 minutes
LAMA: 18-28 minutes
ProPainter: 25-40 minutes

NVIDIA RTX 3060 (12GB):

STTN-auto: 10-15 minutes
STTN-det: 20-30 minutes
LAMA: 30-45 minutes
ProPainter: 50-80 minutes

CPU Only (Ryzen 9 5950X):

STTN: 4-8 hours
LAMA: 5-10 hours
ProPainter: Not recommended (10+ hours)

Factors affecting speed:

Video resolution (4K takes 4x longer than 1080p)
Subtitle density (more text = more processing)
Frame rate (60fps takes 2x longer than 30fps)
Inpainting mode and parameters

Processing Time Comparison (2-hour 1080p video):

RTX 4090:
  STTN-auto:   █░░░░░░░░░  3-5 min
  STTN-det:    ███░░░░░░░  8-12 min
  LAMA:        ████░░░░░░  12-18 min
  ProPainter:  █████░░░░░  15-25 min

RTX 3090:
  STTN-auto:   ██░░░░░░░░  5-8 min
  STTN-det:    ████░░░░░░  12-18 min
  LAMA:        ██████░░░░  18-28 min
  ProPainter:  ████████░░  25-40 min

RTX 3060 (12GB):
  STTN-auto:   ████░░░░░░  10-15 min
  STTN-det:    ███████░░░  20-30 min
  LAMA:        █████████░  30-45 min
  ProPainter:  ██████████  50-80 min

CPU (Ryzen 9 5950X):
  STTN:        4-8 hours 😰
  LAMA:        5-10 hours 😱
  ProPainter:  10+ hours 💀

Step 16

Project structure and customization

Understanding the codebase structure for customization and development:

backend/ - Core processing logic

main.py: CLI entry point
subtitle_detector.py: PP-OCR text detection
inpainter.py: Inpainting model orchestration
video_processor.py: FFmpeg video handling

models/ - Pre-trained model weights

PP-OCR models (text detection/recognition)
STTN/LAMA/ProPainter weights
Auto-downloaded on first run

gui.py - PySimpleGUI interface

requirements.txt - Python dependencies

docker/ - Dockerfile variants for different CUDA versions

video-subtitle-remover/
├── backend/
│   ├── main.py                # CLI entry point
│   ├── subtitle_detector.py   # PP-OCR detection
│   ├── inpainter.py          # Model orchestration
│   ├── video_processor.py     # FFmpeg integration
│   └── utils.py              # Helper functions
├── models/                    # Model weights (auto-dl)
│   ├── pp_ocr/
│   ├── sttn/
│   ├── lama/
│   └── propainter/
├── gui.py                     # PySimpleGUI interface
├── requirements.txt           # Dependencies
├── docker/
│   ├── Dockerfile.cuda11.8
│   ├── Dockerfile.cuda12
│   └── Dockerfile.directml
└── README.md

Step 17
Resources and community
Official Repository: https://github.com/YaoFANGUK/video-subtitle-remover

Docker Hub: https://hub.docker.com/r/eritpchy/video-subtitle-remover

GitHub Issues: https://github.com/YaoFANGUK/video-subtitle-remover/issues

Releases & Changelog: https://github.com/YaoFANGUK/video-subtitle-remover/releases

Related Projects:
- video-subtitle-extractor: Extract hardcoded subtitles to SRT files
Model References:
- STTN Paper: "Learning Joint Spatial-Temporal Transformations for Video Inpainting"
- LAMA Paper: "Resolution-robust Large Mask Inpainting with Fourier Convolutions"
- ProPainter: "ProPainter: Improving Propagation and Transformer for Video Inpainting"
- PP-OCRv5: PaddlePaddle's latest OCR model
License: MIT License - Free for personal and commercial use

Support: Community support via GitHub Issues
```
Main Repo:    https://github.com/YaoFANGUK/video-subtitle-remover
Docker:       https://hub.docker.com/r/eritpchy/video-subtitle-remover
Issues:       https://github.com/YaoFANGUK/video-subtitle-remover/issues
Releases:     https://github.com/YaoFANGUK/video-subtitle-remover/releases
License:      MIT

Related:
- Subtitle Extractor: https://github.com/YaoFANGUK/video-subtitle-extractor

Model Papers:
- STTN: arXiv:2007.10247
- LAMA: arXiv:2109.07161
- ProPainter: arXiv:2309.03897
- PP-OCR: https://github.com/PaddlePaddle/PaddleOCR
```

Video Subtitle Remover: AI-powered hardcoded subtitle removal

What is Video Subtitle Remover?

Technology stack

How it works: The subtitle removal pipeline

Prerequisites

Quick start with Docker (NVIDIA GPU)

Docker with DirectML (AMD/Intel GPU)

Installation from source (Linux/Windows)

Installation from source (macOS Apple Silicon)

Using the GUI interface

Command-line usage and parameters

Choosing the right inpainting model

Advanced configuration and tuning

Batch processing multiple videos

Troubleshooting common issues

Performance benchmarks

Project structure and customization

Resources and community

Feature requests

Discussion