
Accurate Bangla Lip-Sync AI
Python · Wav2Lip GAN · GFPGAN · FFmpeg · OpenCV · NumPy · librosa · CUDA
Problem
Generating high-fidelity lip-sync video from an image/video + audio is notoriously brittle — chin cropping, NumPy/OpenCV/librosa version breakage in the legacy Wav2Lip codebase, and soft mouth edges — and most setups fail outright on Bangla audio.
Solution
Built a production-ready lip-sync tool optimised for Bangla (but language-agnostic). It uses the wav2lip_gan checkpoint for sharper mouth edges and precise phoneme sync, with intelligent padding (0 10 0 0) to stop chin cropping. It self-bootstraps by cloning Wav2Lip and checking models, and ships production patches: auto-patching np.complex/np.float/np.int for NumPy >1.20 (no downgrade), MJPG codec to avoid DIVX failures across Windows/Linux, and fixes for deprecated librosa calls. Optional GFPGAN restores and upscales facial detail in the final output.
Impact
A reliable, reproducible lip-sync pipeline that actually runs on modern environments where the original Wav2Lip breaks — useful for dubbing, avatars and localized video content, with GPU acceleration and graceful CPU fallback.
