Behind the Scenes
DuoSubs aligns two subtitle files using the following steps:
- Parse and Detect LanguageSubtitles are loaded, and the dominant language of each file is automatically identified.
- TokenizationEach subtitle line is tokenized using:• punctuations as line breaks for space separated languages.• punctuations and whitespaces as line breaks for non-space separated languages.
- Extract and filter non-overlapping subtitles (
syncedmode only)Subtitle segments that do not overlap in time are optionally extracted and retained for later combination. - Estimate tokenized subtitle pairings using DTWFor segments with overlapping timestamps, the pairing is estimated using Dynamic Time Warping (DTW), based on semantic similarity between tokenized texts.
- Refine alignment using a sliding window approachThe initial alignment is adjusted using local neighborhood context in a sliding window with a size of 3. This step is important because the DTW-based pairing may result in duplicate secondary subtitles.
- Extract and filter extended subtitles from the primary track (
cutsmode only)Identifies extended subtitle spans with HMM-denoised binary masks, then extract them using progressive similarity filtering. - Refine alignment using a sliding window approachThe alignment is adjusted again in a sliding window with a size of 2 subtitle segments to achieve a better result.
- Combine aligned and non-overlapping subtitles or extended subtitlesThe aligned overlapping segments are merged with the filtered non-overlapping ones to produce a coherent bilingual subtitle track.
- Eliminate Unnecessary NewlineThe unnecessary extra newlines are cleaned in subtitle texts.
Known Limitations
The accuracy of the merging process varies on the model selected.
Some models may produce unreliable results for unsupported or low-resource languages.
Some sentence fragments from secondary subtitles may be misaligned to the primary subtitles line due to the tokenization algorithm used.
Secondary subtitles might contain extra whitespace as a result of token-level merging.
In
mixedandcutsmodes, the algorithm may not work reliably since matching lines have no timestamp overlap, and either subtitle could contain extra or missing lines.