Education 2026-03-26 6 min read

How shortube.pro Generates Word-Level Captions: The Technical Process

A look under the hood at how shortube.pro's caption generation pipeline works — from audio to animated words.

Caption generation sounds simple but involves a precise multi-step process. Here's exactly how shortube.pro turns spoken audio into animated word-level captions.

Step 1: Audio extraction

When a video is ingested, FFmpeg extracts the audio track as a WAV or MP3 file. This isolated audio is cleaner for transcription than the raw video file.

Step 2: Whisper transcription with timestamps

The extracted audio is sent to OpenAI Whisper — specifically the large-v3 model, which provides the best accuracy/speed balance for conversational content.

Whisper returns a transcript with three levels of timestamps:
- Segment level: "Here is the complete sentence." [0.00 → 5.32]
- Word level: "Here" [0.00 → 0.31] "is" [0.31 → 0.45] etc.

The word-level timestamps are what enables animated captions.

Step 3: Caption formatting

Each word is formatted with:
Exact start and end timestamp (millisecond precision)
Font size, color and position in the 9:16 frame
Highlight color for the "active" word (karaoke effect)

Step 4: FFmpeg rendering

During the video render step, FFmpeg uses the word timestamps to overlay caption text on each frame of the video. The active word changes color on the exact frame it's spoken. The result is burned permanently into the video — no separate caption track, no metadata dependency.

Accuracy

Whisper's word error rate on clear English speech is under 5%. On accented English, technical vocabulary or noisy audio, accuracy drops. For best caption quality:
Record in a quiet environment
Speak clearly and at a moderate pace
Avoid heavy background music during speaking portions

Ready to create your first Short?

Start free — no credit card required. Process your first video in minutes.

Get started

How shortube.pro Generates Word-Level Captions: The Technical Process

Step 1: Audio extraction

Step 2: Whisper transcription with timestamps

Step 3: Caption formatting

Step 4: FFmpeg rendering

Accuracy

Ready to create your first Short?

More in Education

How AI Video Clipping Actually Works (The Technical Side)

The YouTube Shorts Algorithm Explained (2026)

YouTube Shorts and Copyright: What You Can and Cannot Clip