Back to Blog
Education 2026-03-26 6 min read

How shortube.pro Generates Word-Level Captions: The Technical Process

A look under the hood at how shortube.pro's caption generation pipeline works — from audio to animated words.

Caption generation sounds simple but involves a precise multi-step process. Here's exactly how shortube.pro turns spoken audio into animated word-level captions.

Step 1: Audio extraction

When a video is ingested, FFmpeg extracts the audio track as a WAV or MP3 file. This isolated audio is cleaner for transcription than the raw video file.

Step 2: Whisper transcription with timestamps

The extracted audio is sent to OpenAI Whisper — specifically the large-v3 model, which provides the best accuracy/speed balance for conversational content.

Whisper returns a transcript with three levels of timestamps:
- Segment level: "Here is the complete sentence." [0.00 → 5.32]
- Word level: "Here" [0.00 → 0.31] "is" [0.31 → 0.45] etc.

The word-level timestamps are what enables animated captions.

Step 3: Caption formatting

  • Each word is formatted with:
  • Exact start and end timestamp (millisecond precision)
  • Font size, color and position in the 9:16 frame
  • Highlight color for the "active" word (karaoke effect)

Step 4: FFmpeg rendering

During the video render step, FFmpeg uses the word timestamps to overlay caption text on each frame of the video. The active word changes color on the exact frame it's spoken. The result is burned permanently into the video — no separate caption track, no metadata dependency.

Accuracy

  • Whisper's word error rate on clear English speech is under 5%. On accented English, technical vocabulary or noisy audio, accuracy drops. For best caption quality:
  • Record in a quiet environment
  • Speak clearly and at a moderate pace
  • Avoid heavy background music during speaking portions

Ready to create your first Short?

Start free — no credit card required. Process your first video in minutes.

Get started