Skip to content

Pipeline Stages

VidPipe executes a 15-stage pipeline for each video. Each stage is wrapped in runStage() which catches errors — a stage failure does not abort the pipeline. Subsequent stages proceed with whatever data is available.

Stage Overview

#StageDescription
1IngestionCopies video into the repo structure, extracts metadata with FFprobe
2TranscriptionExtracts audio and runs OpenAI Whisper for word-level transcription
3Silence RemovalAI detects and removes dead-air segments, capped at 20% of video
4CaptionsGenerates SRT, VTT, and ASS subtitle files with karaoke highlighting
5Caption BurnBurns ASS captions into the video via FFmpeg
6ShortsAI identifies best 15–60s moments and extracts clip variants
7Medium ClipsAI identifies 1–3 min standalone segments with crossfade transitions
8ChaptersAI detects topic boundaries and outputs chapter markers
9SummaryAI writes a Markdown README with key-frame screenshots
10Social MediaGenerates platform-tailored posts for 5 platforms
11Short PostsGenerates per-short social media posts for all 5 platforms
12Medium Clip PostsGenerates per-medium-clip social media posts for all 5 platforms
13Queue BuildCopies posts and video variants into publish-queue/ for review
14BlogAI writes a dev.to-style blog post with web-sourced links
15Git PushAuto-commits and pushes all generated assets

Data Flow

Two transcripts flow through the pipeline:

  • Adjusted transcript — timestamps shifted to match the silence-removed video. Used by captions (stages 4–5) so subtitles align with the edited video.
  • Original transcript — unmodified Whisper output. Used by shorts, medium clips, and chapters (stages 6–8) because clips are cut from the original video.

Shorts and chapters are generated before the summary so the README can reference them.

Stage Details

1. Ingestion

Copies the video file into the repo's recordings/{slug}/ directory and extracts metadata using FFprobe.

InputPath to a .mp4 video file
OutputVideoFile object with slug, duration, size, repoPath, videoDir
ToolsFFprobe (duration, resolution, file size)
Skip flag— (required; pipeline aborts if ingestion fails)
EnumPipelineStage.Ingestion

2. Transcription

Extracts audio from the video as a 64 kbps mono MP3, then sends it to OpenAI Whisper for transcription. Files larger than 25 MB are automatically chunked and results are merged.

InputVideoFile (ingested video)
Outputtranscript.json — segments, words with start/end timestamps, detected language, duration
ToolsFFmpeg (audio extraction), OpenAI Whisper API (whisper-1)
Skip flag
EnumPipelineStage.Transcription

3. Silence Removal

Detects silence regions in the audio using FFmpeg's silencedetect filter. An AI agent then decides which regions to remove, capping total removal at 20% of the video duration. The video is trimmed using singlePassEdit().

InputVideoFile, Transcript
Output{slug}-edited.mp4, transcript-edited.json (adjusted timestamps)
AgentSilenceRemovalAgent — tools: detect_silence, decide_removals
ToolsFFmpeg (silencedetect filter, segment-based trim)
Skip flag--no-silence-removal
EnumPipelineStage.SilenceRemoval

4. Captions

Generates subtitle files from the transcript. Uses the adjusted transcript (post silence-removal) when available, otherwise the original. No AI agent is needed — this is a direct format conversion.

InputAdjusted or original Transcript
Outputcaptions/captions.srt, captions/captions.vtt, captions/captions.ass
FormatsSRT (SubRip), VTT (WebVTT), ASS (Advanced SubStation Alpha with karaoke word highlighting)
Skip flag--no-captions
EnumPipelineStage.Captions

5. Caption Burn

Burns the ASS subtitle file into the video using FFmpeg. When silence was also removed, uses singlePassEditAndCaption() to combine silence removal and caption burning in a single re-encode pass from the original video. Otherwise, uses burnCaptions() standalone.

InputASS caption file, edited or original video, keep-segments (if silence was removed)
Output{slug}-captioned.mp4
ToolsFFmpeg (ass subtitle filter)
Skip flag--no-captions
EnumPipelineStage.CaptionBurn

6. Shorts

An AI agent analyzes the original transcript to identify the best 15–60 second moments. Clips can be single segments or composites (multiple non-contiguous segments concatenated). Each short is extracted and then rendered in platform-specific variants.

InputVideoFile, original Transcript
OutputPer short: {slug}.mp4 (landscape), -portrait.mp4 (9:16), -square.mp4 (1:1), -feed.mp4 (4:5), -captioned.mp4, -portrait-captioned.mp4, {slug}.md
AgentShortsAgent — tool: plan_shorts
ToolsFFmpeg (segment extraction, aspect-ratio variants, caption burning, portrait hook overlay)
Skip flag--no-shorts
EnumPipelineStage.Shorts

7. Medium Clips

An AI agent identifies 1–3 minute standalone segments from the original transcript. Composite clips use crossfade (xfade) transitions between segments. Captions are burned with medium style (smaller, bottom-positioned).

InputVideoFile, original Transcript
OutputPer clip: {slug}.mp4, {slug}-captioned.mp4, {slug}.md
AgentMediumVideoAgent — tool: plan_medium_clips
ToolsFFmpeg (segment extraction, xfade transitions, caption burning)
Skip flag--no-medium-clips
EnumPipelineStage.MediumClips

8. Chapters

An AI agent analyzes the original transcript to detect topic boundaries, producing chapter markers in four formats.

InputVideoFile, original Transcript
Outputchapters/chapters.json, chapters/chapters.md, chapters/chapters.ffmetadata, chapters/chapters-youtube.txt
AgentChapterAgent — tool: generate_chapters
FormatsJSON (structured data), Markdown (table), FFmpeg metadata, YouTube description timestamps
Skip flag
EnumPipelineStage.Chapters

9. Summary

An AI agent captures key frames from the video and writes a narrative README.md with brand voice. Runs after shorts and chapters so it can reference them in the summary.

InputVideoFile, Transcript, ShortClip[], Chapter[]
OutputREADME.md (with embedded screenshots), key-frame images
AgentSummaryAgent — tools: capture_frame, write_summary
Skip flag
EnumPipelineStage.Summary

10. Social Media

An AI agent generates platform-specific posts for the full video across 5 platforms: TikTok, YouTube, Instagram, LinkedIn, and X. Uses Exa web search to find relevant links.

InputVideoFile, Transcript, VideoSummary
Outputsocial-posts/tiktok.md, youtube.md, instagram.md, linkedin.md, x.md
AgentSocialMediaAgent — tools: search_links, create_posts
PlatformsTikTok (2200 chars), YouTube (5000), Instagram (2200), LinkedIn (3000), X (280)
Skip flag--no-social
EnumPipelineStage.SocialMedia

11. Short Posts

For each short clip, generates per-platform social media posts. Posts are saved alongside the short clip.

InputVideoFile, ShortClip, Transcript
Outputshorts/{slug}/posts/{platform}.md for each platform
AgentShortPostsAgent (reuses SocialMediaAgent logic)
Skip flag--no-social
EnumPipelineStage.ShortPosts

12. Medium Clip Posts

For each medium clip, generates per-platform social media posts. Posts are saved alongside the medium clip.

InputVideoFile, MediumClip, Transcript
Outputmedium-clips/{slug}/posts/{platform}.md for each platform
AgentMediumClipPostsAgent (reuses SocialMediaAgent logic)
Skip flag--no-social
EnumPipelineStage.MediumClipPosts

13. Queue Build

Copies social media posts and video variants into a flat publish-queue/ folder for review and scheduling before publishing. Only runs when social posts were generated.

InputVideoFile, ShortClip[], MediumClip[], SocialPost[], captioned video path
Outputpublish-queue/ directory with flattened posts and video files
Skip flag--no-social-publish
EnumPipelineStage.QueueBuild

14. Blog

An AI agent writes a dev.to-style blog post (800–1500 words) with YAML frontmatter. Uses Exa web search to find relevant links to include.

InputVideoFile, Transcript, VideoSummary
Outputsocial-posts/devto.md
AgentBlogAgent — tools: search_web, write_blog
Skip flag
EnumPipelineStage.Blog

15. Git Push

Runs git add -A, git commit, and git push for all generated assets in the recording folder.

Inputslug (recording folder name)
OutputGit commit on origin main
Skip flag--no-git
EnumPipelineStage.GitPush

Error Handling

Each stage is wrapped in runStage() which:

  1. Records the current stage for cost tracking
  2. Executes the stage function in a try/catch
  3. Logs success or failure with wall-clock duration
  4. Pushes a StageResult record (success, error message, duration in ms)
  5. Returns undefined on failure so callers can null-check

This design produces partial results — if shorts generation fails, the summary and social posts can still be generated from the transcript. The only exception is Ingestion (stage 1), which aborts the pipeline if it fails since all subsequent stages depend on video metadata.