DEMO · MULTIMODAL LAYER

Give your agent eyes.

Contendeo is the multimodal layer that lets your AI actually see video — not just read its transcript.

// the asymmetry problem

Transcripts capture half the signal.

When a speaker says "look at this chart" and points at a number on screen, a transcript-only tool loses the number. Contendeo returns it.

▸ Reference clip — Nate Herk, "Claude Opus 4.7 Just Dropped" · 05:30–06:35
Transcript-only output audio → text
"The speaker discusses Claude Opus 4.7 benchmarks, mentioning it performs well on software engineering tasks and shows improvements over previous models. He talks about office-related evaluations and compares it to GPT models and earlier Claude releases. Overall the tone is positive about the release."
Contendeo deep_analyze output audio + vision + ocr

  "summary" "Opus 4.7 release benchmark walkthrough"
  "visual_assets" 
    
      "type" "benchmark_table"
      "timestamp" "05:41"
      "extracted_data" 
        "swe_bench_pro" 
          "opus_4_7" 64.3
          "opus_4_6" 53.4
          "gpt_5_4"   57.7
        
        "office_qa_pro" 
          "opus_4_7" 80.6
          "opus_4_6" 57.1
        
      
    
    
      "type" "bar_chart"
      "timestamp" "06:12"
      "chart_title" "SWE-bench Pro score comparison"
      "axis_labels" 
        "x" "model"
        "y" "% solved"
      
      "extracted_values" 
         "label" "Opus 4.7" "value" 64.3 
         "label" "GPT-5.4"  "value" 57.7 
         "label" "Opus 4.6" "value" 53.4 
      
    
  
Missed by transcript
SWE-bench Pro · Opus 4.7 = 64.3% SWE-bench Pro · Opus 4.6 = 53.4% SWE-bench Pro · GPT-5.4 = 57.7% OfficeQA Pro · Opus 4.7 = 80.6% OfficeQA Pro · Opus 4.6 = 57.1%
// your agent deserves to see

Works on more than benchmark charts.

Any video where the information is split between what's said and what's shown.

Cost analysis from cooking videos

Price overlays + casual narration → full cost breakdown with per-ingredient line items and total.

UI state changes in product demos

Dashboard walkthrough → structured event timeline with exact values at each step.

Chart data from finance clips

Ticker chyrons + commentary → precise position + price extraction, timestamped to the frame.

// how it works

Six stages. One structured output.

Every video URL runs through the same pipeline. Parallelized where possible; results merged at the end.

yt-dlp
Fetch source from any supported URL
ffmpeg
Demux audio + sample keyframes
Groq Whisper
Timestamped transcript + speaker turns
Tesseract OCR
Text burned into frames
Claude Vision
Scene describe + chart extraction
Unified output
One JSON the LLM can reason over
// pricing

Credit-based. No wasted spend.

Pay only for processing. Cache hits are free, failures are refunded automatically, unused credits roll over on paid plans.

Free
$0
10 credits on signup
  • All four tools
  • No card required
  • Cache hits always free
  • Auto-refund on failure
Get started
Power
$39
per month
  • 500 credits / month
  • Batch processing enabled
  • Webhooks for async jobs
  • Rollover up to 1,000 credits
Start Power
Pay as you go: $0.15 per credit. No subscription required. Top up anytime.
Credit cost per tool
quick_transcribe1 credit
deep_analyze5 credits
clip_context (quick / deep)1 / 3 credits
batch_analyzeper video (−10% at 5+)
// faq

Questions, answered.

How is this different from a YouTube transcript MCP?

Transcript MCPs give you a timestamped text track from the audio and stop there. Contendeo runs keyframe extraction in parallel and pipes each scene through Claude Vision plus Tesseract OCR, so numbers burned into slides, chart axis values, code on-screen, and UI labels all end up in the structured output. If the speaker says "look at this number" and points at a chart, a transcript MCP loses the number. Contendeo returns it.

Can your LLM already watch videos — why do I need Contendeo?

Claude's native multimodal input accepts static images, not streaming video frames. When you paste a YouTube URL into Claude, it fetches the page transcript — it doesn't decode or watch the video. Contendeo runs yt-dlp, ffmpeg keyframe extraction, and vision analysis on your behalf and returns a structured payload Claude can reason over. You pay credits per run; otherwise Claude never saw the frames.

What video platforms does Contendeo support?

YouTube, Instagram Reels, Vimeo, Twitter/X, TikTok, and direct MP4 or webm URLs. yt-dlp handles roughly a thousand sites — the listed ones are what we explicitly test and harden against bot-detection upstream (proxy pool plus PO-token provider for YouTube SABR). Runtime ceiling is two hours per video.

How is pricing calculated? What counts as a credit?

Each tool has a fixed credit cost: quick_transcribe 1, deep_analyze 5, clip_context 1 (quick) or 3 (deep), batch_analyze per-video with 10% off at 5+ videos. Credits are deducted atomically at job start and auto-refunded on any failure — download error, API timeout, invalid URL. Cache hits return free: identical URL plus same tool plus same params within the 7-day cache window costs zero credits.

Is my video data stored anywhere?

No. Videos are fetched to a tmpfs, decoded in memory, and purged the moment the tool returns. We cache the analysis result for 7 days keyed by a hash of URL, tool, and params — but never the raw video or audio. The cache is what makes cache hits free; it is not a video library.

How do I install Contendeo in Claude?

In claude.ai or the desktop app: +ConnectorsAdd custom connector → paste https://contendeo.app/mcp/Add → sign in with Google. Any Claude plan with custom-connector access works (Pro, Max, Team, Enterprise). Free plans support one custom connector — spend yours here.