Contendeo is the multimodal layer that lets your AI actually see video — not just read its transcript.
When a speaker says "look at this chart" and points at a number on screen, a transcript-only tool loses the number. Contendeo returns it.
{ "summary": "Opus 4.7 release benchmark walkthrough", "visual_assets": [ { "type": "benchmark_table", "timestamp": "05:41", "extracted_data": { "swe_bench_pro": { "opus_4_7": 64.3, "opus_4_6": 53.4, "gpt_5_4": 57.7 }, "office_qa_pro": { "opus_4_7": 80.6, "opus_4_6": 57.1 } } }, { "type": "bar_chart", "timestamp": "06:12", "chart_title": "SWE-bench Pro score comparison", "axis_labels": { "x": "model", "y": "% solved" }, "extracted_values": [ { "label": "Opus 4.7", "value": 64.3 }, { "label": "GPT-5.4", "value": 57.7 }, { "label": "Opus 4.6", "value": 53.4 } ] } ] }
Any video where the information is split between what's said and what's shown.
Price overlays + casual narration → full cost breakdown with per-ingredient line items and total.
Dashboard walkthrough → structured event timeline with exact values at each step.
Ticker chyrons + commentary → precise position + price extraction, timestamped to the frame.
Every video URL runs through the same pipeline. Parallelized where possible; results merged at the end.
Pay only for processing. Cache hits are free, failures are refunded automatically, unused credits roll over on paid plans.
Transcript MCPs give you a timestamped text track from the audio and stop there. Contendeo runs keyframe extraction in parallel and pipes each scene through Claude Vision plus Tesseract OCR, so numbers burned into slides, chart axis values, code on-screen, and UI labels all end up in the structured output. If the speaker says "look at this number" and points at a chart, a transcript MCP loses the number. Contendeo returns it.
Claude's native multimodal input accepts static images, not streaming video frames. When you paste a YouTube URL into Claude, it fetches the page transcript — it doesn't decode or watch the video. Contendeo runs yt-dlp, ffmpeg keyframe extraction, and vision analysis on your behalf and returns a structured payload Claude can reason over. You pay credits per run; otherwise Claude never saw the frames.
YouTube, Instagram Reels, Vimeo, Twitter/X, TikTok, and direct MP4 or webm URLs. yt-dlp handles roughly a thousand sites — the listed ones are what we explicitly test and harden against bot-detection upstream (proxy pool plus PO-token provider for YouTube SABR). Runtime ceiling is two hours per video.
Each tool has a fixed credit cost: quick_transcribe 1, deep_analyze 5, clip_context 1 (quick) or 3 (deep), batch_analyze per-video with 10% off at 5+ videos. Credits are deducted atomically at job start and auto-refunded on any failure — download error, API timeout, invalid URL. Cache hits return free: identical URL plus same tool plus same params within the 7-day cache window costs zero credits.
No. Videos are fetched to a tmpfs, decoded in memory, and purged the moment the tool returns. We cache the analysis result for 7 days keyed by a hash of URL, tool, and params — but never the raw video or audio. The cache is what makes cache hits free; it is not a video library.
In claude.ai or the desktop app: + → Connectors → Add custom connector → paste https://contendeo.app/mcp/ → Add → sign in with Google. Any Claude plan with custom-connector access works (Pro, Max, Team, Enterprise). Free plans support one custom connector — spend yours here.