
Tutorials
From Opaque Video to Monetizable Moments: Building a Contextual Ad Engine with TwelveLabs

Nathan Che
Most CTV and FAST platforms make ad decisions without analyzing what's actually happening on screen. This tutorial walks through building a contextual ad engine that uses TwelveLabs Pegasus 1.5 for structured scene intelligence, Marengo 3.0 for semantic embeddings, and Databricks Delta Lake for enterprise analytics. The result: ad placement driven by real video understanding with full IAB 3.1 taxonomy compliance and FreeWheel-compatible payloads.
Most CTV and FAST platforms make ad decisions without analyzing what's actually happening on screen. This tutorial walks through building a contextual ad engine that uses TwelveLabs Pegasus 1.5 for structured scene intelligence, Marengo 3.0 for semantic embeddings, and Databricks Delta Lake for enterprise analytics. The result: ad placement driven by real video understanding with full IAB 3.1 taxonomy compliance and FreeWheel-compatible payloads.

In this article
No headings found on page
Join our newsletter
Join our newsletter
Receive the latest advancements, tutorials, and industry insights in video understanding
Receive the latest advancements, tutorials, and industry insights in video understanding
Search, analyze, and explore your videos with AI.
May 19, 2026
14 Minutes
Copy link to article
TLDR
Most CTV/FAST platforms still make ad decisions without looking at what's actually happening on screen. This tutorial walks through building a production-grade contextual ad engine that uses TwelveLabs Pegasus 1.5 for structured scene intelligence, Marengo 3.0 for multimodal embeddings, and Databricks Delta Lake for enterprise analytics. The result: ad placement driven by real video understanding rather than stale metadata, with full IAB 3.1 taxonomy compliance and FreeWheel-compatible payloads.
What you'll build: A complete pipeline that transforms video content into queryable context, matches ads to scenes based on semantic similarity and brand safety rules, identifies optimal break points, and exports decisioning data to Databricks for downstream analytics.
Introduction
Most ad decision stacks treat video as an opaque blob. They rely on metadata, content labels, or historical audience segments to make placement decisions. Everything except the video itself.
This approach works for broad targeting. Keyword matching can get you in the right ballpark. But it leaves significant revenue on the table because it fails to account for three things:
Timing: Ads placed without awareness of scene transitions interrupt the viewing experience and drive abandonment.
Context: Brand safety violations happen when systems can't see what's actually happening on screen. An alcohol ad shouldn't run during a scene depicting addiction recovery.
Depth: Surface-level demographic targeting misses the nuance of household income, viewing device, and real-time engagement signals.
This tutorial addresses all three by building a contextual ad engine that treats video as queryable, structured data rather than a black box. The engine combines:
TwelveLabs Pegasus 1.5 for fine-grained scene understanding: sentiment, tone, cast, environment, and GARM-aligned safety signals
TwelveLabs Marengo 3.0 for multimodal semantic embeddings that enable scene-to-ad similarity scoring
Databricks Delta Lake + Mosaic AI Vector Search for enterprise-grade storage and retrieval
FreeWheel/OpenRTB-compatible payload generation for direct integration with existing ad servers

Figure 1: Intelligence Scene Extraction in Video Inventory
The goal: answer the question "Which ad should run at this break, for this audience, in this specific scene, while respecting brand safety and campaign constraints?" with data grounded in actual video content.
Here's a walkthrough of the finished application:

Prerequisites
Before starting, you'll need:
Node.js 18+ and npm/yarn/pnpm
TwelveLabs API Key with two indexes:
TL_INDEX_IDfor content videosTL_AD_INDEX_IDfor ad creatives
Vercel Blob Token (
BLOB_READ_WRITE_TOKEN) for handling large video file transfers to TwelveLabsOpenAI API Key (optional) for low-latency text embedding during IAB 3.1 taxonomy mapping
Databricks Workspace (optional) with
DATABRICKS_TOKEN,DATABRICKS_HOST,DATABRICKS_HTTP_PATH, and optionallyDATABRICKS_CATALOGandDATABRICKS_SCHEMA
Clone and run:
>> git clone https://github.com/nathanchess/twelvelabs-context-ad-engine.git >> cd contextual-ad-engine >> npm install >> cp .env.example .env.local >> npm
Architecture Overview

Figure 2: Contextual Ad Engine Backend Architecture (LucidChart)
The architecture leverages two TwelveLabs models that serve complementary roles:
Marengo 3.0 is the encoder. It transforms video into searchable 512-dimensional vector embeddings, making products, emotions, environments, and moments queryable. This enables semantic matching between ad creatives and content scenes.
Pegasus 1.5 is the reasoning model. It generates structured metadata about each scene: demographics, brand safety flags, sentiment, and targeting recommendations. It supports structured outputs, producing consistent JSON that downstream systems can parse deterministically.
By leveraging their unique capabilities and metadata generated into a single deterministic calculation, shown on the right hand side of the technical architecture diagram, of (User-Ad Match Score) x (Scene-Ad Match Score) we are able to recommend ads not based off of pre-written text metadata, but making scene-level decisions grounded in real video understanding.
This allows the ad engine to treat each segment as a living context signal, considering:
Tone
Sentiment
Environment
Brand Safety
This approach makes ad decisions based on what's actually happening in the video, not on content metadata that was labeled weeks ago. For deeper background on the underlying technology, see the TwelveLabs Platform Overview and TwelveLabs Research.
Core Ad Decision | Placement Logic
The core decision logic combines both signals into a single score:
totalScore = adAffinity * sceneFit
Where adAffinity measures how well an ad fits the viewer profile (demographics, interests, policy constraints) and sceneFit measures how well the ad creative fits the current scene (semantic similarity + safety + tone + environment).
The scoring pipeline combines four weighted signals into the sceneFit calculation:
sceneFit = suitableMatch * 0.15 + // Pegasus suitable_categories overlap environmentFit * 0.15 + // environment-category affinity toneCompat * 0.10 + // emotional tone compatibility contextMatch * 0.60 // Marengo semantic cosine similarity
The weighting is intentional. In CTV/OTT monetization, the largest CPM lift typically comes from semantic context quality, so Marengo 3.0 drives most of the score. The remaining signals preserve rule-based controllability for policy and content safety teams who need deterministic guardrails.
Step 1: Generate Structured Ad Metadata (Pegasus + IAB + FreeWheel)
This step extracts structured scene intelligence from video content using Pegasus 1.5, normalizes it to IAB 3.1 taxonomy, and generates FreeWheel-compatible key-value pairs for ad server integration.
1.1 - Run Pegasus 1.5 with Structured Output
The /api/analyze endpoint handles three tasks:
Accepts a prompt from the frontend (from the
/videosor/adspage)Checks Vercel Blob cache to avoid redundant processing
Calls Pegasus 1.5 with structured output and stores the result
const tl_client = new TwelveLabs({ apiKey: process.env.TL_API_KEY }); const result = await tl_client.analyze({ videoId, prompt, temperature: 0.2, response_format }, { timeoutInSeconds: 90 });
The output is time-aligned metadata that downstream systems can reason over: scene boundaries, sentiment, environment, cast, and safety flags. This replaces brittle keyword-based targeting with grounded video understanding.
1.2 - Normalize Model Output to IAB 3.1 via Embedding KNN ID Matching
The analysis output from Pegasus needs to map to the IAB Content Taxonomy 3.1 for ad server compatibility. The pipeline uses text embeddings and k-nearest-neighbor matching against canonical IAB IDs.

The approach maintains a closed reference table of approved IAB 3.1 rows:
export const IAB_ALLOWED_ROWS = [ { tier1: "Alcohol", tier2: "Spirits", code: "1005" }, { tier1: "Alcohol", tier2: "Beer", code: "1003" }, { tier1: "Consumer Packaged Goods", tier2: "General Food", tier3: "Snacks", code: "1169" }, { tier1: "Finance and Insurance", tier2: "Stocks and Investments", code: "1338" }, { tier1: "Vehicles", tier2: "Automotive Ownership", tier3: "New Vehicle Ownership", code: "1536" }, // ... ] as const;
Each row is embedded once at index time. At runtime, candidate labels from the model are embedded and matched via KNN to the nearest canonical IAB rows, then thresholded and deduplicated:
export function normalizeIabWithKnnPolicy( rawInput: unknown, categoryKey?: string ): IabPolicyResult { const rawItems = Array.isArray(rawInput) ? rawInput : []; // 1) Embed candidate text from model output const embeddedCandidates = embedCandidateLabels(rawItems); // 2) KNN against canonical IAB 3.1 embedding index const knnMatches = queryIabKnnIndex(embeddedCandidates, { k: 5 }); // 3) Keep only policy-compliant matches above similarity threshold const normalizedItems = dedupeAndSort( applyIabMatchPolicy(knnMatches).filter( (item): item is IabTaxonomyItem => Boolean(item) ) ); const high = normalizedItems.filter((item) => item.confidence >= IAB_HIGH_CONFIDENCE); const medium = normalizedItems.filter((item) => item.confidence >= IAB_MEDIUM_CONFIDENCE); let effectiveItems: IabTaxonomyItem[] = []; let fallbackApplied = false; let fallbackReason: string | null = null; if (high.length > 0) { effectiveItems = high; } else if (medium.length > 0) { effectiveItems = medium; fallbackReason = "No high-confidence Tier-2/3 matches; using medium-confidence Tier-1 band."; } else { const fallback = (categoryKey && FALLBACK_BY_CATEGORY_KEY[categoryKey]) || []; effectiveItems = fallback; fallbackApplied = true; fallbackReason = fallback.length ? "No medium-confidence KNN matches; applied deterministic vertical fallback." : "No medium-confidence KNN matches and no category fallback mapping found."; } const effectiveTier1 = [...new Set(effectiveItems.map((item) => item.tier1))]; const effectiveTier2 = high.length > 0 ? [...new Set(effectiveItems.map((item) => item.tier2))] : []; const effectiveTier3 = high.length > 0 ? [...new Set(effectiveItems.map((item) => item.tier3).filter((tier3): tier3 is string => Boolean(tier3)))] : []; const effectiveIabIds = high.length > 0 ? [...new Set(effectiveItems.map((item) => item.iabId).filter(Boolean))] : []; const averageConfidence = normalizedItems.length > 0 ? normalizedItems.reduce((sum, item) => sum + item.confidence, 0) / normalizedItems.length : 0; return { normalizedItems, effectiveTier1, effectiveTier2, effectiveTier3, effectiveIabIds, averageConfidence, fallbackApplied, fallbackReason, }; }
This pipeline:
Embeds model-generated category phrases
Runs KNN similarity search against canonical IAB 3.1 row embeddings
Snaps candidates to valid IAB taxonomy rows/IDs only
Deduplicates and ranks matches by confidence
Promotes high-confidence rows as effective targeting fields
Applies deterministic vertical fallback when confidence is too low
This is critical for production ad tech: it prevents taxonomy hallucination, enforces valid IAB 3.1 IDs, and still captures semantic nuance through embedding-based matching.
1.3 - Build FreeWheel KVP Payload from Normalized Metadata

Once IAB and context signals are normalized, the engine generates FreeWheel key-value pairs for downstream ad serving:
const freewheelPayload = { ad_server: "Freewheel", endpoint: "https://ads.freewheel.tv/ad/p/1", generated_kvps: { vw_brand: toBrand(parsed.company), vw_ctx_inc: includeContexts.join(","), vw_ctx_exc: excludeContexts.join(","), vw_garm_floor: "strict", vw_duration: String(duration), vw_ad_title: parsed.proposedTitle || "untitled", vw_iab_t1: policy.effectiveTier1.join(","), vw_iab_t2: policy.effectiveTier2.join(","), vw_iab_t3: policy.effectiveTier3.join(","), vw_iab_codes: policy.effectiveCodes.join(","), vw_iab_conf: policy.averageConfidence.toFixed(3), }, };
The key fields:
vw_ctx_inccombines target contexts and Pegasus-recommended contextsvw_ctx_exccombines campaign exclusions, Pegasus negatives, and GARM flagsvw_iab_*fields are populated only from normalized/effective classes
This step is what connects AI-generated understanding to existing ad ops workflows. TwelveLabs provides semantic intelligence. Policy normalization ensures deterministic, auditable taxonomy behavior. FreeWheel/OpenRTB mapping makes the outputs deployable in production ad servers.
Step 2: Build Multimodal Embeddings with Marengo
Both content scenes and ad creatives are vectorized into the same 512-dimensional embedding space using Marengo 3.0. This enables true semantic matching between scenes and ads, not just keyword overlap.
export function cosineSimilarity(vecA: number[], vecB: number[]): number { let dot = 0, normA = 0, normB = 0; const len = Math.min(vecA.length, vecB.length); for (let i = 0; i < len; i++) { dot += vecA[i] * vecB[i]; normA += vecA[i] * vecA[i]; normB += vecB[i] * vecB[i]; } if (normA === 0 || normB === 0) return 0; return dot / (Math.sqrt(normA) * Math.sqrt(normB)); }

A visualization of these embeddings is available in the deployed application under the Metadata View, showing how semantically similar scenes cluster together.
To improve ranking spread, the engine normalizes expected cosine values and applies a non-linear boost (power transform). This separates high-quality matches more clearly in candidate rankings, making the difference between a 0.7 and 0.8 similarity score more meaningful for ad selection.
Step 3: Identify Optimal Ad Breaks
Before recommending ads, the engine identifies optimal monetization points within the content. Pegasus 1.5 analyzes each scene for:
Post-segment break quality
Interruption risk
Emotional valley detection
Transition type bonus
Mode-aware safety multiplier (strict, balanced, revenue_max)

The engine then applies spacing constraints and selects top breaks greedily with chronological ordering.
This matters in production because ad relevance is only useful if insertion timing is viewer-safe and UX-aware. A perfectly matched ad placed mid-sentence or during an emotional climax will still drive abandonment.
Step 4: Rank Ads with Safety Gates + Diversity Constraints
With optimal break points identified, embeddings computed, and metadata extracted, the engine can rank ads. But raw scoring isn't enough. Production ad decisioning requires two additional layers:
1. Hard Gates for Brand Safety
Before scoring, ads are filtered through:
User/category eligibility checks
Negative campaign context overlap detection
GARM-sensitive exclusions (alcohol, gambling, violence)
Safety mode gate policies
2. Cross-Break Diversity
No viewer, even one who loves cars, wants to see a car ad at every break. The engine enforces:
Same-ad repetition caps across breaks
Category frequency limits
Fallback logic when diversity constraints suppress top candidates

The result is an ad plan that is both high-scoring and broadcast-realistic, tailored to each viewer, scene, and available inventory.
Step 5: Export to Databricks for Enterprise Retrieval and Analytics
The metadata, embeddings, and ad decisioning data generated by this engine are only valuable if they flow into enterprise workflows. The engine exports all signals to Databricks Delta tables for downstream analytics and ML pipelines.
Queries are generated on-demand to match each user's Databricks workspace:
CREATE OR REPLACE VIEW ad_metadata_premium_spirits_vec AS SELECT creative_id, campaign_name, from_json(marengo_embedding_json, 'array<double>') AS embedding FROM main.default.ad_metadata_premium_spirits WHERE vector_sync_status = 'embedded_marengo_clip_avg'
This data lift into Databricks enables:
Mosaic AI Vector Search Indexing for semantic retrieval at scale
Campaign QA with full audit trails on every decision
Similarity Retrieval for creative ops and competitive analysis
Model-assisted planning in BI and ML pipelines
For teams evaluating enterprise rollout, this is where the TwelveLabs + Databricks combination becomes compelling: model-native video intelligence meets production data governance and retrieval infrastructure.
Why TwelveLabs for Contextual Advertising
You've now built (or walked through) a contextual ad engine that makes placement decisions based on actual video content rather than stale metadata. Few systems outside of purpose-built video AI can support this depth of decisioning across timing, sentiment, semantics, and policy in a production-ready architecture.
TwelveLabs provides the foundation:
Pegasus 1.5 for fine-grained, structured scene intelligence
Marengo 3.0 for multimodal semantic retrieval and matching
An API-first architecture that integrates cleanly into existing ad tech stacks
The combination transforms video from an opaque storage cost into a queryable, monetizable asset.
Start Building
Playground: Home | TwelveLabs
API Reference: Introduction | TwelveLabs
Product Overview: Video AI Platform: Search, Analyze & Embed - TwelveLabs
Sales | Enterprise: Contact TwelveLabs: Talk to Sales
Architecture Diagram: https://lucid.app/lucidchart/ef8d11e1-3f00-4bf0-b411-ab8e3bb3606b/edit?viewport_loc=515%2C-1146%2C5419%2C2654%2C0_0&invitationId=inv_09de1972-142b-4369-9df4-f91eb3f5a949
Deployed Application: Contextual Ad Engine — TwelveLabs
TLDR
Most CTV/FAST platforms still make ad decisions without looking at what's actually happening on screen. This tutorial walks through building a production-grade contextual ad engine that uses TwelveLabs Pegasus 1.5 for structured scene intelligence, Marengo 3.0 for multimodal embeddings, and Databricks Delta Lake for enterprise analytics. The result: ad placement driven by real video understanding rather than stale metadata, with full IAB 3.1 taxonomy compliance and FreeWheel-compatible payloads.
What you'll build: A complete pipeline that transforms video content into queryable context, matches ads to scenes based on semantic similarity and brand safety rules, identifies optimal break points, and exports decisioning data to Databricks for downstream analytics.
Introduction
Most ad decision stacks treat video as an opaque blob. They rely on metadata, content labels, or historical audience segments to make placement decisions. Everything except the video itself.
This approach works for broad targeting. Keyword matching can get you in the right ballpark. But it leaves significant revenue on the table because it fails to account for three things:
Timing: Ads placed without awareness of scene transitions interrupt the viewing experience and drive abandonment.
Context: Brand safety violations happen when systems can't see what's actually happening on screen. An alcohol ad shouldn't run during a scene depicting addiction recovery.
Depth: Surface-level demographic targeting misses the nuance of household income, viewing device, and real-time engagement signals.
This tutorial addresses all three by building a contextual ad engine that treats video as queryable, structured data rather than a black box. The engine combines:
TwelveLabs Pegasus 1.5 for fine-grained scene understanding: sentiment, tone, cast, environment, and GARM-aligned safety signals
TwelveLabs Marengo 3.0 for multimodal semantic embeddings that enable scene-to-ad similarity scoring
Databricks Delta Lake + Mosaic AI Vector Search for enterprise-grade storage and retrieval
FreeWheel/OpenRTB-compatible payload generation for direct integration with existing ad servers

Figure 1: Intelligence Scene Extraction in Video Inventory
The goal: answer the question "Which ad should run at this break, for this audience, in this specific scene, while respecting brand safety and campaign constraints?" with data grounded in actual video content.
Here's a walkthrough of the finished application:

Prerequisites
Before starting, you'll need:
Node.js 18+ and npm/yarn/pnpm
TwelveLabs API Key with two indexes:
TL_INDEX_IDfor content videosTL_AD_INDEX_IDfor ad creatives
Vercel Blob Token (
BLOB_READ_WRITE_TOKEN) for handling large video file transfers to TwelveLabsOpenAI API Key (optional) for low-latency text embedding during IAB 3.1 taxonomy mapping
Databricks Workspace (optional) with
DATABRICKS_TOKEN,DATABRICKS_HOST,DATABRICKS_HTTP_PATH, and optionallyDATABRICKS_CATALOGandDATABRICKS_SCHEMA
Clone and run:
>> git clone https://github.com/nathanchess/twelvelabs-context-ad-engine.git >> cd contextual-ad-engine >> npm install >> cp .env.example .env.local >> npm
Architecture Overview

Figure 2: Contextual Ad Engine Backend Architecture (LucidChart)
The architecture leverages two TwelveLabs models that serve complementary roles:
Marengo 3.0 is the encoder. It transforms video into searchable 512-dimensional vector embeddings, making products, emotions, environments, and moments queryable. This enables semantic matching between ad creatives and content scenes.
Pegasus 1.5 is the reasoning model. It generates structured metadata about each scene: demographics, brand safety flags, sentiment, and targeting recommendations. It supports structured outputs, producing consistent JSON that downstream systems can parse deterministically.
By leveraging their unique capabilities and metadata generated into a single deterministic calculation, shown on the right hand side of the technical architecture diagram, of (User-Ad Match Score) x (Scene-Ad Match Score) we are able to recommend ads not based off of pre-written text metadata, but making scene-level decisions grounded in real video understanding.
This allows the ad engine to treat each segment as a living context signal, considering:
Tone
Sentiment
Environment
Brand Safety
This approach makes ad decisions based on what's actually happening in the video, not on content metadata that was labeled weeks ago. For deeper background on the underlying technology, see the TwelveLabs Platform Overview and TwelveLabs Research.
Core Ad Decision | Placement Logic
The core decision logic combines both signals into a single score:
totalScore = adAffinity * sceneFit
Where adAffinity measures how well an ad fits the viewer profile (demographics, interests, policy constraints) and sceneFit measures how well the ad creative fits the current scene (semantic similarity + safety + tone + environment).
The scoring pipeline combines four weighted signals into the sceneFit calculation:
sceneFit = suitableMatch * 0.15 + // Pegasus suitable_categories overlap environmentFit * 0.15 + // environment-category affinity toneCompat * 0.10 + // emotional tone compatibility contextMatch * 0.60 // Marengo semantic cosine similarity
The weighting is intentional. In CTV/OTT monetization, the largest CPM lift typically comes from semantic context quality, so Marengo 3.0 drives most of the score. The remaining signals preserve rule-based controllability for policy and content safety teams who need deterministic guardrails.
Step 1: Generate Structured Ad Metadata (Pegasus + IAB + FreeWheel)
This step extracts structured scene intelligence from video content using Pegasus 1.5, normalizes it to IAB 3.1 taxonomy, and generates FreeWheel-compatible key-value pairs for ad server integration.
1.1 - Run Pegasus 1.5 with Structured Output
The /api/analyze endpoint handles three tasks:
Accepts a prompt from the frontend (from the
/videosor/adspage)Checks Vercel Blob cache to avoid redundant processing
Calls Pegasus 1.5 with structured output and stores the result
const tl_client = new TwelveLabs({ apiKey: process.env.TL_API_KEY }); const result = await tl_client.analyze({ videoId, prompt, temperature: 0.2, response_format }, { timeoutInSeconds: 90 });
The output is time-aligned metadata that downstream systems can reason over: scene boundaries, sentiment, environment, cast, and safety flags. This replaces brittle keyword-based targeting with grounded video understanding.
1.2 - Normalize Model Output to IAB 3.1 via Embedding KNN ID Matching
The analysis output from Pegasus needs to map to the IAB Content Taxonomy 3.1 for ad server compatibility. The pipeline uses text embeddings and k-nearest-neighbor matching against canonical IAB IDs.

The approach maintains a closed reference table of approved IAB 3.1 rows:
export const IAB_ALLOWED_ROWS = [ { tier1: "Alcohol", tier2: "Spirits", code: "1005" }, { tier1: "Alcohol", tier2: "Beer", code: "1003" }, { tier1: "Consumer Packaged Goods", tier2: "General Food", tier3: "Snacks", code: "1169" }, { tier1: "Finance and Insurance", tier2: "Stocks and Investments", code: "1338" }, { tier1: "Vehicles", tier2: "Automotive Ownership", tier3: "New Vehicle Ownership", code: "1536" }, // ... ] as const;
Each row is embedded once at index time. At runtime, candidate labels from the model are embedded and matched via KNN to the nearest canonical IAB rows, then thresholded and deduplicated:
export function normalizeIabWithKnnPolicy( rawInput: unknown, categoryKey?: string ): IabPolicyResult { const rawItems = Array.isArray(rawInput) ? rawInput : []; // 1) Embed candidate text from model output const embeddedCandidates = embedCandidateLabels(rawItems); // 2) KNN against canonical IAB 3.1 embedding index const knnMatches = queryIabKnnIndex(embeddedCandidates, { k: 5 }); // 3) Keep only policy-compliant matches above similarity threshold const normalizedItems = dedupeAndSort( applyIabMatchPolicy(knnMatches).filter( (item): item is IabTaxonomyItem => Boolean(item) ) ); const high = normalizedItems.filter((item) => item.confidence >= IAB_HIGH_CONFIDENCE); const medium = normalizedItems.filter((item) => item.confidence >= IAB_MEDIUM_CONFIDENCE); let effectiveItems: IabTaxonomyItem[] = []; let fallbackApplied = false; let fallbackReason: string | null = null; if (high.length > 0) { effectiveItems = high; } else if (medium.length > 0) { effectiveItems = medium; fallbackReason = "No high-confidence Tier-2/3 matches; using medium-confidence Tier-1 band."; } else { const fallback = (categoryKey && FALLBACK_BY_CATEGORY_KEY[categoryKey]) || []; effectiveItems = fallback; fallbackApplied = true; fallbackReason = fallback.length ? "No medium-confidence KNN matches; applied deterministic vertical fallback." : "No medium-confidence KNN matches and no category fallback mapping found."; } const effectiveTier1 = [...new Set(effectiveItems.map((item) => item.tier1))]; const effectiveTier2 = high.length > 0 ? [...new Set(effectiveItems.map((item) => item.tier2))] : []; const effectiveTier3 = high.length > 0 ? [...new Set(effectiveItems.map((item) => item.tier3).filter((tier3): tier3 is string => Boolean(tier3)))] : []; const effectiveIabIds = high.length > 0 ? [...new Set(effectiveItems.map((item) => item.iabId).filter(Boolean))] : []; const averageConfidence = normalizedItems.length > 0 ? normalizedItems.reduce((sum, item) => sum + item.confidence, 0) / normalizedItems.length : 0; return { normalizedItems, effectiveTier1, effectiveTier2, effectiveTier3, effectiveIabIds, averageConfidence, fallbackApplied, fallbackReason, }; }
This pipeline:
Embeds model-generated category phrases
Runs KNN similarity search against canonical IAB 3.1 row embeddings
Snaps candidates to valid IAB taxonomy rows/IDs only
Deduplicates and ranks matches by confidence
Promotes high-confidence rows as effective targeting fields
Applies deterministic vertical fallback when confidence is too low
This is critical for production ad tech: it prevents taxonomy hallucination, enforces valid IAB 3.1 IDs, and still captures semantic nuance through embedding-based matching.
1.3 - Build FreeWheel KVP Payload from Normalized Metadata

Once IAB and context signals are normalized, the engine generates FreeWheel key-value pairs for downstream ad serving:
const freewheelPayload = { ad_server: "Freewheel", endpoint: "https://ads.freewheel.tv/ad/p/1", generated_kvps: { vw_brand: toBrand(parsed.company), vw_ctx_inc: includeContexts.join(","), vw_ctx_exc: excludeContexts.join(","), vw_garm_floor: "strict", vw_duration: String(duration), vw_ad_title: parsed.proposedTitle || "untitled", vw_iab_t1: policy.effectiveTier1.join(","), vw_iab_t2: policy.effectiveTier2.join(","), vw_iab_t3: policy.effectiveTier3.join(","), vw_iab_codes: policy.effectiveCodes.join(","), vw_iab_conf: policy.averageConfidence.toFixed(3), }, };
The key fields:
vw_ctx_inccombines target contexts and Pegasus-recommended contextsvw_ctx_exccombines campaign exclusions, Pegasus negatives, and GARM flagsvw_iab_*fields are populated only from normalized/effective classes
This step is what connects AI-generated understanding to existing ad ops workflows. TwelveLabs provides semantic intelligence. Policy normalization ensures deterministic, auditable taxonomy behavior. FreeWheel/OpenRTB mapping makes the outputs deployable in production ad servers.
Step 2: Build Multimodal Embeddings with Marengo
Both content scenes and ad creatives are vectorized into the same 512-dimensional embedding space using Marengo 3.0. This enables true semantic matching between scenes and ads, not just keyword overlap.
export function cosineSimilarity(vecA: number[], vecB: number[]): number { let dot = 0, normA = 0, normB = 0; const len = Math.min(vecA.length, vecB.length); for (let i = 0; i < len; i++) { dot += vecA[i] * vecB[i]; normA += vecA[i] * vecA[i]; normB += vecB[i] * vecB[i]; } if (normA === 0 || normB === 0) return 0; return dot / (Math.sqrt(normA) * Math.sqrt(normB)); }

A visualization of these embeddings is available in the deployed application under the Metadata View, showing how semantically similar scenes cluster together.
To improve ranking spread, the engine normalizes expected cosine values and applies a non-linear boost (power transform). This separates high-quality matches more clearly in candidate rankings, making the difference between a 0.7 and 0.8 similarity score more meaningful for ad selection.
Step 3: Identify Optimal Ad Breaks
Before recommending ads, the engine identifies optimal monetization points within the content. Pegasus 1.5 analyzes each scene for:
Post-segment break quality
Interruption risk
Emotional valley detection
Transition type bonus
Mode-aware safety multiplier (strict, balanced, revenue_max)

The engine then applies spacing constraints and selects top breaks greedily with chronological ordering.
This matters in production because ad relevance is only useful if insertion timing is viewer-safe and UX-aware. A perfectly matched ad placed mid-sentence or during an emotional climax will still drive abandonment.
Step 4: Rank Ads with Safety Gates + Diversity Constraints
With optimal break points identified, embeddings computed, and metadata extracted, the engine can rank ads. But raw scoring isn't enough. Production ad decisioning requires two additional layers:
1. Hard Gates for Brand Safety
Before scoring, ads are filtered through:
User/category eligibility checks
Negative campaign context overlap detection
GARM-sensitive exclusions (alcohol, gambling, violence)
Safety mode gate policies
2. Cross-Break Diversity
No viewer, even one who loves cars, wants to see a car ad at every break. The engine enforces:
Same-ad repetition caps across breaks
Category frequency limits
Fallback logic when diversity constraints suppress top candidates

The result is an ad plan that is both high-scoring and broadcast-realistic, tailored to each viewer, scene, and available inventory.
Step 5: Export to Databricks for Enterprise Retrieval and Analytics
The metadata, embeddings, and ad decisioning data generated by this engine are only valuable if they flow into enterprise workflows. The engine exports all signals to Databricks Delta tables for downstream analytics and ML pipelines.
Queries are generated on-demand to match each user's Databricks workspace:
CREATE OR REPLACE VIEW ad_metadata_premium_spirits_vec AS SELECT creative_id, campaign_name, from_json(marengo_embedding_json, 'array<double>') AS embedding FROM main.default.ad_metadata_premium_spirits WHERE vector_sync_status = 'embedded_marengo_clip_avg'
This data lift into Databricks enables:
Mosaic AI Vector Search Indexing for semantic retrieval at scale
Campaign QA with full audit trails on every decision
Similarity Retrieval for creative ops and competitive analysis
Model-assisted planning in BI and ML pipelines
For teams evaluating enterprise rollout, this is where the TwelveLabs + Databricks combination becomes compelling: model-native video intelligence meets production data governance and retrieval infrastructure.
Why TwelveLabs for Contextual Advertising
You've now built (or walked through) a contextual ad engine that makes placement decisions based on actual video content rather than stale metadata. Few systems outside of purpose-built video AI can support this depth of decisioning across timing, sentiment, semantics, and policy in a production-ready architecture.
TwelveLabs provides the foundation:
Pegasus 1.5 for fine-grained, structured scene intelligence
Marengo 3.0 for multimodal semantic retrieval and matching
An API-first architecture that integrates cleanly into existing ad tech stacks
The combination transforms video from an opaque storage cost into a queryable, monetizable asset.
Start Building
Playground: Home | TwelveLabs
API Reference: Introduction | TwelveLabs
Product Overview: Video AI Platform: Search, Analyze & Embed - TwelveLabs
Sales | Enterprise: Contact TwelveLabs: Talk to Sales
Architecture Diagram: https://lucid.app/lucidchart/ef8d11e1-3f00-4bf0-b411-ab8e3bb3606b/edit?viewport_loc=515%2C-1146%2C5419%2C2654%2C0_0&invitationId=inv_09de1972-142b-4369-9df4-f91eb3f5a949
Deployed Application: Contextual Ad Engine — TwelveLabs
TLDR
Most CTV/FAST platforms still make ad decisions without looking at what's actually happening on screen. This tutorial walks through building a production-grade contextual ad engine that uses TwelveLabs Pegasus 1.5 for structured scene intelligence, Marengo 3.0 for multimodal embeddings, and Databricks Delta Lake for enterprise analytics. The result: ad placement driven by real video understanding rather than stale metadata, with full IAB 3.1 taxonomy compliance and FreeWheel-compatible payloads.
What you'll build: A complete pipeline that transforms video content into queryable context, matches ads to scenes based on semantic similarity and brand safety rules, identifies optimal break points, and exports decisioning data to Databricks for downstream analytics.
Introduction
Most ad decision stacks treat video as an opaque blob. They rely on metadata, content labels, or historical audience segments to make placement decisions. Everything except the video itself.
This approach works for broad targeting. Keyword matching can get you in the right ballpark. But it leaves significant revenue on the table because it fails to account for three things:
Timing: Ads placed without awareness of scene transitions interrupt the viewing experience and drive abandonment.
Context: Brand safety violations happen when systems can't see what's actually happening on screen. An alcohol ad shouldn't run during a scene depicting addiction recovery.
Depth: Surface-level demographic targeting misses the nuance of household income, viewing device, and real-time engagement signals.
This tutorial addresses all three by building a contextual ad engine that treats video as queryable, structured data rather than a black box. The engine combines:
TwelveLabs Pegasus 1.5 for fine-grained scene understanding: sentiment, tone, cast, environment, and GARM-aligned safety signals
TwelveLabs Marengo 3.0 for multimodal semantic embeddings that enable scene-to-ad similarity scoring
Databricks Delta Lake + Mosaic AI Vector Search for enterprise-grade storage and retrieval
FreeWheel/OpenRTB-compatible payload generation for direct integration with existing ad servers

Figure 1: Intelligence Scene Extraction in Video Inventory
The goal: answer the question "Which ad should run at this break, for this audience, in this specific scene, while respecting brand safety and campaign constraints?" with data grounded in actual video content.
Here's a walkthrough of the finished application:

Prerequisites
Before starting, you'll need:
Node.js 18+ and npm/yarn/pnpm
TwelveLabs API Key with two indexes:
TL_INDEX_IDfor content videosTL_AD_INDEX_IDfor ad creatives
Vercel Blob Token (
BLOB_READ_WRITE_TOKEN) for handling large video file transfers to TwelveLabsOpenAI API Key (optional) for low-latency text embedding during IAB 3.1 taxonomy mapping
Databricks Workspace (optional) with
DATABRICKS_TOKEN,DATABRICKS_HOST,DATABRICKS_HTTP_PATH, and optionallyDATABRICKS_CATALOGandDATABRICKS_SCHEMA
Clone and run:
>> git clone https://github.com/nathanchess/twelvelabs-context-ad-engine.git >> cd contextual-ad-engine >> npm install >> cp .env.example .env.local >> npm
Architecture Overview

Figure 2: Contextual Ad Engine Backend Architecture (LucidChart)
The architecture leverages two TwelveLabs models that serve complementary roles:
Marengo 3.0 is the encoder. It transforms video into searchable 512-dimensional vector embeddings, making products, emotions, environments, and moments queryable. This enables semantic matching between ad creatives and content scenes.
Pegasus 1.5 is the reasoning model. It generates structured metadata about each scene: demographics, brand safety flags, sentiment, and targeting recommendations. It supports structured outputs, producing consistent JSON that downstream systems can parse deterministically.
By leveraging their unique capabilities and metadata generated into a single deterministic calculation, shown on the right hand side of the technical architecture diagram, of (User-Ad Match Score) x (Scene-Ad Match Score) we are able to recommend ads not based off of pre-written text metadata, but making scene-level decisions grounded in real video understanding.
This allows the ad engine to treat each segment as a living context signal, considering:
Tone
Sentiment
Environment
Brand Safety
This approach makes ad decisions based on what's actually happening in the video, not on content metadata that was labeled weeks ago. For deeper background on the underlying technology, see the TwelveLabs Platform Overview and TwelveLabs Research.
Core Ad Decision | Placement Logic
The core decision logic combines both signals into a single score:
totalScore = adAffinity * sceneFit
Where adAffinity measures how well an ad fits the viewer profile (demographics, interests, policy constraints) and sceneFit measures how well the ad creative fits the current scene (semantic similarity + safety + tone + environment).
The scoring pipeline combines four weighted signals into the sceneFit calculation:
sceneFit = suitableMatch * 0.15 + // Pegasus suitable_categories overlap environmentFit * 0.15 + // environment-category affinity toneCompat * 0.10 + // emotional tone compatibility contextMatch * 0.60 // Marengo semantic cosine similarity
The weighting is intentional. In CTV/OTT monetization, the largest CPM lift typically comes from semantic context quality, so Marengo 3.0 drives most of the score. The remaining signals preserve rule-based controllability for policy and content safety teams who need deterministic guardrails.
Step 1: Generate Structured Ad Metadata (Pegasus + IAB + FreeWheel)
This step extracts structured scene intelligence from video content using Pegasus 1.5, normalizes it to IAB 3.1 taxonomy, and generates FreeWheel-compatible key-value pairs for ad server integration.
1.1 - Run Pegasus 1.5 with Structured Output
The /api/analyze endpoint handles three tasks:
Accepts a prompt from the frontend (from the
/videosor/adspage)Checks Vercel Blob cache to avoid redundant processing
Calls Pegasus 1.5 with structured output and stores the result
const tl_client = new TwelveLabs({ apiKey: process.env.TL_API_KEY }); const result = await tl_client.analyze({ videoId, prompt, temperature: 0.2, response_format }, { timeoutInSeconds: 90 });
The output is time-aligned metadata that downstream systems can reason over: scene boundaries, sentiment, environment, cast, and safety flags. This replaces brittle keyword-based targeting with grounded video understanding.
1.2 - Normalize Model Output to IAB 3.1 via Embedding KNN ID Matching
The analysis output from Pegasus needs to map to the IAB Content Taxonomy 3.1 for ad server compatibility. The pipeline uses text embeddings and k-nearest-neighbor matching against canonical IAB IDs.

The approach maintains a closed reference table of approved IAB 3.1 rows:
export const IAB_ALLOWED_ROWS = [ { tier1: "Alcohol", tier2: "Spirits", code: "1005" }, { tier1: "Alcohol", tier2: "Beer", code: "1003" }, { tier1: "Consumer Packaged Goods", tier2: "General Food", tier3: "Snacks", code: "1169" }, { tier1: "Finance and Insurance", tier2: "Stocks and Investments", code: "1338" }, { tier1: "Vehicles", tier2: "Automotive Ownership", tier3: "New Vehicle Ownership", code: "1536" }, // ... ] as const;
Each row is embedded once at index time. At runtime, candidate labels from the model are embedded and matched via KNN to the nearest canonical IAB rows, then thresholded and deduplicated:
export function normalizeIabWithKnnPolicy( rawInput: unknown, categoryKey?: string ): IabPolicyResult { const rawItems = Array.isArray(rawInput) ? rawInput : []; // 1) Embed candidate text from model output const embeddedCandidates = embedCandidateLabels(rawItems); // 2) KNN against canonical IAB 3.1 embedding index const knnMatches = queryIabKnnIndex(embeddedCandidates, { k: 5 }); // 3) Keep only policy-compliant matches above similarity threshold const normalizedItems = dedupeAndSort( applyIabMatchPolicy(knnMatches).filter( (item): item is IabTaxonomyItem => Boolean(item) ) ); const high = normalizedItems.filter((item) => item.confidence >= IAB_HIGH_CONFIDENCE); const medium = normalizedItems.filter((item) => item.confidence >= IAB_MEDIUM_CONFIDENCE); let effectiveItems: IabTaxonomyItem[] = []; let fallbackApplied = false; let fallbackReason: string | null = null; if (high.length > 0) { effectiveItems = high; } else if (medium.length > 0) { effectiveItems = medium; fallbackReason = "No high-confidence Tier-2/3 matches; using medium-confidence Tier-1 band."; } else { const fallback = (categoryKey && FALLBACK_BY_CATEGORY_KEY[categoryKey]) || []; effectiveItems = fallback; fallbackApplied = true; fallbackReason = fallback.length ? "No medium-confidence KNN matches; applied deterministic vertical fallback." : "No medium-confidence KNN matches and no category fallback mapping found."; } const effectiveTier1 = [...new Set(effectiveItems.map((item) => item.tier1))]; const effectiveTier2 = high.length > 0 ? [...new Set(effectiveItems.map((item) => item.tier2))] : []; const effectiveTier3 = high.length > 0 ? [...new Set(effectiveItems.map((item) => item.tier3).filter((tier3): tier3 is string => Boolean(tier3)))] : []; const effectiveIabIds = high.length > 0 ? [...new Set(effectiveItems.map((item) => item.iabId).filter(Boolean))] : []; const averageConfidence = normalizedItems.length > 0 ? normalizedItems.reduce((sum, item) => sum + item.confidence, 0) / normalizedItems.length : 0; return { normalizedItems, effectiveTier1, effectiveTier2, effectiveTier3, effectiveIabIds, averageConfidence, fallbackApplied, fallbackReason, }; }
This pipeline:
Embeds model-generated category phrases
Runs KNN similarity search against canonical IAB 3.1 row embeddings
Snaps candidates to valid IAB taxonomy rows/IDs only
Deduplicates and ranks matches by confidence
Promotes high-confidence rows as effective targeting fields
Applies deterministic vertical fallback when confidence is too low
This is critical for production ad tech: it prevents taxonomy hallucination, enforces valid IAB 3.1 IDs, and still captures semantic nuance through embedding-based matching.
1.3 - Build FreeWheel KVP Payload from Normalized Metadata

Once IAB and context signals are normalized, the engine generates FreeWheel key-value pairs for downstream ad serving:
const freewheelPayload = { ad_server: "Freewheel", endpoint: "https://ads.freewheel.tv/ad/p/1", generated_kvps: { vw_brand: toBrand(parsed.company), vw_ctx_inc: includeContexts.join(","), vw_ctx_exc: excludeContexts.join(","), vw_garm_floor: "strict", vw_duration: String(duration), vw_ad_title: parsed.proposedTitle || "untitled", vw_iab_t1: policy.effectiveTier1.join(","), vw_iab_t2: policy.effectiveTier2.join(","), vw_iab_t3: policy.effectiveTier3.join(","), vw_iab_codes: policy.effectiveCodes.join(","), vw_iab_conf: policy.averageConfidence.toFixed(3), }, };
The key fields:
vw_ctx_inccombines target contexts and Pegasus-recommended contextsvw_ctx_exccombines campaign exclusions, Pegasus negatives, and GARM flagsvw_iab_*fields are populated only from normalized/effective classes
This step is what connects AI-generated understanding to existing ad ops workflows. TwelveLabs provides semantic intelligence. Policy normalization ensures deterministic, auditable taxonomy behavior. FreeWheel/OpenRTB mapping makes the outputs deployable in production ad servers.
Step 2: Build Multimodal Embeddings with Marengo
Both content scenes and ad creatives are vectorized into the same 512-dimensional embedding space using Marengo 3.0. This enables true semantic matching between scenes and ads, not just keyword overlap.
export function cosineSimilarity(vecA: number[], vecB: number[]): number { let dot = 0, normA = 0, normB = 0; const len = Math.min(vecA.length, vecB.length); for (let i = 0; i < len; i++) { dot += vecA[i] * vecB[i]; normA += vecA[i] * vecA[i]; normB += vecB[i] * vecB[i]; } if (normA === 0 || normB === 0) return 0; return dot / (Math.sqrt(normA) * Math.sqrt(normB)); }

A visualization of these embeddings is available in the deployed application under the Metadata View, showing how semantically similar scenes cluster together.
To improve ranking spread, the engine normalizes expected cosine values and applies a non-linear boost (power transform). This separates high-quality matches more clearly in candidate rankings, making the difference between a 0.7 and 0.8 similarity score more meaningful for ad selection.
Step 3: Identify Optimal Ad Breaks
Before recommending ads, the engine identifies optimal monetization points within the content. Pegasus 1.5 analyzes each scene for:
Post-segment break quality
Interruption risk
Emotional valley detection
Transition type bonus
Mode-aware safety multiplier (strict, balanced, revenue_max)

The engine then applies spacing constraints and selects top breaks greedily with chronological ordering.
This matters in production because ad relevance is only useful if insertion timing is viewer-safe and UX-aware. A perfectly matched ad placed mid-sentence or during an emotional climax will still drive abandonment.
Step 4: Rank Ads with Safety Gates + Diversity Constraints
With optimal break points identified, embeddings computed, and metadata extracted, the engine can rank ads. But raw scoring isn't enough. Production ad decisioning requires two additional layers:
1. Hard Gates for Brand Safety
Before scoring, ads are filtered through:
User/category eligibility checks
Negative campaign context overlap detection
GARM-sensitive exclusions (alcohol, gambling, violence)
Safety mode gate policies
2. Cross-Break Diversity
No viewer, even one who loves cars, wants to see a car ad at every break. The engine enforces:
Same-ad repetition caps across breaks
Category frequency limits
Fallback logic when diversity constraints suppress top candidates

The result is an ad plan that is both high-scoring and broadcast-realistic, tailored to each viewer, scene, and available inventory.
Step 5: Export to Databricks for Enterprise Retrieval and Analytics
The metadata, embeddings, and ad decisioning data generated by this engine are only valuable if they flow into enterprise workflows. The engine exports all signals to Databricks Delta tables for downstream analytics and ML pipelines.
Queries are generated on-demand to match each user's Databricks workspace:
CREATE OR REPLACE VIEW ad_metadata_premium_spirits_vec AS SELECT creative_id, campaign_name, from_json(marengo_embedding_json, 'array<double>') AS embedding FROM main.default.ad_metadata_premium_spirits WHERE vector_sync_status = 'embedded_marengo_clip_avg'
This data lift into Databricks enables:
Mosaic AI Vector Search Indexing for semantic retrieval at scale
Campaign QA with full audit trails on every decision
Similarity Retrieval for creative ops and competitive analysis
Model-assisted planning in BI and ML pipelines
For teams evaluating enterprise rollout, this is where the TwelveLabs + Databricks combination becomes compelling: model-native video intelligence meets production data governance and retrieval infrastructure.
Why TwelveLabs for Contextual Advertising
You've now built (or walked through) a contextual ad engine that makes placement decisions based on actual video content rather than stale metadata. Few systems outside of purpose-built video AI can support this depth of decisioning across timing, sentiment, semantics, and policy in a production-ready architecture.
TwelveLabs provides the foundation:
Pegasus 1.5 for fine-grained, structured scene intelligence
Marengo 3.0 for multimodal semantic retrieval and matching
An API-first architecture that integrates cleanly into existing ad tech stacks
The combination transforms video from an opaque storage cost into a queryable, monetizable asset.
Start Building
Playground: Home | TwelveLabs
API Reference: Introduction | TwelveLabs
Product Overview: Video AI Platform: Search, Analyze & Embed - TwelveLabs
Sales | Enterprise: Contact TwelveLabs: Talk to Sales
Architecture Diagram: https://lucid.app/lucidchart/ef8d11e1-3f00-4bf0-b411-ab8e3bb3606b/edit?viewport_loc=515%2C-1146%2C5419%2C2654%2C0_0&invitationId=inv_09de1972-142b-4369-9df4-f91eb3f5a949
Deployed Application: Contextual Ad Engine — TwelveLabs
Related articles
Platform
Enterprise
© 2021
-
2026
TwelveLabs, Inc. All Rights Reserved
Platform
Enterprise
© 2021
-
2026
TwelveLabs, Inc. All Rights Reserved



Platform
Enterprise
© 2021
-
2026
TwelveLabs, Inc. All Rights Reserved


