ad-techtemplatesdata-engineering

How to Scrape and Normalize Ad Performance for AI-driven Creative Optimization

UUnknown

2026-02-16

9 min read

Collect ad creative metrics across platforms, normalize to one schema, and prepare features for AI video ad generation.

Hook: You need reliable creative signals now — not fragmented spreadsheets

Ad ops and growth teams in 2026 are drowning in platform-specific metrics: Google Ads reports, Meta Insights, TikTok dashboards, and CSV exports that use different names, currencies, and timezones. That fragmentation breaks ML pipelines and slows down AI-driven creative optimization. This guide gives reproducible scraping templates, a canonical ad performance schema, and step-by-step feature engineering to feed modern video-ad generators and tabular foundation models.

Why this matters in 2026

Nearly 90% of advertisers now use generative AI for video creative. But adoption alone doesn't win — it's the quality and consistency of the input signals. Tabular foundation models and enterprise LLMs trained on structured tables are mainstream in late 2025 and early 2026. Clean, normalized tables are the new competitive moat. If your creative signals are scattered, your AI video generator hallucinates or makes poor creative choices.

Short story: Bad schema + bad features = bad creative. Normalize first, generate later.

High-level pipeline

Ingest: Prefer official APIs when available; fallback to UI scraping for creative-level details missing from APIs.
Normalize: Map platform-specific fields into a canonical schema and normalize units, currencies, and times.
Enrich: Extract creative-level features (duration, aspect, motion, on-screen text, frames, CLIP embeddings).
Feature store: Store per-creative and aggregated features in a single table for model training and inference. Consider edge datastore strategies when low-latency serving or regional storage matters.
Governance: Log provenance, consent, and rate limits; enforce retention and PII rules.

Ingest: Templates and reproducible scrapers

Use official APIs for reliability (Google Ads API, Meta Marketing API, TikTok Marketing API). Use headless browsers only where APIs lack creative-level fields such as raw asset URLs or creative thumbnails. Below are two reproducible templates: an API ingestion snippet and a Playwright UI scraper with anti-bot mitigation guidance.

1) Google Ads API reporting template (Python)

Use GAQL to pull campaign->ad group->ad metrics at the creative level. This snippet uses single quotes and minimal dependencies.

from google.ads.googleads.client import GoogleAdsClient
from datetime import date

client = GoogleAdsClient.load_from_storage()
service = client.get_service('GoogleAdsService')

query = '''
select
  campaign.id,
  ad_group.id,
  ad_group_ad.ad.id,
  ad_group_ad.ad.name,
  metrics.impressions,
  metrics.clicks,
  metrics.cost_micros,
  metrics.video_views,
  metrics.video_view_rate,
  segments.date
from ad_group_ad
where segments.date between '2026-01-01' and '2026-01-15'
'''

response = service.search_stream(customer_id='INSERT_CUSTOMER', query=query)
for batch in response:
    for row in batch.results:
        # yield or write to CSV/JSON lines
        print({
            'platform':'google',
            'campaign_id': row.campaign.id.value,
            'ad_id': row.ad_group_ad.ad.id.value,
            'name': row.ad_group_ad.ad.name.value,
            'impressions': row.metrics.impressions.value,
            'clicks': row.metrics.clicks.value,
            'cost_micros': row.metrics.cost_micros.value,
            'video_views': getattr(row.metrics, 'video_views', None),
            'date': row.segments.date.value
        })

2) Playwright template for creative gallery scraping (UI fallback)

Remember: prefer APIs. Use UI scraping when asset URLs, thumbnails, or creative text overlays are not accessible via API. Use rotating residential proxies, realistic browsing patterns, and exponential backoffs. Keep CAPTCHAs and legal constraints in mind.

from playwright.sync_api import sync_playwright
import time

def scrape_creatives(username, password, start_url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(user_agent='Mozilla/5.0 (compatible)')
        page = context.new_page()
        page.goto(start_url)
        # login flow - adapt selectors
        page.fill('input[name="email"]', username)
        page.fill('input[name="password"]', password)
        page.click('button[type=submit]')
        time.sleep(3)
        # navigate to creative gallery
        page.goto(start_url + '/creatives')
        items = page.query_selector_all('.creative-card')
        for it in items:
            creative = {
                'platform': 'ui-fallback',
                'creative_id': it.get_attribute('data-id'),
                'title': it.query_selector('.title').inner_text(),
                'thumbnail': it.query_selector('img').get_attribute('src')
            }
            print(creative)
        browser.close()

# Use with care and obey platform ToS

Normalization: The canonical ad performance schema

Define a single, platform-agnostic schema. This acts as the contract your models, dashboards, and feature store rely on. Below is a minimal yet practical schema. Store raw payloads in a cold bucket for replay and provenance.

Canonical schema (one row per creative x day x placement)

platform - string, e.g., 'google', 'meta', 'tiktok'
date - ISO date in UTC
campaign_id - string
ad_group_id - string or null
ad_id - string
creative_id - string
creative_type - enum 'video'|'image'|'carousel'
asset_urls - array of strings
duration_ms - int for videos
aspect_ratio - float (width/height)
impressions - int
clicks - int
video_views - int
completed_views - int
spend_usd - float (normalize currencies)
device - string
placement - string
audience_segment - string
raw_payload_ref - pointer to raw JSON blob

Normalization rules

Currency: Normalize to USD using a nightly FX job with reliable market rates.
Timezones: Convert all dates/times to UTC; store original timezone in raw payload.
Missing fields: Keep nulls explicit and log sample rate by platform to track coverage.
Deduplication: Use a deterministic key: concat(platform, campaign_id, ad_id, creative_hash, date).

Pandas mapping example

import pandas as pd

mapping = {
  'impr': 'impressions',
  'views': 'video_views',
  'spend': 'cost',
  'currency': 'currency'
}

def normalize_currency(df, fx_rates):
    df['spend_usd'] = df.apply(lambda r: r['cost'] / fx_rates.get(r['currency'], 1), axis=1)
    return df

raw = pd.read_json('platform_dump.json', lines=True)
raw = raw.rename(columns=mapping)
raw['date'] = pd.to_datetime(raw['date']).dt.tz_convert('UTC').dt.date
raw = normalize_currency(raw, {'EUR': 1.07, 'USD': 1.0})
raw.to_parquet('normalized_creatives.parquet')

Feature engineering for AI video ad generation

Models need both numeric performance signals and rich creative features. Below are deterministic features that materially improve AI video generators and tabular models in 2026.

Core feature groups

Temporal performance: CTR, VTR, CTR by placement, uplift in first 3 seconds.
Creative composition: duration, aspect_ratio, frame_rate, scene_count.
Content signals: on-screen text count, brand logo presence, speaker on camera flag.
Visual embeddings: CLIP embeddings per keyframe, dominant color histograms.
Contextual targets: audience_segment, day_of_week, hour_of_day.

Extract frames and compute CLIP embeddings (pipeline)

Extract keyframes at 1 fps or use scene-detection for variable FPS.
Resize to model input and compute CLIP or image encoder embeddings.
Aggregate embeddings across frames by median or attention-weighted mean.

# ffmpeg keyframe extraction
# ffmpeg -i input.mp4 -vf fps=1 frames/out%04d.jpg

# minimal Python for CLIP embeddings
from PIL import Image
import torch
import clip

model, preprocess = clip.load('ViT-B/32', device='cpu')
frames = ['frames/out0001.jpg', 'frames/out0002.jpg']
embs = []
for f in frames:
    img = preprocess(Image.open(f)).unsqueeze(0)
    with torch.no_grad():
        e = model.encode_image(img)
    embs.append(e.cpu().numpy())
# aggregate
import numpy as np
clip_embedding = np.median(np.vstack(embs), axis=0)

On-screen text and speech-to-text

Use OCR on frames for on-screen text counts and timestamps. Use a speech-to-text service for transcripts and compute textual features: word count, brand token density, sentiment, and first-3-seconds keyword presence.

Feature examples

engagement_per_sec = (clicks + video_views) / duration_seconds
first3s_ctr = clicks within first 3 seconds / impressions during first 3s
text_density = total_on_screen_chars / duration_seconds
brand_logo_presence = boolean
clip_sim_to_top_1 = cosine_sim(creative_embedding, embedding_of_top1)

Storing features: table design and SQL

Store per-creative features in a feature table to serve models during training and inference. Below is a compact SQL DDL suitable for a data warehouse.

create table creative_features (
  platform varchar,
  date date,
  campaign_id varchar,
  creative_id varchar,
  duration_ms int,
  aspect_ratio float,
  impressions bigint,
  clicks bigint,
  spend_usd double,
  ctr float,
  vtr float,
  engagement_per_sec float,
  text_density float,
  brand_logo_present boolean,
  clip_embedding varbinary, -- or separate vector column
  raw_payload_ref varchar,
  primary key (platform, creative_id, date)
);

Modeling tips for AI-driven creative generation

Train models on aggregated windows (7/14/28 days) and include recency flags. In 2026, recency matters more as creative trends flip weekly.
Use tabular foundation models for feature-rich inputs. They outperform generic LLMs when you provide normalized, well-typed columns.
Augment numeric features with vector columns: CLIP embeddings for visual similarity, and transcript embeddings for semantic matching.
Use causal evaluation: A/B test generated creatives vs human baseline, and instrument early-warning metrics like hallucination rate and factual brand mentions. For automating compliance checks on generated outputs, see legal & compliance automation.

Operational concerns and anti-bot strategy

Scraping at scale triggers anti-bot defenses. Practical mitigations:

Prefer official APIs. They reduce legal risk and improve reliability.
Rotate IPs and user agents. Use session pooling and realistic timing; plan for provider flaps like you would in mass-automation work — see guidance on handling provider changes.
Use headless browsers with stealth plugins. Randomize mouse events and navigation patterns.
Handle CAPTCHAs by routing to human-in-the-loop or third-party solving only when legally permissible.
Monitor telemetry: 429/503 rates, anomaly detection on response shape, and automatic backoff policies.

Be explicit about legal and compliance boundaries. Scraping may violate platform terms of service and applicable laws. Consult legal counsel and prefer API ingestion for high-volume, long-running pipelines.

Proven patterns and real-world example

Case study: an eCommerce advertiser in late 2025 normalized creative signals across Google, Meta, and TikTok into the schema above. They ingested daily, enriched video features, and trained a tabular foundation model to propose 3-second hooks. Results in 90 days:

10% lift in first-week CTR for AI-generated variants
30% reduction in human video editing time via automated templates
Improved governance: all creatives tracked with provenance and raw payload refs

The secret was consistent, high-fidelity features: accurate first-3-second CTR, CLIP-based scene embeddings, and clean spend normalization.

Testing, monitoring, and feature drift

Build these checks into your pipeline:

Schema validation at ingest. Fail fast if expected columns are missing. Document schemas and public mappings using a documentation tool — see docs tooling comparisons.
Coverage dashboards showing percent of creatives with video duration, asset URL, and embeddings.
Drift detection: monitor distributional shifts for CTR, duration, and embedding cosine similarity from baseline.
Retraining cadence policy: weekly for fast-moving accounts, monthly for stable portfolios; scale training and retrain infra with strategies like auto-sharding blueprints and scalable pipelines.

Security, privacy, and governance in 2026

Data governance is non-negotiable. In 2026 regulators and major platforms expect documented data lineage and opt-out compliance. Practical steps:

Keep PII out of feature tables. Remove or hash identifiers unless strictly necessary.
Log raw_payload_ref and a hash for replayability, not the full raw content in production tables.
Set retention windows that match legal and business needs; automatically purge old raw payloads. For storage architecture tradeoffs, review distributed file-system and edge-storage considerations in distributed file systems and edge storage.
Record consent and ad account ownership metadata for auditing. Design audit trails that prove provenance and ownership—see audit trail design.

Actionable checklist to implement this in your stack

Inventory: catalog platforms, fields available via API vs UI.
Design: adopt the canonical schema above and build mapping configs per platform.
Ingest: implement API pulls, with Playwright fallbacks for missing creative assets.
Normalize: currency, timezone, and column mapping; persist raw payloads.
Enrich: FFmpeg for frames, CLIP for embeddings, OCR and STT for text features.
Feature store: store per-creative and aggregated features with provenance. Consider edge datastore patterns when you need geo-aware feature serving.
Model: train tabular foundation models and test with causal A/B experiments. Keep legal checks in CI using automated compliance tooling where practical.
Monitor: set alerts for drift, scraping failures, and coverage regressions.

Key takeaways

Normalize early: canonical tables unlock downstream ML and reduce debugging time.
Enrich smart: combine simple scalar metrics with visual and textual embeddings for best outcomes.
Prefer APIs: scraping is a last resort and requires strict governance. Plan your anti-bot and provider-resilience strategies in advance, and be prepared to rotate IPs and user agents.
Automate and monitor: schema checks and drift detection are essential in 2026's fast-moving ad ecosystem.

Call to action

If you want reproducible starter templates, a canonical schema JSON, and a runnable Docker pipeline that pulls, normalizes, and produces features for an AI video generator, download our starter kit or contact the scrapes.us engineering team for a tailored audit and POC. Move from fragmented metrics to a single reliable feature table and unlock generative creative at scale.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.