Using ClickHouse for High-Speed OLAP on Web-Scraped Data: Implementation Walkthrough
DatabasesClickHouseAnalytics

Using ClickHouse for High-Speed OLAP on Web-Scraped Data: Implementation Walkthrough

UUnknown
2026-03-04
9 min read
Advertisement

A hands-on 2026 guide to architecting ClickHouse for large-scale scraped data: schema, ingestion (Kafka/HTTP), partitioning, vectors, and benchmarks.

Hook: Why ClickHouse is the fast OLAP engine you need for large-scale scraped data

Scraping pipelines break when data volume spikes, anti-bot measures slow crawls, and your analytics warehouse chokes on wide, nested page payloads. In 2026, teams demand predictable ingestion, sub-second aggregate queries over billions of rows, and cheap storage for raw HTML and embeddings. ClickHouse has become a mainstream OLAP choice—backed by major funding and rapid feature growth—and it shines when used as the central analytics store for scraped data.

The evolution of ClickHouse for scraped-data analytics (2024–2026)

Through late 2025 and into 2026 ClickHouse has consolidated features that matter for web-scraping workloads: faster bulk ingestion paths, expanded vector/array tooling, richer index types, and improved cloud managed offerings. With a $400M round in 2025, vendor momentum means production-grade features (replication, distributed queries, TTL, projections) are now stable on large clusters—making ClickHouse a strong alternative to columnar cloud warehouses for OLAP on scraped datasets.

What this guide covers

  • Schema patterns tuned for scraped content and metadata
  • Robust ingestion pipelines: Kafka, HTTP bulk, and batch loaders
  • Partitioning, ORDER BY, TTL, and compression choices
  • Vectorization and embedding storage for semantic search
  • Query examples and benchmarking tips for 2026 workloads

1. Schema design: store raw, parsed, and derived data efficiently

Design for three layers in the same or separate tables: raw (full HTML/response), parsed (extracted fields like title, meta), and metrics/embeddings (tokens, vectors). This separation lets you keep raw payloads compressed with a different TTL and still run fast aggregations on the parsed layer.

Use MergeTree families. Keep cardinality-sensitive columns as LowCardinality(String) and heavy text as String with a compression codec. Use Array(Float32) or ClickHouse's vector-friendly types for embeddings.

Example: raw_pages (for retention and audit)

CREATE TABLE raw_pages (
    url String,
    domain LowCardinality(String),
    fetch_time DateTime64(3),
    http_status UInt16,
    headers String,            -- JSON blob
    body String,               -- raw HTML
    content_length UInt32
  ) ENGINE = MergeTree()
  PARTITION BY toYYYYMM(fetch_time)
  ORDER BY (domain, url)
  TTL fetch_time + INTERVAL 180 DAY
  SETTINGS index_granularity = 8192;
  

Example: parsed_pages (fast analytics)

CREATE TABLE parsed_pages (
    url String,
    domain LowCardinality(String),
    url_hash UInt64,            -- siphash64(url)
    fetch_time DateTime64(3),
    status UInt16,
    title String,
    meta_description String,
    content_text String,        -- cleaned text
    links Nested(
      href String,
      anchor String
    ),
    language LowCardinality(String),
    tokens Array(String),        -- optional token list
    token_count UInt32
  ) ENGINE = ReplacingMergeTree(fetch_time)
  PARTITION BY (domain, toYYYYMM(fetch_time))
  ORDER BY (domain, url_hash)
  SETTINGS index_granularity = 4096;
  

Example: embeddings (semantic search and ML)

CREATE TABLE page_embeddings (
    url String,
    url_hash UInt64,
    fetch_time DateTime64(3),
    embedding Array(Float32),   -- 1536-d or 768-d depending on model
    embedding_norm Float32,     -- precomputed L2 norm to speed cosine
    model_version String
  ) ENGINE = MergeTree()
  PARTITION BY toYYYYMM(fetch_time)
  ORDER BY (url_hash)
  SETTINGS index_granularity = 8192;
  

2. Ingestion patterns for reliability and scale

Choose an ingestion design that separates collection from storage: crawlers should push to an ingest queue (Kafka/RabbitMQ/SQS), then worker pools parse and push to ClickHouse. This improves resilience against temporary database outages and lets you scale parsers independently.

Use ClickHouse's Kafka engine + Materialized View to stream data into MergeTree. This gives at-least-once semantics and leverages Kafka's retention for replay.

-- Kafka topic schema uses JSONEachRow
CREATE TABLE kafka_raw_pages (
  url String,
  domain String,
  fetch_time DateTime64(3),
  status UInt16,
  headers String,
  body String
) ENGINE = Kafka SETTINGS
  kafka_broker_list = 'kafka:9092',
  kafka_topic_list = 'crawl_raw',
  kafka_group_name = 'ch_ingest_group',
  format = 'JSONEachRow';

CREATE MATERIALIZED VIEW mv_kafka_raw
TO raw_pages
AS
SELECT url, domain, fetch_time, status AS http_status, headers, body, length(body) AS content_length
FROM kafka_raw_pages;
  

Pattern B — HTTP bulk inserts (fast for batch loaders)

For periodic bulk loads after parsing, use ClickHouse's HTTP insert endpoint. Example Python snippet for JSONEachRow bulk push:

import requests
rows = [
  {"url": "https://example.com/a", "domain": "example.com", "fetch_time": "2026-01-01 12:00:00", "status":200, "headers":"{}", "body":"..."}
]
body = '\n'.join(json.dumps(r) for r in rows)
resp = requests.post('http://clickhouse:8123/?query=INSERT%20INTO%20raw_pages%20FORMAT%20JSONEachRow', data=body)
print(resp.status_code, resp.text)
  

Pattern C — Batch files via clickhouse-client (ETL style)

For scheduled backfills produce newline-delimited JSON (NDJSON) or Parquet and use clickhouse-client or clickhouse-local to load efficiently.

3. Partitioning, ORDER BY, and index strategies

Good partitioning reduces file count and speeds queries. For scraped data you can partition by time, by domain, or both. Time-based partitions simplify TTL; domain-based partitions speed domain-level analytics.

Guidelines

  • Time partitioning: PARTITION BY toYYYYMM(fetch_time) for monthly partitions. Use daily only if you ingest many terabytes per day.
  • Domain partitioning: combine domain and month for workloads dominated by domain-level aggregations.
  • ORDER BY: choose columns used in joins, group-by, and range scans. Use url_hash (UInt64) for deterministic ordering of strings.
  • TTL: use different TTLs for raw_pages (e.g., 180 days) and parsed_pages/embeddings (retain longer if used for models).
  • Compression: use CODEC(ZSTD(level)) for large strings. Apply Delta or LZ4 for numeric columns.

Advanced: secondary indexes and bloom filters

ClickHouse supports specialized index types (tokenbf_v1 / bloom_filter) for speeding up expensive LIKE/contains predicates on big text fields. Use them sparingly and measure with sample queries.

4. Vectorization and semantic search patterns

By 2026 production pipelines commonly attach embeddings to pages for semantic matching and ML features. ClickHouse can store embeddings in columns and run approximate similarity search inside SQL for hybrid OLAP/semantic workloads.

Store normalized embeddings

Precompute embedding norm and optionally store normalized vectors to make cosine similarity a fast dot product.

-- Find top-10 similar pages using dot product
SELECT url, fetch_time,
  arraySum(arrayMap((x,y) -> x * y, embedding, query_embedding)) AS dot
FROM page_embeddings
WHERE fetch_time >= now() - INTERVAL 90 DAY
ORDER BY dot DESC
LIMIT 10;
  

If you store pre-normalized embeddings, dot product == cosine similarity. For large embedding dimensions, consider a two-step strategy: 1) use a cheap filter (domain, time, or an approximate index), 2) compute exact scores for the top candidates.

Annoy/FAISS hybrid

For sub-100ms kNN at billion-scale, many teams (2025–2026) combine an ANN engine like FAISS/Annoy for candidate generation, then store the embeddings in ClickHouse for scoring and enrichment—this provides the best of both worlds.

5. Materialized views and pre-aggregations

Use Materialized Views to power near-real-time dashboards and reduce query cost. Projections (introduced in recent ClickHouse releases) can also speed repeated multi-dimensional queries.

CREATE MATERIALIZED VIEW mv_domain_hourly
TO domain_hourly_stats
AS
SELECT
  domain,
  toStartOfHour(fetch_time) AS hour,
  count() AS page_count,
  uniqExact(url) AS unique_urls,
  avg(token_count) AS avg_tokens
FROM parsed_pages
GROUP BY domain, hour;
  

6. Query examples: analytics you will run every day

Top domains by volume in the last 7 days

SELECT domain, count() AS pages
FROM parsed_pages
WHERE fetch_time >= now() - INTERVAL 7 DAY
GROUP BY domain
ORDER BY pages DESC
LIMIT 50;
  

Time-series: pages per hour for a domain

SELECT toStartOfHour(fetch_time) AS hour, count() AS pages
FROM parsed_pages
WHERE domain = 'example.com' AND fetch_time >= now() - INTERVAL 30 DAY
GROUP BY hour
ORDER BY hour;
  

Ad-hoc: find pages mentioning a keyword (approximate)

SELECT url, title
FROM parsed_pages
WHERE match(content_text, 'payment processor') -- if match expr is available / indexed
LIMIT 100;
  

7. Benchmarking: measure what matters

Measure both ingestion throughput and query latency under realistic concurrency. Track:

  • rows/sec (bulk inserts and streaming)
  • median/95th/99th query latency for your top queries
  • storage per row (raw vs parsed vs embeddings)

Benchmark recipe

  1. Generate synthetic NDJSON with realistic URL/domain distribution, text length distribution, and embedding dimensions.
  2. Load via the exact ingestion path you intend to use (Kafka or HTTP bulk).
  3. Run TPC-like query mixes: heavy aggregation, point lookup, join to enrichment tables, vector-kNN if used.
  4. Use clickhouse-benchmark for simple insert/query loops or custom load scripts for complex flows.

Record data skew effects—many scraped datasets have a long-tail of domains; shards may become imbalanced without domain-aware sharding.

8. Scaling: replication, sharding, and cloud options

For production you’ll want replication and horizontal sharding:

  • ReplicatedMergeTree for high availability; use ClickHouse Keeper or ZooKeeper.
  • Distributed tables to query across shards.
  • Consider cloud-managed ClickHouse if operational overhead is a concern; hosted offerings also simplify cross-region replication and backups.

Shard key choices

Shard by hashed domain (sipHash64(domain) % shard_count) to reduce hot-shard effects from a few high-volume domains. Alternatively, use a combined hash of (domain, toYYYYMM(fetch_time)) if your queries are often time-scoped.

9. Operational tips & cost control

  • Use different TTLs per table to control storage costs—raw HTML is expensive; consider compressing and deleting older raw pages.
  • Monitor merges—excessive small parts indicate inefficient inserts (use larger batches or buffer tables).
  • Run periodic OPTIMIZE FINAL only for rare compaction needs; rely on background merges normally.
  • Watch memory usage of complex queries (joins, arrayMap) to avoid out-of-memory failures—use LIMIT or pre-aggregations.

10. Security, compliance, and ethical scraping

Keep an audit trail (raw_pages) and retention policies aligned with compliance needs. Redact PII at ingest if required. Use robots.txt and respect site terms; maintain a legal review for commercial use of scraped data. ClickHouse can help with fine-grained access control via role-based auth and external authentication integrations.

Practical takeaway: separate raw and parsed storage, use Kafka for resilient ingestion, partition by month+domain, and store embeddings for semantic layers while delegating heavy ANN to specialized engines if latency requirements are strict.

Example end-to-end flow (summary)

  1. Crawler fetches pages and publishes JSONEachRow to Kafka topic crawl_raw.
  2. ClickHouse consumes crawler topic via Kafka engine and materialized view into raw_pages (audit) and parsed_pages (analytics).
  3. Worker pipeline computes embeddings and upserts into page_embeddings.
  4. Materialized Views populate domain_hourly_stats for dashboards and downstream alerting.
  5. Hybrid semantic search: FAISS for ANN candidate generation, ClickHouse for enrichment and final scoring.

Closing notes: future-proofing your ClickHouse scraping stack in 2026

ClickHouse's fast columnar engine, expanding vector and index tooling, and growing managed deployments make it a top option for OLAP on scraped data in 2026. The key is architecting for streaming ingestion, clear separation of raw vs derived data, and hybrid strategies for nearest-neighbor search. Expect ongoing improvements to vector performance and cloud features—plan to iterate on your embedding strategy and benchmarking routinely.

Call to action

Ready to prototype? Start by creating the three tables above on a dev ClickHouse instance, push a week of real crawl data via Kafka or the HTTP endpoint, and run the benchmarking recipe. If you want a tailored assessment—share your crawl volume, retention needs, and query patterns and we’ll recommend partitioning, shard strategy, and a cost estimate for a production ClickHouse cluster.

Advertisement

Related Topics

#Databases#ClickHouse#Analytics
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-03-04T02:22:34.374Z