snippetstabulartutorial

Quickstart: Converting Scraped HTML Tables into a Tabular Model-ready Dataset

UUnknown

2026-02-24

11 min read

Reproducible Python quickstart to scrape HTML tables, flatten nested cells, infer types, and export schema-compliant Parquet for tabular models.

Cut the friction: go from messy HTML tables to a model-ready tabular dataset in minutes

Scraping HTML tables is routine — until nested cells, rowspan/colspan, inconsistent typing, and missing values break your pipeline. If you’re building analytics or feeding a tabular foundation model, those edge cases become project-stoppers. This quickstart gives a short, reproducible example (Python) that scrapes HTML tables, flattens nested cells, infers column types robustly, and emits a schema-compliant dataset (Parquet + JSON schema) ready for tabular models and downstream ML.

What you’ll get — gist first (inverted pyramid)

One-file reproducible Python pipeline that accepts a URL or raw HTML
Robust HTML table parsing that handles colspan/rowspan and nested tables
Deterministic type inference (int/float/datetime/bool/categorical/text)
Export to Parquet + machine-readable JSON schema for model ingestion

Why this matters in 2026

Tabular foundation models are no longer a novelty — industry coverage in 2025–2026 (see Forbes, Jan 15, 2026) highlights the commercial value of unlocking structured data for AI. Companies are moving from ad‑hoc CSVs to disciplined, schema-first pipelines that protect data quality and reduce fine-tuning costs. That makes reliable HTML table ingestion a practical priority for data teams turning web‑sourced tables into high-value features.

“Structured data is AI’s next major frontier.” — Industry coverage, Jan 2026

Quick reproducible example — environment & dependencies

Run this locally in Python 3.10+ (or in a container). Install the small dependency set:

pip install requests beautifulsoup4 lxml pandas pyarrow python-dateutil

Optional (for JavaScript rendered pages): install Playwright or Selenium; I cover that in the edge cases section.

Full example: parse, flatten, infer, save

Copy this file to html_table_quickstart.py and run. The script demonstrates parsing an inline HTML example and the same functions work for a fetched URL.

#!/usr/bin/env python3
"""
Quick reproducible pipeline:
- parse HTML table with colspan/rowspan and nested tables
- build pandas DataFrame
- infer types and build schema
- save Parquet and JSON schema
"""

import json
import re
from datetime import datetime
from io import StringIO

import pandas as pd
import requests
from bs4 import BeautifulSoup
from dateutil.parser import parse as dateparse

# ---------- Utilities: normalize text ----------

def clean_text(node):
    """Extract text from a BeautifulSoup node; flatten nested tables and lists."""
    # If node contains a , flatten it into a compact string
    nested_table = node.find('table')
    if nested_table:
        # render nested table as pipe-separated rows
        rows = []
        for tr in nested_table.find_all('tr'):
            cells = [t.get_text(strip=True) for t in tr.find_all(['td', 'th'])]
            rows.append(' | '.join(cells))
        return ' || '.join(rows)

    # otherwise handle lists and line breaks
    for br in node.find_all(['br', 'p']):
        br.replace_with('\n')
    text = node.get_text(separator=' ', strip=True)
    # normalize whitespace
    return re.sub(r'\s+', ' ', text)

# ---------- Core: table parser handling colspan/rowspan ----------

def parse_html_table(table_tag):
    """
    Parse a BeautifulSoup  element into a 2D list of cell texts.
    Handles colspan and rowspan by filling placeholders.
    """
    # list of finalized rows (lists)
    grid = []
    # active map for cells that span to future rows: (row_idx, col_idx) -> remaining_rows, value
    spans = {}

    rows = table_tag.find_all('tr')
    for r_idx, tr in enumerate(rows):
        # ensure grid has row
        if len(grid) <= r_idx:
            grid.append([])
        row = grid[r_idx]

        # advance col index using existing length
        col_idx = 0
        # helper to advance to next free col
        def next_free_col():
            nonlocal col_idx
            while any((r_idx, col_idx) in spans and spans[(r_idx, col_idx)]['remaining'] > 0 for _ in [0]):
                # if a span entry exists for this exact cell, we should fill and advance
                col_idx += 1
            while len(row) > col_idx:
                col_idx += 1
            return col_idx

        for cell in tr.find_all(['td', 'th']):
            col_idx = next_free_col()
            colspan = int(cell.get('colspan', 1))
            rowspan = int(cell.get('rowspan', 1))
            text = clean_text(cell)

            # ensure row length
            while len(row) < col_idx:
                row.append(None)

            # place the cell and mark spanned cells
            for c in range(colspan):
                target_col = col_idx + c
                # ensure enough width
                while len(row) <= target_col:
                    row.append(None)
                row[target_col] = text if c == 0 else text  # same text repeated for simplicity

                if rowspan > 1:
                    # register spans for following rows
                    for r in range(1, rowspan):
                        key = (r_idx + r, target_col)
                        spans[key] = {'remaining': 1, 'value': text}

            col_idx += colspan

        # after filling explicit cells, apply spans that target this row
        max_col = max(len(row), max((k[1] for k in spans.keys() if k[0] == r_idx), default=-1) + 1)
        for c in range(max_col):
            key = (r_idx, c)
            if key in spans and (spans[key]['value'] is not None):
                # ensure row width
                while len(row) <= c:
                    row.append(None)
                if row[c] is None:
                    row[c] = spans[key]['value']
                # clear span after applying
                del spans[key]

    # normalize rows to same width
    width = max(len(r) for r in grid)
    for r in grid:
        while len(r) < width:
            r.append(None)
    return grid

# ---------- Type inference ----------

def infer_column_type(series, sample_size=1000, cat_threshold=0.2):
    """
    Infer type: int, float, datetime, bool, categorical, or text.
    Uses simple heuristics tuned for scraped tables.
    Returns (pandas dtype, canonical_type_str).
    """
    s = series.dropna().astype(str).str.strip()
    if s.empty:
        return ('string', 'text')

    # try boolean (common explicit markers)
    low = s.str.lower()
    is_bool_mask = low.isin(['true', 'false', 'yes', 'no', 'y', 'n', '1', '0'])
    if is_bool_mask.sum() / max(1, len(low)) > 0.9:
        return ('bool', 'bool')

    # numeric tests
    to_int = pd.to_numeric(s, errors='coerce', downcast='integer')
    int_success = to_int.notna().sum() / max(1, len(s))
    to_float = pd.to_numeric(s.str.replace(',', ''), errors='coerce')
    float_success = to_float.notna().sum() / max(1, len(s))
    if int_success > 0.9:
        return ('Int64', 'int')
    if float_success > 0.9:
        return ('float64', 'float')

    # datetime test
    dt_success = 0
    for val in s[:sample_size]:
        try:
            dateparse(val, fuzzy=False)
            dt_success += 1
        except Exception:
            pass
    if dt_success / max(1, len(s[:sample_size])) > 0.9:
        return ('datetime64[ns]', 'datetime')

    # categorical (few uniques relative to size)
    unique_frac = s.nunique() / max(1, len(s))
    if unique_frac < cat_threshold and s.nunique() < 2000:
        return ('category', 'categorical')

    return ('string', 'text')

# ---------- Schema builder & saver ----------

def build_schema(df, sample_examples=5):
    schema = {'columns': []}
    for col in df.columns:
        dtype, canonical = infer_column_type(df[col])
        examples = df[col].dropna().astype(str).unique()[:sample_examples].tolist()
        schema['columns'].append({
            'name': str(col),
            'dtype': dtype,
            'type': canonical,
            'nullable': df[col].isnull().any(),
            'examples': examples
        })
    schema['generated_at'] = datetime.utcnow().isoformat() + 'Z'
    return schema

# ---------- Runner: parse inline html, create dataframe, save ----------

SAMPLE_HTML = '''

  Product Price Launch Notes
  
    Widget A
    $10.00
    2025-11-01
    color blue
  
  
    $12.50 (incl. tax)
    Limited
  
  
    Widget B
    15
    Dec 2024
    best-seller
  

'''


def html_table_to_dataframe(html_or_url, selector='table', fetch_remote=True):
    """
    Accepts a URL or raw HTML. If fetch_remote is True and html_or_url is a URL (http/https), it fetches.
    Returns pandas.DataFrame and the JSON schema.
    """
    if fetch_remote and re.match(r'^https?://', html_or_url):
        r = requests.get(html_or_url, timeout=15)
        r.raise_for_status()
        html = r.text
    else:
        html = html_or_url

    soup = BeautifulSoup(html, 'lxml')
    table_tag = soup.select_one(selector)
    if table_tag is None:
        raise ValueError('No table found with selector: %s' % selector)

    grid = parse_html_table(table_tag)
    # assume first non-empty row is header if it contains any non-numeric text
    header = [c or '' for c in grid[0]]
    data_rows = grid[1:]

    df = pd.DataFrame(data_rows, columns=header)

    # trim and normalize whitespace
    df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

    # basic cleanup: convert currency and percent strings
    def normalize_numeric_cell(x):
        if not isinstance(x, str):
            return x
        y = x.replace('$', '').replace(',', '').strip()
        if y.endswith('%'):
            try:
                return float(y[:-1]) / 100.0
            except Exception:
                return x
        try:
            if re.match(r'^-?\d+$', y):
                return int(y)
            return float(y)
        except Exception:
            return x

    # apply a shallow normalization pass
    for col in df.columns:
        df[col] = df[col].map(normalize_numeric_cell)

    # infer types and cast where safe
    for col in df.columns:
        dtype, canonical = infer_column_type(df[col])
        if dtype == 'Int64':
            df[col] = pd.to_numeric(df[col], errors='coerce', downcast='integer').astype('Int64')
        elif dtype == 'float64':
            df[col] = pd.to_numeric(df[col], errors='coerce')
        elif dtype == 'datetime64[ns]':
            df[col] = pd.to_datetime(df[col], errors='coerce')
        elif dtype == 'bool':
            df[col] = df[col].astype(str).str.lower().map({'true': True, 'false': False, 'yes': True, 'no': False, '1': True, '0': False})
        elif dtype == 'category':
            df[col] = df[col].astype('category')
        else:
            df[col] = df[col].astype('string')

    schema = build_schema(df)
    return df, schema


if __name__ == '__main__':
    df, schema = html_table_to_dataframe(SAMPLE_HTML, fetch_remote=False)
    print('DataFrame preview:')
    print(df.head())

    # save outputs
    out_parquet = 'scraped_table.parquet'
    out_schema = 'scraped_table.schema.json'
    df.to_parquet(out_parquet, engine='pyarrow', index=False)
    with open(out_schema, 'w', encoding='utf8') as f:
        json.dump(schema, f, indent=2)
    print(f'Wrote {out_parquet} and {out_schema}')

What the code does — quick explanation

clean_text: flattens nested structures inside a cell. Nested tables are rendered as row1 | row2 segments so nothing is lost.
parse_html_table: builds a full rectangular grid and honors colspan/rowspan. It fills spanned cells to keep column alignment stable.
infer_column_type: heuristics to decide int/float/datetime/bool/categorical/text. Tunable thresholds help avoid misclassification.
html_table_to_dataframe: orchestrates fetch-or-use-html, parsing, normalization (currency/percent), type casting, then emits a schema JSON and Parquet file.

Expected output (from sample)

The script writes scraped_table.parquet and scraped_table.schema.json. The JSON schema contains per-column name, dtype, canonical type, nullable, and sample values — helpful for data contracts or for model ingestion pipelines.

Edge cases & production hardening

JavaScript-rendered tables

If the target page renders tables client-side, use Playwright or Selenium to render, then pass rendered HTML into the same parser function. Playwright is lightweight and scriptable for headless rendering:

# pseudo
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto(url)
    html = page.content()
    df, schema = html_table_to_dataframe(html, fetch_remote=False)

Anti-bot measures, rate limits, and ethics

Respect robots.txt and site TOS; many sites disallow scraping of structured tables — check and document your compliance.
Use polite crawling: delays, exponential backoff, and low concurrency. Implement adaptive retry logic when you see 429/503 responses.
For protected sites or CAPTCHAs, do not bypass intentionally; use official APIs or data partnerships.
Log request IDs, response codes, and content hashes so you can audit scraping events later.

Schema drift and validation

In production, expect schemas to drift as sites change. Add a schema validation step using tools like Great Expectations or custom checks that compare new schema to expected schema and fail fast when critical columns vanish or types change.

Data lineage & provenance

Save provenance: source URL, fetch timestamp, HTTP response code, and a content digest (SHA256) with every dataset. Tabular foundation models are sensitive to subtle dataset shifts; keeping lineage reduces debug time.

Scaling: automation and operations

Containerize the pipeline and schedule with Airflow/Argo Workflows. Use small worker pools and autoscale for bursts.
Store datasets in object storage with partitioning by date/source. Parquet + partitioning gives fast reads for model training.
Maintain a catalog (Glue/Metastore) with registered schema JSON and versioning for each dataset snapshot.
Alert on schema changes, high null rates, or missing columns. Instrument with metrics and dashboards.

Preparing data for tabular foundation models (TFMs)

TFMs expect consistent feature schemas and typed columns. The steps above help make scraping-compatible datasets:

Canonicalize column names (snake_case, no punctuation) and include semantic types in the schema.
Normalize units (currencies, percentages, units) and store both raw and canonicalized columns when useful.
Impute carefully; prefer explicit nulls and a null mask column over brittle imputation when the downstream model can handle missing data.
Provide metadata: cardinality, examples, and provenance so the model trainer can decide embeddings/vocab sizes for categorical columns.

2026 trends & future-proofing

Late 2025 and early 2026 saw an acceleration in production use of tabular models and stronger expectations around data governance. Practical implications:

Companies now favor schema-first ingestion to reduce fine-tuning burden and enable prompt-based retrieval over structured features.
There’s increased regulatory attention on automated scraping — expect stricter audits and the need to keep robust provenance and consent records.
Emerging tools combine LLMs with rule-based cleaners to repair noisy scraped cells automatically. Consider augmenting the pipeline with an LLM-based repair step to standardize text labels and correct parsing mistakes.
On-device and edge inference of tabular models is growing — smaller, high-quality datasets matter more than ever for model compression and deployment.

Real-world checklist before production

Confirm legal/compliance approval and capture robots.txt/TOS evidence.
Implement robust retries, backoff, and caching to avoid repeated heavy fetches.
Add schema validation and alerts for drift.
Record provenance (URL, timestamp, content hash) in metadata.
Persist Parquet + neat JSON schema and register dataset in your catalog.

Actionable takeaways

Start small: use the one-file reproducible pipeline to validate parsing on a handful of target pages.
Make schema explicit: store a JSON schema alongside data so downstream teams and TFMs get deterministic inputs.
Automate drift detection: add lightweight validation tests and alerting rather than manual checks.
Respect source rules and log provenance — it's required for audits and model explanations.

Final notes and next steps

This quickstart is intentionally compact: the core parsing and type-inference concepts are fully reproducible and practical for real scraping tasks. For production systems, complement this with a renderer (Playwright) for JS pages, a scheduler, and a policy layer for compliance.

Try it now: copy the script, run with the inline sample, then point it at a small list of target URLs. If you want a template repository, or a version that integrates Playwright and Great Expectations for validation, reach out or fork the snippet into your team’s pipeline.

Call to action

Convert one messy table today — run the script, register the Parquet + JSON schema in your catalog, and feed it into your next tabular model training job. For a production-ready template (Playwright rendering, retries, monitoring, and Great Expectations checks) grab the extended repo on our site or contact the team to help operationalize scraping-to-schema pipelines.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Up Next

How Ad Platforms Use AI to Evaluate Video Creative: What Scrapers Should Capture

publisher•10 min read

Scraper Privacy Patterns for Publisher Content: Honor Agreements and Automate License Checks

resilience•10 min read

How to Build a Resilient Scraper Fleet When Geopolitics Threaten the AI Supply Chain

browser•9 min read

Puma vs Chrome: Is a Local-AI Browser the Future of Secure Data Collection?

ethics•10 min read

Ethical Considerations When Scraping Predictions and Betting Models (SportsLine Case Study)

From Our Network

Trending stories across our publication group

Privacy-First Browsers: How Local AI in the Browser Changes Data Protection

codeacademy.site

privacy•10 min read

Privacy-First Browsers: How Local AI in the Browser Changes Data Protection

How Windows admins can diagnose and fix the 'Fail To Shut Down' Windows Update bug

windows.page

Windows Update•9 min read

How Windows admins can diagnose and fix the 'Fail To Shut Down' Windows Update bug

From Chrome Extension to Local AI Extension: A Migration Playbook in TypeScript

typescript.website

extensions•11 min read

From Chrome Extension to Local AI Extension: A Migration Playbook in TypeScript

From Bug to Bounty: Building a Secure, Developer-Friendly Bug Bounty Program for Games

thecode.website

Security•9 min read

From Bug to Bounty: Building a Secure, Developer-Friendly Bug Bounty Program for Games

A Practical Migration Plan: Moving Analytics from Snowflake to ClickHouse

codeguru.app

migration•11 min read

A Practical Migration Plan: Moving Analytics from Snowflake to ClickHouse

Build a Privacy-First Mobile Browser with Local AI (Kotlin + CoreML)

codewithme.online

mobile•10 min read

Build a Privacy-First Mobile Browser with Local AI (Kotlin + CoreML)

2026-02-25T11:21:38.830Z