Quickstart: Converting Scraped HTML Tables into a Tabular Model-ready Dataset
snippetstabulartutorial

Quickstart: Converting Scraped HTML Tables into a Tabular Model-ready Dataset

UUnknown
2026-02-24
11 min read
Advertisement

Reproducible Python quickstart to scrape HTML tables, flatten nested cells, infer types, and export schema-compliant Parquet for tabular models.

Cut the friction: go from messy HTML tables to a model-ready tabular dataset in minutes

Scraping HTML tables is routine — until nested cells, rowspan/colspan, inconsistent typing, and missing values break your pipeline. If you’re building analytics or feeding a tabular foundation model, those edge cases become project-stoppers. This quickstart gives a short, reproducible example (Python) that scrapes HTML tables, flattens nested cells, infers column types robustly, and emits a schema-compliant dataset (Parquet + JSON schema) ready for tabular models and downstream ML.

What you’ll get — gist first (inverted pyramid)

  • One-file reproducible Python pipeline that accepts a URL or raw HTML
  • Robust HTML table parsing that handles colspan/rowspan and nested tables
  • Deterministic type inference (int/float/datetime/bool/categorical/text)
  • Export to Parquet + machine-readable JSON schema for model ingestion

Why this matters in 2026

Tabular foundation models are no longer a novelty — industry coverage in 2025–2026 (see Forbes, Jan 15, 2026) highlights the commercial value of unlocking structured data for AI. Companies are moving from ad‑hoc CSVs to disciplined, schema-first pipelines that protect data quality and reduce fine-tuning costs. That makes reliable HTML table ingestion a practical priority for data teams turning web‑sourced tables into high-value features.

“Structured data is AI’s next major frontier.” — Industry coverage, Jan 2026

Quick reproducible example — environment & dependencies

Run this locally in Python 3.10+ (or in a container). Install the small dependency set:

pip install requests beautifulsoup4 lxml pandas pyarrow python-dateutil

Optional (for JavaScript rendered pages): install Playwright or Selenium; I cover that in the edge cases section.

Full example: parse, flatten, infer, save

Copy this file to html_table_quickstart.py and run. The script demonstrates parsing an inline HTML example and the same functions work for a fetched URL.

#!/usr/bin/env python3
"""
Quick reproducible pipeline:
- parse HTML table with colspan/rowspan and nested tables
- build pandas DataFrame
- infer types and build schema
- save Parquet and JSON schema
"""

import json
import re
from datetime import datetime
from io import StringIO

import pandas as pd
import requests
from bs4 import BeautifulSoup
from dateutil.parser import parse as dateparse

# ---------- Utilities: normalize text ----------

def clean_text(node):
    """Extract text from a BeautifulSoup node; flatten nested tables and lists."""
    # If node contains a , flatten it into a compact string
    nested_table = node.find('table')
    if nested_table:
        # render nested table as pipe-separated rows
        rows = []
        for tr in nested_table.find_all('tr'):
            cells = [t.get_text(strip=True) for t in tr.find_all(['td', 'th'])]
            rows.append(' | '.join(cells))
        return ' || '.join(rows)

    # otherwise handle lists and line breaks
    for br in node.find_all(['br', 'p']):
        br.replace_with('\n')
    text = node.get_text(separator=' ', strip=True)
    # normalize whitespace
    return re.sub(r'\s+', ' ', text)

# ---------- Core: table parser handling colspan/rowspan ----------

def parse_html_table(table_tag):
    """
    Parse a BeautifulSoup 
element into a 2D list of cell texts. Handles colspan and rowspan by filling placeholders. """ # list of finalized rows (lists) grid = [] # active map for cells that span to future rows: (row_idx, col_idx) -> remaining_rows, value spans = {} rows = table_tag.find_all('tr') for r_idx, tr in enumerate(rows): # ensure grid has row if len(grid) <= r_idx: grid.append([]) row = grid[r_idx] # advance col index using existing length col_idx = 0 # helper to advance to next free col def next_free_col(): nonlocal col_idx while any((r_idx, col_idx) in spans and spans[(r_idx, col_idx)]['remaining'] > 0 for _ in [0]): # if a span entry exists for this exact cell, we should fill and advance col_idx += 1 while len(row) > col_idx: col_idx += 1 return col_idx for cell in tr.find_all(['td', 'th']): col_idx = next_free_col() colspan = int(cell.get('colspan', 1)) rowspan = int(cell.get('rowspan', 1)) text = clean_text(cell) # ensure row length while len(row) < col_idx: row.append(None) # place the cell and mark spanned cells for c in range(colspan): target_col = col_idx + c # ensure enough width while len(row) <= target_col: row.append(None) row[target_col] = text if c == 0 else text # same text repeated for simplicity if rowspan > 1: # register spans for following rows for r in range(1, rowspan): key = (r_idx + r, target_col) spans[key] = {'remaining': 1, 'value': text} col_idx += colspan # after filling explicit cells, apply spans that target this row max_col = max(len(row), max((k[1] for k in spans.keys() if k[0] == r_idx), default=-1) + 1) for c in range(max_col): key = (r_idx, c) if key in spans and (spans[key]['value'] is not None): # ensure row width while len(row) <= c: row.append(None) if row[c] is None: row[c] = spans[key]['value'] # clear span after applying del spans[key] # normalize rows to same width width = max(len(r) for r in grid) for r in grid: while len(r) < width: r.append(None) return grid # ---------- Type inference ---------- def infer_column_type(series, sample_size=1000, cat_threshold=0.2): """ Infer type: int, float, datetime, bool, categorical, or text. Uses simple heuristics tuned for scraped tables. Returns (pandas dtype, canonical_type_str). """ s = series.dropna().astype(str).str.strip() if s.empty: return ('string', 'text') # try boolean (common explicit markers) low = s.str.lower() is_bool_mask = low.isin(['true', 'false', 'yes', 'no', 'y', 'n', '1', '0']) if is_bool_mask.sum() / max(1, len(low)) > 0.9: return ('bool', 'bool') # numeric tests to_int = pd.to_numeric(s, errors='coerce', downcast='integer') int_success = to_int.notna().sum() / max(1, len(s)) to_float = pd.to_numeric(s.str.replace(',', ''), errors='coerce') float_success = to_float.notna().sum() / max(1, len(s)) if int_success > 0.9: return ('Int64', 'int') if float_success > 0.9: return ('float64', 'float') # datetime test dt_success = 0 for val in s[:sample_size]: try: dateparse(val, fuzzy=False) dt_success += 1 except Exception: pass if dt_success / max(1, len(s[:sample_size])) > 0.9: return ('datetime64[ns]', 'datetime') # categorical (few uniques relative to size) unique_frac = s.nunique() / max(1, len(s)) if unique_frac < cat_threshold and s.nunique() < 2000: return ('category', 'categorical') return ('string', 'text') # ---------- Schema builder & saver ---------- def build_schema(df, sample_examples=5): schema = {'columns': []} for col in df.columns: dtype, canonical = infer_column_type(df[col]) examples = df[col].dropna().astype(str).unique()[:sample_examples].tolist() schema['columns'].append({ 'name': str(col), 'dtype': dtype, 'type': canonical, 'nullable': df[col].isnull().any(), 'examples': examples }) schema['generated_at'] = datetime.utcnow().isoformat() + 'Z' return schema # ---------- Runner: parse inline html, create dataframe, save ---------- SAMPLE_HTML = '''
ProductPriceLaunchNotes
Widget A $10.00 2025-11-01
colorblue
$12.50 (incl. tax) Limited
Widget B 15 Dec 2024 best-seller
''' def html_table_to_dataframe(html_or_url, selector='table', fetch_remote=True): """ Accepts a URL or raw HTML. If fetch_remote is True and html_or_url is a URL (http/https), it fetches. Returns pandas.DataFrame and the JSON schema. """ if fetch_remote and re.match(r'^https?://', html_or_url): r = requests.get(html_or_url, timeout=15) r.raise_for_status() html = r.text else: html = html_or_url soup = BeautifulSoup(html, 'lxml') table_tag = soup.select_one(selector) if table_tag is None: raise ValueError('No table found with selector: %s' % selector) grid = parse_html_table(table_tag) # assume first non-empty row is header if it contains any non-numeric text header = [c or '' for c in grid[0]] data_rows = grid[1:] df = pd.DataFrame(data_rows, columns=header) # trim and normalize whitespace df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x) # basic cleanup: convert currency and percent strings def normalize_numeric_cell(x): if not isinstance(x, str): return x y = x.replace('$', '').replace(',', '').strip() if y.endswith('%'): try: return float(y[:-1]) / 100.0 except Exception: return x try: if re.match(r'^-?\d+$', y): return int(y) return float(y) except Exception: return x # apply a shallow normalization pass for col in df.columns: df[col] = df[col].map(normalize_numeric_cell) # infer types and cast where safe for col in df.columns: dtype, canonical = infer_column_type(df[col]) if dtype == 'Int64': df[col] = pd.to_numeric(df[col], errors='coerce', downcast='integer').astype('Int64') elif dtype == 'float64': df[col] = pd.to_numeric(df[col], errors='coerce') elif dtype == 'datetime64[ns]': df[col] = pd.to_datetime(df[col], errors='coerce') elif dtype == 'bool': df[col] = df[col].astype(str).str.lower().map({'true': True, 'false': False, 'yes': True, 'no': False, '1': True, '0': False}) elif dtype == 'category': df[col] = df[col].astype('category') else: df[col] = df[col].astype('string') schema = build_schema(df) return df, schema if __name__ == '__main__': df, schema = html_table_to_dataframe(SAMPLE_HTML, fetch_remote=False) print('DataFrame preview:') print(df.head()) # save outputs out_parquet = 'scraped_table.parquet' out_schema = 'scraped_table.schema.json' df.to_parquet(out_parquet, engine='pyarrow', index=False) with open(out_schema, 'w', encoding='utf8') as f: json.dump(schema, f, indent=2) print(f'Wrote {out_parquet} and {out_schema}')

What the code does — quick explanation

  • clean_text: flattens nested structures inside a cell. Nested tables are rendered as row1 | row2 segments so nothing is lost.
  • parse_html_table: builds a full rectangular grid and honors colspan/rowspan. It fills spanned cells to keep column alignment stable.
  • infer_column_type: heuristics to decide int/float/datetime/bool/categorical/text. Tunable thresholds help avoid misclassification.
  • html_table_to_dataframe: orchestrates fetch-or-use-html, parsing, normalization (currency/percent), type casting, then emits a schema JSON and Parquet file.

Expected output (from sample)

The script writes scraped_table.parquet and scraped_table.schema.json. The JSON schema contains per-column name, dtype, canonical type, nullable, and sample values — helpful for data contracts or for model ingestion pipelines.

Edge cases & production hardening

JavaScript-rendered tables

If the target page renders tables client-side, use Playwright or Selenium to render, then pass rendered HTML into the same parser function. Playwright is lightweight and scriptable for headless rendering:

# pseudo
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto(url)
    html = page.content()
    df, schema = html_table_to_dataframe(html, fetch_remote=False)

Anti-bot measures, rate limits, and ethics

  • Respect robots.txt and site TOS; many sites disallow scraping of structured tables — check and document your compliance.
  • Use polite crawling: delays, exponential backoff, and low concurrency. Implement adaptive retry logic when you see 429/503 responses.
  • For protected sites or CAPTCHAs, do not bypass intentionally; use official APIs or data partnerships.
  • Log request IDs, response codes, and content hashes so you can audit scraping events later.

Schema drift and validation

In production, expect schemas to drift as sites change. Add a schema validation step using tools like Great Expectations or custom checks that compare new schema to expected schema and fail fast when critical columns vanish or types change.

Data lineage & provenance

Save provenance: source URL, fetch timestamp, HTTP response code, and a content digest (SHA256) with every dataset. Tabular foundation models are sensitive to subtle dataset shifts; keeping lineage reduces debug time.

Scaling: automation and operations

  • Containerize the pipeline and schedule with Airflow/Argo Workflows. Use small worker pools and autoscale for bursts.
  • Store datasets in object storage with partitioning by date/source. Parquet + partitioning gives fast reads for model training.
  • Maintain a catalog (Glue/Metastore) with registered schema JSON and versioning for each dataset snapshot.
  • Alert on schema changes, high null rates, or missing columns. Instrument with metrics and dashboards.

Preparing data for tabular foundation models (TFMs)

TFMs expect consistent feature schemas and typed columns. The steps above help make scraping-compatible datasets:

  • Canonicalize column names (snake_case, no punctuation) and include semantic types in the schema.
  • Normalize units (currencies, percentages, units) and store both raw and canonicalized columns when useful.
  • Impute carefully; prefer explicit nulls and a null mask column over brittle imputation when the downstream model can handle missing data.
  • Provide metadata: cardinality, examples, and provenance so the model trainer can decide embeddings/vocab sizes for categorical columns.

Late 2025 and early 2026 saw an acceleration in production use of tabular models and stronger expectations around data governance. Practical implications:

  • Companies now favor schema-first ingestion to reduce fine-tuning burden and enable prompt-based retrieval over structured features.
  • There’s increased regulatory attention on automated scraping — expect stricter audits and the need to keep robust provenance and consent records.
  • Emerging tools combine LLMs with rule-based cleaners to repair noisy scraped cells automatically. Consider augmenting the pipeline with an LLM-based repair step to standardize text labels and correct parsing mistakes.
  • On-device and edge inference of tabular models is growing — smaller, high-quality datasets matter more than ever for model compression and deployment.

Real-world checklist before production

  1. Confirm legal/compliance approval and capture robots.txt/TOS evidence.
  2. Implement robust retries, backoff, and caching to avoid repeated heavy fetches.
  3. Add schema validation and alerts for drift.
  4. Record provenance (URL, timestamp, content hash) in metadata.
  5. Persist Parquet + neat JSON schema and register dataset in your catalog.

Actionable takeaways

  • Start small: use the one-file reproducible pipeline to validate parsing on a handful of target pages.
  • Make schema explicit: store a JSON schema alongside data so downstream teams and TFMs get deterministic inputs.
  • Automate drift detection: add lightweight validation tests and alerting rather than manual checks.
  • Respect source rules and log provenance — it's required for audits and model explanations.

Final notes and next steps

This quickstart is intentionally compact: the core parsing and type-inference concepts are fully reproducible and practical for real scraping tasks. For production systems, complement this with a renderer (Playwright) for JS pages, a scheduler, and a policy layer for compliance.

Try it now: copy the script, run with the inline sample, then point it at a small list of target URLs. If you want a template repository, or a version that integrates Playwright and Great Expectations for validation, reach out or fork the snippet into your team’s pipeline.

Call to action

Convert one messy table today — run the script, register the Parquet + JSON schema in your catalog, and feed it into your next tabular model training job. For a production-ready template (Playwright rendering, retries, monitoring, and Great Expectations checks) grab the extended repo on our site or contact the team to help operationalize scraping-to-schema pipelines.

Advertisement

Related Topics

#snippets#tabular#tutorial
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-25T11:21:38.830Z