Quickstart: Converting Scraped HTML Tables into a Tabular Model-ready Dataset
Reproducible Python quickstart to scrape HTML tables, flatten nested cells, infer types, and export schema-compliant Parquet for tabular models.
Cut the friction: go from messy HTML tables to a model-ready tabular dataset in minutes
Scraping HTML tables is routine — until nested cells, rowspan/colspan, inconsistent typing, and missing values break your pipeline. If you’re building analytics or feeding a tabular foundation model, those edge cases become project-stoppers. This quickstart gives a short, reproducible example (Python) that scrapes HTML tables, flattens nested cells, infers column types robustly, and emits a schema-compliant dataset (Parquet + JSON schema) ready for tabular models and downstream ML.
What you’ll get — gist first (inverted pyramid)
- One-file reproducible Python pipeline that accepts a URL or raw HTML
- Robust HTML table parsing that handles
colspan/rowspanand nested tables - Deterministic type inference (int/float/datetime/bool/categorical/text)
- Export to Parquet + machine-readable JSON schema for model ingestion
Why this matters in 2026
Tabular foundation models are no longer a novelty — industry coverage in 2025–2026 (see Forbes, Jan 15, 2026) highlights the commercial value of unlocking structured data for AI. Companies are moving from ad‑hoc CSVs to disciplined, schema-first pipelines that protect data quality and reduce fine-tuning costs. That makes reliable HTML table ingestion a practical priority for data teams turning web‑sourced tables into high-value features.
“Structured data is AI’s next major frontier.” — Industry coverage, Jan 2026
Quick reproducible example — environment & dependencies
Run this locally in Python 3.10+ (or in a container). Install the small dependency set:
pip install requests beautifulsoup4 lxml pandas pyarrow python-dateutil
Optional (for JavaScript rendered pages): install Playwright or Selenium; I cover that in the edge cases section.
Full example: parse, flatten, infer, save
Copy this file to html_table_quickstart.py and run. The script demonstrates parsing an inline HTML example and the same functions work for a fetched URL.
#!/usr/bin/env python3
"""
Quick reproducible pipeline:
- parse HTML table with colspan/rowspan and nested tables
- build pandas DataFrame
- infer types and build schema
- save Parquet and JSON schema
"""
import json
import re
from datetime import datetime
from io import StringIO
import pandas as pd
import requests
from bs4 import BeautifulSoup
from dateutil.parser import parse as dateparse
# ---------- Utilities: normalize text ----------
def clean_text(node):
"""Extract text from a BeautifulSoup node; flatten nested tables and lists."""
# If node contains a , flatten it into a compact string
nested_table = node.find('table')
if nested_table:
# render nested table as pipe-separated rows
rows = []
for tr in nested_table.find_all('tr'):
cells = [t.get_text(strip=True) for t in tr.find_all(['td', 'th'])]
rows.append(' | '.join(cells))
return ' || '.join(rows)
# otherwise handle lists and line breaks
for br in node.find_all(['br', 'p']):
br.replace_with('\n')
text = node.get_text(separator=' ', strip=True)
# normalize whitespace
return re.sub(r'\s+', ' ', text)
# ---------- Core: table parser handling colspan/rowspan ----------
def parse_html_table(table_tag):
"""
Parse a BeautifulSoup element into a 2D list of cell texts.
Handles colspan and rowspan by filling placeholders.
"""
# list of finalized rows (lists)
grid = []
# active map for cells that span to future rows: (row_idx, col_idx) -> remaining_rows, value
spans = {}
rows = table_tag.find_all('tr')
for r_idx, tr in enumerate(rows):
# ensure grid has row
if len(grid) <= r_idx:
grid.append([])
row = grid[r_idx]
# advance col index using existing length
col_idx = 0
# helper to advance to next free col
def next_free_col():
nonlocal col_idx
while any((r_idx, col_idx) in spans and spans[(r_idx, col_idx)]['remaining'] > 0 for _ in [0]):
# if a span entry exists for this exact cell, we should fill and advance
col_idx += 1
while len(row) > col_idx:
col_idx += 1
return col_idx
for cell in tr.find_all(['td', 'th']):
col_idx = next_free_col()
colspan = int(cell.get('colspan', 1))
rowspan = int(cell.get('rowspan', 1))
text = clean_text(cell)
# ensure row length
while len(row) < col_idx:
row.append(None)
# place the cell and mark spanned cells
for c in range(colspan):
target_col = col_idx + c
# ensure enough width
while len(row) <= target_col:
row.append(None)
row[target_col] = text if c == 0 else text # same text repeated for simplicity
if rowspan > 1:
# register spans for following rows
for r in range(1, rowspan):
key = (r_idx + r, target_col)
spans[key] = {'remaining': 1, 'value': text}
col_idx += colspan
# after filling explicit cells, apply spans that target this row
max_col = max(len(row), max((k[1] for k in spans.keys() if k[0] == r_idx), default=-1) + 1)
for c in range(max_col):
key = (r_idx, c)
if key in spans and (spans[key]['value'] is not None):
# ensure row width
while len(row) <= c:
row.append(None)
if row[c] is None:
row[c] = spans[key]['value']
# clear span after applying
del spans[key]
# normalize rows to same width
width = max(len(r) for r in grid)
for r in grid:
while len(r) < width:
r.append(None)
return grid
# ---------- Type inference ----------
def infer_column_type(series, sample_size=1000, cat_threshold=0.2):
"""
Infer type: int, float, datetime, bool, categorical, or text.
Uses simple heuristics tuned for scraped tables.
Returns (pandas dtype, canonical_type_str).
"""
s = series.dropna().astype(str).str.strip()
if s.empty:
return ('string', 'text')
# try boolean (common explicit markers)
low = s.str.lower()
is_bool_mask = low.isin(['true', 'false', 'yes', 'no', 'y', 'n', '1', '0'])
if is_bool_mask.sum() / max(1, len(low)) > 0.9:
return ('bool', 'bool')
# numeric tests
to_int = pd.to_numeric(s, errors='coerce', downcast='integer')
int_success = to_int.notna().sum() / max(1, len(s))
to_float = pd.to_numeric(s.str.replace(',', ''), errors='coerce')
float_success = to_float.notna().sum() / max(1, len(s))
if int_success > 0.9:
return ('Int64', 'int')
if float_success > 0.9:
return ('float64', 'float')
# datetime test
dt_success = 0
for val in s[:sample_size]:
try:
dateparse(val, fuzzy=False)
dt_success += 1
except Exception:
pass
if dt_success / max(1, len(s[:sample_size])) > 0.9:
return ('datetime64[ns]', 'datetime')
# categorical (few uniques relative to size)
unique_frac = s.nunique() / max(1, len(s))
if unique_frac < cat_threshold and s.nunique() < 2000:
return ('category', 'categorical')
return ('string', 'text')
# ---------- Schema builder & saver ----------
def build_schema(df, sample_examples=5):
schema = {'columns': []}
for col in df.columns:
dtype, canonical = infer_column_type(df[col])
examples = df[col].dropna().astype(str).unique()[:sample_examples].tolist()
schema['columns'].append({
'name': str(col),
'dtype': dtype,
'type': canonical,
'nullable': df[col].isnull().any(),
'examples': examples
})
schema['generated_at'] = datetime.utcnow().isoformat() + 'Z'
return schema
# ---------- Runner: parse inline html, create dataframe, save ----------
SAMPLE_HTML = '''
Product Price Launch Notes
Widget A
$10.00
2025-11-01
color blue
$12.50 (incl. tax)
Limited
Widget B
15
Dec 2024
best-seller
'''
def html_table_to_dataframe(html_or_url, selector='table', fetch_remote=True):
"""
Accepts a URL or raw HTML. If fetch_remote is True and html_or_url is a URL (http/https), it fetches.
Returns pandas.DataFrame and the JSON schema.
"""
if fetch_remote and re.match(r'^https?://', html_or_url):
r = requests.get(html_or_url, timeout=15)
r.raise_for_status()
html = r.text
else:
html = html_or_url
soup = BeautifulSoup(html, 'lxml')
table_tag = soup.select_one(selector)
if table_tag is None:
raise ValueError('No table found with selector: %s' % selector)
grid = parse_html_table(table_tag)
# assume first non-empty row is header if it contains any non-numeric text
header = [c or '' for c in grid[0]]
data_rows = grid[1:]
df = pd.DataFrame(data_rows, columns=header)
# trim and normalize whitespace
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
# basic cleanup: convert currency and percent strings
def normalize_numeric_cell(x):
if not isinstance(x, str):
return x
y = x.replace('$', '').replace(',', '').strip()
if y.endswith('%'):
try:
return float(y[:-1]) / 100.0
except Exception:
return x
try:
if re.match(r'^-?\d+$', y):
return int(y)
return float(y)
except Exception:
return x
# apply a shallow normalization pass
for col in df.columns:
df[col] = df[col].map(normalize_numeric_cell)
# infer types and cast where safe
for col in df.columns:
dtype, canonical = infer_column_type(df[col])
if dtype == 'Int64':
df[col] = pd.to_numeric(df[col], errors='coerce', downcast='integer').astype('Int64')
elif dtype == 'float64':
df[col] = pd.to_numeric(df[col], errors='coerce')
elif dtype == 'datetime64[ns]':
df[col] = pd.to_datetime(df[col], errors='coerce')
elif dtype == 'bool':
df[col] = df[col].astype(str).str.lower().map({'true': True, 'false': False, 'yes': True, 'no': False, '1': True, '0': False})
elif dtype == 'category':
df[col] = df[col].astype('category')
else:
df[col] = df[col].astype('string')
schema = build_schema(df)
return df, schema
if __name__ == '__main__':
df, schema = html_table_to_dataframe(SAMPLE_HTML, fetch_remote=False)
print('DataFrame preview:')
print(df.head())
# save outputs
out_parquet = 'scraped_table.parquet'
out_schema = 'scraped_table.schema.json'
df.to_parquet(out_parquet, engine='pyarrow', index=False)
with open(out_schema, 'w', encoding='utf8') as f:
json.dump(schema, f, indent=2)
print(f'Wrote {out_parquet} and {out_schema}')
What the code does — quick explanation
- clean_text: flattens nested structures inside a cell. Nested tables are rendered as
row1 | row2 segments so nothing is lost.
- parse_html_table: builds a full rectangular grid and honors
colspan/rowspan. It fills spanned cells to keep column alignment stable.
- infer_column_type: heuristics to decide int/float/datetime/bool/categorical/text. Tunable thresholds help avoid misclassification.
- html_table_to_dataframe: orchestrates fetch-or-use-html, parsing, normalization (currency/percent), type casting, then emits a schema JSON and Parquet file.
Expected output (from sample)
The script writes scraped_table.parquet and scraped_table.schema.json. The JSON schema contains per-column name, dtype, canonical type, nullable, and sample values — helpful for data contracts or for model ingestion pipelines.
Edge cases & production hardening
JavaScript-rendered tables
If the target page renders tables client-side, use Playwright or Selenium to render, then pass rendered HTML into the same parser function. Playwright is lightweight and scriptable for headless rendering:
# pseudo
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
html = page.content()
df, schema = html_table_to_dataframe(html, fetch_remote=False)
Anti-bot measures, rate limits, and ethics
- Respect robots.txt and site TOS; many sites disallow scraping of structured tables — check and document your compliance.
- Use polite crawling: delays, exponential backoff, and low concurrency. Implement adaptive retry logic when you see 429/503 responses.
- For protected sites or CAPTCHAs, do not bypass intentionally; use official APIs or data partnerships.
- Log request IDs, response codes, and content hashes so you can audit scraping events later.
Schema drift and validation
In production, expect schemas to drift as sites change. Add a schema validation step using tools like Great Expectations or custom checks that compare new schema to expected schema and fail fast when critical columns vanish or types change.
Data lineage & provenance
Save provenance: source URL, fetch timestamp, HTTP response code, and a content digest (SHA256) with every dataset. Tabular foundation models are sensitive to subtle dataset shifts; keeping lineage reduces debug time.
Scaling: automation and operations
- Containerize the pipeline and schedule with Airflow/Argo Workflows. Use small worker pools and autoscale for bursts.
- Store datasets in object storage with partitioning by date/source. Parquet + partitioning gives fast reads for model training.
- Maintain a catalog (Glue/Metastore) with registered schema JSON and versioning for each dataset snapshot.
- Alert on schema changes, high null rates, or missing columns. Instrument with metrics and dashboards.
Preparing data for tabular foundation models (TFMs)
TFMs expect consistent feature schemas and typed columns. The steps above help make scraping-compatible datasets:
- Canonicalize column names (snake_case, no punctuation) and include semantic types in the schema.
- Normalize units (currencies, percentages, units) and store both raw and canonicalized columns when useful.
- Impute carefully; prefer explicit nulls and a null mask column over brittle imputation when the downstream model can handle missing data.
- Provide metadata: cardinality, examples, and provenance so the model trainer can decide embeddings/vocab sizes for categorical columns.
2026 trends & future-proofing
Late 2025 and early 2026 saw an acceleration in production use of tabular models and stronger expectations around data governance. Practical implications:
- Companies now favor schema-first ingestion to reduce fine-tuning burden and enable prompt-based retrieval over structured features.
- There’s increased regulatory attention on automated scraping — expect stricter audits and the need to keep robust provenance and consent records.
- Emerging tools combine LLMs with rule-based cleaners to repair noisy scraped cells automatically. Consider augmenting the pipeline with an LLM-based repair step to standardize text labels and correct parsing mistakes.
- On-device and edge inference of tabular models is growing — smaller, high-quality datasets matter more than ever for model compression and deployment.
Real-world checklist before production
- Confirm legal/compliance approval and capture robots.txt/TOS evidence.
- Implement robust retries, backoff, and caching to avoid repeated heavy fetches.
- Add schema validation and alerts for drift.
- Record provenance (URL, timestamp, content hash) in metadata.
- Persist Parquet + neat JSON schema and register dataset in your catalog.
Actionable takeaways
- Start small: use the one-file reproducible pipeline to validate parsing on a handful of target pages.
- Make schema explicit: store a JSON schema alongside data so downstream teams and TFMs get deterministic inputs.
- Automate drift detection: add lightweight validation tests and alerting rather than manual checks.
- Respect source rules and log provenance — it's required for audits and model explanations.
Final notes and next steps
This quickstart is intentionally compact: the core parsing and type-inference concepts are fully reproducible and practical for real scraping tasks. For production systems, complement this with a renderer (Playwright) for JS pages, a scheduler, and a policy layer for compliance.
Try it now: copy the script, run with the inline sample, then point it at a small list of target URLs. If you want a template repository, or a version that integrates Playwright and Great Expectations for validation, reach out or fork the snippet into your team’s pipeline.
Call to action
Convert one messy table today — run the script, register the Parquet + JSON schema in your catalog, and feed it into your next tabular model training job. For a production-ready template (Playwright rendering, retries, monitoring, and Great Expectations checks) grab the extended repo on our site or contact the team to help operationalize scraping-to-schema pipelines.
Related Reading
- Run a Privacy-First Local LLM on Raspberry Pi 5 with the AI HAT+ 2
- How BTS’s Arirang Could Reframe K-Pop’s Relationship With Korean Tradition
- How Google’s Total Campaign Budgets Change Job Ad Strategy
- The Ultimate Guide to Hot-Water Bottle Safety and Longevity (So Your Cheap One Lasts)
- Netflix Just Killed Casting — Here’s How to Still Watch on Your Big Screen
AdvertisementRelated Topics
#snippets#tabular#tutorialUUnknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
AdvertisementUp Next
More stories handpicked for you
ad-tech•10 min readHow Ad Platforms Use AI to Evaluate Video Creative: What Scrapers Should Capture
publisher•10 min readScraper Privacy Patterns for Publisher Content: Honor Agreements and Automate License Checks
resilience•10 min readHow to Build a Resilient Scraper Fleet When Geopolitics Threaten the AI Supply Chain
browser•9 min readPuma vs Chrome: Is a Local-AI Browser the Future of Secure Data Collection?
ethics•10 min readEthical Considerations When Scraping Predictions and Betting Models (SportsLine Case Study)
From Our Network
Trending stories across our publication group
codeacademy.siteprivacy•10 min readPrivacy-First Browsers: How Local AI in the Browser Changes Data Protection
windows.pageWindows Update•9 min readHow Windows admins can diagnose and fix the 'Fail To Shut Down' Windows Update bug
typescript.websiteextensions•11 min readFrom Chrome Extension to Local AI Extension: A Migration Playbook in TypeScript
thecode.websiteSecurity•9 min readFrom Bug to Bounty: Building a Secure, Developer-Friendly Bug Bounty Program for Games
codeguru.appmigration•11 min readA Practical Migration Plan: Moving Analytics from Snowflake to ClickHouse
codewithme.onlinemobile•10 min readBuild a Privacy-First Mobile Browser with Local AI (Kotlin + CoreML)
2026-02-25T11:21:38.830Z