tabular-modelsprice-datatutorials

Deploying Tabular Foundation Models to Clean Scraped Price Lists: A Recipe

UUnknown

2026-02-01

10 min read

Hands-on 2026 recipe: convert messy HTML/PDF vendor price lists into normalized tables using tabular foundation models, with code and tests.

Hook: Your price lists are a mess — here’s a reproducible recipe to fix them

You scrape dozens of vendor price lists every week and get back a pile of HTML tables, scanned PDFs, and CSVs with inconsistent columns, duplicate SKUs, mixed units, and vendor-specific descriptions. Upstream analytics, margin calculations, and ML models break downstream. This article shows a practical, production-ready recipe (2026) that combines reliable scraping, lightweight OCR, and a tabular foundation model to perform deduplication, unit conversion, and schema mapping — all reproducible with code snippets and testable steps.

Why tabular foundation models matter for price scraping in 2026

In late 2025 and early 2026 the industry shifted from text-first LLM workflows to specialized tabular foundation models (TFMs) that understand rows, columns, and table-level semantics. These models excel at mapping vendor‑specific columns to canonical schemas, inferring units, normalizing price formats, and resolving duplicate items across feeds. The result: less brittle heuristic code, fewer manual rules, and faster time-to-clean data.

At the same time, cloud GPU and memory price volatility (a trend visible after CES 2026) means teams increasingly prefer hosted TFMs or small, efficient on-prem agents rather than running huge general-purpose models for each ingestion pipeline. This recipe is built for that hybrid reality: use cheap pre-processing, call a focused tabular inference endpoint for complex normalization, and run local deterministic code for cheap, repeated tasks.

Recipe overview — what you’ll end up with

Ingestion pipeline: HTML scraping + PDF extraction into raw row lists (see notes on robust scraping and secure storage)
Preprocessing: text cleanup, tokenization, and weak schema hints
TFM tasks: schema mapping, unit normalization, and deduplication
Postprocessing: deterministic conversions, fuzzy matching, and validation rules
Testing: unit tests and data-quality checks to catch regressions (integrate with observability and CI)

1) Ingest: HTML and PDF scraping (practical snippets)

Start by extracting every candidate row as a minimal JSON object: {vendor, raw_text, source_type, page, bounding_box?}. Keep source provenance for audits; persist originals to a secure, auditable store.

HTML example (BeautifulSoup)

import requests
from bs4 import BeautifulSoup
import pandas as pd

r = requests.get('https://vendor.example/prices.html', timeout=30)
soup = BeautifulSoup(r.text, 'html.parser')

rows = []
for table in soup.select('table'):
    headers = [th.get_text(strip=True) for th in table.select('thead th')]
    for tr in table.select('tbody tr'):
        cells = [td.get_text(' ', strip=True) for td in tr.find_all(['td','th'])]
        rows.append({'vendor': 'vendor.example', 'raw': cells, 'headers': headers, 'source': 'html'})

df = pd.DataFrame(rows)
print(df.head())

PDF extraction (pdfplumber, layout-aware)

Scanned PDFs or text PDFs need a two-step strategy: OCR for images and table extraction for born-digital PDFs. Use pdfplumber for text PDFs and Tesseract (via pytesseract) for scanned pages. Keep page and bbox metadata for later human review.

import pdfplumber
import pytesseract
from PIL import Image

rows = []
with pdfplumber.open('vendor_pricelist.pdf') as pdf:
    for i, page in enumerate(pdf.pages):
        try:
            tables = page.extract_tables()
            for table in tables:
                for row in table:
                    rows.append({'vendor':'vendor.pdf','raw': row, 'page': i, 'source': 'pdf'})
        except Exception:
            # fallback: rasterize and OCR
            pil = page.to_image(resolution=200).original
            text = pytesseract.image_to_string(pil)
            # simple line split fallback
            for line in text.splitlines():
                if line.strip():
                    rows.append({'vendor':'vendor.pdf','raw':[line], 'page': i, 'source': 'pdf-ocr'})

2) Preprocess: token-level cleanup and weak schema hints

Before calling the model, normalize whitespace, strip currency symbols into separate fields, and extract obvious numeric candidates with regex. These weak hints reduce model latency and improve accuracy.

import re
from decimal import Decimal

def extract_price(s):
    s = s.replace(',', '')
    m = re.search(r'([£$€]?\s?\d+\.?\d*)', s)
    return Decimal(m.group(1).replace('$','').replace('£','').replace('€','')) if m else None

for r in rows:
    text = ' \t '.join(r['raw']) if isinstance(r['raw'], list) else r['raw']
    r['clean_text'] = re.sub('\s+', ' ', text).strip()
    r['price_hint'] = extract_price(r['clean_text'])

3) Call a Tabular Foundation Model for schema mapping & normalization

Use a dedicated TFM API (hosted or vendor) to end-run brittle heuristics. The TFM should accept:

a small context: canonical schema definition
a batch of raw rows (clean_text, price_hint, vendor, source)
tasks: map_columns, normalize_units, dedupe_candidates

Below is a practical, provider-agnostic request pattern. Adapt to your vendor API or self-hosted endpoint; consider local-first deployment for sensitive catalogs.

# Example: POST to /v1/tabular/infer
payload = {
    'schema': {
        'columns': [
            {'name': 'sku', 'type': 'string'},
            {'name': 'description', 'type': 'string'},
            {'name': 'price_usd', 'type': 'money', 'currency': 'USD'},
            {'name': 'uom', 'type': 'string'},
            {'name': 'pack_qty', 'type': 'integer'}
        ]
    },
    'rows': [
        {'id': 1, 'clean_text': 'ACME Bolt 1/2in 100pc $12.50', 'price_hint': 12.5, 'vendor': 'v1'},
        {'id': 2, 'clean_text': 'ACME Bolt 12mm pk100 USD 12.50', 'price_hint': 12.5, 'vendor': 'v2'},
    ],
    'tasks': ['map_columns','normalize_units','dedupe']
}

# requests.post('https://tfm.example/api/v1/tabular/infer', json=payload)

Expected model response (abbreviated):

{
  'rows': [
    {
      'id': 1,
      'mapped': {'sku': null, 'description': 'ACME Bolt 1/2in', 'price_usd': 12.5, 'uom': 'inch', 'pack_qty': 100},
      'dedupe_key': 'acme-bolt-0.5in'
    },
    {
      'id': 2,
      'mapped': {'sku': null, 'description': 'ACME Bolt 12mm', 'price_usd': 12.5, 'uom': 'mm', 'pack_qty': 100},
      'dedupe_key': 'acme-bolt-12mm'
    }
  ],
  'mappings': {
    'description': { 'confidence': 0.98 },
    'price_usd': { 'confidence': 0.95 }
  }
}

Why separate model tasks?

TFMs are good at semantic understanding (what field this text likely refers to). Deterministic tasks like currency math and exact unit conversion are cheaper and more reliable when done locally with libraries — use the model to produce the high-level mapping and normalized candidate values, then run deterministic code for conversions and checks. Also instrument calls with observability and cost controls to avoid runaway inference bills.

4) Unit conversion and deterministic validation

Use a unit library (pint) for safe conversions. The model may return uom candidates with confidence. Accept high-confidence suggestions automatically; queue low-confidence items for human review.

from pint import UnitRegistry
ureg = UnitRegistry()

def convert_to_base(qty, uom_str):
    try:
        q = (qty * ureg(uom_str)).to_base_units()
        return float(q.magnitude), str(q.units)
    except Exception:
        return None, None

# Example
q, unit = convert_to_base(1, '12mm')  # model might output '12mm' or 'mm'

5) Deduplication: combine model keys with fuzzy matching

Deduplication should be deterministic across runs but robust to small vendor variations. Use the TFM’s suggested dedupe_key as a blocking key, then use a fuzzy match library (rapidfuzz) to cluster remaining candidates. Persist stable canonical IDs and consider persisting caches of model outputs to avoid repeated calls for identical inputs (edge caching patterns discussed in edge-first architectures).

from rapidfuzz import fuzz, process

# cluster by dedupe_key first
clusters = {}
for r in model_resp['rows']:
    key = r.get('dedupe_key') or r['mapped'].get('description')
    clusters.setdefault(key, []).append(r)

# within-cluster fuzzy merge on description
canonical_rows = []
for key, group in clusters.items():
    group_desc = [g['mapped']['description'] for g in group]
    # choose longest description as canonical (heuristic)
    canonical = max(group, key=lambda g: len(g['mapped']['description']))
    # check for near-duplicates to merge prices
    for g in group:
        score = fuzz.token_sort_ratio(canonical['mapped']['description'], g['mapped']['description'])
        if score > 92:
            # merge logic: choose lowest price, sum pack_qty, etc.
            pass
    canonical_rows.append(canonical)

6) Schema mapping strategies and examples

Two practical mapping strategies are effective in production:

Schema-first mapping: provide the model with the canonical schema and a few examples per vendor. Best when you control the target warehouse schema.
Schema-less extraction then canonical mapping: ask the model to extract labeled fields from each row (description, qty, uom, price) and then map extracted fields to canonical schema using rules. Best when vendors vary widely. These patterns echo playbooks used by retail teams moving from pop-up feeds to permanent catalogs (pop-up-to-permanent) and by indie brands running hybrid showrooms.

Example: schema-first prompt fragment sent as structured input to the TFM:

{
  'schema_hint': [
    {'target':'sku', 'examples':['SKU 12345','Part #: 12345']},
    {'target':'description', 'examples':['10mm Hex Bolt Stainless','Bolt, Zinc Plated, 1/2in']},
    {'target':'price_usd', 'examples':['$12.50','USD 12.50','12.50']}
  ]
}

7) Testing and validation — prevent regressions

Implement the following checks as CI steps when updating mapping logic or model versions:

Column coverage: percentage of rows with a mapped price and unit
Value sanity checks: no negative prices, price per unit bounds
Deduplication stability: consistent canonical IDs across runs for same inputs
Model drift alerts: track mapping confidence and flag sudden changes in average confidence (tie these alerts into your observability dashboards)

Example automated test (pytest)

def test_price_coverage(cleaned_rows):
    covered = sum(1 for r in cleaned_rows if r.get('mapped', {}).get('price_usd') is not None)
    assert covered / len(cleaned_rows) > 0.98

8) Human-in-the-loop (HITL) and auditability

Keep a compact review interface that shows:

original row text
TFM-mapped fields and confidence
suggested canonical grouping
quick accept/override actions

For compliance and auditing, persist the model response, mapping version, and any human overrides. This is crucial for vendor disputes and for tracing training data drift. Store provenance in a provenance-aware store and periodically run a "strip the fat" audit (one-page stack audit) to kill underused tool chains that add cost and complexity.

9) Cost and performance considerations (2026 realities)

Late-2025/early-2026 trends changed cost math:

TFMs optimized for tabular tasks are far smaller and cheaper than general LLMs for the same job.
Memory and GPU availability remains cyclical — run bulk preprocessing locally, batch model calls, and use confidence thresholds to bypass the model for trivial rows. Use observability to measure model calls per row and keep TFM usage to the fraction of rows that need semantic mapping.
Where possible, cache model outputs for identical inputs across vendors to avoid redundant calls (see edge-first caching). For field operations or remote teams, plan for portable power and batch jobs when connectivity is constrained.

Practical budgeting tip: measure model calls per row and aim to keep TFM usage to the percentage of rows that require semantic mapping (often 10–30%). Cheap deterministic parsing should handle the rest.

10) Evaluation: metrics to track

Track these KPIs weekly:

Mapping accuracy (sampled human-verified)
Price extraction recall/precision
Deduplication F1 against a gold set
Time-to-clean (seconds per feed)
Model confidence distribution and percentage of low-confidence rows

11) Quick pitfalls and troubleshooting

Bad source HTML: vendor-side scripts may inject invisible characters — normalize Unicode and remove zero-width spaces.
Currency ambiguity: if a vendor uses local currency symbols inconsistently, ask the TFM to infer currency from vendor metadata or fallback to IP-derived locale; tie currency inference to your legal and privacy guidance (reader data trust).
Packaging ambiguity: “12x100” vs “100x12” — define canonical pack semantics and provide examples to the model.
Model drift: maintain versioned sample datasets and re-evaluate mapping accuracy before switching model versions in production.

12) Example end-to-end flow (compact)

Scrape source (HTML/PDF) → extract per-row raw_text
Preprocess: cleanup, price_hint, numeric extraction
Batch-call TFM with schema hints → get mapped fields + dedupe keys + confidence
Run deterministic converters (pint) and fuzzy dedupe (rapidfuzz)
Persist mapped rows, canonical IDs, provenance, and model metadata (consider provenance stores)
Run CI tests and surface low-confidence rows to review queue

13) Reproducible snippet: run locally then swap the TFM endpoint

Put the workflow in a lightweight script allowing you to swap the model endpoint in one place. This makes A/B testing model versions straightforward.

class PriceCleaner:
    def __init__(self, tfm_endpoint):
        self.tfm = tfm_endpoint

    def ingest(self, source):
        # call earlier scraping code
        pass

    def call_model(self, rows):
        # batch and POST to self.tfm
        pass

    def postprocess(self, model_resp):
        # deterministic conversions, dedupe
        pass

    # run tests and export

14) Future-proofing and 2026 predictions

Expect the next 12–24 months to bring:

Standardized tabular model APIs and interchange formats (reduces adapter code)
Better small-footprint TFMs you can run on-prem for sensitive catalogs (see notes on local-first appliances)
Improved multimodal table extraction for mixed HTML/PDF/Images with layout-aware TFMs

The practical win in 2026 isn’t replacing scrapers with a model — it’s replacing brittle heuristics with a hybrid: light pre-processing + targeted TFM calls + deterministic conversions.

Actionable takeaways (one-page checklist)

Extract minimal per-row provenance (vendor, page, bbox)
Precompute easy hints (price_hint, numeric tokens) before model calls
Use the TFM for high-value semantic tasks: mapping, unit inference, canonical keys
Perform deterministic unit conversions locally with pint and fuzzy clustering with rapidfuzz
Persist model responses and confidence for audit and drift detection
Implement CI tests for mapping coverage and dedupe stability

Final notes: compliance and ethics

When scraping price lists, always respect robots.txt, vendor terms, and regional scraping laws. Store only the minimum required vendor data and keep an auditable trail of provenance and human approvals. Where data is sensitive, prefer on-prem TFM runs or enterprise-hosted models with clear data-use contracts and secure storage (zero-trust storage).

Call to action

Ready to try this recipe on your feeds? Clone a starter repo, swap in your TFM endpoint, and run the provided tests to benchmark mapping accuracy in one hour. If you need a consultation to adapt this to your scale or compliance constraints, reach out to the scrapes.us team — we help teams automate reliable price ingestion pipelines that scale. Consider running a short micro-event to A/B test pipeline changes before a full rollout.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.