fmerian

Tabstack Structured Extraction - Extract web data into structured JSON, no scraper required.

Define a schema, pass a URL, get back JSON that matches. Tabstack's extract endpoint turns any web page into structured output, no parsing code and no LLM call to maintain. generate endpoint adds AI instructions for reasoned answers, not raw fields. Both enforce your schema on every call, even when the page changes. Tune speed with effort levels, target any country with geo_target. Mozilla-backed: your data is never sold or used to train models. 10,000 free credits to start.

Add a comment

Replies

Best
Anand Thakkar

No-scraper structured extraction solves a real pain. The challenge has always been handling dynamic content and lazy-load patterns reliably at scale. Running a full browser context per request is expensive, but lighter HTML parsing doesn't catch enough on modern SPAs. How do you handle JS-heavy pages? Do you spin up a real browser for every extraction or have a tiered approach to keep costs down?

Tessa Kriesel

@anand_thakkar1 This is exactly the tradeoff we built around: no, it's not a real browser on every request. Extract and generate give you three effort levels, and you pick the tier:

  • min: plain HTTP fetch, no JS. Lowest cost and latency, for static or server-rendered pages.

  • standard (default): balanced handling that covers most pages without full browser rendering.

  • max: full browser render that executes JS and handles lazy-loaded content. For known heavy SPAs.

So a real browser is the heaviest tier, used for the pages that need it rather than the default for every call. Responses are cached too, so repeat requests for the same page don't re-fetch (unless you use nocache).

Honest tradeoff: the lighter tiers are faster and cheaper but can miss content on the most dynamic pages, which is exactly when you reach for max.

Oleksii Sekundant

The "no scraper to maintain" pitch lands for anyone who's watched selectors break every time a site reships its markup. Does Tabstack lean on the rendered DOM or a model to infer structure — and how does it hold up on pages that lazy-load behind scroll?

Tessa Kriesel

@oleksii_sekundant Tabstack uses the rendered DOM plus a model, not selectors. For JS-heavy pages it renders the page in a real headless browser, then a model maps the rendered content to the JSON schema you define. The part that solves your selector pain: extraction is schema-driven, not selector-driven. You describe the meaning of the data you want, not a DOM path, so when a site reships its markup there is no selector to break. The model re-infers structure from the new render against the same schema.

On lazy-load behind scroll: the rendered path handles it. After navigation it waits for the network to go idle and for JS to render, then scrolls the page in passes (with short pauses) to trigger lazy-loaded content before it reads the DOM. So data that only appears as you scroll down gets pulled in.

The honest boundary: that scroll pass is bounded, so it covers typical lazy-load-on-scroll, not endless infinite feeds. For unbounded scrolling, or "scroll, then click or interact, then read" flows, the automate endpoint drives a real browser and can scroll as a deliberate step, then return structured output.

Fabian Maume

Do you plan to implement some user agent rotation?

For now, all requests are signed with the same user agent: Mozilla-Tabstack/1.0 (+https://tabstack.ai)

Tessa Kriesel

@fabian_maume Good catch, and the single agent string is intentional. Every request identifies as Mozilla-Tabstack/1.0 with a contact URL on purpose, so site operators can see exactly who is accessing them and reach us directly. Identifiable, predictable access is the posture we want right now. If there's a specific case where the single UA is blocking a legitimate extraction for you, tell us more about it, that's genuinely useful for how we prioritize.

fmerian

Looking forward to seeing what you're building with @Tabstack by Mozilla!

Corey Haines

Another amazing shipment 🛳️

Tessa Kriesel

@corey_haines aww thanks for the kind words!

Chris Davis

Seems like an interesting concept. How well do you handle things like sites with heavy js rendering in them?

Tessa Kriesel

@chris_davis23 Heavy JS rendering is handled through the effort parameter on the extract and generate endpoints:

  • max: full headless browser rendering. It executes JavaScript and waits for dynamic content to load before pulling data. This is the setting for SPAs (React, Vue, Angular, Next.js client-side), lazy-loaded content, and pricing or product grids that only appear after JS runs.

  • standard (default): lighter JS handling that covers most pages.

  • min: static HTML only, no JS, for lowest latency.

So for a JS-heavy site you set effort: 'max' and extract against the fully rendered DOM:

const data = await client.extract.json({
  url: 'https://example.com',
  effort: 'max',
  json_schema: { /* the shape you want back */ }
})

The same effort control applies to markdown extraction and the generate endpoint. And if a page only reveals content after interaction (click, scroll, log in), the automate endpoint drives a real browser to do that first, then hands back structured output.

xiaosong

The structured extraction angle is useful, especially if it keeps schema drift visible instead of just returning a 'best effort' JSON blob. Not sure if I missed it, but can teams version extraction rules per site/workflow?