Data Extraction

Extract structured data from HTML content using CSS or XPath selectors with our powerful extraction rules system to advanced scenarios including dynamic content and AI-powered extraction.

Key Features

🎯 Precise Targeting: Extract exactly what you need using CSS or XPath selectors
📦 Multiple Formats: Get data as text, HTML, or specific attributes
📋 Bulk Extraction: Extract single items or lists of elements
🔄 Dynamic Content: Works seamlessly with JavaScript-rendered pages
🤖 AI-Powered: Extract data using natural language queries (LLM) (Beta)

Overview

The data extraction feature allows you to:

Extract specific elements using CSS or XPath selectors
Get data in various formats (text, HTML, attributes)
Extract single items or lists of items
Use both simple and advanced extraction rules

Basic Usage

The simplest way to extract data is using key-value pairs where the key is your desired data name and the value is the selector:

json

{
    "title": "h1",
    "subtitle": "#subtitle"
}

This will return:

json

{
    "title": "The BitFetcher Blog",
    "subtitle": "We help you get better at web-scraping"
}

For simple extractions, you can use shorthand notation. The API will automatically detect the selector type and extract the text content.

Rule Types

Simple String Rules

For basic extractions, use a string selector:

json

{
    "title": "h1",                    // Extract text from h1
    "link": "a@href",                 // Extract href attribute from anchor
    "description": ".content p"       // Extract text from paragraph
}

Advanced Rules

For more complex scenarios, use the full rule object:

json

{
    "products": {
        "selector": ".product-item",    // CSS or XPath selector
        "selector_type": "css",         // "css", "xpath", or "auto" (default)
        "output": "text",               // "text", "html", or "@attribute"
        "type": "list"                  // "item" or "list"
    }
}

Selector Types

CSS Selectors

CSS selectors are the default when the selector doesn't start with /:

json

{
    "title": "#main-title",
    "meta_description": "meta[name='description']@content",
    "product_prices": {
        "selector": ".price-tag",
        "type": "list"
    }
}

XPath Selectors

XPath selectors are automatically detected when starting with /, or can be explicitly specified:

json

{
    "external_links": {
        "selector": "//a[@class='external']",
        "selector_type": "xpath",
        "type": "list",
        "output": "@href"
    }
}

List Extraction

To extract multiple elements, use type: "list":

json

{
    "product_titles": {
        "selector": ".product-item h2",
        "type": "list",
        "output": "text"
    },
    "product_images": {
        "selector": ".product-item img",
        "type": "list",
        "output": "@src"
    }
}

Response:

json

{
    "product_titles": [
        "iPhone 15 Pro Max",
        "Samsung Galaxy S24"
    ],
    "product_images": [
        "iphone15.jpg",
        "galaxy24.jpg"
    ]
}

When using list extraction, make sure your selector matches all desired elements. Test with different page states to ensure reliability.

Output Types

The extractor supports three output formats:

text: Extracts text content (default)
html: Extracts HTML content
@attribute: Extracts specific HTML attributes (e.g., @href, @src)

Text Output

json

{
    "title": {
        "selector": "h1",
        "output": "text"
    }
}

HTML Output

json

{
    "content": {
        "selector": ".article",
        "output": "html"
    }
}

Attribute Output

json

{
    "image_url": {
        "selector": "img.hero",
        "output": "@src"
    }
}

Real-World Examples

Email Address Extraction

Extract all email addresses from a page:

json

{
    "email_addresses": {
        "selector": "a[href^='mailto']",
        "output": "@href",
        "type": "list"
    }
}

Product Information

Extract detailed product information:

json

{
    "page_title": "h1",
    "meta_description": "meta[name='description']@content",
    "product_prices": {
        "selector": ".price-tag",
        "type": "list",
        "output": "text"
    },
    "navigation_links": {
        "selector": "nav a",
        "type": "list",
        "output": "@href"
    }
}

Error Handling

The extractor handles errors gracefully:

Skips individual rules that fail to match
Continues processing other rules even if some fail
Returns null for unmatched selectors
Provides meaningful error messages for debugging

Best Practices

Use Specific Selectors: More specific selectors are less likely to break with HTML changes
Choose Appropriate Types: Use type: "list" when expecting multiple elements
Prefer CSS Selectors: Use CSS selectors when possible for better readability
Test Edge Cases: Verify your rules work with empty results and malformed HTML
Use Attribute Extraction: When possible, extract attributes directly instead of parsing text

Integration with JavaScript Rendering

When using data extraction with JavaScript-rendered pages, make sure to:

Enable JavaScript rendering with js_render=true
Set appropriate wait times for dynamic content to load
Use selectors that target the final rendered state of the page

For more information about JavaScript rendering, see our Headless Browser documentation.

Headless Browser

Javascript Scenario

Proxies

Headers

Data Extraction

Key Features

Overview

Basic Usage

Rule Types

Simple String Rules

Advanced Rules

Selector Types

CSS Selectors

XPath Selectors

List Extraction

Output Types

Text Output

HTML Output

Attribute Output

Real-World Examples

Email Address Extraction

Product Information

Error Handling

Best Practices

Integration with JavaScript Rendering

Javascript Scenario

Data Extraction ​

Key Features ​

Overview ​

Basic Usage ​

Rule Types ​

Simple String Rules ​

Advanced Rules ​

Selector Types ​

CSS Selectors ​

XPath Selectors ​

List Extraction ​

Output Types ​

Text Output ​

HTML Output ​

Attribute Output ​

Real-World Examples ​

Email Address Extraction ​

Product Information ​

Error Handling ​

Best Practices ​

Integration with JavaScript Rendering ​

Data Extraction

Key Features

Overview

Basic Usage

Rule Types

Simple String Rules

Advanced Rules

Selector Types

CSS Selectors

XPath Selectors

List Extraction

Output Types

Text Output

HTML Output

Attribute Output

Real-World Examples

Email Address Extraction

Product Information

Error Handling

Best Practices

Integration with JavaScript Rendering