Skip to content

Data Extraction

Extract structured data from HTML content using CSS or XPath selectors with our powerful extraction rules system to advanced scenarios including dynamic content and AI-powered extraction.

Key Features

  • 🎯 Precise Targeting: Extract exactly what you need using CSS or XPath selectors
  • 📦 Multiple Formats: Get data as text, HTML, or specific attributes
  • 📋 Bulk Extraction: Extract single items or lists of elements
  • 🔄 Dynamic Content: Works seamlessly with JavaScript-rendered pages
  • 🤖 AI-Powered: Extract data using natural language queries (LLM) (Beta)

Overview

The data extraction feature allows you to:

  • Extract specific elements using CSS or XPath selectors
  • Get data in various formats (text, HTML, attributes)
  • Extract single items or lists of items
  • Use both simple and advanced extraction rules

Basic Usage

The simplest way to extract data is using key-value pairs where the key is your desired data name and the value is the selector:

json
{
    "title": "h1",
    "subtitle": "#subtitle"
}

This will return:

json
{
    "title": "The BitFetcher Blog",
    "subtitle": "We help you get better at web-scraping"
}

Rule Types

Simple String Rules

For basic extractions, use a string selector:

json
{
    "title": "h1",                    // Extract text from h1
    "link": "a@href",                 // Extract href attribute from anchor
    "description": ".content p"       // Extract text from paragraph
}

Advanced Rules

For more complex scenarios, use the full rule object:

json
{
    "products": {
        "selector": ".product-item",    // CSS or XPath selector
        "selector_type": "css",         // "css", "xpath", or "auto" (default)
        "output": "text",               // "text", "html", or "@attribute"
        "type": "list"                  // "item" or "list"
    }
}

Selector Types

CSS Selectors

CSS selectors are the default when the selector doesn't start with /:

json
{
    "title": "#main-title",
    "meta_description": "meta[name='description']@content",
    "product_prices": {
        "selector": ".price-tag",
        "type": "list"
    }
}

XPath Selectors

XPath selectors are automatically detected when starting with /, or can be explicitly specified:

json
{
    "external_links": {
        "selector": "//a[@class='external']",
        "selector_type": "xpath",
        "type": "list",
        "output": "@href"
    }
}

List Extraction

To extract multiple elements, use type: "list":

json
{
    "product_titles": {
        "selector": ".product-item h2",
        "type": "list",
        "output": "text"
    },
    "product_images": {
        "selector": ".product-item img",
        "type": "list",
        "output": "@src"
    }
}

Response:

json
{
    "product_titles": [
        "iPhone 15 Pro Max",
        "Samsung Galaxy S24"
    ],
    "product_images": [
        "iphone15.jpg",
        "galaxy24.jpg"
    ]
}

Output Types

The extractor supports three output formats:

  • text: Extracts text content (default)
  • html: Extracts HTML content
  • @attribute: Extracts specific HTML attributes (e.g., @href, @src)

Text Output

json
{
    "title": {
        "selector": "h1",
        "output": "text"
    }
}

HTML Output

json
{
    "content": {
        "selector": ".article",
        "output": "html"
    }
}

Attribute Output

json
{
    "image_url": {
        "selector": "img.hero",
        "output": "@src"
    }
}

Real-World Examples

Email Address Extraction

Extract all email addresses from a page:

json
{
    "email_addresses": {
        "selector": "a[href^='mailto']",
        "output": "@href",
        "type": "list"
    }
}

Product Information

Extract detailed product information:

json
{
    "page_title": "h1",
    "meta_description": "meta[name='description']@content",
    "product_prices": {
        "selector": ".price-tag",
        "type": "list",
        "output": "text"
    },
    "navigation_links": {
        "selector": "nav a",
        "type": "list",
        "output": "@href"
    }
}

Error Handling

The extractor handles errors gracefully:

  • Skips individual rules that fail to match
  • Continues processing other rules even if some fail
  • Returns null for unmatched selectors
  • Provides meaningful error messages for debugging

Best Practices

  1. Use Specific Selectors: More specific selectors are less likely to break with HTML changes
  2. Choose Appropriate Types: Use type: "list" when expecting multiple elements
  3. Prefer CSS Selectors: Use CSS selectors when possible for better readability
  4. Test Edge Cases: Verify your rules work with empty results and malformed HTML
  5. Use Attribute Extraction: When possible, extract attributes directly instead of parsing text

Integration with JavaScript Rendering

When using data extraction with JavaScript-rendered pages, make sure to:

  1. Enable JavaScript rendering with js_render=true
  2. Set appropriate wait times for dynamic content to load
  3. Use selectors that target the final rendered state of the page

For more information about JavaScript rendering, see our Headless Browser documentation.