Appearance
Data Extraction
Extract structured data from HTML content using CSS or XPath selectors with our powerful extraction rules system to advanced scenarios including dynamic content and AI-powered extraction.
Key Features
- 🎯 Precise Targeting: Extract exactly what you need using CSS or XPath selectors
- 📦 Multiple Formats: Get data as text, HTML, or specific attributes
- 📋 Bulk Extraction: Extract single items or lists of elements
- 🔄 Dynamic Content: Works seamlessly with JavaScript-rendered pages
- 🤖 AI-Powered: Extract data using natural language queries (LLM) (Beta)
Overview
The data extraction feature allows you to:
- Extract specific elements using CSS or XPath selectors
- Get data in various formats (text, HTML, attributes)
- Extract single items or lists of items
- Use both simple and advanced extraction rules
Basic Usage
The simplest way to extract data is using key-value pairs where the key is your desired data name and the value is the selector:
json
{
"title": "h1",
"subtitle": "#subtitle"
}
This will return:
json
{
"title": "The BitFetcher Blog",
"subtitle": "We help you get better at web-scraping"
}
For simple extractions, you can use shorthand notation. The API will automatically detect the selector type and extract the text content.
Rule Types
Simple String Rules
For basic extractions, use a string selector:
json
{
"title": "h1", // Extract text from h1
"link": "a@href", // Extract href attribute from anchor
"description": ".content p" // Extract text from paragraph
}
Advanced Rules
For more complex scenarios, use the full rule object:
json
{
"products": {
"selector": ".product-item", // CSS or XPath selector
"selector_type": "css", // "css", "xpath", or "auto" (default)
"output": "text", // "text", "html", or "@attribute"
"type": "list" // "item" or "list"
}
}
Selector Types
CSS Selectors
CSS selectors are the default when the selector doesn't start with /
:
json
{
"title": "#main-title",
"meta_description": "meta[name='description']@content",
"product_prices": {
"selector": ".price-tag",
"type": "list"
}
}
XPath Selectors
XPath selectors are automatically detected when starting with /
, or can be explicitly specified:
json
{
"external_links": {
"selector": "//a[@class='external']",
"selector_type": "xpath",
"type": "list",
"output": "@href"
}
}
List Extraction
To extract multiple elements, use type: "list"
:
json
{
"product_titles": {
"selector": ".product-item h2",
"type": "list",
"output": "text"
},
"product_images": {
"selector": ".product-item img",
"type": "list",
"output": "@src"
}
}
Response:
json
{
"product_titles": [
"iPhone 15 Pro Max",
"Samsung Galaxy S24"
],
"product_images": [
"iphone15.jpg",
"galaxy24.jpg"
]
}
When using list extraction, make sure your selector matches all desired elements. Test with different page states to ensure reliability.
Output Types
The extractor supports three output formats:
text
: Extracts text content (default)html
: Extracts HTML content@attribute
: Extracts specific HTML attributes (e.g.,@href
,@src
)
Text Output
json
{
"title": {
"selector": "h1",
"output": "text"
}
}
HTML Output
json
{
"content": {
"selector": ".article",
"output": "html"
}
}
Attribute Output
json
{
"image_url": {
"selector": "img.hero",
"output": "@src"
}
}
Real-World Examples
Email Address Extraction
Extract all email addresses from a page:
json
{
"email_addresses": {
"selector": "a[href^='mailto']",
"output": "@href",
"type": "list"
}
}
Product Information
Extract detailed product information:
json
{
"page_title": "h1",
"meta_description": "meta[name='description']@content",
"product_prices": {
"selector": ".price-tag",
"type": "list",
"output": "text"
},
"navigation_links": {
"selector": "nav a",
"type": "list",
"output": "@href"
}
}
Error Handling
The extractor handles errors gracefully:
- Skips individual rules that fail to match
- Continues processing other rules even if some fail
- Returns
null
for unmatched selectors - Provides meaningful error messages for debugging
Best Practices
- Use Specific Selectors: More specific selectors are less likely to break with HTML changes
- Choose Appropriate Types: Use
type: "list"
when expecting multiple elements - Prefer CSS Selectors: Use CSS selectors when possible for better readability
- Test Edge Cases: Verify your rules work with empty results and malformed HTML
- Use Attribute Extraction: When possible, extract attributes directly instead of parsing text
Integration with JavaScript Rendering
When using data extraction with JavaScript-rendered pages, make sure to:
- Enable JavaScript rendering with
js_render=true
- Set appropriate wait times for dynamic content to load
- Use selectors that target the final rendered state of the page
For more information about JavaScript rendering, see our Headless Browser documentation.