- Google Search: Enables comprehensive extraction of Google SERP data across all result types.
- Supports selection of localized Google domains (e.g.,
google.com
,google.ad
) to retrieve region-specific search results. - Pagination supported for retrieving results beyond the first page.
- Supports a search result filtering toggle to control whether to exclude duplicate or similar content.
- Supports selection of localized Google domains (e.g.,
- Google Trends: Retrieves keyword trend data from Google, including popularity over time, regional interest, and related searches.
- Supports multi-keyword comparison.
- Supports multiple data types:
interest_over_time
,interest_by_region
,related_queries
, andrelated_topics
. - Allows filtering by specific Google properties (Web, YouTube, News, Shopping) for source-specific trend analysis.
- Designed for modern, JavaScript-heavy websites, allowing dynamic content extraction.
- Global premium proxy support for bypassing geo-restrictions and improving reliability.
- Crawl: Recursively crawl a website and its linked pages to extract site-wide content.
- Supports configurable crawl depth and scoped URL targeting.
- Scrape: Extract content from a single webpage with high precision.
- Supports “main content only” extraction to exclude ads, footers, and other non-essential elements.
- Allows batch scraping of multiple standalone URLs.
Overview
Integration details
Class | Package | Serializable | JS support | Version |
---|---|---|---|---|
ScrapelessUniversalScrapingTool | langchain-scrapeless | ✅ | ❌ |
Tool features
Native async | Returns artifact | Return data |
---|---|---|
✅ | ✅ | html, markdown, links, metadata, structured content |
Setup
The integration lives in thelangchain-scrapeless
package.
!pip install langchain-scrapeless
Credentials
You’ll need a Scrapeless API key to use this tool. You can set it as an environment variable:Instantiation
Here we show how to instantiate an instance of the Scrapeless Universal Scraping Tool. This tool allows you to scrape any website using a headless browser with JavaScript rendering capabilities, customizable output types, and geo-specific proxy support. The tool accepts the following parameters during instantiation:url
(required, str): The URL of the website to scrape.headless
(optional, bool): Whether to use a headless browser. Default is True.js_render
(optional, bool): Whether to enable JavaScript rendering. Default is True.js_wait_until
(optional, str): Defines when to consider the JavaScript-rendered page ready. Default is'domcontentloaded'
. Options include:load
: Wait until the page is fully loaded.domcontentloaded
: Wait until the DOM is fully loaded.networkidle0
: Wait until the network is idle.networkidle2
: Wait until the network is idle for 2 seconds.
outputs
(optional, str): The specific type of data to extract from the page. Options include:phone_numbers
headings
images
audios
videos
links
menus
hashtags
emails
metadata
tables
favicon
response_type
(optional, str): Defines the format of the response. Default is'html'
. Options include:html
: Return the raw HTML of the page.plaintext
: Return the plain text content.markdown
: Return a Markdown version of the page.png
: Return a PNG screenshot.jpeg
: Return a JPEG screenshot.
response_image_full_page
(optional, bool): Whether to capture and return a full-page image when using screenshot output (png or jpeg). Default is False.selector
(optional, str): A specific CSS selector to scope scraping within a part of the page. Default isNone
.proxy_country
(optional, str): Two-letter country code for geo-specific proxy access (e.g.,'us'
,'gb'
,'de'
,'jp'
). Default is'ANY'
.