While using Turbo, the latency is reduced drastically. However, one major problem with the convectional way of using HyperCrawl Turbo is that it takes away all the URLs available on the page/domain.
Problem Simulation
Suppose we run the following code :
from hypercrawlturbo import scraper
# Define the URL of the webpage to scrape
url_to_scrape = "https://hyperllm.gitbook.io/hyperllm"
# Call the scrape_urls function and pass in the URL
extracted_urls = scraper.scrape_urls(url_to_scrape)
# Process the extracted URLs
for url in extracted_urls:
print(url)
# Here you can perform further processing on each URL, such as visiting it or storing it in a database
Now, in the response, as you can see, it scrapes our docs and gives back the URL. However, in line #12, you can see that it is not a URL we wanted to crawl.
We want cralwers to return URLs that we can scrape. However, if we crawl back URLs from sponsored links, then the scrape load skyrockets and we end up scraping tons of unknown data which might tamper our original source retrieval dataset.
The Fix
Now, here's how you can extend the code functionalities to meet this demand :
from hypercrawlturbo import scraper
from urllib.parse import urlparse
# Define the URL of the webpage to scrape
url_to_scrape = "https://hyperllm.gitbook.io/hyperllm"
# Parse the domain of the URL to scrape
parsed_url = urlparse(url_to_scrape)
base_domain = parsed_url.netloc
# Call the scrape_urls function and pass in the URL
extracted_urls = scraper.scrape_urls(url_to_scrape)
# Filter and process the extracted URLs
for url in extracted_urls:
# Parse the domain of the extracted URL
parsed_extracted_url = urlparse(url)
extracted_domain = parsed_extracted_url.netloc
# Check if the extracted URL's domain matches the base domain
if extracted_domain == base_domain:
print(url)
# Here you can perform further processing on each URL, such as visiting it or storing it in a database
Explanation
Parsing the Domain:
urlparse(url_to_scrape): This function is used to parse the URL to scrape and extract the base domain (e.g., hyperllm.gitbook.io).
base_domain = parsed_url.netloc: Extracts the network location part of the URL, which is the domain.
Filtering Extracted URLs:
For each URL in extracted_urls, the domain is parsed using urlparse(url).
The domain of each extracted URL (extracted_domain = parsed_extracted_url.netloc) is then compared to the base domain.
Only URLs with the same domain as the base domain are printed or further processed.
By adding this domain-checking logic, the script ensures that only URLs belonging to the specified domain are included in the results.