Domain Restriction

While using Turbo, the latency is reduced drastically. However, one major problem with the convectional way of using HyperCrawl Turbo is that it takes away all the URLs available on the page/domain.

Problem Simulation

Suppose we run the following code :


from hypercrawlturbo import scraper

# Define the URL of the webpage to scrape
url_to_scrape = "https://hyperllm.gitbook.io/hyperllm"

# Call the scrape_urls function and pass in the URL
extracted_urls = scraper.scrape_urls(url_to_scrape)

# Process the extracted URLs
for url in extracted_urls:
    print(url)
    # Here you can perform further processing on each URL, such as visiting it or storing it in a database

Then, the response that we get is :


https://hyperllm.gitbook.io/hyperllm
https://hyperllm.gitbook.io/hyperllm
https://hyperllm.gitbook.io/hyperllm/company/what-is-hyperllm
https://hyperllm.gitbook.io/hyperllm/company/what-are-our-key-achievements
https://hyperllm.gitbook.io/hyperllm/hypercrawl/what-is-hypercrawl
https://hyperllm.gitbook.io/hyperllm/hypercrawl/versions-and-alterations
https://hyperllm.gitbook.io/hyperllm/hypercrawl/installation
https://hyperllm.gitbook.io/hyperllm/hypercrawl/usage
https://hyperllm.gitbook.io/hyperllm/hypercrawl/performance-testing
https://hyperllm.gitbook.io/hyperllm/hyperefficiency/what-is-hyperefficiency
https://www.gitbook.com/?utm_source=content&utm_medium=trademark&utm_campaign=4Nv6vvgZBuXWPHIU2cl0
https://hyperllm.gitbook.io/hyperllm/company/what-is-hyperllm

The Actual Problem

Now, in the response, as you can see, it scrapes our docs and gives back the URL. However, in line #12, you can see that it is not a URL we wanted to crawl.

We want cralwers to return URLs that we can scrape. However, if we crawl back URLs from sponsored links, then the scrape load skyrockets and we end up scraping tons of unknown data which might tamper our original source retrieval dataset.

The Fix

Now, here's how you can extend the code functionalities to meet this demand :


from hypercrawlturbo import scraper
from urllib.parse import urlparse

# Define the URL of the webpage to scrape
url_to_scrape = "https://hyperllm.gitbook.io/hyperllm"

# Parse the domain of the URL to scrape
parsed_url = urlparse(url_to_scrape)
base_domain = parsed_url.netloc

# Call the scrape_urls function and pass in the URL
extracted_urls = scraper.scrape_urls(url_to_scrape)

# Filter and process the extracted URLs
for url in extracted_urls:
    # Parse the domain of the extracted URL
    parsed_extracted_url = urlparse(url)
    extracted_domain = parsed_extracted_url.netloc
    
    # Check if the extracted URL's domain matches the base domain
    if extracted_domain == base_domain:
        print(url)
        # Here you can perform further processing on each URL, such as visiting it or storing it in a database

Explanation

  1. Parsing the Domain:

    • urlparse(url_to_scrape): This function is used to parse the URL to scrape and extract the base domain (e.g., hyperllm.gitbook.io).

    • base_domain = parsed_url.netloc: Extracts the network location part of the URL, which is the domain.

  2. Filtering Extracted URLs:

    • For each URL in extracted_urls, the domain is parsed using urlparse(url).

    • The domain of each extracted URL (extracted_domain = parsed_extracted_url.netloc) is then compared to the base domain.

    • Only URLs with the same domain as the base domain are printed or further processed.

By adding this domain-checking logic, the script ensures that only URLs belonging to the specified domain are included in the results.

Last updated