# Domain Restriction

While using Turbo, the latency is reduced drastically. However, one major problem with the convectional way of using HyperCrawl Turbo is that it takes away all the URLs available on the page/domain.&#x20;

### Problem Simulation

Suppose we run the following code :&#x20;

```python

from hypercrawlturbo import scraper

# Define the URL of the webpage to scrape
url_to_scrape = "https://hyperllm.gitbook.io/hyperllm"

# Call the scrape_urls function and pass in the URL
extracted_urls = scraper.scrape_urls(url_to_scrape)

# Process the extracted URLs
for url in extracted_urls:
    print(url)
    # Here you can perform further processing on each URL, such as visiting it or storing it in a database

```

Then, the response that we get is :&#x20;

{% code lineNumbers="true" %}

```xml

https://hyperllm.gitbook.io/hyperllm
https://hyperllm.gitbook.io/hyperllm
https://hyperllm.gitbook.io/hyperllm/company/what-is-hyperllm
https://hyperllm.gitbook.io/hyperllm/company/what-are-our-key-achievements
https://hyperllm.gitbook.io/hyperllm/hypercrawl/what-is-hypercrawl
https://hyperllm.gitbook.io/hyperllm/hypercrawl/versions-and-alterations
https://hyperllm.gitbook.io/hyperllm/hypercrawl/installation
https://hyperllm.gitbook.io/hyperllm/hypercrawl/usage
https://hyperllm.gitbook.io/hyperllm/hypercrawl/performance-testing
https://hyperllm.gitbook.io/hyperllm/hyperefficiency/what-is-hyperefficiency
https://www.gitbook.com/?utm_source=content&utm_medium=trademark&utm_campaign=4Nv6vvgZBuXWPHIU2cl0
https://hyperllm.gitbook.io/hyperllm/company/what-is-hyperllm

```

{% endcode %}

### The Actual Problem&#x20;

Now, in the response, as you can see, it scrapes our docs and gives back the URL. However, in line #12, you can see that it is not a URL we wanted to crawl.&#x20;

We want cralwers to return URLs that we can scrape. However, if we crawl back URLs from sponsored links, then the scrape load skyrockets and we end up scraping tons of unknown data which might tamper our original source retrieval dataset.&#x20;

### The Fix

Now, here's how you can extend the code functionalities to meet this demand :&#x20;

```python

from hypercrawlturbo import scraper
from urllib.parse import urlparse

# Define the URL of the webpage to scrape
url_to_scrape = "https://hyperllm.gitbook.io/hyperllm"

# Parse the domain of the URL to scrape
parsed_url = urlparse(url_to_scrape)
base_domain = parsed_url.netloc

# Call the scrape_urls function and pass in the URL
extracted_urls = scraper.scrape_urls(url_to_scrape)

# Filter and process the extracted URLs
for url in extracted_urls:
    # Parse the domain of the extracted URL
    parsed_extracted_url = urlparse(url)
    extracted_domain = parsed_extracted_url.netloc
    
    # Check if the extracted URL's domain matches the base domain
    if extracted_domain == base_domain:
        print(url)
        # Here you can perform further processing on each URL, such as visiting it or storing it in a database

```

### Explanation

1. **Parsing the Domain**:
   * `urlparse(url_to_scrape)`: This function is used to parse the URL to scrape and extract the base domain (e.g., `hyperllm.gitbook.io`).
   * `base_domain = parsed_url.netloc`: Extracts the network location part of the URL, which is the domain.
2. **Filtering Extracted URLs**:
   * For each URL in `extracted_urls`, the domain is parsed using `urlparse(url)`.
   * The domain of each extracted URL (`extracted_domain = parsed_extracted_url.netloc`) is then compared to the base domain.
   * Only URLs with the same domain as the base domain are printed or further processed.

By adding this domain-checking logic, the script ensures that only URLs belonging to the specified domain are included in the results.&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://hyperllm.gitbook.io/hypercrawl/extending-hypercrawlturbo-functionalities/domain-restriction.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
