The Crawler is a powerful tool designed to streamline the process of creating and maintaining your agent’s knowledge base. By automatically scraping and importing content from specified websites, the crawler ensures your agent stays up-to-date with the latest information.

How the Crawler Works

The crawler operates by systematically visiting web pages, extracting relevant content, and organizing it into makrdown suitable for your agent’s knowledge base. This process involves crawling through links, scraping text, and formatting the data for optimal use by AI models.

Crawler Jobs

The core of the crawler functionality revolves around crawler jobs. Each job is associated with specific source URLs you want to crawl and is identified by a unique ID. The developer API offers several ways to interact with the crawler.

Job Details UI

Creating a New Crawler Job

To initiate a new crawler job, follow these steps:

1

Set Source URLs

Navigate to the crawler tab from the menu on the left side of your dashboard. Click on new job and Enter the main URL(s) you want the crawler to begin with (e.g., https://www.tixaeagents.ai).

Ensure you use valid URLs in the correct format: https://example.com

2

Configure Crawl Settings

There are two options to consider that determines the quality of the scrape:

3

Refresh Rate

Set Crawler Refresh Rate

The crawler refresh rate determines how often the crawler will update the current job with potential new information from the scraped site. This is particularly useful for websites that update frequently, such as e-commerce sites.

You can create separate crawler jobs for different sub-pages. This allows you to set different refresh rates for various sections of a website. For example, on a Shopify site, you might want to update /collections/protein-powder more frequently than the main page, as product information changes more often.

Available Refresh Rate Options:

Refresh Rates
Every 1 hour
Every 6 hours
Every 12 hours
Every 24 hours
Every 7 days
Never
4

Specify Page Limit

Set the maximum number of pages to scrape for that job, ranging from 10 up to 500 pages.

Review the sitemap beforehand to determine the optimal number of pages to scrape. To view, write /sitemap.xml at the end of a valid URL. Ex. https://www.tixaeagents.ai/sitemap.xml or use this

5

Define URL Patterns

Match URLs: Include subpages you want to scrape based on the source URL.

Unmatch Patterns: Specify URLs or patterns to exclude from the scrape.

Coming soon: Ability to assign crawl jobs directly to specific agents for automatic knowledge base updates.

Scraped Pages

After completing a crawler job, you can review and manage the scraped pages in the jobs dedicated interface. This section provides an overview of all pages collected during the job and status messages, such as when the maximum page limit is reached or when the crawler is active.

The pages view provides:

Managing Scraped Pages

You can perform the following actions on the scraped pages:

Select Pages

Check the pages you want to process further

Export

Download selected pages as a zip file containing .txt documents

Import

Add selected pages to the knowledge base of your chosen agent

Scraped Page Data

The scraped page data shows a detailed view of each page scraped in the job:

The page data view provides:

Example snippet of scraped information in markdown:

TIXAE AI provides access to a wide range of **state-of-the-art** AI models, ensuring that your agents are always equipped with the best and newest models on the market.

=========================================================

As soon as **new models** are released, the TIXAE AI team promptly updates the platform. This means you typically get access to the latest and most powerful models **right away**.

Scraped information is formatted in markdown for easy reading by LLMs. To learn more about formatting KB documents, visit the formatting doc.

Crawler Job Status

When you initiate a new crawler job, it will progress through several status stages:

  1. Pending: The crawler has started and is in the process of gathering URLs and scraping content.
  2. Active: The crawler is actively scraping pages.
  3. Completed: The job has finished, and all specified pages have been scraped.

You will receive a notification in the dashboard when the job status changes to Completed.

Best Practices and Tips

Optimize URL Patterns

Carefully define match and unmatch patterns to focus on the most relevant content.

Monitor Credit Usage

Remember that each page scraped costs credits (1 for normal, 10 for deep scrape).

Regular Updates

Set appropriate refresh rates for dynamic content to keep your knowledge base current.

Review Before Import

Always review scraped content before importing it into your agent’s knowledge base.