Crawler - TIXAE Agents

The Crawler is a powerful tool designed to streamline the process of creating and maintaining your agent’s knowledge base. By automatically scraping and importing content from specified websites, the crawler ensures your agent stays up-to-date with the latest information.

How the Crawler Works

The crawler operates by systematically visiting web pages, extracting relevant content, and organizing it into makrdown suitable for your agent’s knowledge base. This process involves crawling through links, scraping text, and formatting the data for optimal use by AI models.

Crawler Jobs

The core of the crawler functionality revolves around crawler jobs. Each job is associated with specific source URLs you want to crawl and is identified by a unique ID. The developer API offers several ways to interact with the crawler.

Job Details UI

Creating a New Crawler Job

To initiate a new crawler job, follow these steps:

Set Source URLs

Navigate to the crawler tab from the menu on the left side of your dashboard. Click on new job and Enter the main URL(s) you want the crawler to begin with (e.g., https://www.tixaeagents.ai).

Ensure you use valid URLs in the correct format: https://example.com

Configure Crawl Settings

There are two options to consider that determines the quality of the scrape:

Regular crawl

Deep crawl

Refresh Rate

Set Crawler Refresh Rate

The crawler refresh rate determines how often the crawler will update the current job with potential new information from the scraped site. This is particularly useful for websites that update frequently, such as e-commerce sites.

You can create separate crawler jobs for different sub-pages. This allows you to set different refresh rates for various sections of a website. For example, on a Shopify site, you might want to update /collections/protein-powder more frequently than the main page, as product information changes more often.

Available Refresh Rate Options:

Refresh Rates

Every 1 hour
Every 6 hours
Every 12 hours
Every 24 hours
Every 7 days
Never

Specify Page Limit

Set the maximum number of pages to scrape for that job, ranging from 10 up to 500 pages.

Review the sitemap beforehand to determine the optimal number of pages to scrape. To view, write /sitemap.xml at the end of a valid URL. Ex. https://www.tixaeagents.ai/sitemap.xml or use this

Define URL Patterns

Match URLs: Include subpages you want to scrape based on the source URL.

Example: Match URLs

Unmatch Patterns: Specify URLs or patterns to exclude from the scrape.

Example: Unmatch Patterns

Coming soon: Ability to assign crawl jobs directly to specific agents for automatic knowledge base updates.

Scraped Pages

After completing a crawler job, you can review and manage the scraped pages in the jobs dedicated interface. This section provides an overview of all pages collected during the job and status messages, such as when the maximum page limit is reached or when the crawler is active. Scraped Pages Interface

The pages view provides:

Unique document ID

URL

Title

Description

Character count

Managing Scraped Pages

You can perform the following actions on the scraped pages:

Select Pages

Check the pages you want to process further

Export

Download selected pages as a zip file containing .txt documents

Import

Add selected pages to the knowledge base of your chosen agent

Scraped Page Data

The scraped page data shows a detailed view of each page scraped in the job: Scraped Page Data Interface

The page data view provides:

URL

Title

Description

Detected URLs

Content

Example snippet of scraped information in markdown:

TIXAE AI provides access to a wide range of **state-of-the-art** AI models, ensuring that your agents are always equipped with the best and newest models on the market.

=========================================================

As soon as **new models** are released, the TIXAE AI team promptly updates the platform. This means you typically get access to the latest and most powerful models **right away**.

Scraped information is formatted in markdown for easy reading by LLMs. To learn more about formatting KB documents, visit the formatting doc.

Crawler Job Status

When you initiate a new crawler job, it will progress through several status stages:

Pending: The crawler has started and is in the process of gathering URLs and scraping content.
Active: The crawler is actively scraping pages.
Completed: The job has finished, and all specified pages have been scraped.

You will receive a notification in the dashboard when the job status changes to Completed.

Best Practices and Tips

Optimize URL Patterns

Carefully define match and unmatch patterns to focus on the most relevant content.

Monitor Credit Usage

Remember that each page scraped costs credits (1 for normal, 10 for deep scrape).

Regular Updates

Set appropriate refresh rates for dynamic content to keep your knowledge base current.

Review Before Import

Always review scraped content before importing it into your agent’s knowledge base.

​How the Crawler Works

​Crawler Jobs

Job Details UI

​Creating a New Crawler Job

​Set Crawler Refresh Rate

​Available Refresh Rate Options:

Match URLs: Include subpages you want to scrape based on the source URL.

Unmatch Patterns: Specify URLs or patterns to exclude from the scrape.

​Scraped Pages

The pages view provides:

​Managing Scraped Pages

​Scraped Page Data

The page data view provides:

​

​Crawler Job Status

​Best Practices and Tips

Optimize URL Patterns

Monitor Credit Usage

Regular Updates

Review Before Import

How the Crawler Works

Crawler Jobs

Creating a New Crawler Job

Set Crawler Refresh Rate

Available Refresh Rate Options:

Scraped Pages

Managing Scraped Pages

Scraped Page Data

Crawler Job Status

Best Practices and Tips