AI Agent

Sources

Train your AI agent with websites, documents, text, and Q&A pairs. Learn how to manage, optimize, and troubleshoot your training data.

Training Sources

Sources are the foundation of everything your AI agent knows. Every response your agent generates is grounded in the training data you provide. The quality, accuracy, and comprehensiveness of your sources directly determine the quality of your agent's responses. There is no shortcut -- an agent is only as good as the data behind it.

This guide covers every source type Chatsby supports, how the processing pipeline works, best practices for structuring your data, and how to troubleshoot common issues.

The Role of Sources in Agent Quality

When a user asks your agent a question, the system searches your training data for the most relevant content, then uses that content to generate a response. This means:

  • If the answer exists in your sources and is well-structured, the agent will respond accurately.
  • If the answer exists but is buried in poorly formatted content, the agent may miss it or generate an incomplete response.
  • If the answer does not exist in your sources, the agent will either say it does not know or, if not properly instructed, may generate an inaccurate response.

Investing time in curating high-quality sources is the single most impactful thing you can do to improve your agent's performance.

Source Types

Chatsby supports four source types, each suited to different kinds of content and use cases.

Website Crawling

Website crawling automatically extracts text content from one or more web pages. This is the fastest way to train your agent on existing public content like documentation, help centers, blog posts, and product pages.

Navigate to Sources

From your agent dashboard, click the Sources tab, then click Add Source and select Website.

Enter the URL

Paste the root URL you want to crawl. Toggle Crawl subpages to include linked pages under the same domain. Set the crawl depth to control how many levels of links to follow (1 = only the given page, 2 = the page plus directly linked pages, and so on).

Preview and Fetch

Click Fetch Links to preview the list of pages that will be crawled. Review the list and deselect any pages that are not relevant to your agent's purpose.

Start Training

Click Train to begin processing. Chatsby extracts the text content from each page, chunks it into segments, generates embeddings, and indexes everything for retrieval.

What gets extracted: Chatsby extracts visible text content from the page body, including headings, paragraphs, lists, and tables. Navigation menus, footers, sidebars, and scripts are excluded automatically.

JavaScript-rendered sites: Chatsby uses a headless browser for crawling, so content rendered by JavaScript frameworks (React, Vue, Angular) is captured. However, content behind authentication walls or that requires user interaction (clicking tabs, expanding accordions) may not be extracted. For such content, use direct text input or file uploads instead.

Crawl frequency: Website sources can be manually refreshed at any time by clicking the refresh icon. The agent re-crawls the URL and updates the training data with the latest content. For frequently changing content, set up a refresh schedule or use the API to trigger re-crawls programmatically.

File Uploads

Upload documents directly to train your agent on content that is not publicly accessible on the web, such as internal policies, product manuals, research reports, or training materials.

FormatExtensionsMax File SizeWhat Gets Extracted
PDF.pdf10 MBText content, including text from scanned documents via OCR. Tables and structured layouts are preserved where possible.
Word.doc, .docx10 MBAll text content including headings, paragraphs, lists, and tables. Images and embedded media are not extracted.
Plain Text.txt5 MBThe full text content of the file.
CSV.csv5 MBEach row is treated as a discrete piece of information. Column headers provide context for each cell value.

Navigate to Sources

From your agent dashboard, click the Sources tab, then click Add Source and select File Upload.

Select Files

Drag and drop files into the upload area, or click to browse. You can upload multiple files at once. Each file is processed as a separate source.

Upload and Train

Click Upload to begin. Chatsby extracts the text content, processes it through the chunking and embedding pipeline, and adds it to your agent's knowledge base.

OCR (Optical Character Recognition) is supported for scanned PDFs. The extraction accuracy depends on the scan quality. For best results, ensure scanned documents are at least 300 DPI with clear, legible text.

Direct Text Input

Direct text input is ideal for content that does not exist in a file or on a webpage. Use it for FAQs, product descriptions, company policies, pricing information, or any content you want to write specifically for your agent.

Navigate to Sources

From your agent dashboard, click the Sources tab, then click Add Source and select Text.

Enter a Title

Give the source a descriptive title (e.g., "Return Policy" or "Product Pricing Q3 2026"). This helps you identify the source later when managing your training data.

Write or Paste Content

Enter the content in the text editor. Use clear headings, short paragraphs, and structured formatting. The more organized the content, the better the agent can retrieve and use it.

Save and Train

Click Save to process the content. It is immediately available for your agent to use.

When to use text input over other source types:

  • The content is not published anywhere and exists only in your head or in internal notes.
  • You want to provide curated, agent-optimized versions of content rather than raw website or document text.
  • You need to combine information from multiple places into a single, coherent source.

Q&A Pairs

Q&A pairs give you the highest level of precision and control. When a user asks a question that closely matches a Q&A pair, the agent uses the provided answer directly, ensuring accuracy for your most critical topics.

Navigate to Sources

From your agent dashboard, click the Sources tab, then click Add Source and select Q&A.

Enter the Question

Write the question as a customer would ask it. Include common variations if possible (e.g., "What is your refund policy?" / "How do I get a refund?").

Enter the Answer

Write the complete, definitive answer. Be thorough -- this is the exact response the agent will use or heavily reference.

Save

Click Save. You can add as many Q&A pairs as needed. Each pair is stored as a separate, high-priority source.

When to use Q&A pairs:

  • For questions where accuracy is non-negotiable (pricing, legal disclaimers, refund policies).
  • For questions that your agent consistently gets wrong using other source types.
  • For questions with answers that are too nuanced or specific to be reliably extracted from general documents.

Source Processing Pipeline

Understanding what happens after you add a source helps you optimize your training data and troubleshoot issues.

When you add any source, Chatsby processes it through a three-stage pipeline:

  1. Extraction -- Raw text is extracted from the source (crawled from a webpage, parsed from a document, or taken directly from your input).
  2. Chunking -- The extracted text is split into smaller, semantically meaningful segments. Chunking ensures that the retrieval system can find the most relevant piece of information for a given question, rather than searching through an entire document.
  3. Embedding and Indexing -- Each chunk is converted into a numerical vector (embedding) that captures its semantic meaning. These embeddings are stored in a vector index, enabling fast similarity search when users ask questions.

When a user sends a message, the system converts the message into an embedding, searches the index for the most similar chunks, and provides those chunks as context for the AI model to generate a response.

Processing time depends on the size and type of source. A single webpage typically processes in under 30 seconds. Large PDF files may take several minutes. You can monitor processing status on the Sources tab.

Managing Sources

Viewing Source Status

The Sources tab displays all training data for your agent in a table with the following columns:

ColumnDescription
NameThe source title or URL.
TypeWebsite, File, Text, or Q&A.
CharactersThe total number of characters extracted from the source.
StatusActive (ready for use), Processing (currently being indexed), Failed (an error occurred during processing).
Date AddedWhen the source was first created.

Refreshing Sources

For website sources, click the refresh icon to re-crawl the URL and update the training data with the latest content. This is essential when your website content changes and you want the agent to reflect those updates.

After refreshing, test the affected topics in the Playground to verify the updated content is being used correctly.

Deleting Sources

Click the delete icon next to any source to permanently remove it. The agent will no longer reference that content when generating responses. Deletion is immediate, but the agent may take a few moments to fully re-index.

Bulk Operations

Select multiple sources using the checkboxes to perform bulk actions:

  • Bulk delete -- Remove multiple sources at once.
  • Bulk refresh -- Re-crawl multiple website sources simultaneously.

Source Quality Guidelines

Not all content makes good training data. Follow these guidelines to maximize source effectiveness.

Do:

  • Write in clear, direct language. Avoid jargon unless it is terminology your users will use.
  • Structure content with headings and short paragraphs. The chunking algorithm uses structural cues to create better segments.
  • Include specific facts, numbers, and details. "Our Pro plan costs $49/month and includes up to 10 users" is far more useful than "Check our pricing page for details."
  • Keep sources up to date. Outdated information is worse than no information because users will trust it.

Do not:

  • Upload entire books or massive documents that are only tangentially relevant. More data is not always better -- irrelevant content can dilute the quality of retrieval.
  • Include duplicate content across multiple sources. Conflicting duplicates confuse the retrieval system.
  • Upload content with heavy formatting artifacts, headers/footers repeated on every page, or legalese boilerplate unless it is genuinely needed.

Troubleshooting

Source Processing Failed

If a source shows a Failed status, the most common causes are:

  • Website unreachable -- The URL returned a 404, 403, or timeout. Verify the URL is publicly accessible.
  • File format unsupported -- Ensure the file matches one of the supported formats listed above.
  • File too large -- Check the size against the limits in the table above.
  • Encoding issues -- Text files must use UTF-8 encoding. Re-save the file with UTF-8 encoding and re-upload.

Content Not Being Used in Responses

If your agent is not referencing a source you added:

  • Verify the source status is Active on the Sources tab.
  • Test in the Playground and expand source citations to check which sources are being referenced.
  • The content may be too similar to other sources, causing the retrieval system to prefer the other source. Consider consolidating related content into a single source.
  • The question may not be semantically similar enough to the source content. Add a Q&A pair for precision.

Agent Giving Wrong Answers from a Source

If the agent cites the correct source but generates an incorrect response:

  • The source content itself may be ambiguous or contradictory. Review and rewrite for clarity.
  • The relevant information may be buried in a long document. Extract the key content into a dedicated text source or Q&A pair.
  • The temperature setting may be too high, causing the model to paraphrase inaccurately. Lower it to 0.1-0.3.

Plan Limits

Source limits vary by plan. Consult your plan details for exact numbers.

ResourceFreeStarterProfessionalEnterprise
Sources per Agent525100Unlimited
Max File Size5 MB10 MB10 MB25 MB
Max Characters per Source100,000400,0001,000,0005,000,000
Website Crawl Depth2 levels5 levels10 levelsUnlimited
Q&A Pairs per Agent20100500Unlimited
If you reach a plan limit, existing sources continue to work. You will need to upgrade or remove existing sources before adding new ones.