Sources
Train your AI agent with websites, documents, text, and Q&A pairs. Learn how to manage, optimize, and troubleshoot your training data.
Training Sources
Sources are the foundation of everything your AI agent knows. Every response your agent generates is grounded in the training data you provide. The quality, accuracy, and comprehensiveness of your sources directly determine the quality of your agent's responses. There is no shortcut -- an agent is only as good as the data behind it.
This guide covers every source type Chatsby supports, how the processing pipeline works, best practices for structuring your data, and how to troubleshoot common issues.
The Role of Sources in Agent Quality
When a user asks your agent a question, the system searches your training data for the most relevant content, then uses that content to generate a response. This means:
- If the answer exists in your sources and is well-structured, the agent will respond accurately.
- If the answer exists but is buried in poorly formatted content, the agent may miss it or generate an incomplete response.
- If the answer does not exist in your sources, the agent will either say it does not know or, if not properly instructed, may generate an inaccurate response.
Investing time in curating high-quality sources is the single most impactful thing you can do to improve your agent's performance.
Source Types
Chatsby supports four source types, each suited to different kinds of content and use cases.
Website Crawling
Website crawling automatically extracts text content from one or more web pages. This is the fastest way to train your agent on existing public content like documentation, help centers, blog posts, and product pages.
Navigate to Sources
From your agent dashboard, click the Sources tab, then click Add Source and select Website.
Enter the URL
Paste the root URL you want to crawl. Toggle Crawl subpages to include linked pages under the same domain. Set the crawl depth to control how many levels of links to follow (1 = only the given page, 2 = the page plus directly linked pages, and so on).
Preview and Fetch
Click Fetch Links to preview the list of pages that will be crawled. Review the list and deselect any pages that are not relevant to your agent's purpose.
Start Training
Click Train to begin processing. Chatsby extracts the text content from each page, chunks it into segments, generates embeddings, and indexes everything for retrieval.
What gets extracted: Chatsby extracts visible text content from the page body, including headings, paragraphs, lists, and tables. Navigation menus, footers, sidebars, and scripts are excluded automatically.
JavaScript-rendered sites: Chatsby uses a headless browser for crawling, so content rendered by JavaScript frameworks (React, Vue, Angular) is captured. However, content behind authentication walls or that requires user interaction (clicking tabs, expanding accordions) may not be extracted. For such content, use direct text input or file uploads instead.
Crawl frequency: Website sources can be manually refreshed at any time by clicking the refresh icon. The agent re-crawls the URL and updates the training data with the latest content. For frequently changing content, set up a refresh schedule or use the API to trigger re-crawls programmatically.
File Uploads
Upload documents directly to train your agent on content that is not publicly accessible on the web, such as internal policies, product manuals, research reports, or training materials.
| Format | Extensions | Max File Size | What Gets Extracted |
|---|---|---|---|
.pdf | 10 MB | Text content, including text from scanned documents via OCR. Tables and structured layouts are preserved where possible. | |
| Word | .doc, .docx | 10 MB | All text content including headings, paragraphs, lists, and tables. Images and embedded media are not extracted. |
| Plain Text | .txt | 5 MB | The full text content of the file. |
| CSV | .csv | 5 MB | Each row is treated as a discrete piece of information. Column headers provide context for each cell value. |
Navigate to Sources
From your agent dashboard, click the Sources tab, then click Add Source and select File Upload.
Select Files
Drag and drop files into the upload area, or click to browse. You can upload multiple files at once. Each file is processed as a separate source.
Upload and Train
Click Upload to begin. Chatsby extracts the text content, processes it through the chunking and embedding pipeline, and adds it to your agent's knowledge base.
Direct Text Input
Direct text input is ideal for content that does not exist in a file or on a webpage. Use it for FAQs, product descriptions, company policies, pricing information, or any content you want to write specifically for your agent.
Navigate to Sources
From your agent dashboard, click the Sources tab, then click Add Source and select Text.
Enter a Title
Give the source a descriptive title (e.g., "Return Policy" or "Product Pricing Q3 2026"). This helps you identify the source later when managing your training data.
Write or Paste Content
Enter the content in the text editor. Use clear headings, short paragraphs, and structured formatting. The more organized the content, the better the agent can retrieve and use it.
Save and Train
Click Save to process the content. It is immediately available for your agent to use.
When to use text input over other source types:
- The content is not published anywhere and exists only in your head or in internal notes.
- You want to provide curated, agent-optimized versions of content rather than raw website or document text.
- You need to combine information from multiple places into a single, coherent source.
Q&A Pairs
Q&A pairs give you the highest level of precision and control. When a user asks a question that closely matches a Q&A pair, the agent uses the provided answer directly, ensuring accuracy for your most critical topics.
Navigate to Sources
From your agent dashboard, click the Sources tab, then click Add Source and select Q&A.
Enter the Question
Write the question as a customer would ask it. Include common variations if possible (e.g., "What is your refund policy?" / "How do I get a refund?").
Enter the Answer
Write the complete, definitive answer. Be thorough -- this is the exact response the agent will use or heavily reference.
Save
Click Save. You can add as many Q&A pairs as needed. Each pair is stored as a separate, high-priority source.
When to use Q&A pairs:
- For questions where accuracy is non-negotiable (pricing, legal disclaimers, refund policies).
- For questions that your agent consistently gets wrong using other source types.
- For questions with answers that are too nuanced or specific to be reliably extracted from general documents.
Source Processing Pipeline
Understanding what happens after you add a source helps you optimize your training data and troubleshoot issues.
When you add any source, Chatsby processes it through a three-stage pipeline:
- Extraction -- Raw text is extracted from the source (crawled from a webpage, parsed from a document, or taken directly from your input).
- Chunking -- The extracted text is split into smaller, semantically meaningful segments. Chunking ensures that the retrieval system can find the most relevant piece of information for a given question, rather than searching through an entire document.
- Embedding and Indexing -- Each chunk is converted into a numerical vector (embedding) that captures its semantic meaning. These embeddings are stored in a vector index, enabling fast similarity search when users ask questions.
When a user sends a message, the system converts the message into an embedding, searches the index for the most similar chunks, and provides those chunks as context for the AI model to generate a response.
Managing Sources
Viewing Source Status
The Sources tab displays all training data for your agent in a table with the following columns:
| Column | Description |
|---|---|
| Name | The source title or URL. |
| Type | Website, File, Text, or Q&A. |
| Characters | The total number of characters extracted from the source. |
| Status | Active (ready for use), Processing (currently being indexed), Failed (an error occurred during processing). |
| Date Added | When the source was first created. |
Refreshing Sources
For website sources, click the refresh icon to re-crawl the URL and update the training data with the latest content. This is essential when your website content changes and you want the agent to reflect those updates.
After refreshing, test the affected topics in the Playground to verify the updated content is being used correctly.
Deleting Sources
Click the delete icon next to any source to permanently remove it. The agent will no longer reference that content when generating responses. Deletion is immediate, but the agent may take a few moments to fully re-index.
Bulk Operations
Select multiple sources using the checkboxes to perform bulk actions:
- Bulk delete -- Remove multiple sources at once.
- Bulk refresh -- Re-crawl multiple website sources simultaneously.
Source Quality Guidelines
Not all content makes good training data. Follow these guidelines to maximize source effectiveness.
Do:
- Write in clear, direct language. Avoid jargon unless it is terminology your users will use.
- Structure content with headings and short paragraphs. The chunking algorithm uses structural cues to create better segments.
- Include specific facts, numbers, and details. "Our Pro plan costs $49/month and includes up to 10 users" is far more useful than "Check our pricing page for details."
- Keep sources up to date. Outdated information is worse than no information because users will trust it.
Do not:
- Upload entire books or massive documents that are only tangentially relevant. More data is not always better -- irrelevant content can dilute the quality of retrieval.
- Include duplicate content across multiple sources. Conflicting duplicates confuse the retrieval system.
- Upload content with heavy formatting artifacts, headers/footers repeated on every page, or legalese boilerplate unless it is genuinely needed.
Troubleshooting
Source Processing Failed
If a source shows a Failed status, the most common causes are:
- Website unreachable -- The URL returned a 404, 403, or timeout. Verify the URL is publicly accessible.
- File format unsupported -- Ensure the file matches one of the supported formats listed above.
- File too large -- Check the size against the limits in the table above.
- Encoding issues -- Text files must use UTF-8 encoding. Re-save the file with UTF-8 encoding and re-upload.
Content Not Being Used in Responses
If your agent is not referencing a source you added:
- Verify the source status is
Activeon the Sources tab. - Test in the Playground and expand source citations to check which sources are being referenced.
- The content may be too similar to other sources, causing the retrieval system to prefer the other source. Consider consolidating related content into a single source.
- The question may not be semantically similar enough to the source content. Add a Q&A pair for precision.
Agent Giving Wrong Answers from a Source
If the agent cites the correct source but generates an incorrect response:
- The source content itself may be ambiguous or contradictory. Review and rewrite for clarity.
- The relevant information may be buried in a long document. Extract the key content into a dedicated text source or Q&A pair.
- The temperature setting may be too high, causing the model to paraphrase inaccurately. Lower it to 0.1-0.3.
Plan Limits
Source limits vary by plan. Consult your plan details for exact numbers.
| Resource | Free | Starter | Professional | Enterprise |
|---|---|---|---|---|
| Sources per Agent | 5 | 25 | 100 | Unlimited |
| Max File Size | 5 MB | 10 MB | 10 MB | 25 MB |
| Max Characters per Source | 100,000 | 400,000 | 1,000,000 | 5,000,000 |
| Website Crawl Depth | 2 levels | 5 levels | 10 levels | Unlimited |
| Q&A Pairs per Agent | 20 | 100 | 500 | Unlimited |
On this page
- The Role of Sources in Agent Quality
- Source Types
- Website Crawling
- File Uploads
- Direct Text Input
- Q&A Pairs
- Source Processing Pipeline
- Managing Sources
- Viewing Source Status
- Refreshing Sources
- Deleting Sources
- Bulk Operations
- Source Quality Guidelines
- Troubleshooting
- Source Processing Failed
- Content Not Being Used in Responses
- Agent Giving Wrong Answers from a Source
- Plan Limits