Web Sources
Crawl and index websites to build your AI assistant's knowledge base. Configure sitemap indexing, URL patterns, and crawl settings.
Web sources allow Ask0 to automatically crawl and index websites, documentation sites, blogs, and any other web content. This is the most common way to build your AI assistant's knowledge base.
How Web Crawling Works
Ask0's intelligent web crawler:
- Starts from your specified URL(s)
- Discovers pages through links and sitemaps
- Respects robots.txt and crawler directives
- Extracts content while removing navigation and ads
- Processes and chunks content for optimal retrieval
- Indexes for semantic search
- Refreshes on your defined schedule
Setting Up Web Sources
Navigate to Sources
In your Ask0 project dashboard, click on Sources in the sidebar.
Add Web Source
Click "Add Source" and select "Web Crawler" from the options.
Configure Crawler Settings
Start URLs:
- https://docs.yoursite.com
- https://help.yoursite.com
Include Patterns:
- https://docs.yoursite.com/**
- https://help.yoursite.com/articles/**
Exclude Patterns:
- **/archive/**
- **/internal/**
- **/*.pdf
- **/changelog
Max Pages: 1000
Max Depth: 5
Follow External Links: No
Respect Robots.txt: Yes
Refresh: Daily at 2:00 AM UTCStart Crawling
Click "Save & Start Crawling". The initial crawl typically takes:
- Small site (< 100 pages): 2-5 minutes
- Medium site (100-1000 pages): 5-20 minutes
- Large site (1000+ pages): 20-60 minutes
Configuration Options
URL Patterns
Use glob patterns to control what gets crawled:
Include Patterns:
- https://docs.site.com/v2/** # All v2 documentation
- https://blog.site.com/2024/** # 2024 blog posts only
- **/*tutorial* # Any URL containing "tutorial"
Exclude Patterns:
- **/temp/** # Temporary pages
- **/*?print=true # Printer-friendly versions
- **/api/v1/** # Old API docsAdvanced Settings
Follow Redirects: Yes
Max Redirects: 5
User Agent: Ask0Bot/1.0
Accept Language: en-US,en
Crawl Delay: 1 second
Timeout: 30 secondsConcurrent Requests: 5
Rate Limit: 10 requests/second
Max Page Size: 10 MB
Max Response Time: 30 seconds
Retry Failed Pages: 3 times
Backoff Strategy: ExponentialExtract Main Content: Yes
Remove Navigation: Yes
Remove Footers: Yes
Remove Ads: Yes
Convert Tables: To Markdown
Process JavaScript: No (for SPAs, set to Yes)
Extract Metadata: YesAuth Type: Basic / OAuth / Cookie
Username: your-username
Password: your-password
Bearer Token: your-token
Cookies:
- name: session_id
value: abc123Handling Different Site Types
Documentation Sites
For documentation sites, focus on technical content and exclude changelog/release notes unless specifically needed.
Include:
- /docs/**
- /api/**
- /guides/**
- /tutorials/**
Exclude:
- /blog/** # Often redundant with docs
- /changelog/** # Version-specific, can confuse
- /download/** # Binary filesSingle Page Applications (SPAs)
For React, Vue, Angular sites that render content client-side:
Process JavaScript: Yes
Wait for Selector: .content-loaded
Render Timeout: 5000ms
Use Headless Browser: YesBlogs & Content Sites
Include:
- /posts/**
- /articles/**
- /resources/**
Exclude:
- /tag/** # Tag archives
- /author/** # Author pages
- /page/** # Pagination
Extract:
- Article content
- Publication date
- Author informationMulti-language Sites
Languages:
- en # English
- es # Spanish
- fr # French
URL Patterns:
- /en/**
- /es/**
- /fr/**
Start URLs:
- https://en.site.com
- https://es.site.com
- https://fr.site.comCrawl Scheduling
Schedule Options
Schedule: Manual
Schedule: Every 6 hours
Schedule: Daily at 2:00 AM
Schedule: Weekly on Sunday
Schedule: Monthly on 1st
Schedule: "0 2 * * *" # Daily at 2 AMIncremental Updates
Enable smart crawling to only update changed content:
Incremental Mode: Enabled
Check Method: Last-Modified Header
Check Method: Content Hash
Check Method: Sitemap ChangesMonitoring & Debugging
Crawl Status
Monitor your crawl progress:
- Queued: Pages waiting to be crawled
- Processing: Currently crawling
- Completed: Successfully indexed
- Failed: Errors during crawling
- Skipped: Excluded by rules
Common Issues & Solutions
Crawler Blocked (403/429 errors)
- Reduce crawl rate
- Add delays between requests
- Check robots.txt compliance
- Contact site owner if needed
Missing Content
- Check if content requires JavaScript
- Verify include/exclude patterns
- Look for authentication requirements
- Check max depth settings
Duplicate Content
- Add canonical URL handling
- Exclude print/mobile versions
- Use parameter exclusions
Crawl Logs
Access detailed logs for debugging:
[2024-01-15 02:00:00] Starting crawl for https://docs.site.com
[2024-01-15 02:00:01] Found sitemap: https://docs.site.com/sitemap.xml
[2024-01-15 02:00:02] Discovered 245 URLs from sitemap
[2024-01-15 02:00:03] Crawling: https://docs.site.com/getting-started
[2024-01-15 02:00:04] ✓ Indexed: /getting-started (1.2kb)
[2024-01-15 02:00:05] ⚠ Skipped: /internal/admin (excluded pattern)
[2024-01-15 02:00:06] ✗ Failed: /broken-page (404)Optimization Tips
1. Use Sitemaps
If available, configure sitemap URLs for faster discovery:
Sitemaps:
- https://site.com/sitemap.xml
- https://site.com/sitemap-docs.xml2. Optimize Patterns
Be specific with patterns to avoid crawling unnecessary pages:
Include: /docs/current/**
Include: /**3. Content Quality
Ensure crawled content is high quality:
- Well-structured HTML with semantic tags
- Clear headings and sections
- Minimal boilerplate text
- Updated regularly
4. Monitor Performance
Regular check metrics:
- Crawl duration
- Pages per minute
- Error rate
- Content freshness
API Integration
Manage web sources programmatically:
// Create web source
const source = await ask0.sources.create({
type: 'web',
name: 'Product Documentation',
config: {
startUrls: ['https://docs.example.com'],
includePatterns: ['/guides/**', '/api/**'],
excludePatterns: ['/archive/**'],
maxPages: 1000,
schedule: 'daily'
}
});
// Trigger manual crawl
await ask0.sources.crawl(source.id);
// Get crawl status
const status = await ask0.sources.getCrawlStatus(source.id);
console.log(`Crawled ${status.completed}/${status.total} pages`);
// Update configuration
await ask0.sources.update(source.id, {
config: {
maxPages: 2000,
schedule: 'hourly'
}
});Best Practices
- Start Small: Begin with core pages, then expand
- Test Patterns: Verify include/exclude rules before full crawl
- Monitor First Crawl: Watch for issues during initial indexing
- Regular Maintenance: Review crawl logs and update patterns
- Respect Resources: Don't overload target servers
- Version Control: Track configuration changes
Advanced Features
Custom Headers
Add headers for specific requirements:
Headers:
Accept: application/json
X-API-Key: your-api-keyProxy Support
Route crawls through proxy:
Proxy:
URL: http://proxy.example.com:8080
Username: proxyuser
Password: proxypassContent Transformation
Apply transformations during indexing:
Transformations:
- Remove dates from URLs
- Convert relative to absolute links
- Extract structured data (JSON-LD)