Web Sources

Crawl and index websites to build your AI assistant's knowledge base. Configure sitemap indexing, URL patterns, and crawl settings.

Web sources allow Ask0 to automatically crawl and index websites, documentation sites, blogs, and any other web content. This is the most common way to build your AI assistant's knowledge base.

How Web Crawling Works

Ask0's intelligent web crawler:

Starts from your specified URL(s)
Discovers pages through links and sitemaps
Respects robots.txt and crawler directives
Extracts content while removing navigation and ads
Processes and chunks content for optimal retrieval
Indexes for semantic search
Refreshes on your defined schedule

Setting Up Web Sources

Navigate to Sources

In your Ask0 project dashboard, click on Sources in the sidebar.

Add Web Source

Click "Add Source" and select "Web Crawler" from the options.

Configure Crawler Settings

Start URLs:
  - https://docs.yoursite.com
  - https://help.yoursite.com

Include Patterns:
  - https://docs.yoursite.com/**
  - https://help.yoursite.com/articles/**

Exclude Patterns:
  - **/archive/**
  - **/internal/**
  - **/*.pdf
  - **/changelog

Max Pages: 1000
Max Depth: 5
Follow External Links: No
Respect Robots.txt: Yes

Refresh: Daily at 2:00 AM UTC

Start Crawling

Click "Save & Start Crawling". The initial crawl typically takes:

Small site (< 100 pages): 2-5 minutes
Medium site (100-1000 pages): 5-20 minutes
Large site (1000+ pages): 20-60 minutes

Configuration Options

URL Patterns

Use glob patterns to control what gets crawled:

Include Patterns:
  - https://docs.site.com/v2/**    # All v2 documentation
  - https://blog.site.com/2024/**  # 2024 blog posts only
  - **/*tutorial*                  # Any URL containing "tutorial"

Exclude Patterns:
  - **/temp/**                     # Temporary pages
  - **/*?print=true                # Printer-friendly versions
  - **/api/v1/**                   # Old API docs

Advanced Settings

Follow Redirects: Yes
Max Redirects: 5
User Agent: Ask0Bot/1.0
Accept Language: en-US,en
Crawl Delay: 1 second
Timeout: 30 seconds

Concurrent Requests: 5
Rate Limit: 10 requests/second
Max Page Size: 10 MB
Max Response Time: 30 seconds
Retry Failed Pages: 3 times
Backoff Strategy: Exponential

Extract Main Content: Yes
Remove Navigation: Yes
Remove Footers: Yes
Remove Ads: Yes
Convert Tables: To Markdown
Process JavaScript: No (for SPAs, set to Yes)
Extract Metadata: Yes

Auth Type: Basic / OAuth / Cookie
Username: your-username
Password: your-password
Bearer Token: your-token
Cookies:
  - name: session_id
    value: abc123

Handling Different Site Types

Documentation Sites

For documentation sites, focus on technical content and exclude changelog/release notes unless specifically needed.

Include:
  - /docs/**
  - /api/**
  - /guides/**
  - /tutorials/**
Exclude:
  - /blog/**        # Often redundant with docs
  - /changelog/**   # Version-specific, can confuse
  - /download/**    # Binary files

Single Page Applications (SPAs)

For React, Vue, Angular sites that render content client-side:

Process JavaScript: Yes
Wait for Selector: .content-loaded
Render Timeout: 5000ms
Use Headless Browser: Yes

Blogs & Content Sites

Include:
  - /posts/**
  - /articles/**
  - /resources/**
Exclude:
  - /tag/**         # Tag archives
  - /author/**      # Author pages
  - /page/**        # Pagination
Extract:
  - Article content
  - Publication date
  - Author information

Multi-language Sites

Languages:
  - en    # English
  - es    # Spanish
  - fr    # French

URL Patterns:
  - /en/**
  - /es/**
  - /fr/**

Start URLs:
  - https://en.site.com
  - https://es.site.com
  - https://fr.site.com

Crawl Scheduling

Schedule Options

Schedule: Manual

Schedule: Every 6 hours
Schedule: Daily at 2:00 AM
Schedule: Weekly on Sunday
Schedule: Monthly on 1st

Schedule: "0 2 * * *"  # Daily at 2 AM

Incremental Updates

Enable smart crawling to only update changed content:

Incremental Mode: Enabled
Check Method: Last-Modified Header
Check Method: Content Hash
Check Method: Sitemap Changes

Monitoring & Debugging

Crawl Status

Monitor your crawl progress:

Queued: Pages waiting to be crawled
Processing: Currently crawling
Completed: Successfully indexed
Failed: Errors during crawling
Skipped: Excluded by rules

Common Issues & Solutions

Crawler Blocked (403/429 errors)

Reduce crawl rate
Add delays between requests
Check robots.txt compliance
Contact site owner if needed

Missing Content

Check if content requires JavaScript
Verify include/exclude patterns
Look for authentication requirements
Check max depth settings

Duplicate Content

Add canonical URL handling
Exclude print/mobile versions
Use parameter exclusions

Crawl Logs

Access detailed logs for debugging:

[2024-01-15 02:00:00] Starting crawl for https://docs.site.com
[2024-01-15 02:00:01] Found sitemap: https://docs.site.com/sitemap.xml
[2024-01-15 02:00:02] Discovered 245 URLs from sitemap
[2024-01-15 02:00:03] Crawling: https://docs.site.com/getting-started
[2024-01-15 02:00:04] ✓ Indexed: /getting-started (1.2kb)
[2024-01-15 02:00:05] ⚠ Skipped: /internal/admin (excluded pattern)
[2024-01-15 02:00:06] ✗ Failed: /broken-page (404)

Optimization Tips

1. Use Sitemaps

If available, configure sitemap URLs for faster discovery:

Sitemaps:
  - https://site.com/sitemap.xml
  - https://site.com/sitemap-docs.xml

2. Optimize Patterns

Be specific with patterns to avoid crawling unnecessary pages:

Include: /docs/current/**

Include: /**

3. Content Quality

Ensure crawled content is high quality:

Well-structured HTML with semantic tags
Clear headings and sections
Minimal boilerplate text
Updated regularly

4. Monitor Performance

Regular check metrics:

Crawl duration
Pages per minute
Error rate
Content freshness

API Integration

Manage web sources programmatically:

// Create web source
const source = await ask0.sources.create({
  type: 'web',
  name: 'Product Documentation',
  config: {
    startUrls: ['https://docs.example.com'],
    includePatterns: ['/guides/**', '/api/**'],
    excludePatterns: ['/archive/**'],
    maxPages: 1000,
    schedule: 'daily'
  }
});

// Trigger manual crawl
await ask0.sources.crawl(source.id);

// Get crawl status
const status = await ask0.sources.getCrawlStatus(source.id);
console.log(`Crawled ${status.completed}/${status.total} pages`);

// Update configuration
await ask0.sources.update(source.id, {
  config: {
    maxPages: 2000,
    schedule: 'hourly'
  }
});

Best Practices

Start Small: Begin with core pages, then expand
Test Patterns: Verify include/exclude rules before full crawl
Monitor First Crawl: Watch for issues during initial indexing
Regular Maintenance: Review crawl logs and update patterns
Respect Resources: Don't overload target servers
Version Control: Track configuration changes

Advanced Features

Custom Headers

Add headers for specific requirements:

Headers:
  Accept: application/json
  X-API-Key: your-api-key

Proxy Support

Route crawls through proxy:

Proxy:
  URL: http://proxy.example.com:8080
  Username: proxyuser
  Password: proxypass

Content Transformation

Apply transformations during indexing:

Transformations:
  - Remove dates from URLs
  - Convert relative to absolute links
  - Extract structured data (JSON-LD)

Next Steps

Add Custom Knowledge

Supplement web sources with custom content

Monitor Analytics

Track source performance and usage

Configure Other Sources

Add GitHub, Discord, or file sources

Web Sources

Add Custom Knowledge

Monitor Analytics

Configure Other Sources

On this page