Ask0 logoAsk0
Sources

Web Sources

Crawl and index websites to build your AI assistant's knowledge base. Configure sitemap indexing, URL patterns, and crawl settings.

Web sources allow Ask0 to automatically crawl and index websites, documentation sites, blogs, and any other web content. This is the most common way to build your AI assistant's knowledge base.

How Web Crawling Works

Ask0's intelligent web crawler:

  1. Starts from your specified URL(s)
  2. Discovers pages through links and sitemaps
  3. Respects robots.txt and crawler directives
  4. Extracts content while removing navigation and ads
  5. Processes and chunks content for optimal retrieval
  6. Indexes for semantic search
  7. Refreshes on your defined schedule

Setting Up Web Sources

In your Ask0 project dashboard, click on Sources in the sidebar.

Add Web Source

Click "Add Source" and select "Web Crawler" from the options.

Configure Crawler Settings

Start URLs:
  - https://docs.yoursite.com
  - https://help.yoursite.com

Include Patterns:
  - https://docs.yoursite.com/**
  - https://help.yoursite.com/articles/**

Exclude Patterns:
  - **/archive/**
  - **/internal/**
  - **/*.pdf
  - **/changelog

Max Pages: 1000
Max Depth: 5
Follow External Links: No
Respect Robots.txt: Yes

Refresh: Daily at 2:00 AM UTC

Start Crawling

Click "Save & Start Crawling". The initial crawl typically takes:

  • Small site (< 100 pages): 2-5 minutes
  • Medium site (100-1000 pages): 5-20 minutes
  • Large site (1000+ pages): 20-60 minutes

Configuration Options

URL Patterns

Use glob patterns to control what gets crawled:

Include Patterns:
  - https://docs.site.com/v2/**    # All v2 documentation
  - https://blog.site.com/2024/**  # 2024 blog posts only
  - **/*tutorial*                  # Any URL containing "tutorial"

Exclude Patterns:
  - **/temp/**                     # Temporary pages
  - **/*?print=true                # Printer-friendly versions
  - **/api/v1/**                   # Old API docs

Advanced Settings

Follow Redirects: Yes
Max Redirects: 5
User Agent: Ask0Bot/1.0
Accept Language: en-US,en
Crawl Delay: 1 second
Timeout: 30 seconds
Concurrent Requests: 5
Rate Limit: 10 requests/second
Max Page Size: 10 MB
Max Response Time: 30 seconds
Retry Failed Pages: 3 times
Backoff Strategy: Exponential
Extract Main Content: Yes
Remove Navigation: Yes
Remove Footers: Yes
Remove Ads: Yes
Convert Tables: To Markdown
Process JavaScript: No (for SPAs, set to Yes)
Extract Metadata: Yes
Auth Type: Basic / OAuth / Cookie
Username: your-username
Password: your-password
Bearer Token: your-token
Cookies:
  - name: session_id
    value: abc123

Handling Different Site Types

Documentation Sites

For documentation sites, focus on technical content and exclude changelog/release notes unless specifically needed.

Include:
  - /docs/**
  - /api/**
  - /guides/**
  - /tutorials/**
Exclude:
  - /blog/**        # Often redundant with docs
  - /changelog/**   # Version-specific, can confuse
  - /download/**    # Binary files

Single Page Applications (SPAs)

For React, Vue, Angular sites that render content client-side:

Process JavaScript: Yes
Wait for Selector: .content-loaded
Render Timeout: 5000ms
Use Headless Browser: Yes

Blogs & Content Sites

Include:
  - /posts/**
  - /articles/**
  - /resources/**
Exclude:
  - /tag/**         # Tag archives
  - /author/**      # Author pages
  - /page/**        # Pagination
Extract:
  - Article content
  - Publication date
  - Author information

Multi-language Sites

Languages:
  - en    # English
  - es    # Spanish
  - fr    # French

URL Patterns:
  - /en/**
  - /es/**
  - /fr/**

Start URLs:
  - https://en.site.com
  - https://es.site.com
  - https://fr.site.com

Crawl Scheduling

Schedule Options

Schedule: Manual

Schedule: Every 6 hours
Schedule: Daily at 2:00 AM
Schedule: Weekly on Sunday
Schedule: Monthly on 1st

Schedule: "0 2 * * *"  # Daily at 2 AM

Incremental Updates

Enable smart crawling to only update changed content:

Incremental Mode: Enabled
Check Method: Last-Modified Header
Check Method: Content Hash
Check Method: Sitemap Changes

Monitoring & Debugging

Crawl Status

Monitor your crawl progress:

  • Queued: Pages waiting to be crawled
  • Processing: Currently crawling
  • Completed: Successfully indexed
  • Failed: Errors during crawling
  • Skipped: Excluded by rules

Common Issues & Solutions

Crawler Blocked (403/429 errors)

  • Reduce crawl rate
  • Add delays between requests
  • Check robots.txt compliance
  • Contact site owner if needed

Missing Content

  • Check if content requires JavaScript
  • Verify include/exclude patterns
  • Look for authentication requirements
  • Check max depth settings

Duplicate Content

  • Add canonical URL handling
  • Exclude print/mobile versions
  • Use parameter exclusions

Crawl Logs

Access detailed logs for debugging:

[2024-01-15 02:00:00] Starting crawl for https://docs.site.com
[2024-01-15 02:00:01] Found sitemap: https://docs.site.com/sitemap.xml
[2024-01-15 02:00:02] Discovered 245 URLs from sitemap
[2024-01-15 02:00:03] Crawling: https://docs.site.com/getting-started
[2024-01-15 02:00:04] ✓ Indexed: /getting-started (1.2kb)
[2024-01-15 02:00:05] ⚠ Skipped: /internal/admin (excluded pattern)
[2024-01-15 02:00:06] ✗ Failed: /broken-page (404)

Optimization Tips

1. Use Sitemaps

If available, configure sitemap URLs for faster discovery:

Sitemaps:
  - https://site.com/sitemap.xml
  - https://site.com/sitemap-docs.xml

2. Optimize Patterns

Be specific with patterns to avoid crawling unnecessary pages:

Include: /docs/current/**

Include: /**

3. Content Quality

Ensure crawled content is high quality:

  • Well-structured HTML with semantic tags
  • Clear headings and sections
  • Minimal boilerplate text
  • Updated regularly

4. Monitor Performance

Regular check metrics:

  • Crawl duration
  • Pages per minute
  • Error rate
  • Content freshness

API Integration

Manage web sources programmatically:

// Create web source
const source = await ask0.sources.create({
  type: 'web',
  name: 'Product Documentation',
  config: {
    startUrls: ['https://docs.example.com'],
    includePatterns: ['/guides/**', '/api/**'],
    excludePatterns: ['/archive/**'],
    maxPages: 1000,
    schedule: 'daily'
  }
});

// Trigger manual crawl
await ask0.sources.crawl(source.id);

// Get crawl status
const status = await ask0.sources.getCrawlStatus(source.id);
console.log(`Crawled ${status.completed}/${status.total} pages`);

// Update configuration
await ask0.sources.update(source.id, {
  config: {
    maxPages: 2000,
    schedule: 'hourly'
  }
});

Best Practices

  1. Start Small: Begin with core pages, then expand
  2. Test Patterns: Verify include/exclude rules before full crawl
  3. Monitor First Crawl: Watch for issues during initial indexing
  4. Regular Maintenance: Review crawl logs and update patterns
  5. Respect Resources: Don't overload target servers
  6. Version Control: Track configuration changes

Advanced Features

Custom Headers

Add headers for specific requirements:

Headers:
  Accept: application/json
  X-API-Key: your-api-key

Proxy Support

Route crawls through proxy:

Proxy:
  URL: http://proxy.example.com:8080
  Username: proxyuser
  Password: proxypass

Content Transformation

Apply transformations during indexing:

Transformations:
  - Remove dates from URLs
  - Convert relative to absolute links
  - Extract structured data (JSON-LD)

Next Steps