Agent Disco check

Common Crawl index presence

← Back to all checks

passive · Category LLM training data · Weight 8 · Key llm_training.common_crawl

Description

Queries the Common Crawl CDX endpoint for the most recent monthly snapshot and counts the target's pages. Common Crawl's corpus underpins most open-source LLM training runs, so presence is the load-bearing signal for "an LLM already knows this site." A 10+ page threshold separates pass (properly crawled) from warn (marginally known).