Description
Queries the Common Crawl CDX endpoint for the most recent monthly snapshot and counts the target's pages. Common Crawl's corpus underpins most open-source LLM training runs, so presence is the load-bearing signal for "an LLM already knows this site." A 10+ page threshold separates pass (properly crawled) from warn (marginally known).