# Agent Disco — full reference

> Grade any public URL for AI-agent discoverability. This file is the long-form knowledge dump — start at https://agentdisco.io/llms.txt for the short version.

Agent Disco fetches the handful of URIs agent frameworks actually read (robots.txt, llms.txt, /.well-known/ai-plugin.json, /.well-known/agent.json, /.well-known/mcp.json, OpenAPI specs, SDK signals, registry presence) and returns a letter grade A–F plus a per-category breakdown. The scanner runs on-demand only: no background crawling, no sitemap-walking, no link-following. One scan sends 10–20 requests against the target origin at 2 req/s max, then stops.

## How to use

1. Submit a URL at https://agentdisco.io/ — or POST to /api/v1/scans via the API.
2. The scanner runs all registered checks against the target. Per-category weights come from config/packages/agent_disco.yaml; each check's weight is published at /checks.
3. When the scan completes (status=completed), the grade + per-category breakdown + quick-win hints are available at /report/{host} and /api/v1/websites/{host}.
4. Per-scan detail (individual check findings with evidence) lives at /scan/{id}.

## Scoring groups (AD-35a)

The overall grade is a weighted average of six groups:

- 25% protocol_surface (well_known/ai_plugin_json, well_known/agent_json, well_known/mcp_json, protocols/*, identity/tls)
- 25% api_docs (api/*, docs/*)
- 20% onboarding (onboarding/*, identity/email_auth)
- 15% crawl_llm_training (crawl/*, root_level/*, html_meta/*, llm_training/*)
- 10% trust_realtime (anti_bot/*, identity/*)
- 5% economic_federation (registries/*)

Letter bands: A ≥ 85, B ≥ 70, C ≥ 55, D ≥ 40, F < 40. Skipped and errored checks are excluded from scoring (they carry `pointsPossible = null`).

## Full check catalogue

### Well-known URIs

- **well_known.agent_json** (weight 10) — /.well-known/agent.json (A2A AgentCard): Looks for an A2A AgentCard at `/.well-known/agent.json` (a2aprotocol.ai (https://a2aprotocol.ai)) and checks for the four load-bearing top-level keys: `name`, `description`, `skills`, `endpoints`. We only grade presence + shape; deep A2A capability discovery is a separate sub-product.
- **well_known.ai_plugin_json** (weight 8) — /.well-known/ai-plugin.json manifest: Looks for the ChatGPT-plugin manifest at `/.well-known/ai-plugin.json` and counts how many of the load-bearing OpenAI-schema keys are present (`name_for_human`, `name_for_model`, `description_for_model`, `api`). Convention is informally deprecated but many LLM runtimes still probe for it.
- **well_known.mcp_json** (weight 8) — /.well-known/mcp.json (Model Context Protocol): Looks for an MCP (Model Context Protocol) manifest at `/.well-known/mcp.json`. Records the top-level keys but only insists on the presence of at least one MCP indicator (`server`, `capabilities`, or `tools`) because the schema is still evolving.
- **well_known.openapi** (weight 6) — /.well-known/openapi.{json,yaml}: Probes the two RFC 8615 well-known OpenAPI paths (`/.well-known/openapi.json`, `/.well-known/openapi.yaml`) and confirms the body is OpenAPI 3.x. This is narrower than the general OpenAPI discovery check — sites that expose their spec at the well-known path earn both credits.

### Root-level files

- **root_level.ai_txt** (weight 4) — /ai.txt AI-crawler directives: Looks for `/ai.txt` — a secondary, still-draft root-level declaration of AI-crawler directives. Multiple drafts compete, so this check only records presence + a body excerpt, not any specific schema. Absence is not a negative signal (skip, not fail).
- **root_level.llms_full_txt** (weight 8) — /llms-full.txt long-form index: Looks for `/llms-full.txt` — the long-form companion to `/llms.txt` (llmstxt.org (https://llmstxt.org)). Passes on a substantive (≥ 1 KB) text body; warns on small or odd-typed responses; flags an HTML catch-all as a fail.
- **root_level.llms_txt** (weight 8) — /llms.txt index for LLMs: Looks for `/llms.txt` at the site root — a Markdown index of important URLs + summaries for LLM consumers (llmstxt.org (https://llmstxt.org)). Presence is the signal; we don't enforce the schema. Flags a plausible SPA catch-all (HTML at the path) as a fail.

### Crawl & indexing

- **crawl.feed** (weight 4) — RSS/Atom feed: Looks for an RSS/Atom feed via the conventional paths (`/feed`, `/feed.xml`, `/rss`, `/atom.xml`) and the homepage `<link rel="alternate">` tag. The link-alternate declaration is the more reliable signal; the paths are a fallback. Absence is a skip, not a fail — feeds are secondary to sitemaps for agent discoverability.
- **crawl.robots_txt** (weight 13) — robots.txt AI-agent rules: Parses `/robots.txt` and checks whether the major AI crawlers (GPTBot, ClaudeBot, PerplexityBot, CCBot, Google-Extended, and others) are allowed to crawl the site root. A blanket `User-agent: * / Disallow: /` fails the check outright.
- **crawl.sitemap** (weight 10) — XML sitemap discovery: Looks for a sitemap via both the conventional paths (`/sitemap.xml`, `/sitemap_index.xml`) and any `Sitemap:` directives in `/robots.txt`. Accepts `<urlset>` and `<sitemapindex>`. Absence is a skip, not a fail — we can't tell "no sitemap" apart from "sitemap lives somewhere undeclared".

### HTML & meta

- **html_meta.description** (weight 3) — meta description: Looks for a `<meta name="description">` on the homepage, and checks it's in the 50-300 character range most search engines and AI agents actually surface. Absence fails; an SPA shell (homepage that doesn't render HTML) is a skip.
- **html_meta.json_ld** (weight 8) — JSON-LD structured data: Parses every `<script type="application/ld+json">` block on the homepage and categorises by `@type`. `WebAPI` or `SoftwareApplication` is the primary agent signal — those entries explicitly declare a callable surface. `Organization` / `Article` / `FAQPage` earn partial credit. Malformed JSON-LD is a warn; absence fails.
- **html_meta.open_graph** (weight 4) — Open Graph tags: Looks for the three load-bearing Open Graph tags on the homepage: `og:title`, `og:description`, `og:type`. LLM-driven link previews and search result cards surface OG data without re-parsing HTML themselves.

### API discoverability

- **api.graphql_introspection** (weight 8) — GraphQL introspection: POSTs a minimal GraphQL introspection query (`{ __schema { queryType { name } } }`) to `/graphql`, `/api/graphql`, `/query`. Passes when the server returns the schema root type; warns if introspection is disabled but the endpoint responds (a common security-hardening posture). Skips if nothing answers.
- **api.json_error_body** (weight 5) — JSON error bodies for API callers: Requests a random non-existent path with `Accept: application/json`. Passes when the server returns a JSON error body (`application/json` or `application/problem+json`); fails when it returns an HTML 404 page for an API-like caller. The random suffix prevents the check from being gamed by whitelisting a single URL.
- **api.openapi_discovery** (weight 10) — OpenAPI specification discovery: Probes nine conventional paths for an OpenAPI spec (`/openapi.json`, `/api/openapi.yaml`, `/swagger.json`, etc.) and confirms the top-level `openapi` key declares version 3.x. Supports JSON and YAML bodies.

### Protocols

- **protocols.a2a_agent_card** (weight 8) — A2A AgentCard conformance: Deep conformance check on the A2A AgentCard. Requires `version`, a non-empty `skills` array (each with `name` + `description`) and a non-empty `endpoints` array (each with `url` + `type`). Skips when `well_known.agent_json` didn't find a card in the first place.
- **protocols.mcp_registry_presence** (weight 10) — Public MCP registry listing: Searches the major public MCP registries (Smithery, mcp.so, PulseMCP, Glama) for the target host. A listing in any registry earns full credit — being catalogued elsewhere is the strongest off-host agent-discoverability signal there is. Registry timeouts / blocks are a skip (not a fail) because they are our fetch problem, not the target's.

### Registries

- **registries.github_repo** (weight 5) — GitHub public repository: Searches GitHub for repositories whose name or topic matches the target host. Extra note in evidence when the repo carries an agent-relevant topic (`mcp-server`, `ai-agents`, `llm-tools`). Authenticates with `GITHUB_API_TOKEN` when set; falls back to the unauthenticated 60/hr limit otherwise.
- **registries.npm_package** (weight 6) — npm SDK package: Queries the npm registry for packages plausibly attributable to the target — scoped packages at `@<org>/*`, plus any package whose name carries the target's bare host and looks like an SDK (suffixed `-sdk` / `-mcp`). Absence is a skip, not a fail.
- **registries.pypi_package** (weight 6) — PyPI SDK package: Direct-probes PyPI for `<bare-host>` and `<bare-host>-sdk` — the two names most likely to exist if an official Python SDK does. Skips when neither probe succeeds; does not fail, because plenty of services legitimately publish no Python SDK.

### Documentation

- **docs.platform** (weight 6) — Docs platform discoverability: Probes the conventional docs paths (`/docs`, `/documentation`, `/api`, `/api/docs`, `/reference`, `/developers`) and fingerprints the first hit. Recognises Mintlify, GitBook, Docusaurus, ReadTheDocs, Redoc, Stoplight and Swagger UI — all crawler-friendly. Downgrades to warn when the docs page needs JS to render content.
- **docs.sdk_availability** (weight 8) — SDK availability across languages: Counts language-level SDKs by combining the npm + PyPI registry findings with any install commands scraped from the docs homepage (`npm install`, `pip install`, `go get`, `gem install`, `composer require`). Primary per §9 — agent-native services need programmatic clients.

### LLM training data

- **llm_training.common_crawl** (weight 8) — Common Crawl index presence: Queries the Common Crawl CDX endpoint for the most recent monthly snapshot and counts the target's pages. Common Crawl's corpus underpins most open-source LLM training runs, so presence is the load-bearing signal for "an LLM already knows this site." A 10+ page threshold separates pass (properly crawled) from warn (marginally known).
- **llm_training.hn_mentions** (weight 5) — Hacker News mentions: Queries the Algolia-hosted HN Search API for mentions of the target host. A handful of stories or comments indicate the service has been discussed enough to show up in training data. Absence is a skip, not a fail — plenty of young services haven't hit HN yet.
- **llm_training.wikipedia** (weight 8) — Wikipedia article: Searches Wikipedia for an article about the service (derived from the org part of the host) and checks whether the domain appears in the article's external links. Domain-linked article is the canonical "LLM knows this service" signal. Name matching is heuristic — false positives are possible on generic org names; inspect evidence.article_title to verify.

### Anti-bot posture

- **anti_bot.cloudflare_interstitial** (weight 10) — Anti-bot interstitial: Fetches the homepage and flags Cloudflare / PerimeterX / Akamai / HUMAN interstitials by header (`cf-mitigated`, `cf-chl-*`, `x-px-captcha`, `x-akamai-session-info`, `x-humansec`) and body fingerprint ("Just a moment…", "Checking your browser"). A match is a blocking fail: an LLM crawler hitting one sees nothing.
- **anti_bot.user_agent_sniffing** (weight 5) — User-agent sniffing: Fetches the homepage as `AgentDisco/1.0` and again as `curl/7.88.0`, then compares status, content-type, and body length. A large divergence indicates user-agent sniffing — an LLM crawler with a non-browser UA may see a degraded page or be blocked outright.

### Identity & verification

- **identity.email_auth** (weight 5) — Email auth (SPF, DMARC, DKIM): Looks up TXT records on the domain for SPF (`v=spf1`), DMARC (`_dmarc.<host>`), and DKIM at the four most common selector names (`default`, `google`, `s1`, `selector1`). DKIM presence is a best-effort "selector resolves to *something*" — we can't introspect real mail headers without a sample email.
- **identity.tls** (weight 10) — TLS + HSTS + HTTPS redirect: Three-part trust check: TLS certificate validity (chain + dates), `Strict-Transport-Security` with `max-age >= 15552000` (6 months, the HSTS preload minimum), and a `http://` → `https://` redirect. Cert problems are a hard fail — agents won't negotiate with a site that can't prove its identity.

### Agent onboarding

- **onboarding.api_key_path** (weight 6) — API-key / signup path discoverability: Looks for API-key signup discoverability: probes conventional paths (/signup, /register, /developers, /api-keys, /account/api) plus anchors on the homepage and docs landing whose text matches "api key", "get started", "authenticate", or "sign up". `pass` when the path is LINKED from homepage or docs; `warn` if it exists at a conventional path but nothing points at it.

## Legal

Operated by Starsol Ltd (England, company 06002018, VAT GB 879 8964 22). Registered office: Unit D10 Upper Lounge, Pinetrees Road, Norwich, England, NR7 9BB. ICO registration ZA083698.

- Terms: https://agentdisco.io/terms
- Privacy: https://agentdisco.io/privacy
- Contact form: https://agentdisco.io/contact