methodology — stackpeek

01
Third-party origin extraction
foundational

What. Parse the homepage HTML and enumerate every script, link, iframe, image, and embed loaded from a domain other than the site's own origin.

How. Plain GET request with a standard browser user-agent. Parse response HTML with lxml. Extract every src, href, and srcset attribute. Resolve relative URLs. Group by registered domain.

Why. Every other check depends on this one. If we cannot enumerate third-party origins, we cannot compare them to policy claims.
02
Vendor identification and category assignment
foundational

What. Map each third-party domain to a known vendor and a category (analytics, advertising, session replay, tag manager, support, payments, CDN, fonts, video, social, security, error tracking).

How. Match against a bundled database of ~150 well-known domain → vendor entries. For unknown domains, optionally use the Anthropic API to classify based on the domain name and any available evidence. Stackpeek degrades gracefully to rules-only mode if no API key is present.

Why. Raw domains are not actionable. The category — is this tracking, analytics, advertising, or a font CDN? — is what determines whether a mismatch matters.
03
Privacy policy URL discovery
foundational

What. Find the privacy policy URL from the homepage.

How. Scan homepage <a> tags for href values matching /privacy, /privacy-policy, /legal/privacy, /privacy.html, or similar. Fall back to common default paths. Follow one level of redirect. Emit the first candidate that returns 200 OK.

Why. Every finding that compares observed behavior to stated policy depends on locating the actual policy document. If no policy is found, the audit reports 'no policy available' as its own finding.
04
Structured policy claim extraction
core

What. Extract structured claims from the privacy policy text: does it claim to collect PII, share with third parties, sell data, use cookies, use analytics, use advertising, and which specific third parties are named.

How. Strip the policy HTML with trafilatura. Submit the cleaned text to the Anthropic API using a tool-use schema with explicit booleans and a named-third-party list. In --no-llm mode, skip this step entirely and report only the policy URL.

Why. A privacy policy is a prose document. Converting it into a structured record of yes/no claims and named parties is what makes comparison to observed behavior possible at all.
05
Deterministic rule checks
core

What. Apply a set of if-then rules that compare the observed stack to the extracted policy. Example: if policy.uses_analytics is false and observed analytics count > 0, emit a MISMATCH finding.

How. Python code in stackpeek/auditor.py. Each rule is explicit and unit-tested. Runs in zero LLM calls. Every MISMATCH-level finding must be producible by at least one rule.

Why. Deterministic rules produce findings that do not depend on model output. They are reproducible, testable, and defensible in a disagreement.
06
Inline script pattern matching
supporting

What. Scan inline JavaScript in the HTML for known analytics and advertising fingerprints: gtag(), fbq(), _hjSettings, clarity, Sentry.init, amplitude, mixpanel, ttq, twq, rdt, and similar.

How. Regex pattern matching against inline <script> contents. Each match contributes to the vendor presence set independent of whether an external script tag was observed.

Why. Many trackers use a single inline configuration script that loads additional resources at runtime. Catching the inline fingerprint is often the only signal available without a headless browser.
07
LLM-assisted nuance findings
supporting

What. Ask the LLM to produce findings that pure rules cannot express: partial mismatches, implied disclosures, vague policy language, and missing but expected disclosures.

How. A structured prompt that includes the observed third parties, the extracted policy claims, and the original policy text, and asks Claude to emit additional findings at NOTE or WARN level. Never used for MISMATCH-level findings — those must come from rules.

Why. Some disclosure gaps are real but require reading comprehension to detect. The LLM is good at those. They are always labeled NOTE or WARN, never MISMATCH, because they are inherently subjective.
08
Security header inspection
supporting

What. Check for Strict-Transport-Security (HSTS), Content-Security-Policy (CSP), Referrer-Policy, X-Content-Type-Options, and Permissions-Policy on the homepage response.

How. Read the HTTP response headers. Report presence and key directive values. CSP is also parsed to extract the allowed script-src origins for cross-reference with observed third parties.

Why. Missing security headers are not usually a policy mismatch, but they are published findings in their own right. Users with strong privacy postures often want to see them.

Every check, documented.

the checks

Third-party origin extraction

Vendor identification and category assignment

Privacy policy URL discovery

Structured policy claim extraction

Deterministic rule checks

Inline script pattern matching

LLM-assisted nuance findings

Security header inspection

changelog