The 32 OSINT workers Tracelight runs, and what each one finds
Comprehensive list of the OSINT data sources Tracelight queries on every subject, what each one returns, and the licensing posture. Reference for buyers + integrators.
Tracelight runs 32 OSINT workers in parallel against every subject. People ask what they are. Below is the full list, grouped by signal type, with what each returns + the licensing/access posture.
Identity + breach
**HIBP (Have I Been Pwned)** — email-to-breach lookup. Free for our use, paid commercial tier for higher rate limits.
**Dehashed** — broader breach corpus including stealer logs + paste-site dumps. Paid subscription required; cost passed through transparently.
**BreachDirectory** — secondary breach aggregation source. Useful as a confirmation layer when HIBP returns nothing.
**IntelX** — dark-web + paste-site search. Paid tier required for full corpus access.
**EmailRep** — reputation scoring on an email address (deliverability, age, spam reputation, breach exposure summary).
**Hunter.io** — email-to-domain mapping + email validation + email-format inference.
Cross-platform identity
**Username probe (Sherlock-style)** — checks 30+ social platforms for a given username. Quality-of-life improvements over base Sherlock: timeout handling, retry logic, content sanity check (does the returned page actually look like a profile or a 404?).
**GitHub key fingerprinting** — pulls SSH + GPG public keys from a GitHub username for cryptographic identity correlation.
**Gravatar hash lookup** — links email to Gravatar profile + linked services.
**Wayback Machine** — historical snapshots of profiles + websites for change tracking.
Sanctions + adverse intel
**OFAC SDN + sectoral sanctions lists** — primary US sanctions check.
**EU consolidated sanctions list** — EU-wide sanctions data.
**UK HMT consolidated sanctions list** — UK sanctions data.
**UN Security Council sanctions list** — international sanctions data.
**PEP database (politically-exposed persons)** — international PEP scan.
**Adverse media corpus** — 80,000+ news source scan for negative coverage.
Court records + corporate
**PACER** — US federal court filings.
**State court systems** — varies by state; coverage where publicly indexed.
**OpenCorporates** — 130+ jurisdictions' corporate registry data.
**Companies House (UK)** — UK corporate registry + officer records.
**SEC EDGAR** — US securities filings.
**Bankruptcy filings (national)** — US bankruptcy case index.
Infrastructure + cyber
**Shodan** — IP + service intelligence (open ports, banners, CVEs visible from outside).
**AbuseIPDB** — IP abuse score + complaint history.
**VirusTotal** — domain + IP reputation, hash lookups for evidence files.
**WHOIS** — domain registration data.
**DNS records** — current + historical DNS for domain investigations.
**SSL certificate transparency** — historical SSL certs issued for a domain (subdomain enumeration).
Content + image
**Google Vision OCR** — extract text from screenshot evidence + photo metadata.
**Reverse image search** — find other instances of an image (catches stock photos in fraud cases).
**EXIF metadata extraction** — date / camera / GPS / device from photos.
Activity + presence
**Activity heatmap aggregation** — combines signals across sources to build the patterns-of-life view.
**Recurring identifier detection** — cross-case correlation across the workspace's entire dataset.
Why 32, not more
Adding the 33rd source has lower marginal value than the first 5. The first 5 — HIBP, OFAC, Hunter, OpenCorporates, Sherlock-style probe — get you 80% of the answer for any reasonable subject. The next 27 fill in long-tail signal: court records, dark-web exposure, infrastructure, image metadata.
Could we run 100 sources? Sure. The cost would be more API spend, more rate-limit management, more maintenance burden, and slower per-subject runs. The marginal evidence value at source 50 is approximately zero for the median investigative use case.
If your specific use case needs a source that isn't in this list, tell us — at product@trytracelight.com. Worker SDK is on the roadmap for the cases where the right answer is "you should run this in your own workspace."
