Methodology

Nullary aggregates negative results in drug discovery from public databases, normalizes them into a unified schema, and exposes them via MCP server and REST API with full citation provenance. This page describes how the pipeline works.

TLDR

Five-stage pipeline: source ingestion → triage → extraction → verification → normalization
Public-database records use deterministic extraction from structured sources (confidence = 1.0)
Enterprise tier records use LLM extraction followed by adversarial verification with Claude Opus
Every record carries DOI, source URL, license, extraction confidence, and verification status
Retracted papers are tracked via Retraction Watch and flagged in API responses
Honest limitations disclosed: proprietary pharma data, full-text supplementary materials, and patents are gaps the Enterprise tier addresses

Validated, including the negative results

Two technical reports validate the methodology and analytics described below: a bioactivity paper (DOI 10.5281/zenodo.20370264) reporting 0.775 ROC-AUC under temporal split, and an Analytics suite paper reporting tractability validated against Open Targets' independent structural druggability at AUC 0.78. Both papers report their own negative findings honestly — decoy advantage shrinks from +0.031 (scaffold) to +0.014 (temporal); cross-modality combination adds no predictive value over single-modality features.

Source ingestion

Each public database is ingested by a dedicated worker on a scheduled cadence. Workers fetch via bulk download or API, depending on what the source supports. Every ingestion is idempotent — re-running doesn't duplicate records. Content-hash deduplication ensures the same paper appearing in multiple sources is matched and clustered, not duplicated. Refresh schedules range from daily (ClinicalTrials.gov) to quarterly (ChEMBL bulk releases).

Normalization

Every source has its own schema. Nullary maps all of them into a unified findings table with consistent fields: target identifier (UniProt-canonical), compound identifier (InChI key when applicable), assay context, outcome, measurement value, and provenance.

For each modality, additional fields capture modality-specific data — guide RNA sequence for CRISPR, antibody sequences for antibodies, PROTAC components for degraders, and so on. The schema is designed to support 7 current modalities and 4 deferred modalities without breaking changes.

Provenance

Every record carries these provenance fields:

source_doi — DOI of the originating publication, when available
source_pmid — PubMed identifier, when available
source_url — direct URL to the source database record
source_type — which source database (ChEMBL, DepMap, etc.)
source_license — license of the source data
source_attribution_required — whether attribution is required for redistribution
extraction_confidence — confidence score from 0 to 1
verification_status — auto_published, human_reviewed, pending, or rejected

A query result without complete provenance is treated as a pipeline bug, not a feature.

Confidence scoring

extraction_confidence reflects how the record was produced:

1.0: Deterministic extraction from structured sources (ChEMBL, PubChem, AACT, etc.). The source database itself provides the data in structured form; Nullary normalizes it without LLM interpretation.
0.85-0.95: LLM extraction from semi-structured sources (specific data tables in PDFs) with verification.
0.70-0.85: LLM extraction from supplementary materials with verification.
Below 0.70: Not currently surfaced. Queued for re-extraction with improved prompts.

Records from public databases are all confidence 1.0 (structured sources only). The Enterprise tier introduces LLM-extracted records with the lower confidence tiers.

Verification

For LLM-extracted records (Enterprise tier), each candidate finding is verified by an adversarial second pass using Anthropic's Opus model. The verifier is prompted to disqualify the finding by looking for misinterpretations, retracted sources, sub-therapeutic concentrations, or alternative interpretations. Records that survive verification carry verification_status='auto_published'. Records flagged for human review go to verification_status='pending' and are reviewed by a domain specialist before publication.

Retraction tracking

Nullary monitors Retraction Watch and Crossref for new retractions. When a source paper is retracted, all derived records have source_retracted=true set and the retraction date and reason recorded. Retracted records remain queryable but are clearly flagged in API responses.

Data freshness

Per-source refresh cadences are listed on the coverage page. Each source's last successful refresh timestamp is shown in the coverage page header and in API response metadata.

Limitations

What Nullary's public-database coverage does not include:

Proprietary internal failure data from pharma companies
Failure data from unpublished supplementary materials (the Enterprise tier extracts these)
Failure reasons from FDA Complete Response Letters and EMA Assessment Reports (Enterprise tier deep curation)
Failed compounds disclosed only in patents (Enterprise tier patent extraction)
Conference abstracts with copyright restrictions on text redistribution (Enterprise tier, fact-only extraction architecture)

These gaps are honestly disclosed because pretending coverage is complete when it isn't destroys trust faster than any other failure mode.

Reporting issues

If you find a record you believe is wrong, retracted but not flagged, or misclassified, email team@nullary.ai with the record ID. Reviewer notes are added to the record and audit trail.

How to cite Nullary

Recommended citation: Nullary. (2026). Nullary: Negative results intelligence for drug discovery (Version 1.0) [Software]. https://nullary.ai

bibtex

@misc{nullary2026,
  title  = {Nullary: Negative results intelligence for drug discovery},
  author = {The Nullary Team},
  year   = {2026},
  url    = {https://nullary.ai},
  note   = {MCP server and REST API; Version 1.0}
}

When citing specific findings: each Nullary record carries source_doi and/or source_pmid for the original publication. Cite the source publication directly (per the source database's license terms). Optionally also cite Nullary as the aggregation layer: “Negative results aggregated via Nullary (https://nullary.ai) on [date], with underlying sources cited per record.”