Methodology
Nullary aggregates negative results in drug discovery from public databases, normalizes them into a unified schema, and exposes them via MCP server and REST API with full citation provenance. This page describes how the pipeline works.
TLDR
- Five-stage pipeline: source ingestion → triage → extraction → verification → normalization
- Public-database records use deterministic extraction from structured sources (confidence = 1.0)
- Enterprise tier records use LLM extraction followed by adversarial verification with Claude Opus
- Every record carries DOI, source URL, license, extraction confidence, and verification status
- Retracted papers are tracked via Retraction Watch and flagged in API responses
- Honest limitations disclosed: proprietary pharma data, full-text supplementary materials, and patents are gaps the Enterprise tier addresses
Validated, including the negative results
Two technical reports validate the methodology and analytics described below: a bioactivity paper (DOI 10.5281/zenodo.20370264) reporting 0.775 ROC-AUC under temporal split, and an Analytics suite paper reporting tractability validated against Open Targets' independent structural druggability at AUC 0.78. Both papers report their own negative findings honestly — decoy advantage shrinks from +0.031 (scaffold) to +0.014 (temporal); cross-modality combination adds no predictive value over single-modality features.
Source ingestion
Each public database is ingested by a dedicated worker on a scheduled cadence. Workers fetch via bulk download or API, depending on what the source supports. Every ingestion is idempotent — re-running doesn't duplicate records. Content-hash deduplication ensures the same paper appearing in multiple sources is matched and clustered, not duplicated. Refresh schedules range from daily (ClinicalTrials.gov) to quarterly (ChEMBL bulk releases).
Normalization
Every source has its own schema. Nullary maps all of them into a unified findings table with consistent fields: target identifier (UniProt-canonical), compound identifier (InChI key when applicable), assay context, outcome, measurement value, and provenance.
For each modality, additional fields capture modality-specific data — guide RNA sequence for CRISPR, antibody sequences for antibodies, PROTAC components for degraders, and so on. The schema is designed to support 7 current modalities and 4 deferred modalities without breaking changes.
Provenance
Every record carries these provenance fields:
source_doi— DOI of the originating publication, when availablesource_pmid— PubMed identifier, when availablesource_url— direct URL to the source database recordsource_type— which source database (ChEMBL, DepMap, etc.)source_license— license of the source datasource_attribution_required— whether attribution is required for redistributionextraction_confidence— confidence score from 0 to 1verification_status— auto_published, human_reviewed, pending, or rejected
A query result without complete provenance is treated as a pipeline bug, not a feature.
Confidence scoring
extraction_confidence reflects how the record was produced:
- 1.0: Deterministic extraction from structured sources (ChEMBL, PubChem, AACT, etc.). The source database itself provides the data in structured form; Nullary normalizes it without LLM interpretation.
- 0.85-0.95: LLM extraction from semi-structured sources (specific data tables in PDFs) with verification.
- 0.70-0.85: LLM extraction from supplementary materials with verification.
- Below 0.70: Not currently surfaced. Queued for re-extraction with improved prompts.
Records from public databases are all confidence 1.0 (structured sources only). The Enterprise tier introduces LLM-extracted records with the lower confidence tiers.
Verification
For LLM-extracted records (Enterprise tier), each candidate finding is verified by an adversarial second pass using Anthropic's Opus model. The verifier is prompted to disqualify the finding by looking for misinterpretations, retracted sources, sub-therapeutic concentrations, or alternative interpretations. Records that survive verification carry verification_status='auto_published'. Records flagged for human review go to verification_status='pending' and are reviewed by a domain specialist before publication.
Retraction tracking
Nullary monitors Retraction Watch and Crossref for new retractions. When a source paper is retracted, all derived records have source_retracted=true set and the retraction date and reason recorded. Retracted records remain queryable but are clearly flagged in API responses.
Data freshness
Per-source refresh cadences are listed on the coverage page. Each source's last successful refresh timestamp is shown in the coverage page header and in API response metadata.
Limitations
What Nullary's public-database coverage does not include:
- Proprietary internal failure data from pharma companies
- Failure data from unpublished supplementary materials (the Enterprise tier extracts these)
- Failure reasons from FDA Complete Response Letters and EMA Assessment Reports (Enterprise tier deep curation)
- Failed compounds disclosed only in patents (Enterprise tier patent extraction)
- Conference abstracts with copyright restrictions on text redistribution (Enterprise tier, fact-only extraction architecture)
These gaps are honestly disclosed because pretending coverage is complete when it isn't destroys trust faster than any other failure mode.
Further reading
- Real Measured Negatives as a Substrate for Calibrated, Target-Specific Bioactivity Prediction — methodology and limitations of the prediction API and per-target model registry. DOI 10.5281/zenodo.20370264.
- The Nullary Analytics Suite: Coverage, Tractability, and Cross-Modality Analysis — methods and validation for Premium tier analytics.
Reporting issues
If you find a record you believe is wrong, retracted but not flagged, or misclassified, email team@nullary.ai with the record ID. Reviewer notes are added to the record and audit trail.
How to cite Nullary
Recommended citation: Nullary. (2026). Nullary: Negative results intelligence for drug discovery (Version 1.0) [Software]. https://nullary.ai
@misc{nullary2026,
title = {Nullary: Negative results intelligence for drug discovery},
author = {The Nullary Team},
year = {2026},
url = {https://nullary.ai},
note = {MCP server and REST API; Version 1.0}
}When citing specific findings: each Nullary record carries source_doi and/or source_pmid for the original publication. Cite the source publication directly (per the source database's license terms). Optionally also cite Nullary as the aggregation layer: “Negative results aggregated via Nullary (https://nullary.ai) on [date], with underlying sources cited per record.”