# Purpose

Community facing documentation about quality control (QC) for the entire Monarch stack.

# Monarch components - QC and workflow

# Dipper: data ingestion/ETL pipeline (pre-QC)

Dipper ingests data from many different remote sources.

Ingestion of data from remote sources into the Monarch stack is configured as a Jenkins pipeline that runs Dipper on a regular basis and records output, error messages, and runtime failures. Failures and errors are addressed with fairly frequent updates to the Dipper codebase.

The bleeding edge ingested and transformed data (that have not passed the QC process) along with metadata about the ingested data are output as rdf.

# Monarch data releases (post-QC)

A critical aspect of data ingest is confirming the data ingested are fit for purpose. With many heterogeneous sources updating their data, their data formats, their data servers, their data licenses, keeping the ingestion running is a Sisyphean endeavor.

When hundreds of million statements are successfully produced, questions abound: What do they contain? How do they compare with previous releases? What is new? What went away? What caused these changes? Does the output conform to the intended model?

To confirm that data are fit for purpose, the Monarch team periodically performs an extensive QC process on the data output by Dipper.

This process includes the following steps:

  • data output from Dipper are loaded into a new Scigraph instance
  • a comparison of the contents of new Scigraph instance and existing production SciGraph instance is performed
  • the results are manually inspected, and large discrepancies between the two SciGraph instances are investigated for possible issues (see here for an example diff result)
  • a "visual reduction" is performed to output a graphical representation of the new data, as well what has changed from the previous release (example)

When the data pass this QC process, they are output into the Monarch Data Archive as a new data release. These occur approximately once every few months. These data releases include:

  • rdf files for each data ingested from each source (example)
  • owl files for each ontology ingested (example)
  • a detailed QC report with an itemized list of changes between the current and previous release (example)

# Contact

For questions, please contact info@monarchinitiative.org.