Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center Standardizing Metadata Associated.

  • Published on
    23-Dec-2015

  • View
    214

  • Download
    2

Transcript

  • Slide 1
  • Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers N01AI2008038 N01AI40041
  • Slide 2
  • Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter Institute Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers N01AI2008038 N01AI40041
  • Slide 3
  • Genome Sequencing Centers for Infectious Disease (GSCID) Bioinformatics Resource Centers (BRC) www.viprbrc.orgwww.fludb.org
  • Slide 4
  • High Throughput Sequencing Enabling technology Epidemiology of outbreaks Pathogen evolution Host range restriction Genetic determinants of virulence and pathogenicity Metadata requirements Temporal-spatial information about isolates Selective pressures Host species of specimen source Disease severity and clinical manifestations
  • Slide 5
  • Metadata Submission Spreadsheets 1111 2 2 3 3 4 44
  • Slide 6
  • Complex Query Interface
  • Slide 7
  • Metadata Inconsistencies Each project was providing different types of metadata No consistent nomenclature being used Impossible to perform reliable comparative genomics analysis Required extensive custom bioinformatics system development
  • Slide 8
  • GSC-BRC Metadata Standards Working Group NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs Develop metadata standards for pathogen isolate sequencing projects Bottom up approach Assemble into a semantic framework
  • Slide 9
  • GSC-BRC Metadata Working Groups
  • Slide 10
  • Metadata Standards Process Divide into pathogen subgroups viruses, bacteria, eukaryotic pathogens and vectors Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS) Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data fields that appear to be project specific For each data field, provide common set of attributes, including definitions, synonyms, allowed value sets preferably using controlled vocabularies, and expected syntax, etc. Merge subgroup core elements into a common set of core metadata fields and attributes Assemble set of pathogen-specific and project-specific metadata fields to be used in conjunction with core fields Compare, harmonize, map to other relevant initiatives, including OBI, MIGS, MIxS, BioProjects, BioSamples (ongoing) Assemble all metadata fields into a semantic network (ongoing) Harmonize semantic network with the Ontology of Biomedical Investigation (OBI) Draft data submission spreadsheets to be used for all white paper and BRC-associated projects Finalize version 1.0 metadata standard and version 1.0 data submission spreadsheet Beta test version 1.0 standard with new white paper projects, collecting feedback
  • Slide 11
  • Data Fields:Core ProjectCore Sample Attributes
  • Slide 12
  • organism environmental material equipment person specimen source role specimen capture role specimen collector role temporal-spatial region spatial region temporal interval GPS location date/time specimen X specimen isolation procedure X isolation protocol has_input has_output plays has_specification has_part denotes located_in name denotes spatial region geographic location denotes located_in affiliation has_affiliation ID denotes specimen type instance_of specimen isolation procedure type instance_of Specimen Isolation plays has_input organism part hypothesis is_about IRB/IACUC approval has_authorization environment has_quality organism pathogenic disposition has part has disposition ID denotes CS1 genderagehealth status has quality CS4CS5/6CS7 CS2/3 CS8 CS9/10 CS11/12 CS13 CS14 CS18 CS15/16
  • Slide 13
  • Metadata Processes data transformations image processing assembly sequencing assay specimen source organism or environmental specimen collector input sample reagents technician equipment typeID qualities temporal-spatial region data transformations variant detection serotype marker detect. gene detection primary data sequence data genotype/serotype/ gene data specimen microorganism enriched NA sample microorganism genomic NA specimen isolation process isolation protocol sample processing data archiving process sequence data record has_input has_output has_specificationhas_part is_about has_input has_output has_input has_output is_about GenBank ID denotes located_in denotes has_input has_quality instance_of temporal-spatial region located_in Specimen Isolation Material Processing Data Processing Sequencing Assay Investigation temporal-spatial region located_in temporal-spatial region located_in temporal-spatial region located_in temporal-spatial region located_in quality assessment assay Quality Assessment has_input has_output
  • Slide 14
  • Outcome of Metadata Standards WG Consistent metadata captured across GSCID Guidance to collaborators regarding metadata expectations for sequencing and analysis services Support more standardized BRC interface development Harmonization with related stakeholders Genome Standards Consortium MIxS, OBO Foundry OBI and NCBI BioSample Represented in the context of an extensible semantic framework
  • Slide 15
  • Conclusions Metadata standards for microorganism sequencing projects Bottom up approach focuses standard on important features Harmonizing with related standards from the Genome Standards Consortium, OBO Foundry and NCBI Being beta-tested by GSCIDs for adoption by all NIAID-sponsored sequencing projects Utility of semantic representation Identified gaps in data field list (e.g. temporal components) Includes logical structure for other, project-specific, data fields - extensible Identified gaps in ontology data standards (use case-driven standard development) Identified commonalities in data structures (reusable) Support for semantic queries and inferential analysis in future Ontology-based framework is extensible Sequencing => omics