Workflows for Digital Curation and Preservation

  • Published on
    24-Feb-2016

  • View
    28

  • Download
    0

DESCRIPTION

Workflows for Digital Curation and Preservation. Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012. Topics. Goals A Very Brief Introduction to Workflow Systems Components for Curation Workflow Scenarios Future Work. Workflows for Curation. Goals - PowerPoint PPT Presentation

Transcript

Workflows for Curation

Workflows for Digital Curation and PreservationStacy KowalczykPASIG Dublin 2012October 17, 2012TopicsGoalsA Very Brief Introduction to Workflow SystemsComponents for CurationWorkflow ScenariosFuture Work

2Workflows for CurationGoalsIncrease capacity and scalability of curation effortsDevelop distributed curation processesLower costs of curation activitiesImprove quality with systematic and repeatable processesReduce human errors

3Why Workflow Systems

Repetitive and mundane activities simplifiedFacilitates and enforces best practices Enables efficient scheduling Machinery for coordinating the execution of services and linking together resourcesFacilitates outreach to researchers for direct deposit and automatic curation

4Types of Workflow Systems5

Kepler

BPEL

Ptolemy II

Triana

Taverna5TridentOpen source projectBased on Microsoft Workflow Foundation classesSupported by Microsoft Research and academic researchersIntegrates with myExperimentWell accepted in the research communitywell over 100 peer-reviewed and white papers were discovered from one scholarly aggregation serviceGraphical workflow design and execution interface

6

6Trident Workflow Components

FixityData IntegrityMetadata CreationFormat Normalization and Derivative GenerationPersistent IdentificationRepository Integration

7Fixity ComponentsMD5 checksum generator

MD5 checksum validator

8

Data Integrity Components JHOVE for format verification and validation

Group validation (for object integrity)

9

ImageMagick 9Metadata Creation ComponentsMIX data generator and validator

METS data generator and validator

10

Dublin Core and MODs as well10Format ComponentsFormat Conversions for normalization and derivative generation.xlsx to .csv.docx to .pdf.ppt to .pdf.tif to .jpgZipping on demandImage (.tif or .jpg) to .pdf (single document and multipage)

11

Repository ComponentIngest to DSpace via Sword

DOI generator

12

Data Ingest WorkflowsScenariosSingle part objects (individual images)Multi-part objects (a book)Multiple instantiations of a logical object (word, pdf and ppt of a research paper)Multiple multi-part objects (a group of letters)Research data products (multiple files of various types)

13

Single Part Objects

14Single Part Objects Workflow

15Derivative GenerationFormat Validation andVerificationFixity CheckCreateTech MetadataCreate Intellectual MetadataCreate Object MetadataPersistentIdentificationDeposit in RepositoryImage Quality ChecksSingle Part Objects WorkflowFor each original imageMD5 checksumJHOVE validation and verification report ImageMagick reportMIX fileFor each derivative fileMD5 ChecksumDOIFor each logical objectDC recordMETS recordSword package

16Multi-part Object Workflow

17Multi-part Object WorkflowComic BookRISSet of .tif files

18CreateTech MetadataDerivative GenerationFormat Validation andVerificationFixity CheckObject IntegrityCreate Intellectual MetadataCreate Object MetadataPersistent IdentificationDeposit in RepositoryImage Quality ChecksMulti-part Object WorkflowFor each individual image fileMD5 checksumJHOVE validation and verification report ImageMagick reportMIX fileFor each derivative fileMD5 ChecksumFor the whole objectDOIDC recordMETS recordSword Package19Multiple Instantiations of a Logical Object Workflow

20Multiple Instantiations of a Logical Object Workflow PapersEach logical object per subdirectoryRIS, word file and (perhaps) supplemental file21Format NormalizationFormat Validation andVerificationFixity CheckCreate Intellectual MetadataCreate Object MetadataPersistent IdentificationDeposit in RepositoryDerivative GenerationMultiple Instantiations of a Logical Object Workflow For each original objectMD5 ChecksumJHOVE reportFor each derivative objectMD5 ChecksumOutput from normalization processDOI for delivery objectFor the whole packageMETS fileDC recordSword Package22Multiple Multi-part Object Workflow

23

Multiple Multi-part Object WorkflowBall collectionRIS for collection and Inventory spreadsheetEach logical object in separate subdirectory

24CreateTech MetadataDerivative GenerationFormat Validation andVerificationFixity CheckCreate Intellectual MetadataCreate Object MetadataPersistent IdentificationDeposit in RepositoryImage Quality ChecksCollection IntegrityCreate Collection MetadataMultiple Multi-part Object WorkflowFor each fileMD5 checksumJHOVE reportMIX fileScanning specificationsDerivative filesFor each logical objectDerivative objectDC recordMETS fileDOIsFor the whole collectionMETS fileDC record

25Research Data Products

26Research Data Products VortexA subdirectory for each experiment27Compress DataFixity CheckCreate Intellectual MetadataCreate Object MetadataPersistentIdentificationDeposit in RepositoryResearch Data Products OutputsZipped data fileMD5 ChecksumFGDC metadata record Dublin Core recordMETS recordSword Package

28Post Deposit Curation WorkflowScenarios Fixity verificationFormat normalizationNew or additional derivative generationMedia migrationPersistent identifier updatesMetadata updates 29Future WorkAdding additional componentsEAD from spreadsheetMARC record supportPremis supportTesting in the labDigital library scanning labsResearch labsIntegrating with a production repository

30AcknowledgementsThis research was made possible through a generous grant by Microsoft Research

And by the Data to Insight Center of Indiana Universitys Pervasive Technology Institute

Thanks to Kavitha Chandrashankar and Quan Zhou for their help with developing components, workflows, and documentation31

Thank you

skowalcz@indiana.eduhttp://d2i.indiana.edu32