Published on
24-Feb-2016
View
28
Download
0
DESCRIPTION
Workflows for Digital Curation and Preservation. Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012. Topics. Goals A Very Brief Introduction to Workflow Systems Components for Curation Workflow Scenarios Future Work. Workflows for Curation. Goals - PowerPoint PPT Presentation
Transcript
Workflows for Curation
Workflows for Digital Curation and PreservationStacy KowalczykPASIG Dublin 2012October 17, 2012TopicsGoalsA Very Brief Introduction to Workflow SystemsComponents for CurationWorkflow ScenariosFuture Work
2Workflows for CurationGoalsIncrease capacity and scalability of curation effortsDevelop distributed curation processesLower costs of curation activitiesImprove quality with systematic and repeatable processesReduce human errors
3Why Workflow Systems
Repetitive and mundane activities simplifiedFacilitates and enforces best practices Enables efficient scheduling Machinery for coordinating the execution of services and linking together resourcesFacilitates outreach to researchers for direct deposit and automatic curation
4Types of Workflow Systems5
Kepler
BPEL
Ptolemy II
Triana
Taverna5TridentOpen source projectBased on Microsoft Workflow Foundation classesSupported by Microsoft Research and academic researchersIntegrates with myExperimentWell accepted in the research communitywell over 100 peer-reviewed and white papers were discovered from one scholarly aggregation serviceGraphical workflow design and execution interface
6
6Trident Workflow Components
FixityData IntegrityMetadata CreationFormat Normalization and Derivative GenerationPersistent IdentificationRepository Integration
7Fixity ComponentsMD5 checksum generator
MD5 checksum validator
8
Data Integrity Components JHOVE for format verification and validation
Group validation (for object integrity)
9
ImageMagick 9Metadata Creation ComponentsMIX data generator and validator
METS data generator and validator
10
Dublin Core and MODs as well10Format ComponentsFormat Conversions for normalization and derivative generation.xlsx to .csv.docx to .pdf.ppt to .pdf.tif to .jpgZipping on demandImage (.tif or .jpg) to .pdf (single document and multipage)
11
Repository ComponentIngest to DSpace via Sword
DOI generator
12
Data Ingest WorkflowsScenariosSingle part objects (individual images)Multi-part objects (a book)Multiple instantiations of a logical object (word, pdf and ppt of a research paper)Multiple multi-part objects (a group of letters)Research data products (multiple files of various types)
13
Single Part Objects
14Single Part Objects Workflow
15Derivative GenerationFormat Validation andVerificationFixity CheckCreateTech MetadataCreate Intellectual MetadataCreate Object MetadataPersistentIdentificationDeposit in RepositoryImage Quality ChecksSingle Part Objects WorkflowFor each original imageMD5 checksumJHOVE validation and verification report ImageMagick reportMIX fileFor each derivative fileMD5 ChecksumDOIFor each logical objectDC recordMETS recordSword package
16Multi-part Object Workflow
17Multi-part Object WorkflowComic BookRISSet of .tif files
18CreateTech MetadataDerivative GenerationFormat Validation andVerificationFixity CheckObject IntegrityCreate Intellectual MetadataCreate Object MetadataPersistent IdentificationDeposit in RepositoryImage Quality ChecksMulti-part Object WorkflowFor each individual image fileMD5 checksumJHOVE validation and verification report ImageMagick reportMIX fileFor each derivative fileMD5 ChecksumFor the whole objectDOIDC recordMETS recordSword Package19Multiple Instantiations of a Logical Object Workflow
20Multiple Instantiations of a Logical Object Workflow PapersEach logical object per subdirectoryRIS, word file and (perhaps) supplemental file21Format NormalizationFormat Validation andVerificationFixity CheckCreate Intellectual MetadataCreate Object MetadataPersistent IdentificationDeposit in RepositoryDerivative GenerationMultiple Instantiations of a Logical Object Workflow For each original objectMD5 ChecksumJHOVE reportFor each derivative objectMD5 ChecksumOutput from normalization processDOI for delivery objectFor the whole packageMETS fileDC recordSword Package22Multiple Multi-part Object Workflow
23
Multiple Multi-part Object WorkflowBall collectionRIS for collection and Inventory spreadsheetEach logical object in separate subdirectory
24CreateTech MetadataDerivative GenerationFormat Validation andVerificationFixity CheckCreate Intellectual MetadataCreate Object MetadataPersistent IdentificationDeposit in RepositoryImage Quality ChecksCollection IntegrityCreate Collection MetadataMultiple Multi-part Object WorkflowFor each fileMD5 checksumJHOVE reportMIX fileScanning specificationsDerivative filesFor each logical objectDerivative objectDC recordMETS fileDOIsFor the whole collectionMETS fileDC record
25Research Data Products
26Research Data Products VortexA subdirectory for each experiment27Compress DataFixity CheckCreate Intellectual MetadataCreate Object MetadataPersistentIdentificationDeposit in RepositoryResearch Data Products OutputsZipped data fileMD5 ChecksumFGDC metadata record Dublin Core recordMETS recordSword Package
28Post Deposit Curation WorkflowScenarios Fixity verificationFormat normalizationNew or additional derivative generationMedia migrationPersistent identifier updatesMetadata updates 29Future WorkAdding additional componentsEAD from spreadsheetMARC record supportPremis supportTesting in the labDigital library scanning labsResearch labsIntegrating with a production repository
30AcknowledgementsThis research was made possible through a generous grant by Microsoft Research
And by the Data to Insight Center of Indiana Universitys Pervasive Technology Institute
Thanks to Kavitha Chandrashankar and Quan Zhou for their help with developing components, workflows, and documentation31
Thank you
skowalcz@indiana.eduhttp://d2i.indiana.edu32