Dr Liz Lyon, Associate Director Outreach UK Digital Curation Centre An Introduction Digital Curation Centre a centre of support for data curation and preservation Grand Challenge Meeting, Bath June 2005 Slide 2 2 For later use? In use now (and the future)? Repositories and digital curation Data preservationData curation StaticDynamic maintaining and adding value to a trusted body of digital information for current and future use Slide 3 3 Assuring permanent access to the records of science & the humanities? Long term access to primary data Increasing data volumes from eScience and Grid-enabled / cyberinfrastructure applications Changing research paradigm: data-driven science, big science Observational data, simulations, large-scale experimentation Multi-media resources, statistical data, surveys, geo-spatial data Slide 4 4 Slide 5 5 Facilitate post-processing and knowledge extraction Enable the acquisition of newly-derived information and knowledge Run complex algorithms over primary datasets Mining (data, text, structures) Modelling (economic, climate, mathematical, biological) Analysis (statistical, lexical, pattern matching, gene) Presentation (visualisation, rendering) Slide 6 6 Slide 7 7 Provide additional functionality beyond digital preservation processes Annotations Gene and protein sequences e-Lab books (Smart Tea Project in chemistry) Slide 8 8 Research & e-Science workflows Aggregator services: national, commercial Repositories : institutional, e-prints, subject, data, learning objects Data curation: databases & databanks Validation Harvesting metadata Data creation / capture / gathering: laboratory experiments, Grids, fieldwork, surveys, media Deposit / self- archiving Peer-reviewed publications: journals, conference proceedings Publication Validation Data analysis, transformation, mining, modelling Searching, harvesting, embedding Presentation services: subject, media-specific, data, commercial portals Resource discovery, linking, embedding Linking The scholarly knowledge cycle : linking research data to publications eBank UK Project http://www.ukoln.ac.uk/projects/ebank-uk/ Emerging policy on open access to data Slide 9 9 DCC people (some of them) Management & Co-ordination Director Chris Rusbridge (University of Edinburgh) Community Support & Outreach Led by Dr Liz Lyon (UKOLN, University of Bath) Service Definition & Delivery Led by Professor Seamus Ross (HATII [ERPANET], University of Glasgow) Development Led by Dr David Giaretta (Astronomical Software & Services, CCLRC) Research Led by Professor Peter Buneman (Informatics, University of Edinburgh) Slide 10 10 (Some of) the challenges we face Standards: Interoperability issues: technical & ??soluble Scale: Volume and diversity of datasets Culture: Bringing communities together Library/information science/archives document tradition Domain research (chemists, astronomers, biologists) Computer science (databases) Commercial suppliers (storage technology) Process & Skills: Highly-distributed organisation Use collaborative tools, combined skills Engagement: Existing work & key players Slide 11 11 User requirements analysis: some sound bytes R&D issues: Annotation services, Ontology development, Automating metadata creation, Tools and toolkits, Data Format Description Language, Identifiers, Registries, Economic and cost-benefits studies Advisory services :Ask-a-Curator,FAQs, reports, briefings, awareness-raising materials, best practice guidance, Storage media, Like Erpanet, advise Government, Research Councils, funding bodies Professional development: Short courses, conferences, seminars, workshops, secondments to DCC and to working repository services Outreach: Leadership for the future, case studies, sharing solutions, collaboration with other partners, international peers, industry links Taxonomy of Users Slide 12 12 Outline Taxonomy of digital curation users by role 1. Data Creators 2. Data Curators 3. Data Re-users 4. Policy makers -funding bodies -other leaders Data Preservers Data publishers Slide 13 13 Outline Taxonomy by significant function of organisational entity 1.Research 2. Service provision 3. Learning & teaching 4. Funders 5. Policy / strategy makers Designated communities Commercial Slide 14 14 Advisory services Responses to queriesfrom legal to technical guidance HELPDESK@dcc.ac.uk FAQs constructed Informing workshops and information services Monthly site visits (National Institute of Environmental eScience) Slide 15 15 Professional development workshops 2005 Programme Persistent identifiers June, Glasgow Institutional repositories: July University of Cambridge, with DSpace Cost models July British Library, London with the Digital Preservation Coalition Preservation of medical databases: October Gulbenkian Institute, Lisbon with ERPANET & the Wellcome Trust Slide 16 16 Standards Watch Covering existing and emerging standards Working with community and standards bodies (e.g. ISO) Organising associates groups around new standards developments Initiating standardisation definitions where gaps identified Currently re-purposing Diffuse database of standards materials Slide 17 17 Digital Curation Manual A world class resource Constructed from topic-specific chapters written by international experts editorial board comprising leading researchers and practitioners 45 initial topics including Appraisal and Selection; Costs; Freedom of Information; Interoperability; the OAIS Reference Model; Preservation Strategies; and Open Source Less in-depth insight offered by DCC Briefing Papers, aimed at needs of senior managers Slide 18 18 OAIS Reference Model Functional Model Slide 19 19 Audit and Certification (1) How can people know who to entrust with their information? There is a demand for a certification process for Repositories and components e.g. archive storage Software Certification standards (ISO 9000 and ISO 17799) do not do the job OCLC/RLG Trusted Digital Repositories: Attributes and Responsibilities high level model for design, delivery and maintenance of digital repositories Slide 20 20 Audit and Certification (2) International expert group led by RLG and NARA is drafting a Certification standard DCC is participating: aiming for international consensus Draft goes to Technical Editor end of June DCC testbeds to support development of audit and certification standards Commitment to offer guidance on self-audit and self-certification carry out independent audits issue certificates to qualifying repositories Slide 21 21 Tools and Technologies Accumulate and Maintain Registry and online Repository of relevant tools Repository Implementations Packaging Tools Rendering Software Format Converters Device Drivers Slide 22 22 Representation Registry development Simple PHP prototype Scoping study Formats, standards, tools More robust prototype in development Based on ebXML & JAXR Potentially distributed, cooperative maintenance model Representation information: describe CCLRC (science) data using EAST, Links to PRONOM, GDFR and other pilots Aim to handover to services Development info see http://dev.dcc.ac.uk for details of Wiki and email list open to all Slide 23 23 Research agenda (1) Publishing & integrating scientific databases Archiving past states of volatile databases Database provenance and annotation Organisational dynamics of trusted repositories Automating metadata extraction Cost-benefit analysis of data curation Rights and responsibilities Slide 24 24 The database picture Source data Curated data: classified, cleaned, annotated, integrated, cross-linked Slide 25 25 Curated databases some issues Integrating, publishing and citing data so that someone else can use it. Annotating existing data and moving annotations to other databases Provenance: where did this data come from? Archiving: how do you preserve something that is constantly changing? Slide 26 26 Research agenda (2) Publishing & integrating scientific databases Archiving past states of volatile databases Database provenance and annotation Organisational dynamics of trusted repositories Automating metadata extraction Cost-benefit analysis of data curation Rights and responsibilities Public domain, public interest, public funding paper Waelde & McGinley Slide 27 27 www.dcc.ac.uk Slide 28 28 www.ijdc.net Launch planned July Peer-review Editorial Board Peter Buneman Editor (research) Production editor Philip Hunter Papers for submission are very welcome! Slide 29 29 1 st DCC International Conference Location - Bath UK 29-30 September 2005 Keynote speakers Clifford Lynch CNI Graham Cameron European Bio-informatics Institute DCC Research update Social highlights Slide 30 30 Associates Network Goals Develop understanding, share best practice, advance research, promote recognition, develop consensus Membership International groups, national bodies, industry partners, funders, research groups, HEIs, FEIs, individuals Benefits Early access to R&D outputs, advisory services, training, input to definition and design, community participation Discussion Forum www.dcc.ac.uk Please join us!