Digital Preservation through Cooperation: LOCKSS Gail McMillan Digital Library and Archives, University Libraries Virginia Polytechnic Institute and State.
Digital Preservation through Cooperation: LOCKSSGail McMillanDigital Library and Archives, University LibrariesVirginia Polytechnic Institute and State UniversityVIVA Steering Committee and SCHEV LACVirginia State University June 10, 2005Libraries: Collections, not just LinksLibraries should own, as well as manage, their digital collections, includingContent currently leased: VIVA examplesBioOne, Cambridge Uni. Press, Nature Publishing Group, Project Muse See http://lockss.stanford.edu/about/titles.htmLOCKSS prevents the publisher from revoking access rights to back contentOpen-access web resources, for exampleAbbey's Web: Provides links to biographical information, bibliographies, articles, and other resources about the environmental writer, Edward Abbey: http://www.abbeyweb.net/LOCKSS BasicsLibrary uses inexpensive computer and free softwareProgrammatically collects content from publisherPreserves content among LOCKSS servers Periodically audits content and repairs as needed from other LOCKSS servers Disseminates content to librarys appropriate usersHost librarys readers see the content from publishers URLUnless it isnt available from there It is delivered from the readers librarys LOCKSS-preserved content.It doesnt look any different.LOCKSS and EJournals Library (consortium) negotiates with publishersPublishers trust LOCKSSCollections begin with subscriptions, not retrospectivelyLibraries have access to their collections in perpetuityOutside the appropriate user community, access only to audit and repair filesLow cost to administer and runLess than 1 hour per month95% of systems patched in 48hrsLow storage costs: 2003: $0.70 = one year, one journal, ~0.5GB 600MHz-128MB RAM-Bootable CD drive-Floppy disk driveOne PC holds >3,000 years of an average electronic journal (2005)LOCKSS software turns a PC into a preservation toolLOCKSS and PublishersSuggested license language permits libraries toCollect and preserve currently accessible materials, i.e., subscription-based contentUse materials consistent with original license termsProvide copies to others for purposes of audit and repairReview of Writing and Photography of AppalachiaLOCKSS is for more than just ejournalsMetaArchive of Southern Digital Culture ETDs: Electronic Theses and DissertationsASERL: Association of SouthEastern Research Libraries9/11 web sites -- NYPLNewspapers -- University of UtahGovernment DocumentsNDIIPP National Digital Information Infrastructure and Preservation ProgramCreated by federal legislation in December 2000Support preservation of significant born-digital content at riskThree areas of focusNetwork of preservation partners: Clear instructions from legislators that LC should work with othersArchitectural framework for preservationDigital preservation researchMetaArchive NDIIPP Networkhttp://www.metaarchive.orgAuburn UniversityEmory UniversityGa TechVa TechUniversity of LouisvilleFlorida State UniversityKey Features of a Secure MetaArchiveDistributed preservation strategyFlexible organizational modelFormal content selection process Capability for migrating archivesDim archiving strategyLow cost to deploymentSelf-Sustaining incentives Simple preservation exchange mechanisms with the Library of Congress MetaArchive Project GoalsCreate a conspectus of digital content within the subject domain held by the partner sitesHarvested body of the most critical content to be preserved (3 terabytes, w/ capability to expand)Develop a model cooperative agreement for ongoing collaboration and sustainabilityDistributed preservation network infrastructure based on the LOCKSS softwareMetaArchive: Deliverables,more than CLOCKSSDefine the Scope of the ContentWhat is Southern digital culture?What is at risk?Developing a Conspectus: Content SelectionWhat collections will be preserved?MetadataAdaptations showing any unique or qualified tags Rights issues: harvesting for preservation vs. user accessMetaArchives CLOCKSS (Collecting Lots of Copies Keeps Stuff Safe)Diversifying LOCKSSSoftware , hardware, collections, communitiesStudy problemsDynamic contentFormat migration (next grant)Cooperative agreement modelNot only an effective preservation network for one body of digital content, but enable the creation of many others for this important purpose.http://www.lockss.orgThe Digital Preservation ProblemPreservation of digital content is an enormous issue, too big for individual institutions to handle alone.Working together cooperatively well develop effective models for how to do this.Preservation system requirement: no single point of failure.The LOCKSS (Lots of Copies Keeps Stuff Safe) initiative was developed at Stanford to provide Libraries with low-cost online storage of content they license from publishers. This enables libraries operating in the online environment to again own the material they purchased, as they do in the print environment. Libraries can return to the necessary role of custodians of scholarly digital materials. Because materials are stored at several sites, the risk of losing content in the event of disaster is minimised because there are other copies stored elsewhere. LOCKSS claims to be the only system available that addresss these problems.LOCKSS is not expensive so it can work with small as well as large publishers. It works well for any digital format delivered via the web. The LOCKSS system administration can be shared among those with technical expertise and those without. You decide on storage on a small number of large machines or a smaller machine. LOCKSS is expanding its applications beyond ejournals. Im involved in:MetaArchive of Southern Digital CultureASERL ETDsInstitutional Repositories: libraries are not generally expecting IRs to solve the e-journal preservation problem. They are turning to solutions such as LOCKSS to do that.Library uses inexpensive PC and free LOCKSS software downloaded from the webLOCKSS clusters six independent servers that audit each other, keeping ejournals complete. It periodically collects content from publisherIf there is a permission statement on the publishers journal (called a manifest page)Any format: text, sound, video, images, etc.Publishers grant permission for libraries to collect materials in chunks, called LOCKSS archival units--typically a volume and all its component issuesPreserves content among LOCKSS machines at other institutionsPeriodically audits content among LOCKSS machines and repairs as neededDisseminates content to Library usersHost librarys readers see the content at original URLUnless it isnt available from there and then it is delivered from the readers librarys LOCKSS-preserved content.It looks the same from either source: with the exception of dynamic content such as advertisements or rotating gifs.70 publishers, >2,000 titles, endorse LOCKSSLibrary (consortium) negotiates with publishersPermission to collect and preserve content is an addition to, not a separate agreement.Publishers trust LOCKSSCollections begin with subscriptions, not retrospectively--pubrs like thisPublishers continue to control who collects their e-content and how it is used.Libraries have access to these collections in perpetuity--after subscriptions endOutside the appropriate user community, access only to audit and repair filesContent is not shared among machines: Readers at institutions that did not collect it from the publisher do not get access via an institution that did. Access to LOCKSS-preserved content is triggered when someone in your authorized community of library users cant access an article, for example, from the publishers database.Storage costs low: 2003: $0.70 = one year, one journal, ~0.5GB Costs associated with LOCKSS are all lowLow system administration costsStorage costs low: 2003: $0.70 = one year, one journal (0.5GB) Public access: preservation and accessDark archive: preservation onlyMetaArchive of Southern Digital Culture: Ill tell you more about this project ETDs: Electronic Theses and DissertationsASERL: Association of SouthEastern Research Libraries 9/11 web sites -- NYPLNewspapers -- University of Utahstate newspaper project with othersGovernment DocumentsHalf-life of a federal web resource is 4 months12 library partners looking for support -- international, federal, state, and local gov docsIn a free society, citizens must be able to access the information published by their governments. A decade of experience with Web-pubd gov info has demonstrated that leaving materials only in the agencies custody can result in loss of important publications.One of the issues we have to deal with in some of these projects is defining the unit to be captured and preserved periodically. Since we often dont necessarily have nice logical units like a volume that is a collection of issues. Ive learned that collections in my Digital Library and Archives are far less dead that youd expect to find in Special Collections. So weve renamed our version of LOCKSS to CLOCKSS.Primary outcomes for partnerships:Identify and preserve significant contentLeverage resources, experience via collaborationPromote standards and best practicesThese 6 collaborating institutions received an award for $1.3M to develop over 3 years a preservation cooperative of digital content with a particular focus on Southern culture and historySome of the MetaArchive partners will also be part of the ASERL ETD project.Those partners, at this time (6/10/05) areUniversity of Kentucky--Beth KraemerVT--meFSU--Robert McDonaldLOCKSS--Vicky Reich and Tom Robertson (MetaArchive partner too)Georgia Tech--Tyler WlatersVanderbilt--project initiated by Paul GhermanASERL (John Burger, ex dir, coordinating)[Pettersens renovated Alford-Nixon House]Our philosophy is that effective digital preservation efforts succeed through a strategy for dispersing multiple copies of content in secure, distributed locations over time and validating the integrity of those copies periodically. We anticipate that if a file at one institution is lost or damaged, it can be replaced with the same file from one of the other five institutions.Later this summer we will run our first test of the distributed, but closed, preservation network that is a dark archive, accessible only to these partners and some of it will be open to the partners only at very specific and brief periods, e.g., embargoed ETDs.The purpose of a dark archive is to function as a repository for information that can be used as a failsafe during disaster recovery. http://www.webopedia.com/TERM/D/dark_archive.htmlEJOURNALS ARE IN A LIGHT ARCHIVEThat is, access is open to all the appropriate members of the community.[Anisfield house]Content SelectionPreservation efforts are likely to be most coherent around shared focusSubject domain: Southern culture and historySelection of collections to be preserved made by teams of subject specialists and archivists at partner institutionsThese teams creating a conspectus of collections for consideration and prioritizationUsing collection framework of the Encyclopedia of Southern Culture The selection of specific materials is left to the cooperating institutionsDeveloping a Conspectus: Content SelectionWhat collections will be preserved?Metadata schema adapted from many sources; Dublin Core ,UK/RSLP, Western States DCBest Practices, IMLS/DCC-UIUC, OCLC/RLG PREMIS (Preservation Metadata) Metadata accompanies and makes reference to each collection and provides associated descriptive, structural, administrative, and other kinds of information. Clifford Lynch, DLib Magazine, 1999We will publish online the adapted way in which we use metadata showing any unique or qualified tags that are used (Storage & Use MD and metadata that is adapted for LOCKSS are of particular interest).Insuring that we have appropriate rights to harvest and distribute whether among a partners in a closed preservation network or to our sponsor, the Library of Congress--our national library for the publicCan the digital collections be made available for harvesting?Small preservation caches vs. megacaches like MetaArchive:Disk storage arrays attached to each vault server in the network can store 2 TB. 90%) will be dedicated to the shared content harvest that the cooperative will jointly identify and assemble. The remainder of the networks capacity will be allocated for preservation of critical content identified solely by the individual partners. By allocating a quota of 40 GB of replicated, secure storage to each partner for preservation of locally determined content, we offer a clear incentive for members to continue in the cooperative in the future. By creating a mechanism for cooperative members to both contribute to the common good and individual interests, we strike an effective balance that will be sustainable over time.Will develop a simple and flexible cooperative agreement as a model for other institutions seeking to cooperate for purposes of digital preservationThe cooperative seeks to not only create an effective preservation network for one body of digital content, but enable the creation of many others for this important purposeThe LOCKSS approach tries to prevent content being lost through budget cuts by dispersing all costs and responsibilities across many institutions. The systems robustness depends upon redundancy of hardware, software, content and administration.