A METS based information package for long term accessibility of Web Archives

Contents:

  • 1. Introduction
  • 2. Web Archiving
  • 3.Preservation Requirements
  • 4. Conclusion

The British Library’s web archive comprises several terabyte of harvested websites. Like other content streams this data is ingested into the library’s central preservation repository via a standardized Submission Information Package (SIP). Harvested Websites are stored in Archival Information Packages (AIP). Each AIP is described by a METS file. This paper describes how the operational metadata for resource discovery as well as archival metadata are normalized and embedded in the METS descriptor using common metadata profiles such as PREMIS and MODS. For example, the British Library’s METS profile for web archiving defines two different events: virus check- and migration event. Redundant storage of metadata is avoided in the profile. The underlying complex content model disaggregates websites into web pages, associated objects and their actual digital manifestations. Defining abstract entities and their manifestation as separate objects allows future implementations of tools to support a complex end-to-end process without relying on proprietary data structures. The additional abstract layer ensures accessibility over the long term and the ability to carry out preservation actions such as migrations. This way the library wide preservation policies and principles become applicable to web content as well. The web archiving profile supports the whole business process as described in the OAIS model, even though it does not define a Dissemination Information Package (DIP). The profile ensures that all the metadata is part of the SIP/AIP, thus enabling long term accessibility. The author concludes with pointing at the importance of using standardized metadata frameworks and schemas such as METS and PREMIS, as well as using an extendable and flexible content model.The paper was presented at the IPRES 2010 conference.

Readable paper explaining the practical implementation of the web archiving profile applied at the Britisch Library. With parts of XML code showing o.a. how the PREMIS metadata schema serves as an extension schema to METS and concrete examples of the application of administrative, technical and provenance metadata. Which makes this paper relevant for IT-employees that are orientating themselves on how to tackle the preservation of a web archive via an existing digital repository. This real life use case shows policy makers at (AV)-archives that a web archive can be an integral part of a trusted digital repository (TDR).