Contents:
- 1. Introduction
- 2. Issues
- 3. Standards
- 4. Software
- 5. Case Studies
- 6. Conclusions and Recommendations
- 7. Glossary
- 8. References and Bibliography
This report introduces and discusses the key issues, for example selection, limitations of current crawlers, authenticity, temporal coherence, viruses and de-duplication, faced by organizations engaged in web archiving initiatives. Section 4 presents available integrated systems, like PANDORA Digital Archiving system (PANDAS), Web Curator Tool (WCT) and NetarchiveSuite. Followed by separate subsections on commercial web archiving services; web crawlers, like HTTrack and WGet; NutchWAX and SOLR for indexing and searching web archive collections; and Access components and tools, like Wayback and Memento. The different ways in which a web archiving solution may be implemented are illustrated in Section 5 through three case studies: The UK Web Archive: a national library with a self-hosted and self-managed, open source web archiving solution. The Internet Memory Foundation: a web archiving service offered by a non-profit foundation and utilized by several small, medium or large institutions to host and manage their web archiving collections. The Coca-Cola Web Archive: a non-public commercial web archive hosted and managed by a third party commercial organization. The author concludes that, despite all of the effort, web archives still face significant challenges, especially in the case of archiving social and collaborative platforms. She pleads for more funding to develop tools that can bring Quality Assurance to a higher level, as in the last decade this area has made the least progress. Furthermore, she concludes that tools to support long-term preservation of web archives are similarly under-developed. Last, but not least, legislative challenges can inhibit collection, and also limit access. This report is part of the series DPC Technology Watch Report.
The completeness of the aspects of web archiving, presented in this easy to read report, makes the report an interesting read for those responsible for managing the lifecycle of web content and new to web archiving. It targets (AV-) archivists who wish to broaden their knowledge of web archiving prior to embarking on or revising their own initiatives. Also, due to its summative nature, existing practitioners might find value in the report.