National Archives sign at Kew Gardens Station

National Archives sign at Kew Gardens Station

2009-06-03

WARC file format published as an international standard

The big news today in the Web preservation world is the publication of the WARC file format as an international standard. Here's most of the announcement as circulated to various mailing lists:
The International Internet Preservation Consortium is pleased to announce the publication of the WARC file format as an international

standard: ISO 28500:2009, Information and documentation -- WARC file format.

[http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717]

For many years, heritage organizations have tried to find the most appropriate ways to collect and keep track of World Wide Web material using web-scale tools such as web crawlers. At the same time, these organizations were concerned with the requirement to archive very large numbers of born-digital and digitized files. A need was for a container format that permits one file simply and safely to carry a very large number of constituent data objects (of unrestricted type, including many binary types) for the purpose of storage, management, and exchange.

Another requirement was that the container need only minimal knowledge of the nature of the objects.

The WARC format is expected to be a standard way to structure, manage and store billions of resources collected from the web and elsewhere. It is an extension of the ARC format [http://www.archive.org/web/researcher/ArcFileFormat.php ], which has been used since 1996 to store files harvested on the web. WARC format offers new possibilities, notably the recording of HTTP request headers, the recording of arbitrary metadata, the allocation of an identifier for every contained file, the management of duplicates and of migrated records, and the segmentation of the records. WARC files are intended to store every type of digital content, either retrieved by HTTP or another protocol.

The motivation to extend the ARC format arose from the discussion and experiences of the International Internet Preservation Consortium [ http://netpreserve.org/ ], whose core mission is to acquire, preserve and make accessible knowledge and information from the Internet for future generations. IIPC Standards Working Group put forward to ISO

TC46/SC4/WG12 a draft presenting the WARC file format. The draft was accepted as a new Work Item by ISO in May 2005.

Over a period of four years, the ISO working group, with the Bibliothèque nationale de France [http://www.bnf.fr/ ] as convener, collaborated closely with IIPC experts to improve the original draft.

The WG12 will continue to maintain [http://bibnum.bnf.fr/WARC/ ] the standard and prepare its future revision.

Standardization offers a guarantee of durability and evolution for the WARC format. It will help web archiving entering into the mainstream activities of heritage institutions and other branches, by fostering the development of new tools and ensuring the interoperability of collections. Several applications are already WARC compliant, such as the Heritrix [http://crawler.archive.org/ ] crawler for harvesting, the WARC tools [http://code.google.com/p/warc-tools/ ] for data management and exchange, the Wayback Machine [http://archive-access.sourceforge.net/projects/wayback], NutchWAX [http://archive-access.sourceforge.net/projects/nutch] and other search tools [http://code.google.com/p/search-tools/] for access. The international recognition of the WARC format and its applicability to every kind of digital object will provide strong incentives to use it within and beyond the web archiving community.

A press release is available on the IIPC website:

http://netpreserve.org/press/pr20090601.php

No comments:

Post a Comment