Dataset originally created 1/25/2021 I. About This Dataset This dataset is hosted by LC Labs on https://labs.loc.gov/work/experiments/webarchive-datasets/. It consists of web archive capture indexes (CDX files) for content harvested as part of the United States Election Web Archive (https://www.loc.gov/collections/united-states-elections-web-archive/?fa=partof:united+states+elections+2000). This is a multi-part dataset containing CDX indexes for the entire United States Elections Web Archive. (https://www.loc.gov/collections/united-states-elections-web-archive/about-this-collection/). The entire dataset consists of CDX indexes from every election year starting in 2000 through the most recent election out of embargo. See section VI below for more details about rights and access restrictions. CDX indexes contain one line per web object in the Web Archive and are delimited by a single space. See the CDX specification for more information about the format: https://web.archive.org/web/20171123000400/http://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/. This dataset was initially gathered to satisfy a specific researcher request, but can be accessed by the public until two years from the created date. II. What's Included? This dataset includes: 411,815 .cdx.gz files (totaling 260 GB) including derived metadata for each resource in the United States Elections Web Archive. The fields and their contents are described in Section IV below. The parts of the set consist of the following: - Election 2000 - 3,521 .cdx.gz files (totaling 1.98 GB) - Election 2002 - 9,483 .cdx.gz files (totaling 5.24 GB) - there may be 1 more CDX file added to this set - Election 2004 - 48,451 .cdx.gz files (totaling 11 GB) - there may be 128 more CDX files added to this set - Election 2006 - 95,797 .cdx.gz files (totaling 31.15 GB) - Election 2008 - 141,288 .cdx.gz files (totaling 41.24 GB) - there may be 811 CDX files added to this set - Election 2010 - 8,974 .cdx.gz files (totaling 35.05 GB) - there may be 731 CDX files added to this set - Election 2012 - 16,211 .cdx.gz files (totaling 31.29 GB) - Election 2014 - 21,541 .cdx.gz files (totaling 20.66 GB) - Election 2016 - 50,796 .cdx.gz files (totaling 63.93 GB) - Election 2018 - 15,753 .cdx.gz files (totaling 19.20 GB) The Library may add 1,671 more CDX files to the existing datasets. The Library will add CDX files for subsequent election years as the content exits embargo. III. How Was It Created? The web archive CDX indexes are created as part of a normal process to provide access to archived web objects via the Wayback Machine software. The files in this dataset were gathered from their storage locations and collated into the public AWS S3 bucket for download. IV. Dataset Field Descriptions This section lists and describes each of the fields included in the United States Elections Web Archive CDX indexes as they align with the CDX specification linked in section I. The CDX indexes contain 11 fields (listed in the first line of each CDX file), with the corresponding information for each field as follows: - urlkey (N): the URL of the captured web object, without the protocol (http://) or the leading www and in SURT format (http://crawler.archive.org/articles/user_manual/glossary.html#surt). This information was extracted from the CDX index file. - timestamp (b): timestamp in the form YYYYMMDDhhmmss. The time represents the point at which the web object was captured, measured in GMT, as recorded in the CDX index file. - original (a): the URL of the captured web object, including the protocol (http://) and the leading www, if applicable, extracted from the CDX index file. - mimetype (m): the media type as recorded in the CDX. - statuscode (s): the HTTP response code received from the server at the time of capture, e.g., 200, 404. In this case, only codes that matched "200" were selected. - digest (k): a unique, cryptographic hash of the web object’s payload at the time of the crawl. This provides a distinct fingerprint for the object; it is a Base32 encoded SHA-1 hash, derived from the CDX index file. - redirect (r): likely blank - metatags (M): likely blank - file_size (S): the size of the web object, in bytes, derived from the CDX index file. - offset (V): the location of the resource in the compressed Web Archive (WARC) file which stores the full archived object - WARC filename (g) - name of the compressed Web Archive (WARC) file which stores the full archived object V. Usage Each dataset has a manifest with links to each individual CDX file. Currently, this is the only way to access the CDX files. Programmatically accessing the files via the manifest may be the most efficient way to obtain the whole or part of the dataset. VI. Rights Statement This dataset was derived from content in the Library’s web archives. The Library follows a notification and permission process in the acquisition of content for the web archives, and to allow researcher access to the archived content, as described on the web archiving program page, https://www.loc.gov/programs/web-archiving/about-this-program/. See the full Rights & Access statement for the collection which applies to all of the content in this dataset: https://www.loc.gov/collections/united-states-elections-web-archive/about-this-collection/rights-and-access/ VII. Creator and Contributor Information Creator: Grace Thomas VIII. Contact Information Please direct all questions and comments to webcapture@loc.gov.