<> 3,500 targets

Available as JSON

Available as Text File


The Archive of Tomorrow project is a multi-institutional project, funded by the Wellcome Trust and led by the National Library of Scotland, to explore and preserve online health resources, capturing a wide range of official and unofficial medical content across the web. The archived websites collected form the wide-ranging ‘Talking about Health’ collection within the UK Web Archive, giving researchers and members of the public access to a wide representation of diverse online health resources. 

As a relatively new area of Library collecting, web archives present significant access challenges, with the Non-Print Legal Deposit regulations under which these sites are collected often requiring researchers to visit the Library in person to view the captures. It is hoped that the collection metadata, outlining the collection’s scope and describing the websites it contains, will provide a useful springboard for research on a variety of subjects, from dedicated support communities addressing specific health topics online to life in the UK during the pandemic. 

The first dataset represents a snapshot of the Talking about Health collection, accessed on 24/04/2023, containing descriptive and technical metadata for around 3500 targets. The data is made available in JSON format, articulating the collection and sub-collection structure as well as individual target information (an additional file represents targets in spreadsheet form alongside information on collection URIs). Draft notebooks by Web Archivist Leontien Talboom proposing ways of working with the data are available on GitHub, and a write-up of Topic Modelling research performed by Andrea Kocsis is available on Notion. A second dataset provides a sample of Heritrix crawl metadata from early 2022, documenting the process of capturing UK Web Archive targets. 


Rights information

This data collection is licensed under a CC-BY 4.0 license.


Download the data

Talking about Health

File contents: 1 readme .txt file (plain text); 1 structured JSON file with sub-collection and target information; 1 .xslx file (spreadsheet) of the target metadata; 1 .rtf (rich text format) file containing the datasheet for the Archive of Tomorrow project.

File size: 6.3 MB compressed (77.9 MB uncompressed)

Sample crawl data

File contents: 1 readme .txt file (plain text); 1 sample crawl data .txt file (plain text)

File size: 286 MB compressed ( 1.95 GB uncompressed)

Cite the data

DOI: https://doi.org/10.34812/h88t-dr97

Dataset creator: National Library of Scotland

Dataset publisher: National Library of Scotland

Publication year: 2023

Suggested citation: National Library of Scotland. Talking about Health dataset. National Library of Scotland, 2023. https://doi.org/10.34812/h88t-dr97