Created August-September 2020 for the National Library of Scotland's Data Foundry by Lucy Havens, Digital Library Research Intern
This dataset is the first version of the bibliographic records for the National Bibliography of Scotland (NBS). This version of the National Bibliography of Scotland references materials published in Scotland, materials in language Scots, or materials in language Scottish Gaelic from National Library of Scotland's main catalogue. This is the first iteration of the new National Bibliography of Scotland, which was originally produced in April 2019. National Bibliography of Scotland is an ongoing programme of work.
Import libraries to use for cleaning, summarising and exploring the data:
# To prevent SSL certificate failure
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
getattr(ssl, '_create_unverified_context', None)):
ssl._create_default_https_context = ssl._create_unverified_context
# Libraries for data loading
import xml.etree.ElementTree as ET
import pandas as pd
import numpy as np
import string
import re
from collections import defaultdict
# Libraries for visualization
import altair as alt
import matplotlib.pyplot as plt
Due to the large size of The National Bibliography of Scotland (NBS) data files, they aren't uploaded to the collections-as-data GitHub repo. To load the NBS MARCXML data into this Notebook, please download the data from the NLS Data Foundry website. Edit the file path below as necessary so that you can run this Notebook on your own computer.
The NBS data is actually metadata, meaning descriptive data about data. In this case, the metadata contains information about books that have been published in Scotland, published in the language Scots, or published in the language Scottish Gaelic. The metadata is provided as MARC XML. MARC is a metadata standard used in libraries. XML is a file format that is more widely used than MARC, so MARC is often provided as MARC XML so that systems other than library databases can read the data.
If you've never seen XML data before, check out this sample XML file of MARC metadata from the Library of Congress. To learn more, I'd recommend starting with W3 Schools' tutorial on XML, which explains its purpose, structure (a tree), tag naming conventions, and much more.
To load the NBS MARC XML file, we'll use the Python library ElementTree. ElementTree, which we abbreviate ET
, loads XML data (or metadata, in our case) in a hierarchical structure, or tree. To iterate through the metadata, we need to find the root, or top-most level, of the tree. From there we can travel up and down to pull out metadata of interest.
# Edit the file path in parentheses below to correspond to where the NBS MARC XML file that
# you downloaded can be found on your computer (unless the file is in a folder named Downloads!)
tree = ET.parse('data/National-Bibliography-of-Scotland-v1-dataset-MARC.xml')
root = tree.getroot()
Let's see how the metadata loaded:
print("Root tag:", str(root.tag))
# print("Root text:", str(root.text)) # empty
print("Root's first child tag:", str(root[0].tag))
print("Root's 4th grandchild tag:")
print(" ",root[0][4].tag)
print(" Datafield attribute:",root[0][4].attrib)
print(" Great grandchild tag:",root[0][4][0].tag)
print(" Great grandchild attribute:",root[0][4][0].attrib)
print(" Great grandchild text:",root[0][4][0].text)
Knowing that the MARC field is a combination of the tag
and code
we can see that the MARC field we've printed is 020$a
, which is for an International Standard Book Number (ISBN).
MARC was first developed by the Library of Congress and has since been adopted by libraries around the world. The NBS uses MARC Bibliographic, about which more can be read at this website. MARC contains hundreds of metadata fields that libraries can choose to use, with a certain number of required fields and many optional fields. In MARC metadata, tags are indicated with 3-digit numbers, indicators are single-digit numbers, and subfields are indicated with lowercase letters. A space and pound sign (#
) separates fields from indicators, and a dollar sign ($
) separates the fields from subfields. For example, the personal name (or primary author) metadata entry has:
100
0
for forename and 1
for surname$a
for personal name, $b
for numeration, $c
for titles and other words associated with a name, $q
for a fuller form of the author's name, $d
for dates associated with a nameA metadata entry for an author could look like:
100 1# $a Gregory, Ruth W.
$q (Ruth Wilhelme),
$d 1910-
Some MARC fields are repeatable, such as the International Standard Book Number (ISBN), while others are not, such as Main entry -- Personal name (author). More detail and examples for commonly used fields are available here.
MARCXML use the same tags, indicators, and subfields as attributes inside tags < >
. For example, the MARC metadata entry above would be the following in MARCXML:
<datafield tag="100" ind1="1" ind2="">
<subfield code="a">Gregory, Ruth W.</subfield>
<subfield code="q">(Ruth Wilhelme)</subfield>
<subfield code="d">1910-</subfield>
</datafield>
First, let's get a sense of how much metadata we have in the National Bibliography of Scotland (NBS) dataset:
records = root.getchildren()
total_records = len(records)
print("Total records so far:",total_records)
Note that I've print Total records so far
rather than simply Total records
. This is because the NBS is an in-progress work and we're using the first of what will be many future versions of the NBS.
Let's see what metadata has been documented for the first record (or first child of the root):
for child in records[0]:
if 'datafield' in child.tag:
print('Tag:', child.attrib['tag'])
ind1 = re.match("\d", child.attrib['ind1'])
ind2 = re.match("\d", child.attrib['ind2'])
if ind1 != None:
print(' Indicator 1:', ind1[0])
if ind2 != None:
print(' Indicator 2:', ind2[0])
grandchildren = child.getchildren()
for grandchild in grandchildren:
print(' ', grandchild.attrib['code'], grandchild.text)
Try replacing the number 0
with any number less than 368,961 to see how different records use different combinations of metadata fields! (Remember that a list's maximum index is always 1 less than the length of a list, because the indeces begin at 0, not 1.)
Let's select a subset of MARC fields that we want to extract from the NBS MARCXML file we've loaded and put those selections in a dataframe. Dataframes are essentially tables. The Python library Pandas, which we abbreviated pd
at the start of this notebook, allows us to create dataframes and then run queries over their rows and columns to efficiently analyze the data.
Step 1: First, we'll create a dictionary (a data type in Python) that matches up MARC field with the name of the type of metadata the field contains (note that we're only defining a subset of all the available metadata field information):
marc_tags = ['100', '130', '245', '260', '650', '700', '710',]
marc_names = ['Author', 'Uniform title', 'Title statement',
'Publication, distribution, etc.', 'Subject added entry -- Topical term',
'Added entry -- Personal name', 'Added entry -- Corporate name']
marc_dict = dict(zip(marc_tags,marc_names))
for key,value in marc_dict.items():
print("MARC Tag:", key, "| Tag Name:", value)
Step 2: We'll also create dictionaries for select subfields, one for each tag in the dictionary from step 1, and put them into a dictionary where each subfield dictionary (value) is associated with its corresponding tag (key):
author = {'a' : 'Personal name'}
unif_t = {'a' : 'Uniform title', 'l' : 'Language of work', 'f' : 'Date of work'}
title_stat = {'a' : 'Title proper'}
pub_dist = {'a' : 'Place', 'b' : 'Name', 'c' : 'Date'}
topic = {'a' : 'Topical term'}
pers_name = {'a' : 'Personal name', 'q' : 'Fuller name'}
corp_name = {'a' : 'Corporate or jurisdiction name'}
marc_subfields = [author, unif_t, title_stat, pub_dist, topic, pers_name, corp_name]
marc_tag_subfields = dict(zip(marc_tags, marc_subfields))
for key,value in marc_tag_subfields.items():
print("MARC Tag:", key, "| Subfields:", value)
Step 3: Now, using the dictionaries from steps 1 and 2 as reference, let's extract the metadata field values from the NBS MARCXML metadata for select tag and subfield combinations:
# To avoid rewriting similar code lines, we'll write a function in which we
# input a child element of the XML tree's root, a MARC tag, and a subfield of that tag,
# and receive as output the text of the MARC field if it's found (and False if it's not found)
def getSubfieldText(elem, marcTag, subfield):
if (elem.attrib['tag'] == marcTag) :
subelems = elem.getchildren()
for subelem in subelems:
if subelem.attrib['code'] == subfield:
return subelem.text
else:
return False
all_authors, all_titles, all_langs, all_pub_dates, all_pub_places, all_topics = [], [], [], [], [], []
for record in records:
has_author = False
has_title = False
has_lang = False
has_pub_date = False
has_pub_place = False
has_topic = False
authors, titles, langs, pub_dates, pub_places, topics = [], [], [], [], [], []
for child in record.findall('{http://www.loc.gov/MARC21/slim}datafield'):
# Get author name in field 100$a
author = getSubfieldText(child, "100", "a")
if author:
has_author = True
authors += [author]
# Get title in either field 130$a or 245$a
title1 = getSubfieldText(child, "130", "a")
title2 = getSubfieldText(child, "245", "a")
if title2:
has_title = True
titles += [title2]
elif title1:
has_title = True
titles += [title1]
# Get language
lang = getSubfieldText(child, "130", "l")
if lang:
has_lang = True
langs += [lang]
# Get publication date
pub_date1 = getSubfieldText(child, "130", "f")
pub_date2 = getSubfieldText(child, "260", "c")
if pub_date2:
has_pub_date = True
pub_dates += [pub_date2]
elif pub_date1:
has_pub_date = True
pub_dates += [pub_date1]
# Get publication place
pub_place = getSubfieldText(child, "260", "a")
if pub_place:
has_pub_place = True
pub_places += [pub_place]
# Get topical terms
topic = getSubfieldText(child, "650", "a")
if topic:
has_topic = True
topics += [topic]
# After iterating through all datafield elements of the record
# (the elements where MARC fields may be found), if a MARC field
# we searched for isn't found, then add an empty string for that
# record's MARC field text
if not has_author:
authors += ["None"]
if not has_title:
titles += ["None"]
if not has_lang:
langs += ["None"]
if not has_pub_date:
pub_dates += ["None"]
if not has_pub_place:
pub_places += ["None"]
if not has_topic:
topics += ["None"]
all_authors.append(authors)
all_titles.append(titles)
all_langs.append(langs)
all_pub_dates.append(pub_dates)
all_pub_places.append(pub_places)
all_topics.append(topics)
# There should be one sublist inside each all_xxx lists for every record (meaning they are all the same length)
# The assertions below will throw an error if this is not the case, indicating that our function didn't work as expected
assert len(all_topics) == len(all_titles)
assert len(all_titles) == len(all_authors)
assert len(all_authors) == len(all_langs)
assert len(all_langs) == len(all_pub_places)
assert len(all_pub_places) == len(all_pub_dates)
assert len(all_pub_dates) == len(records)
all_topics[0:10]
Step 4: Now we'll turn the lists we created of all topics, titles, authors, langauges, publication places, and publication dates into a dataframe, or table, using Pandas:
# First create a dictionary for each column that will be in the dataframe
cols = {'author' : all_authors, 'title' : all_titles, 'topic' : all_topics, 'language' : all_langs, 'publication_place' : all_pub_places, 'publication_date' : all_pub_dates}
df = pd.DataFrame(cols)
df.head() # this prints the first 5 rows of a dataframe
df.tail() # df.tail prints the last 5 rows of a dataframe
df.to_csv("NBSv1_subset_messy.csv", index=False, encoding="utf-8")
For those of you running this Notebook in Binder (which doesn't allow enough memory to load the full MARCXML dataset), you can explore this section interactively by loading a subset of the dataset from a CSV file (see the previous section for the code that created this CSV file):
import pandas as pd
df = pd.read_csv("NBSv1_subset_messy.csv")
all_authors = list(df.author)
all_titles = list(df.title)
all_topics = list(df.topic)
all_langs = list(df.language)
all_pub_places = list(df.publication_place)
all_pub_dates = list(df.publication_date)
df.tail()
Although some of the cells in our dataframe df
have more than one value, many of them only have one value. Let's turn cells with only one value into strings or, for the publication_date column
, into numbers. That way the data will be easier to clean, extract, and analyse.
Step 1: First, we'll go through the lists we used to fill each column in the dataframe and flatten them, only keeping sublists if a record has more than one value in the sublist:
def selectivelyFlatten(df_col_list):
flat = []
for c in df_col_list:
if len(c) > 1:
flat.append(c)
else:
flat.append(c[0])
return flat
all_authors_flat = selectivelyFlatten(all_authors)
all_titles_flat = selectivelyFlatten(all_titles)
all_topics_flat = selectivelyFlatten(all_topics)
all_langs_flat = selectivelyFlatten(all_langs)
all_pub_places_flat = selectivelyFlatten(all_pub_places)
all_pub_dates_flat = selectivelyFlatten(all_pub_dates)
# There should still be one value in the list for each record, whether it's a string or a sublist
assert(len(all_authors) == len(all_authors_flat))
Step 2: Next, we'll replace the df
variable with a new, selectively flattened dataframe:
cols_flat = {'author' : all_authors_flat, 'title' : all_titles_flat, 'topic' : all_topics_flat, 'language' : all_langs_flat, 'publication_place' : all_pub_places_flat, 'publication_date' : all_pub_dates_flat}
df = pd.DataFrame(cols_flat)
df.head()
Much better! Note that in metadata dates often have punctuation indicating uncertainty about dates, such as a question mark ?
. For simplicity, we'll extract only the digits from dates for the dataframe, so that the dates can be treated as numbers.
years = []
pub_dates = list(df.publication_date)
for date in pub_dates:
date = str(date)
yr = re.search("\d{4}", date)
if yr:
years += [int(yr[0])]
else:
years += [0]
df.publication_date = years
df.head()
We can write the DataFrame to a CSV file so that it can be opened in Microsoft Excel or loaded into a new Jupyter Notebook!
# The parameters (in parentheses) are the filename, whether you want to include the index (row numbers), and the file encoding
df.to_csv("NBSv1_subset.csv", index=False, encoding="utf-8")
For those of you running this Notebook in Binder (which doesn't allow enough memory to load the full MARCXML dataset), you can explore this section interactively by loading a subset of the dataset from a CSV file (see the previous section for the code that created this CSV file):
import pandas as pd
df = pd.read_csv("NBSv1_subset.csv")
df.head()
Let's look closer at the dates books were published as recorded in the NBS metadata:
published = list(df.publication_date)
unique_dates = list(set(published))
unique_dates.sort()
print(unique_dates)
We assigned 0 to books without a 4-digit year listed in their publication date field, so we can ignore that date. We also can assume that dates after the current year (2020) are invalid. Let's filter our list to include only the dates likely to be correct:
filtered_dates = []
for d in unique_dates:
if d != 0 and d <= 2020:
filtered_dates += [d]
print("Earliest publication date:", min(filtered_dates))
print("Latest publication date:", max(filtered_dates))
I wonder what book was published in 1004...
df[df['publication_date'] == 1004]
From a quick Google, it looks like this book was actually published in 1657, so this date must have been a mistake.
I wonder, what's the most common year a book was published out of those in the NBS?
df['publication_date'].value_counts()
The year 2000! 7,267 books from the NBS were published that year.
print("Total books in NBS so far:", total_records)
print("Percentage of books in NBS published in 2000:", str((7267/total_records)*100)+"%")