# Exploring the National Bibliography of Scotland¶

Created August-September 2020 for the National Library of Scotland's Data Foundry by Lucy Havens, Digital Library Research Intern

### About The National Bibliography of Scotland (version 1) Dataset¶

This dataset is the first version of the bibliographic records for the National Bibliography of Scotland (NBS). This version of the National Bibliography of Scotland references materials published in Scotland, materials in language Scots, or materials in language Scottish Gaelic from National Library of Scotland's main catalogue. This is the first iteration of the new National Bibliography of Scotland, which was originally produced in April 2019. National Bibliography of Scotland is an ongoing programme of work.

Before you begin: If you are interacting with this Notebook in Binder, please note that there is a memory limit that will prevent the first section, 0. Preparation, from running, due to the large file size of this Notebook's data source. You can still work with the NBS metadata in the remaining sections, 1. Data Cleaning and Standardisation and 2. Summary Statistics, using CSV files with a subset of the metadata extracted in the first section!

### 0. Preparation¶

Import libraries to use for cleaning, summarising and exploring the data:

In [2]:
# To prevent SSL certificate failure
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
getattr(ssl, '_create_unverified_context', None)):
ssl._create_default_https_context = ssl._create_unverified_context

import xml.etree.ElementTree as ET
import pandas as pd
import numpy as np
import string
import re
from collections import defaultdict

# Libraries for visualization
import altair as alt
import matplotlib.pyplot as plt


Due to the large size of The National Bibliography of Scotland (NBS) data files, they aren't uploaded to the collections-as-data GitHub repo. To load the NBS MARCXML data into this Notebook, please download the data from the NLS Data Foundry website. Edit the file path below as necessary so that you can run this Notebook on your own computer.

The NBS data is actually metadata, meaning descriptive data about data. In this case, the metadata contains information about books that have been published in Scotland, published in the language Scots, or published in the language Scottish Gaelic. The metadata is provided as MARC XML. MARC is a metadata standard used in libraries. XML is a file format that is more widely used than MARC, so MARC is often provided as MARC XML so that systems other than library databases can read the data.

If you've never seen XML data before, check out this sample XML file of MARC metadata from the Library of Congress. To learn more, I'd recommend starting with W3 Schools' tutorial on XML, which explains its purpose, structure (a tree), tag naming conventions, and much more.

To load the NBS MARC XML file, we'll use the Python library ElementTree. ElementTree, which we abbreviate ET, loads XML data (or metadata, in our case) in a hierarchical structure, or tree. To iterate through the metadata, we need to find the root, or top-most level, of the tree. From there we can travel up and down to pull out metadata of interest.

In [3]:
# Edit the file path in parentheses below to correspond to where the NBS MARC XML file that
tree = ET.parse('data/National-Bibliography-of-Scotland-v1-dataset-MARC.xml')
root = tree.getroot()


In [4]:
print("Root tag:", str(root.tag))
# print("Root text:", str(root.text))               # empty
print("Root's first child tag:", str(root[0].tag))
print("Root's 4th grandchild tag:")
print("  ",root[0][4].tag)
print("     Datafield attribute:",root[0][4].attrib)
print("        Great grandchild tag:",root[0][4][0].tag)
print("        Great grandchild attribute:",root[0][4][0].attrib)
print("        Great grandchild text:",root[0][4][0].text)

Root tag: {http://www.loc.gov/MARC21/slim}collection
Root's first child tag: {http://www.loc.gov/MARC21/slim}record
Root's 4th grandchild tag:
{http://www.loc.gov/MARC21/slim}datafield
Datafield attribute: {'tag': '020', 'ind1': ' ', 'ind2': ' '}
Great grandchild tag: {http://www.loc.gov/MARC21/slim}subfield
Great grandchild attribute: {'code': 'a'}
Great grandchild text: 1850980284


Knowing that the MARC field is a combination of the tag and code we can see that the MARC field we've printed is 020$a, which is for an International Standard Book Number (ISBN). MARC was first developed by the Library of Congress and has since been adopted by libraries around the world. The NBS uses MARC Bibliographic, about which more can be read at this website. MARC contains hundreds of metadata fields that libraries can choose to use, with a certain number of required fields and many optional fields. In MARC metadata, tags are indicated with 3-digit numbers, indicators are single-digit numbers, and subfields are indicated with lowercase letters. A space and pound sign (#) separates fields from indicators, and a dollar sign ($) separates the fields from subfields. For example, the personal name (or primary author) metadata entry has:

• tag: 100
• indicators: 0 for forename and 1 for surname
• subfields: $a for personal name, $b for numeration, $c for titles and other words associated with a name, $q for a fuller form of the author's name, $d for dates associated with a name A metadata entry for an author could look like: 100 1#$a Gregory, Ruth W.
$q (Ruth Wilhelme),$d 1910-

Some MARC fields are repeatable, such as the International Standard Book Number (ISBN), while others are not, such as Main entry -- Personal name (author). More detail and examples for commonly used fields are available here.

MARCXML use the same tags, indicators, and subfields as attributes inside tags < >. For example, the MARC metadata entry above would be the following in MARCXML:

<datafield tag="100" ind1="1" ind2="">
<subfield code="a">Gregory, Ruth W.</subfield>
<subfield code="q">(Ruth Wilhelme)</subfield>
<subfield code="d">1910-</subfield>
</datafield>

#### 0.1 Dataset Size¶

First, let's get a sense of how much metadata we have in the National Bibliography of Scotland (NBS) dataset:

In [5]:
records = root.getchildren()
total_records = len(records)
print("Total records so far:",total_records)

Total records so far: 368961


Note that I've print Total records so far rather than simply Total records. This is because the NBS is an in-progress work and we're using the first of what will be many future versions of the NBS.

Let's see what metadata has been documented for the first record (or first child of the root):

In [6]:
for child in records[0]:
if 'datafield' in child.tag:
print('Tag:', child.attrib['tag'])
ind1 = re.match("\d", child.attrib['ind1'])
ind2 = re.match("\d", child.attrib['ind2'])
if ind1 != None:
print('  Indicator 1:', ind1[0])
if ind2 != None:
print('  Indicator 2:', ind2[0])

grandchildren = child.getchildren()
for grandchild in grandchildren:
print(' ', grandchild.attrib['code'], grandchild.text)

Tag: 020
a 1850980284
Tag: 020
a 9781850980285
Tag: 035
a (StEdNL)2614-nlsdb-Voyager
Tag: 100
Indicator 1: 1
a Dryden, Derek.
Tag: 245
Indicator 1: 1
Indicator 2: 0
a Salmon fishing on the South Esk Estuary :
b Teachers' Booklet /
c devised by Derek Dryden ; series editor Norman Nichol ; cover illustration by Archie Williams.
Tag: 260
a Hamilton :
b Hamilton College of Education,
c 1978.
Tag: 300
a 14 p. :
b ill., maps ;
c 30 cm.
Tag: 490
Indicator 1: 1
a Learning resources ;
v B78
Tag: 700
Indicator 1: 1
a Nichol, Norman.
Tag: 700
Indicator 1: 1
a Williams, Archie.
Tag: 710
Indicator 1: 2
a Hamilton College of Education.
Tag: 830
Indicator 2: 0
a Learning resources ;
v B78.
Tag: 919
a NBS


Try replacing the number 0 with any number less than 368,961 to see how different records use different combinations of metadata fields! (Remember that a list's maximum index is always 1 less than the length of a list, because the indeces begin at 0, not 1.)

#### 0.2 Identifying Subsets of the Data¶

Let's select a subset of MARC fields that we want to extract from the NBS MARCXML file we've loaded and put those selections in a dataframe. Dataframes are essentially tables. The Python library Pandas, which we abbreviated pd at the start of this notebook, allows us to create dataframes and then run queries over their rows and columns to efficiently analyze the data.

Step 1: First, we'll create a dictionary (a data type in Python) that matches up MARC field with the name of the type of metadata the field contains (note that we're only defining a subset of all the available metadata field information):

In [7]:
marc_tags = ['100', '130', '245', '260', '650', '700', '710',]
marc_names = ['Author', 'Uniform title', 'Title statement',
'Publication, distribution, etc.', 'Subject added entry -- Topical term',

marc_dict = dict(zip(marc_tags,marc_names))
for key,value in marc_dict.items():
print("MARC Tag:", key, "| Tag Name:", value)

MARC Tag: 100 | Tag Name: Author
MARC Tag: 130 | Tag Name: Uniform title
MARC Tag: 245 | Tag Name: Title statement
MARC Tag: 260 | Tag Name: Publication, distribution, etc.
MARC Tag: 650 | Tag Name: Subject added entry -- Topical term
MARC Tag: 700 | Tag Name: Added entry -- Personal name
MARC Tag: 710 | Tag Name: Added entry -- Corporate name


Step 2: We'll also create dictionaries for select subfields, one for each tag in the dictionary from step 1, and put them into a dictionary where each subfield dictionary (value) is associated with its corresponding tag (key):

In [8]:
author = {'a' : 'Personal name'}
unif_t = {'a' : 'Uniform title', 'l' : 'Language of work', 'f' : 'Date of work'}
title_stat = {'a' : 'Title proper'}
pub_dist = {'a' : 'Place', 'b' : 'Name', 'c' : 'Date'}
topic = {'a' : 'Topical term'}
pers_name = {'a' : 'Personal name', 'q' : 'Fuller name'}
corp_name = {'a' : 'Corporate or jurisdiction name'}

marc_subfields = [author, unif_t, title_stat, pub_dist, topic, pers_name, corp_name]
marc_tag_subfields = dict(zip(marc_tags, marc_subfields))
for key,value in marc_tag_subfields.items():
print("MARC Tag:", key, "| Subfields:", value)

MARC Tag: 100 | Subfields: {'a': 'Personal name'}
MARC Tag: 130 | Subfields: {'a': 'Uniform title', 'l': 'Language of work', 'f': 'Date of work'}
MARC Tag: 245 | Subfields: {'a': 'Title proper'}
MARC Tag: 260 | Subfields: {'a': 'Place', 'b': 'Name', 'c': 'Date'}
MARC Tag: 650 | Subfields: {'a': 'Topical term'}
MARC Tag: 700 | Subfields: {'a': 'Personal name', 'q': 'Fuller name'}
MARC Tag: 710 | Subfields: {'a': 'Corporate or jurisdiction name'}


Step 3: Now, using the dictionaries from steps 1 and 2 as reference, let's extract the metadata field values from the NBS MARCXML metadata for select tag and subfield combinations:

In [9]:
# To avoid rewriting similar code lines, we'll write a function in which we
# input a child element of the XML tree's root, a MARC tag, and a subfield of that tag,
# and receive as output the text of the MARC field if it's found (and False if it's not found)
def getSubfieldText(elem, marcTag, subfield):
if (elem.attrib['tag'] == marcTag) :
subelems = elem.getchildren()
for subelem in subelems:
if subelem.attrib['code'] == subfield:
return subelem.text
else:
return False

In [10]:
all_authors, all_titles, all_langs, all_pub_dates, all_pub_places, all_topics = [], [], [], [], [], []
for record in records:
has_author = False
has_title = False
has_lang = False
has_pub_date = False
has_pub_place = False
has_topic = False

authors, titles, langs, pub_dates, pub_places, topics = [], [], [], [], [], []

for child in record.findall('{http://www.loc.gov/MARC21/slim}datafield'):
# Get author name in field 100$a author = getSubfieldText(child, "100", "a") if author: has_author = True authors += [author] # Get title in either field 130$a or 245\$a
title1 = getSubfieldText(child, "130", "a")
title2 = getSubfieldText(child, "245", "a")
if title2:
has_title = True
titles += [title2]
elif title1:
has_title = True
titles += [title1]

# Get language
lang = getSubfieldText(child, "130", "l")
if lang:
has_lang = True
langs += [lang]

# Get publication date
pub_date1 = getSubfieldText(child, "130", "f")
pub_date2 = getSubfieldText(child, "260", "c")
if pub_date2:
has_pub_date = True
pub_dates += [pub_date2]
elif pub_date1:
has_pub_date = True
pub_dates += [pub_date1]

# Get publication place
pub_place = getSubfieldText(child, "260", "a")
if pub_place:
has_pub_place = True
pub_places += [pub_place]

# Get topical terms
topic = getSubfieldText(child, "650", "a")
if topic:
has_topic = True
topics += [topic]

# After iterating through all datafield elements of the record
# (the elements where MARC fields may be found), if a MARC field
# we searched for isn't found, then add an empty string for that
# record's MARC field text
if not has_author:
authors += ["None"]
if not has_title:
titles += ["None"]
if not has_lang:
langs += ["None"]
if not has_pub_date:
pub_dates += ["None"]
if not has_pub_place:
pub_places += ["None"]
if not has_topic:
topics += ["None"]

all_authors.append(authors)
all_titles.append(titles)
all_langs.append(langs)
all_pub_dates.append(pub_dates)
all_pub_places.append(pub_places)
all_topics.append(topics)

# There should be one sublist inside each all_xxx lists for every record (meaning they are all the same length)
# The assertions below will throw an error if this is not the case, indicating that our function didn't work as expected
assert len(all_topics) == len(all_titles)
assert len(all_titles) == len(all_authors)
assert len(all_authors) == len(all_langs)
assert len(all_langs) == len(all_pub_places)
assert len(all_pub_places) == len(all_pub_dates)
assert len(all_pub_dates) == len(records)

In [11]:
all_topics[0:10]

Out[11]:
[['None'],
['Birds', 'Birds'],
['Nature conservation'],
['Wages.'],
['None'],
['Scientific expeditions'],
['None'],
['None']]

Step 4: Now we'll turn the lists we created of all topics, titles, authors, langauges, publication places, and publication dates into a dataframe, or table, using Pandas:

In [12]:
# First create a dictionary for each column that will be in the dataframe
cols = {'author' : all_authors, 'title' : all_titles, 'topic' : all_topics, 'language' : all_langs, 'publication_place' : all_pub_places, 'publication_date' : all_pub_dates}
df = pd.DataFrame(cols)
df.head()   # this prints the first 5 rows of a dataframe

Out[12]:
author title topic language publication_place publication_date
0 [Dryden, Derek.] [Salmon fishing on the South Esk Estuary :] [None] [None] [Hamilton :] [1978.]
1 [Macpherson, Iain.] [Attracting new students to adult education :] [Adult education] [None] [Edinburgh :] [1989.]
2 [None] [The breeding birds of south-east Scotland :] [Birds, Birds] [None] [Edinburgh :] [1998.]
3 [None] [Adult education, the challenge of change] [Adult education] [None] [Edinburgh] [1975]
4 [None] [Nature conservation in the Cairngorms :] [Nature conservation] [None] [Edinburgh :] [[1989?].]
In [13]:
df.tail()   # df.tail prints the last 5 rows of a dataframe

Out[13]:
author title topic language publication_place publication_date
368956 [Lister, John,] [Epigrams, and Jeux d'Esprit /] [None] [None] [Edinburgh :] [1870.]
368957 [A. T. G.] [Border reminiscences. Annals of Thornlea [in ... [None] [None] [Galashiels, ] [[1899]]
368958 [A. T. G.] ["Lammermoor leaves" /] [None] [None] [Galashiels,] [1898.]
368959 [A. W. G.] [Dissolution of parliament. A statesman's adve... [None] [None] [Edinburgh, ] [[1880]]
368960 [E. G.] [The Sabbath trader.] [Sunday., Sabbath., Merchants] [None] [Stirling :] [1855.]
In [14]:
df.to_csv("NBSv1_subset_messy.csv", index=False, encoding="utf-8")


### 1. Data Cleaning and Standardisation¶

For those of you running this Notebook in Binder (which doesn't allow enough memory to load the full MARCXML dataset), you can explore this section interactively by loading a subset of the dataset from a CSV file (see the previous section for the code that created this CSV file):

In [2]:
import pandas as pd

all_authors = list(df.author)
all_titles = list(df.title)
all_topics = list(df.topic)
all_langs = list(df.language)
all_pub_places = list(df.publication_place)
all_pub_dates = list(df.publication_date)

df.tail()

Out[2]:
author title topic language publication_place publication_date
368956 ['Lister, John,'] ["Epigrams, and Jeux d'Esprit /"] ['None'] ['None'] ['Edinburgh :'] ['1870.']
368957 ['A. T. G.'] ['Border reminiscences. Annals of Thornlea [in... ['None'] ['None'] ['Galashiels, '] ['[1899]']
368958 ['A. T. G.'] ['"Lammermoor leaves" /'] ['None'] ['None'] ['Galashiels,'] ['1898.']
368959 ['A. W. G.'] ["Dissolution of parliament. A statesman's adv... ['None'] ['None'] ['Edinburgh, '] ['[1880]']
368960 ['E. G.'] ['The Sabbath trader.'] ['Sunday.', 'Sabbath.', 'Merchants'] ['None'] ['Stirling :'] ['1855.']

Although some of the cells in our dataframe df have more than one value, many of them only have one value. Let's turn cells with only one value into strings or, for the publication_date column, into numbers. That way the data will be easier to clean, extract, and analyse.

Step 1: First, we'll go through the lists we used to fill each column in the dataframe and flatten them, only keeping sublists if a record has more than one value in the sublist:

In [16]:
def selectivelyFlatten(df_col_list):
flat = []
for c in df_col_list:
if len(c) > 1:
flat.append(c)
else:
flat.append(c[0])
return flat

all_authors_flat = selectivelyFlatten(all_authors)
all_titles_flat = selectivelyFlatten(all_titles)
all_topics_flat = selectivelyFlatten(all_topics)
all_langs_flat = selectivelyFlatten(all_langs)
all_pub_places_flat = selectivelyFlatten(all_pub_places)
all_pub_dates_flat = selectivelyFlatten(all_pub_dates)

# There should still be one value in the list for each record, whether it's a string or a sublist
assert(len(all_authors) == len(all_authors_flat))


Step 2: Next, we'll replace the df variable with a new, selectively flattened dataframe:

In [17]:
cols_flat = {'author' : all_authors_flat, 'title' : all_titles_flat, 'topic' : all_topics_flat, 'language' : all_langs_flat, 'publication_place' : all_pub_places_flat, 'publication_date' : all_pub_dates_flat}
df = pd.DataFrame(cols_flat)

Out[17]:
author title topic language publication_place publication_date
0 Dryden, Derek. Salmon fishing on the South Esk Estuary : None None Hamilton : 1978.
1 Macpherson, Iain. Attracting new students to adult education : Adult education None Edinburgh : 1989.
2 None The breeding birds of south-east Scotland : [Birds, Birds] None Edinburgh : 1998.
3 None Adult education, the challenge of change Adult education None Edinburgh 1975
4 None Nature conservation in the Cairngorms : Nature conservation None Edinburgh : [1989?].

Much better! Note that in metadata dates often have punctuation indicating uncertainty about dates, such as a question mark ?. For simplicity, we'll extract only the digits from dates for the dataframe, so that the dates can be treated as numbers.

Try It! How might you keep this uncertainty present in the data? How else could you represent, with numbers, that there is a range of years in which something is thought to have been published?
In [18]:
years = []
pub_dates = list(df.publication_date)
for date in pub_dates:
date = str(date)
yr = re.search("\d{4}", date)
if yr:
years += [int(yr[0])]
else:
years += [0]

In [19]:
df.publication_date = years

Out[19]:
author title topic language publication_place publication_date
0 Dryden, Derek. Salmon fishing on the South Esk Estuary : None None Hamilton : 1978
1 Macpherson, Iain. Attracting new students to adult education : Adult education None Edinburgh : 1989
2 None The breeding birds of south-east Scotland : [Birds, Birds] None Edinburgh : 1998
3 None Adult education, the challenge of change Adult education None Edinburgh 1975
4 None Nature conservation in the Cairngorms : Nature conservation None Edinburgh : 1989
Try It! How might you clean the metadata in the other columns of the DataFrame?

We can write the DataFrame to a CSV file so that it can be opened in Microsoft Excel or loaded into a new Jupyter Notebook!

In [20]:
# The parameters (in parentheses) are the filename, whether you want to include the index (row numbers), and the file encoding
df.to_csv("NBSv1_subset.csv", index=False, encoding="utf-8")


### 2. Summary Statistics¶

For those of you running this Notebook in Binder (which doesn't allow enough memory to load the full MARCXML dataset), you can explore this section interactively by loading a subset of the dataset from a CSV file (see the previous section for the code that created this CSV file):

In [21]:
import pandas as pd

Out[21]:
author title topic language publication_place publication_date
0 Dryden, Derek. Salmon fishing on the South Esk Estuary : None None Hamilton : 1978
1 Macpherson, Iain. Attracting new students to adult education : Adult education None Edinburgh : 1989
2 None The breeding birds of south-east Scotland : ['Birds', 'Birds'] None Edinburgh : 1998
3 None Adult education, the challenge of change Adult education None Edinburgh 1975
4 None Nature conservation in the Cairngorms : Nature conservation None Edinburgh : 1989

Let's look closer at the dates books were published as recorded in the NBS metadata:

In [22]:
published = list(df.publication_date)
unique_dates = list(set(published))
unique_dates.sort()
print(unique_dates)

[0, 1004, 1068, 1074, 1078, 1366, 1474, 1494, 1505, 1507, 1508, 1509, 1535, 1537, 1540, 1541, 1552, 1553, 1554, 1555, 1556, 1558, 1559, 1561, 1562, 1563, 1564, 1565, 1566, 1567, 1568, 1569, 1570, 1571, 1572, 1573, 1574, 1575, 1576, 1577, 1578, 1579, 1580, 1581, 1582, 1584, 1585, 1587, 1588, 1589, 1590, 1591, 1592, 1593, 1594, 1595, 1596, 1597, 1598, 1599, 1600, 1601, 1602, 1603, 1604, 1605, 1606, 1607, 1608, 1609, 1610, 1611, 1612, 1613, 1614, 1615, 1616, 1617, 1618, 1619, 1620, 1621, 1622, 1623, 1624, 1625, 1626, 1627, 1628, 1629, 1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 1638, 1639, 1640, 1641, 1642, 1643, 1644, 1645, 1646, 1647, 1648, 1649, 1650, 1651, 1652, 1653, 1654, 1655, 1656, 1657, 1658, 1659, 1660, 1661, 1662, 1663, 1664, 1665, 1666, 1667, 1668, 1669, 1670, 1671, 1672, 1673, 1674, 1675, 1676, 1677, 1678, 1679, 1680, 1681, 1682, 1683, 1684, 1685, 1686, 1687, 1688, 1689, 1690, 1691, 1692, 1693, 1694, 1695, 1696, 1697, 1698, 1699, 1700, 1701, 1702, 1703, 1704, 1705, 1706, 1707, 1708, 1709, 1710, 1711, 1712, 1713, 1714, 1715, 1716, 1717, 1718, 1719, 1720, 1721, 1722, 1723, 1724, 1725, 1726, 1727, 1728, 1729, 1730, 1731, 1732, 1733, 1734, 1735, 1736, 1737, 1738, 1739, 1740, 1741, 1742, 1743, 1744, 1745, 1746, 1747, 1748, 1749, 1750, 1751, 1752, 1753, 1754, 1755, 1756, 1757, 1758, 1759, 1760, 1761, 1762, 1763, 1764, 1765, 1766, 1767, 1768, 1769, 1770, 1771, 1772, 1773, 1774, 1775, 1776, 1777, 1778, 1779, 1780, 1781, 1782, 1783, 1784, 1785, 1786, 1787, 1788, 1789, 1790, 1791, 1792, 1793, 1794, 1795, 1796, 1797, 1798, 1799, 1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843, 1844, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854, 1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1865, 1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2110, 2111, 3417]


We assigned 0 to books without a 4-digit year listed in their publication date field, so we can ignore that date. We also can assume that dates after the current year (2020) are invalid. Let's filter our list to include only the dates likely to be correct:

In [23]:
filtered_dates = []
for d in unique_dates:
if d != 0 and d <= 2020:
filtered_dates += [d]
print("Earliest publication date:", min(filtered_dates))
print("Latest publication date:", max(filtered_dates))

Earliest publication date: 1004
Latest publication date: 2019


I wonder what book was published in 1004...

In [24]:
df[df['publication_date'] == 1004]

Out[24]:
author title topic language publication_place publication_date
40760 Mackay, Nicci, Spoken in whispers : ['Horse whisperers', 'Human-animal communicati... None Edinburgh : 1004

From a quick Google, it looks like this book was actually published in 1657, so this date must have been a mistake.

I wonder, what's the most common year a book was published out of those in the NBS?

In [25]:
df['publication_date'].value_counts()

Out[25]:
0       43097
2000     7267
1999     6981
2006     6832
2005     6695
...
1509        1
1004        1
1554        1
1559        1
1494        1
Name: publication_date, Length: 483, dtype: int64

The year 2000! 7,267 books from the NBS were published that year.

In [26]:
print("Total books in NBS so far:", total_records)
print("Percentage of books in NBS published in 2000:", str((7267/total_records)*100)+"%")

Total books in NBS so far: 368961
Percentage of books in NBS published in 2000: 1.9695848612726008%

Try It! What other questions can you ask of the metadata? How about finding the most common topics assigned to books in the NBS?