Exploring Britain and UK Handbooks

Created in July-September 2020 for the National Library of Scotland's Data Foundry by Lucy Havens, Digital Library Research Intern

About the Britain and UK Handbooks Dataset

The data consists of digitised text from select Britain and UK Handbooks produced between 1954 and 2005. A central statistics bureau (the Central Statistical Office until 1 April 1966, when it merged with the Office of Population Censuses and Surveys and became the Office for National Statistics) produced the Handbooks each year to communicate information about the UK that would impress international diplomats. The Handbooks provide a factual skeleton of the UK, focusing on reporting quantitative information and a civil service perspective.

Before you begin: If you are interacting with this Notebook in Binder, please note that there is a memory limit (see top right corner) that may prevent the entire Notebook from running due to the large size of the dataset. Installing Jupyter Lab or Jupyter Notebook locally will allow you to run the entire Notebeook on your own computer without running into memory limitations.

Table of Contents

  1. Preparation
  2. Data Cleaning and Standardisation
  3. Summary Statistics
  4. Exploratory Analysis

Citations

  • Alex, Beatrice and Llewellyn, Clare. (2020) Library Carpentry: Text & Data Mining. Centre for Data, Culture & Society, University of Edinburgh. http://librarycarpentry.org/lc-tdm/.
  • Bird, Steven and Klein, Ewan and Loper, Edward. (2019) Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit. O'Reilly Media. 978-0-596-51649-9. https://www.nltk.org/book/.

0. Preparation

Import libraries to use for cleaning, summarizing and exploring the data:

In [43]:
# To prevent SSL certificate failure
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)):
    ssl._create_default_https_context = ssl._create_unverified_context

# Libraries for data loading
import pandas as pd
import numpy as np
import string
import re

# Libraries for visualization
import altair as alt
import matplotlib.pyplot as plt

# Libraries for text analysis
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
nltk.download('wordnet')
from nltk.corpus import wordnet
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.text import Text
from nltk.stem.porter import PorterStemmer
from nltk.probability import FreqDist
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
nltk.download('tagsets')  # part of speech tags
from nltk.draw.dispersion import dispersion_plot as displt

To explore the text in the Britain and UK Handbooks collection, we'll mainly use the Natural Language Toolkit (NLTK), a library written for the programming language Python.

The nls-text-handbooks folder (downloadable as Just the text data from the website at the top of this notebook) contains TXT files of digitised text, with numerical names, as well as a CSV inventory file and a TXT ReadMe file. Load only the TXT files of digitised text and tokenise the text (which splits a string into separate words, numbers, and punctuation):

In [2]:
corpus_folder = 'data/nls-text-handbooks/'
wordlists = PlaintextCorpusReader(corpus_folder, '\d.*', encoding='latin1')
corpus_tokens = wordlists.words()
print(corpus_tokens[:10])
['BRITAIN', '1979', '3W', '+', 'L', 'Capita', '!', 'Edinburgh', 'Population', '5']

It's hard to get a sense of how accurately the text has been digitised from this list of 10 tokens, so let's look at one of these words in context. To see phrases in which "Edinburgh" is used, we can use the concordance() method:

In [3]:
t = Text(corpus_tokens)
t.concordance('Edinburgh', lines=20)
Displaying 20 of 2579 matches:
BRITAIN 1979 3W + L Capita ! Edinburgh Population 5 , 196 / GOO ENGLAND A
ondon WC1V 6HB 13a Castle Street , Edinburgh EH2 3AR 41 The Hayes , Cardiff CF1
ield Liverpool Manchester Bradford Edinburgh Bristol Belfast Coventry Cardiff s
Counsellors of State ( the Duke of Edinburgh , the four adult persons next in s
ments , accompanied by the Duke of Edinburgh , and undertakes lengthy tours in 
y government bookshops in London , Edinburgh , Cardiff , Belfast , Manchester ,
five Scottish departments based in Edinburgh and known as the Scottish Office .
 is centred in the Crown Office in Edinburgh . The Parliamentary Draftsmen for 
. The main seat of the court is in Edinburgh where all appeals are heard . All 
 The Court of Session sits only in Edinburgh , and has jurisdiction to deal wit
ersities are : Aberdeen , Dundee , Edinburgh , Glasgow , Heriot - Watt ( Edinbu
nburgh , Glasgow , Heriot - Watt ( Edinburgh ), St . Andrews , Stirling , and S
. Andrews , Glasgow , Aberdeen and Edinburgh from the fifteenth and sixteenth c
the Pentland Hills to the south of Edinburgh . Over 98 per cent of the land in 
 , a major commercial centre , and Edinburgh , Scotland s capital , an administ
 , bife and Dundee , as well as in Edinburgh , where this and other modern indu
c Services Station , East Craigs , Edinburgh , provide scientific and technical
similar service between London and Edinburgh in May 1978 and the construction o
t England and on the route linking Edinburgh , Newcastle upon Tyne , Birmingham
e been introduced from Heathrow to Edinburgh and Belfast . Joint shuttle servic

I'm guessing bife should be Fife as it's closely followed by Dundee, but overall not so bad!

We can also load individual files from the nls-text-handbooks folder:

In [4]:
file = open('data/nls-text-handbooks/205336772.txt', 'r')
sample_text = file.read()
sample_tokens = word_tokenize(sample_text)
print(sample_tokens[:10])
['GH', '.', 'fl-', '[', 'IASG0', '>', 'J^RSEI', 'nice', ']', 'ROME']

However, in this Notebook, we're interested in the entire dataset, so we'll use all its files. Let's find out just how many files, and just how much text, we're working with.

0.1 Dataset Size

In [5]:
def corpusStatistics(plaintext_corpus_read_lists):
    total_tokens = 0
    total_sents = 0
    total_files = 0
    for fileid in plaintext_corpus_read_lists.fileids():
        total_tokens += len(plaintext_corpus_read_lists.words(fileid))
        total_sents += len(plaintext_corpus_read_lists.sents(fileid))
        total_files += 1
    print("Total...")
    print("  Tokens in Handbooks Data:", total_tokens)
    print("  Sentences in Handbooks Data:", total_sents)
    print("  Files in Handbooks Data:", total_files)

corpusStatistics(wordlists)
Total...
  Tokens in Handbooks Data: 16606800
  Sentences in Handbooks Data: 584618
  Files in Handbooks Data: 50

Note that I've print Tokens rather than words, though the NLTK method used to count those was .words(). This is because words in NLTK include punctuation marks and digits, in addition to alphabetic words.

Across the 50 files that make up the Handbooks dataset, there are over 90 million characters (which could be words, numbers, punctuation, abbreviations, etc.), over 16 million words, and nearly 600,000 sentences. Of course, OCR isn't perfect, so these numbers are estimates, not precise totals.

Variables that store the words and sentences in our dataset will be useful for future analysis. Let's create those now:

In [6]:
def getWordsSents(plaintext_corpus_read_lists):
    all_words = []
    all_sents = []
    for fileid in plaintext_corpus_read_lists.fileids():
        
        file_words = plaintext_corpus_read_lists.words(fileid)
        all_words += [str(word) for word in file_words  if word.isalpha()]
        
        file_sents = sent_tokenize(plaintext_corpus_read_lists.raw(fileid))  #plaintext_corpus_read_lists.sents(fileid)
        all_sents += [str(sent) for sent in file_sents]
        
    return all_words, all_sents
        
handbooks_words, handbooks_sents = getWordsSents(wordlists)
In [7]:
print(handbooks_words[:10])
print(handbooks_words[-10:])
print()
sample_sentences = handbooks_sents[:5] + handbooks_sents[-5:]
for s in sample_sentences:
    # remove new lines and tabs at the start and end of sentences
    s = s.strip('\n')
    s = s.strip('\t')
    # remove new lines and tabs in the middle of sentences
    s = s.replace('\n','')
    s = s.replace('\t','')
    print(s)
['BRITAIN', 'L', 'Capita', 'Edinburgh', 'Population', 'GOO', 'ENGLAND', 'Area', 'km', 'miles']
['AIRWAYS', 'ADEN', 'ALWAYS', 'BAHAMAS', 'AIRWAYS', 'ASSOCIATES', 'August', 'r', 'MORESBY', 'T']

Capita! 1979
Capita!q.miles.kmGOO
^xt:. i - <1.. i 'i&rr
u.
(between pp 390 and 391).t structure390);olomgssalaries
LABI!BO’r
CHICAGO!EAIATED BY BRITISH OVERSEAS AIRWAYS-BRITISH EUROPEAN AIRWA YS-TRANS-CANADA AIR LINES • Q ANT AS EMPIRE AIRWAYS
GIBRALTAR!
GRAND CAYMAN.
MORESBY%TN A1RWA VS • TASMAN EMPIRE AIRWA YS- BRITISH WEST INDIAN AIRWAYS-ADEN ALWAYS BAHAMAS AIRWAYS'* ASSOCIATES

bife isn't the only word the OCR incorrectly digitised. To get a sense of how much of the digitised text we can perform meaningful analysis on, let's figure out how many of NLTK's "words" are actually recognisable English words. We'll use WordNet,* a database of English words, to evaluate which of NLTK's "words" are not valid English words. Section 1. Data Cleaning and Standardisation walks through how to estimate the amount digitisation mistakes.


*Princeton University "About WordNet." WordNet. Princeton University. 2010.

0.2 Identifying Subsets of the Data

Before we move onto cleaning and standardisation, we'll create a lists and dictionary that will help us easily access subsets of the Handbooks data. First need to load the inventory (CSV file) that lists which file name corresponds with which text in the Handbooks dataset. When you open the inventory in Microsoft Excel or a text editor, you can see that there are no column names. The Python library Pandas, which reads CSV files, calls these column names the header. When we use Pandas to read the inventory, we'll create our own header by specifying that the CSV file as None and providing a list of column names.

When Pandas (abbreviated pd when we loaded libraries in the first cell of this notebook) reads a CSV file, it creates a table called a DataFrame from that data. Let's see what the Handbooks inventory DataFrame looks like:

In [8]:
df = pd.read_csv('data/nls-text-handbooks/handbooks-inventory.csv', header=None, names=['fileid', 'title'])
df.head()  # df.head() returns the first 5 rows of a table, df.tail() returns the last 5 rows of the table, and df returns the entire table
Out[8]:
fileid title
0 189742208.txt Britain: An official handbook - 1979 - GII.11
1 189742209.txt Britain: An official handbook - 1980 - GII.11
2 189742210.txt Britain: An official handbook - 1981 - GII.11
3 189742211.txt Britain: An official handbook - 1982 - GII.11
4 189742212.txt Britain: An official handbook - 1983 - GII.11

It looks like the titles of the handbooks in this collection changes over time! The returned DataFrame (the inventory table) isn't showing the entire titles, so let's increase the maximum width of the columns:

In [9]:
pd.set_option('display.max_colwidth', 150)
df.tail()
Out[9]:
fileid title
45 204486117.txt Britain: The official yearbook of the United Kingdom - 2001 - GII.11
46 204882221.txt UK: The official yearbook of the United Kingdom of Great Britain and Northern Ireland - 2002 - GII.11 SER
47 204882222.txt UK: The official yearbook of the United Kingdom of Great Britain and Northern Ireland - 2003 - GII.11 SER
48 204882223.txt UK: The official yearbook of the United Kingdom of Great Britain and Northern Ireland - 2005 - GII.11 SER
49 205336772.txt Britain: An official handbook - 1955 - GII.11

Perfect! Now we can see the full titles for each Handbook file. It looks like they're not in chronological order, though, so let's sort them from earliest to latest publication date.

Step 1: Let's extract the date from each title of the Handbooks dataset using Regular Expressions, which enable us to specify patterns to look for that may be a combination of letters, digits, punctuation, or white space:

In [10]:
titles = list(df['title'])
dates = []
for title in titles:
    # Write a Regular Expression to extract the date from the title
    # (I find it helpful to text out Regular Expressions with Pythex: https://pythex.org/)
    # and turn the date into an Integer, so we can analyse it like a number
    yr = int((re.search('\d{4}', title))[0])
    dates += [yr]
print(dates)
[1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1954, 1956, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2005, 1955]

Step 2: Next, we'll associate the extracted years with their titles and fileids, adding them to the DataFrame in a column named year:

In [11]:
df['year'] = dates
In [12]:
df.sort_values(by=['year'], ascending=True, inplace=True)  # 'inplace=False' would create a different DataFrame that's sorted, instead of sorting the 'df' DataFrame
# df
Try It! Uncomment the last line of code in the cell above by removing the '#' before 'df' to see the sorted DataFrame. Are the titles ordered chronologically?

With a DataFrame, we can access individual cells, for example:

In [13]:
# To view the first (index = 0) row's fileid and title:
print(df.iloc[0][0], df.iloc[0][1])
# To view a title value given a fileid value:
print(df[df.fileid == '204882223.txt']['title'].values[0])
# To view a fileid value given a title value
print(df.loc[df['title'] == 'Britain: An official handbook - 1955 - GII.11'].values[0][0])
204486084.txt Britain: An official handbook - 1954 - GII.11
UK: The official yearbook of the United Kingdom of Great Britain and Northern Ireland - 2005 - GII.11 SER
205336772.txt

Now we can create a two dictionaries of fileids and their associated journal titles, so we can easily identify which wordlists correspond with which text in the Handbooks dataset (you can uncomment any lines of code by removing the # to see what they print):

In [14]:
# 1. Obtain a list of all file IDs
fileids = list(df["fileid"])
# print("Sample file IDs from list of file IDs:\n", fileids[-5:])
# print()

# 2. Obtain a list of all titles
titles = list(df["title"])
# print("Sample titles from list of titles:\n", titles[-5:])
# print()

# 3. Create a dictionary where the keys are file IDs and the values are titles
inventory = dict(zip(fileids, titles))
# print(inventory)
# print()

# 4. Pick a file ID by its index number...
i = 10
a_file_id = fileids[i]
# ... and get the title corresponding with the file ID in the inventory dictionary
print("The title for the file ID at index " + str(i) + ":\n", inventory[a_file_id])
The title for the file ID at index 10:
 Britain: An official handbook - 1965 - GII.11

Python's Natural Language Toolkit (NLTK) library, which we use for text analysis later on, stores the lists of tokens (worldists in the corpus_tokens variable we created) by the file IDs, so it's useful to be able to match the file IDs with their handbook text!

However, the previous corpus_tokens is based on the unsorted collection of wordlists for the Handbooks dataset. We'll fix that in the next section.

1. Data Cleaning and Standardisation

There are several ways to standardise, or "normalise," text, with each way providing suitable text for different types of analysis. For example, to study the vocabulary of a text-based dataset, it's useful to remove punctuation and digits, lowercase the remaining alphabetic words, and then reduce those words to their root form (with stemming or lemmatisation, for example). Alternatively, to identify people and places using named entity recognition, it's important to keep capitalisation in words and keep words in the context of their sentences.

Additionally, when working with a range of files published at different times, sorting the files and their wordlists chronologically is useful so that you can study changes in vocabulary or topics over time.

Step 1: In the previous section, Preparation, we tokenized the Handbooks dataset, creating a list of the words and a list of the sentences in each file. However, the files were not sorted when we created those lists, so we'll create new lists of word tokens and sentence tokens using on the sorted list of Handbooks files, fileid:

In [15]:
def getSortedWordsSents(plaintext_corpus_read_lists):
    all_words = []
    all_words_lower = []
    all_sents = []
    
    # Iterate through the list of SORTED fileids so that 
    # the words and sentences are tokenized in chronological order
    for fileid in fileids:
        file_words = plaintext_corpus_read_lists.words(fileid)
        all_words_lower += [str(word).lower() for word in file_words if word.isalpha()]
        all_words += [str(word) for word in file_words  if word.isalpha()]
        file_sents = sent_tokenize(plaintext_corpus_read_lists.raw(fileid))  
        all_sents += [str(sent) for sent in file_sents]
        
    return all_words, all_words_lower, all_sents
        
handbooks_words, handbooks_words_lower, handbooks_sents = getSortedWordsSents(wordlists)

Step 2: To get an estimate of how accurately OCR digitised the Handbooks, though, we'll use words in the sense that they are recognizable words in the English language. Let's write a regular expression that can tell us whether a string is a word or abbreviation:

In [16]:
isWord = re.compile('[a-zA-z.]+')  # include single letters and abbreviations

Step 3: Lastly, let's use that regular expression to write a function to distinguish words recognisable English words from unrecognisable strings:

In [17]:
def removeNonEnglishWords(list_of_strings):
    english_only = []
    nonenglish = []
    for s in list_of_strings:
        test = isWord.match(s)            # fails if has characters other than letters or a period
        if (test != None):
            passed = test.group()   # get the matching string
            if wordnet.synsets(passed):  # see if WordNet recognizes the matching string
                english_only.append(passed)
            else:
                nonenglish.append(passed)
        else:
            nonenglish.append(passed)
    return english_only, nonenglish
                
recognised, unrecognised = removeNonEnglishWords(handbooks_words)
In [18]:
print("Total alphabetic words recognised in WordNet:", len(recognised))
print("Total alphabetic words NOT reccognised in WordNet:", len(unrecognised))
print("Percentage of alphabetic words that are unrecognised in WordNet:", (len(unrecognised)/len(recognised))*100, "%")
Total alphabetic words recognised in WordNet: 9652429
Total alphabetic words NOT reccognised in WordNet: 3422753
Percentage of alphabetic words that are unrecognised in WordNet: 35.460017369721136 %

Note that these totals and percentage should be used as rough estimates, not precise calculations. WordNet may not recognise some British English terms or Scottish terms since it was developed at Princeton, an American university. There are other data sources that provide lists of valid words to which you could compare words from the Handbooks dataset. Using a combination of several sources of valid English words could provide more accurate estimates.

1.2 Reducing to Root Forms

In addition to tokenisation, lemmatisation is a method of standardising, or "normalising," text. NLTK's WordNet Lemmatizer reduces a token to its root only if the reduction of the token results in a word that's recognized as an English word in WordNet. Here's what that looks like:

In [19]:
# Lemmatize the text (reduce words to their root ONLY if the root is considered a word in WordNet)
wnl = nltk.WordNetLemmatizer()
lemmatised = [wnl.lemmatize(t) for t in handbooks_words_lower if t.isalpha()]  # only include alphabetic tokens
print(lemmatised[500:600])
['be', 'placed', 'on', 'general', 'sale', 'and', 'it', 'wa', 'finally', 'decided', 'to', 'do', 'this', 'after', 'a', 'recommendation', 'by', 'the', 'inter', 'departmental', 'committee', 'on', 'social', 'and', 'economic', 'research', 'the', 'handbook', 'contains', 'factual', 'and', 'statistical', 'information', 'compiled', 'from', 'authoritative', 'and', 'official', 'source', 'about', 'the', 'united', 'kingdom', 'it', 'people', 'and', 'it', 'institution', 'it', 'doe', 'not', 'claim', 'to', 'be', 'comprehensive', 'it', 'principal', 'purpose', 'is', 'to', 'provide', 'basic', 'data', 'on', 'the', 'main', 'aspect', 'of', 'national', 'administration', 'and', 'national', 'economy', 'and', 'to', 'give', 'an', 'account', 'of', 'the', 'part', 'played', 'by', 'the', 'government', 'in', 'the', 'life', 'of', 'the', 'community', 'in', 'considering', 'it', 'content', 'reader', 'in', 'the', 'united', 'kingdom']

2. Summary Statistics

2.1 Frequencies and Sizes

Now that we've created some different cuts of the Handbooks dataset, let's start investigating the frequency of terms as they appear across the dataset. One way to do so is with a frequency distribution, which is a line chart that shows how many times a token appears in the dataset.

Let's plot the frequency distribution using tokens that were recognised by WordNet as English words, excluding stop words (for example: a, an, the), digits, and punctuation:

In [20]:
rec_min_two_letters = [t.lower() for t in recognised if len(t) > 2]
to_exclude = set(stopwords.words('english') + list(string.punctuation) + list(string.digits) + ['also', 'per', '000', 'one', 'many', 'may', 'two', 'see'])
filtered_rec_tokens = [t for t in rec_min_two_letters if not t in to_exclude]

fdist_ft_rec = FreqDist(filtered_rec_tokens)
print("Total tokens after filtering:", fdist_ft_rec.N())  # count the total tokens after filtering
Total tokens after filtering: 7219577
In [21]:
plt.figure(figsize = (20, 8))
plt.rc('font', size=12)

number_of_tokens = 30 # Try increasing or decreasing this number to view more or fewer tokens in the visualization
fdist_ft_rec.plot(number_of_tokens, title='Frequency Distribution for ' + str(number_of_tokens) + ' Most Common Tokens among Recognized English Words in the Handbooks Dataset (excluding stop words)')
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1add77668>

We can create another data visualisation, one that illustrates when specific words are used within the Handbooks dataset. This is called a Lexical Dispersion Plot. Since capitalisation is important for identifying place names, we'll use the handbooks_words list rather than the handbooks_words_lower list for the plot. We'll pick some place names (the list of targets) to see when they appear:

In [22]:
corpus_text = Text(handbooks_words)
targets = ['UK', 'Britain', 'British', 'England', 'English', 'Scotland', 'Scottish', 'Ireland', 'Irish', 'Wales', 'Welsh']
plt.figure(figsize=(18,10))
plt.rc('font', size=12)
displt(corpus_text, targets, ignore_case=True, title='Lexical Dispersion Plot of UK and Ireland Place-related Words in the Handbooks Dataset')

Since we sorted the Handbooks tokens by date, the lexical dispersion plot is showing changes in the use of our list of target words (tokens) over time, with the earliest publications to the left and the most recent publications to the right! We can see that the word UK is used with increasing frequency over time. We can also see that English, Welsh, and Irish occur less than British and Scottish.

Try It! Instead of tokens, visualise the frequency distribution of lemmas or stems, and pick your own list of targets! You could try looking at sports, industries (such as mining), or other economy-related topics, for example.

2.2 Uniqueness and Variety

Another way to summarise the Handbooks dataset is to look at the uniqueness and variety of word usage. We can obtain the vocabulary of the text by creating a set of unique tokens that occur in the dataset, as well as creating a set of unique lemmatised tokens that occur in the dataset.

In [23]:
# Remove duplicate tokens from the text (obtain the vocabulary of the text)
t_vocab = set(handbooks_words)
t_vocab_lower = set(handbooks_words_lower)
lemma_vocab = set(lemmatised)
print("Unique tokens:", len(t_vocab))
print("Unique lowercase tokens:", len(t_vocab_lower))
print("Unique lemmatised (lowercase) tokens:", len(lemma_vocab))
print()
rec_vocab = set(recognised)
unrec_vocab = set(unrecognised)
print("Unique recognised words:", len(rec_vocab))
print("Unique unrecognised words:", len(unrec_vocab))
Unique tokens: 72489
Unique lowercase tokens: 57422
Unique lemmatised (lowercase) tokens: 52713

Unique recognised words: 36742
Unique unrecognised words: 33606

The vocabulary of the entire Handbooks dataset contains 70,922 unique words, 36,780 of which are recognised English words in WordNet. The lemmatised vocabulary of the dataset contains 66,172 words.

In [24]:
print(list(lemma_vocab)[:100])
['digitalâ', 'rusedskiâ', 'jermanent', 'leggett', 'gave', 'bso', 'portpatrick', 'certifies', 'gzji', 'minimised', 'wu', 'cordial', 'conductivity', 'hedge', 'vlalochnagar', 'ltuaâ', 'lities', 'respectâ', 'gnvqs', 'bc', 'branching', 'weartieayp', 'thalmic', 'toria', 'theddl', 'unfurnished', 'pont', 'oocno', 'rac', 'informatics', 'ductive', 'ccea', 'maintenâ', 'lte', 'coid', 'minehead', 'kabul', 'chad', 'btec', 'waketield', 'admmisters', 'welj', 'câ', 'granddaughter', 'federation', 'hide', 'vegetarian', 'acrobat', 'perchaâ', 'underneath', 'ratesâ', 'emulsifier', 'ogc', 'cathay', 'iqsl', 'centerpiece', 'yearbook', 'vohi', 'minoritiesâ', 'baofe', 'uci', 'enact', 'monofilament', 'mics', 'betws', 'supplying', 'printerâ', 'rih', 'confecâ', 'loiga', 'jobless', 'repatriated', 'fomy', 'apprenticeshipsâ', 'dam', 'foundry', 'halsey', 'sponsqred', 'sssp', 'talbot', 'filler', 'supplement', 'intelsatâ', 'fcambridge', 'arrested', 'wdges', 'romantic', 'carrying', 'vjz', 'hn', 'ransfersâ', 'asks', 'territorv', 'abandoned', 'corres', 'rehousing', 'traffickersâ', 'reaching', 'upheaval', 'utc']

Since the Handbooks dataset contains multiple publications (one file per year of publication), we could try picking a subset of publications, or even a single publication, and then compare the vocabulary across different publications. What patterns would you expect to see? How might the lexical diversity of the Handbooks dataset compare to the lexical diversity of novels in the Lewis Grassic Gibbon First Editions collection?

3. Exploratory Analysis

Let's group the Handbooks TXT files into 10-year periods so that we can investigate patterns in the Handbooks' text over time, comparing one decade to the next.

We'll group the Handbooks using Regular Expressions on their titles to identify the decade in which they were published:

In [25]:
# Make a dictionary with years as keys and fileids as values
list_of_years = list(df['year'])
fileid_to_year = dict(zip(fileids, list_of_years))

# Create a list for each decade during which the Handbooks were published:
fifties = [f for f in fileids if re.match('195\d{1}', str(fileid_to_year[f]))]
print(fifties)
sixties = [f for f in fileids if re.match('196\d{1}', str(fileid_to_year[f]))]
seventies = [f for f in fileids if re.match('197\d{1}', str(fileid_to_year[f]))]
eighties = [f for f in fileids if re.match('198\d{1}', str(fileid_to_year[f]))]
nineties = [f for f in fileids if re.match('199\d{1}', str(fileid_to_year[f]))]
twotho = [f for f in fileids if re.match('200\d{1}', str(fileid_to_year[f]))]

# Check that the decade lists' lengths sum to the length of the list of all fileids (an error is thrown if they don't)
assert len(fifties) + len(sixties) + len(seventies) + len(eighties) + len(nineties) + len(twotho) == len(fileids)
['204486084.txt', '205336772.txt', '204486085.txt', '204486086.txt', '204486087.txt']
In [26]:
# INPUT: a wordlist (from the PlaintextCorpusReader - see section 0. Preparation),
#        a year and a list of fileids associated with that year
# OUTPUT: a list of word tokens for fileids from the inputted year
def getTokens(plaintext_corpus_read_lists, decade_files):
    all_words = []
    for fileid in decade_files:
        file_words = plaintext_corpus_read_lists.words(fileid)
        all_words += [str(word) for word in file_words  if word.isalpha()]  # isalpha() removes non-letter tokens
    return all_words
In [27]:
fifties_tokens = getTokens(wordlists, fifties)
print(fifties_tokens[:100])
sixties_tokens = getTokens(wordlists, sixties)
seventies_tokens = getTokens(wordlists, seventies)
eighties_tokens = getTokens(wordlists, eighties)
nineties_tokens = getTokens(wordlists, nineties)
twotho_tokens = getTokens(wordlists, twotho)
['m', 'Gti', 'i', 'BRITAIN', 'An', 'Official', 'Handbook', 'BRITAIN', 'An', 'Official', 'Handbook', 'PREPARED', 'BY', 'THE', 'CENTRAL', 'OFFICE', 'OF', 'INFORMATION', 'AND', 'PUBLISHED', 'BY', 'HER', 'MAJESTYâ', 'S', 'STATIONERY', 'OFFICE', 'LONDON', 'Crown', 'Copyright', 'Reserved', 'HER', 'MAJESTYâ', 'S', 'STATIONERY', 'OFFICE', 'Copies', 'of', 'this', 'book', 'may', 'be', 'had', 'from', 'H', 'M', 'Stationery', 'Office', 'York', 'House', 'Kingsway', 'London', 'W', 'C', 'Oxford', 'St', 'London', 'W', 'i', 'orders', 'by', 'post', 'to', 'be', 'sent', 'to', 'P', 'O', 'Box', 'London', 'S', 'E', 'i', 'Castle', 'St', 'Edinburgh', 'King', 'St', 'Manchester', 'Edmund', 'St', 'Birmingham', 'St', 'Andrewâ', 's', 'Crescent', 'Cardiff', 'Tower', 'Lane', 'Bristol', 'Chichester', 'St', 'Belfast', 'or', 'through', 'any', 'bookseller', 'Obtainable', 'in', 'the', 'United']

Great! Now we can analyse the Handbooks dataset by 10-year periods!

3.1 How are the United Kingdom and Great Britain portrayed? How does this change over time?

The Handbooks were written for an international audience to impress people with the success and strength of Britain and the UK. Let's investigate how Britain and the UK are portrayed:

In [28]:
t = Text(corpus_tokens)
t.concordance('Britain', lines=10)
Displaying 10 of 49467 matches:
 BRITAIN 1979 3W + L Capita ! Edinburgh Popu
ges and salaries 1973 - 78 318 Maps Britain inside back cover Economic planning
icity 255 Some minerals produced in Britain ⠀¢ 261 Main railway passenger rou
ween pp 390 and 391 ). Introduction Britain 1979 is the thirtieth official hand
onery Office throughout the world . Britain 1979 is primarily concerned to desc
ember 1 978 THE PHYSICAL BACKGROUND Britain , formally known as the United King
nown as the United Kingdom of Great Britain and Northern Ireland , constitutes 
ope . The largest islands are Great Britain ( comprising the mainlands of Engla
d the Channel Islands between Great Britain and France have a large measure of 
ed Kingdom 244 , 103 94 , 249 Great Britain 229 , 983 88 , 797 England 130 , 44
In [29]:
fdist = FreqDist(handbooks_words)
print("Frequency (percentage) of Britain and the UK in Handbooks dataset:")
print(" - Britain:", (fdist.freq('Britain'))*100, "%")
print(" - GB:", (fdist.freq('GB'))*100, "%")
print(" - UK:", (fdist.freq('UK'))*100, "%")
Frequency (percentage) of Britain and the UK in Handbooks dataset:
 - Britain: 0.2972731087031905 %
 - GB: 0.0018202423492078353 %
 - UK: 0.07828571717013194 %

Let's create Frequency Distribution visualisations for the Handbooks published in the fifties and the 2000s.

Step 1: First, we'll lowercase the words and remove stopwords from the lists:

In [30]:
to_exclude = stopwords.words('english') + ['â', 'per', 'cent']
fifties_filtered = [w.lower() for w in fifties_tokens if not w.lower() in to_exclude]
twotho_filtered = [w.lower() for w in twotho_tokens if not w.lower() in to_exclude]

Step 2: Then, we'll stem the words in both lists:

In [31]:
porter = nltk.PorterStemmer()
porter_stemmed = [porter.stem(w) for w in fifties_filtered]
print("Fifties stems:", porter_stemmed[590:600])  
porter_stemmed = [porter.stem(w) for w in twotho_filtered]
print("Twotho stems:", porter_stemmed[590:600])
Fifties stems: ['purpos', 'great', 'britain', 'compris', 'england', 'wale', 'scotland', 'posit', 'complic', 'fact']
Twotho stems: ['postal', 'area', 'england', 'counti', 'unitari', 'author', 'sinc', 'april', 'region', 'england']

Step 3: Lastly, we'll calculate the frequency distributions of stems for the 1950s and 2000s, and visualise those distributions:

In [32]:
fdist_fifties = FreqDist(fifties_filtered)
fdist_twotho = FreqDist(twotho_filtered)

number_of_tokens = 10 # Try increasing or decreasing this number to view more or fewer tokens in the visualization
plt.figure(figsize = (10, 6))
plt.rc('font', size=12)
fdist_fifties.plot(number_of_tokens, title='Frequency Distribution of Top ' + str(number_of_tokens) + ' Stems in Handbooks from 1954-59')

plt.figure(figsize = (10, 6))
plt.rc('font', size=12)
fdist_twotho.plot(number_of_tokens, title='Frequency Distribution of Top ' + str(number_of_tokens) + ' Stems in Handbooks from 2000-05')
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x161a31c50>

Be careful when comparing these graphs to pay attention to the different scales on their y axes!

3.1 Visualising Words Over Time

Step 1: Let's pick some words that relate to various industries that may appear in the Handbooks and visualise their occurrences over time:

In [33]:
targets = ['mining', 'technology', 'shipbuilding', 'football', 'medicine', 'research', 'digital']
Try It! How could you use the Contents pages in the Handbooks to look at the topics that are added, removed, or maintained over the years each Handbook has been published?

Step 2: Let's group the Handbooks by decade to study these words' occurrences in the Handbooks based on their decade of publication:

In [34]:
# INPUT: a capitalised word (in String format)
# OUTPUT: a list of the ratios of the inputted word's occurrence
#         (lowercased and capitalised) to all words in each 
#         decade group of Handbooks
def wordOccurrenceByDecade(word):
    word_occurs = []
    tokens_lists = [fifties_tokens, sixties_tokens, seventies_tokens, eighties_tokens, nineties_tokens, twotho_tokens]
    for decade in tokens_lists:
        word_count_capital = decade.count(word)
        word_lower = word.lower()
        word_count_lower = decade.count(word_lower)
        total_words = len(decade)
        occurrence = (word_count_capital + word_count_lower)/total_words
        word_occurs += [occurrence]
    return word_occurs

digital = wordOccurrenceByDecade('Digital')
mining = wordOccurrenceByDecade('Mining')
shipbuilding = wordOccurrenceByDecade('Shipbuilding')
technology = wordOccurrenceByDecade('Technology')
football = wordOccurrenceByDecade('Football')

Step 3: Now we'll create a DataFrame of the occurrence data to view the occurrence of the words by decade of Handbooks publications, and we can export the DataFrame as a CSV file so it can be opened in Microsoft Excel or loaded into another Jupyter Notebook as a DataFrame:

In [35]:
col_names = ['1950s', '1960s', '1970s', '1980s', '1990s', '2000s' ]
row_names = ['Digital', 'Football', 'Mining', 'Shipbuilding', 'Technology']
industry_df = pd.DataFrame(data=[digital, football, mining, shipbuilding, technology], columns=col_names, index=row_names)
industry_df.to_csv('handbooks_industry_occurrences.csv')
industry_df

# Transpose the data to rotate the columns and rows of a DataFrame with '.T' or '.transpose()'
industry_df = industry_df.T
industry_df
Out[35]:
Digital Football Mining Shipbuilding Technology
1950s 0.000004 0.000038 0.000095 0.000117 0.000072
1960s 0.000006 0.000140 0.000078 0.000092 0.000275
1970s 0.000008 0.000115 0.000098 0.000084 0.000299
1980s 0.000049 0.000141 0.000089 0.000084 0.000516
1990s 0.000092 0.000202 0.000063 0.000030 0.000800
2000s 0.000182 0.000186 0.000054 0.000027 0.000606

Step 4: Using Altair, we can visualise the occurrence of a single word over the decades of Handbooks publications:

In [37]:
source = pd.DataFrame({
    'decade': col_names,
    'occurrence': list(industry_df['Digital'])
})

alt.Chart(source, title="Occurrence of 'Digital' and 'digital' in Handbooks by Decade").mark_bar(size=60).encode(
    x='decade',
    y=alt.Y('occurrence', axis=alt.Axis(format='%', title='Occurrence'))
).configure_axis(
    grid=False,
    labelAngle=0
).configure_view(
    strokeWidth=0
).properties(
    width=440
)
Out[37]:

Ta da!

Try It! Can you edit the bar chart to show the occurrence of another word by decade, such as mining? Then, try editing the code to visualise the occurrence of a word of your choosing, not one of the words already in our target list!

We can also use Altair for other types of visualisations, such as line charts that display the occurrence of all the words per decade in a single plot. To plot the occurrence of multiple words at once, we need to create a DataFrame with a slightly different structure...

First, though, we'll calculate the occurrence of words in every Handbook:

In [38]:
def wordOccurrenceByFile(word, wordlists, fileids):
    word_occurs = []
    for file in fileids:
        file_words = wordlists.words(file)
        
        word_count_capital = file_words.count(word)
        word_lower = word.lower()
        word_count_lower = file_words.count(word_lower)
        
        total_words = len(list(file_words))
        occurrence = (word_count_capital + word_count_lower)/total_words
        word_occurs += [occurrence]
        
    return word_occurs

digital = wordOccurrenceByFile('Digital', wordlists, fileids)
mining = wordOccurrenceByFile('Mining', wordlists, fileids)
shipbuilding = wordOccurrenceByFile('Shipbuilding', wordlists, fileids)
technology = wordOccurrenceByFile('Technology', wordlists, fileids)
football = wordOccurrenceByFile('Football', wordlists, fileids)
In [39]:
assert (len(digital) == len(fileids))
assert (len(digital) == len(mining))

Now we'll create a new DataFrame:

In [40]:
word = (['digital'] * (len(digital))) + (['mining'] * (len(mining))) + (['shipbuilding'] * (len(shipbuilding))) + (['technology'] * (len(technology))) + (['football'] * (len(football)))

occurrence = digital + mining + shipbuilding + technology + football

yrs = list(df['year'])
year = yrs * 5

word_df = pd.DataFrame({'word': word, 'occurrence': occurrence, 'year': year})
word_df.tail()
Out[40]:
word occurrence year
245 football 0.000151 2000
246 football 0.000164 2001
247 football 0.000152 2002
248 football 0.000162 2003
249 football 0.000096 2005

Using that DataFrame, we'll visualise the occurrence of all the words in every Handbook in our dataset:

In [41]:
alt.Chart(word_df, title="Occurrence of Select Words in the Britain and UK Handbooks").mark_line().encode(
    x='year:O',
    y=alt.Y('occurrence', axis=alt.Axis(format='%')),
    color='word',
    tooltip='word'
)
Out[41]:
Try It! How is Scotland portrayed? How does the portrayal of Britain and the UK compare or contrast with the portrayal of Scotland?
In [42]:
# HINT 1: try using Regular Expressions to search for words related to Scotland and Scottish-ness...

# scot_strings = [s for s in handbooks_words_lower if (re.search('scot$', s) or re.search('scot[tcls]+', s))]
# print("Total tokens related to Scotland:", len(scot_strings))

####################

# HINT 2: Sets in Python are similar to Lists except that they can't have repeating items, 
#         so changing a list to a set is a quick way to remove duplicates from a list!
# unique_scot = set(scot_strings)
# print("Unique tokens related to Scotland:", len(unique_scot))
# print(unique_scot)