Created in July-September 2020 for the National Library of Scotland's Data Foundry by Lucy Havens, Digital Library Research Intern
The data consists of digitised text from select Britain and UK Handbooks produced between 1954 and 2005. A central statistics bureau (the Central Statistical Office until 1 April 1966, when it merged with the Office of Population Censuses and Surveys and became the Office for National Statistics) produced the Handbooks each year to communicate information about the UK that would impress international diplomats. The Handbooks provide a factual skeleton of the UK, focusing on reporting quantitative information and a civil service perspective.
Import libraries to use for cleaning, summarizing and exploring the data:
# To prevent SSL certificate failure
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
getattr(ssl, '_create_unverified_context', None)):
ssl._create_default_https_context = ssl._create_unverified_context
# Libraries for data loading
import pandas as pd
import numpy as np
import string
import re
# Libraries for visualization
import altair as alt
import matplotlib.pyplot as plt
# Libraries for text analysis
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
nltk.download('wordnet')
from nltk.corpus import wordnet
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.text import Text
from nltk.stem.porter import PorterStemmer
from nltk.probability import FreqDist
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
nltk.download('tagsets') # part of speech tags
from nltk.draw.dispersion import dispersion_plot as displt
To explore the text in the Britain and UK Handbooks collection, we'll mainly use the Natural Language Toolkit (NLTK), a library written for the programming language Python.
The nls-text-handbooks folder (downloadable as Just the text data from the website at the top of this notebook) contains TXT files of digitised text, with numerical names, as well as a CSV inventory file and a TXT ReadMe file. Load only the TXT files of digitised text and tokenise the text (which splits a string into separate words, numbers, and punctuation):
corpus_folder = 'data/nls-text-handbooks/'
wordlists = PlaintextCorpusReader(corpus_folder, '\d.*', encoding='latin1')
corpus_tokens = wordlists.words()
print(corpus_tokens[:10])
It's hard to get a sense of how accurately the text has been digitised from this list of 10 tokens, so let's look at one of these words in context. To see phrases in which "Edinburgh" is used, we can use the concordance() method:
t = Text(corpus_tokens)
t.concordance('Edinburgh', lines=20)
I'm guessing bife
should be Fife
as it's closely followed by Dundee
, but overall not so bad!
We can also load individual files from the nls-text-handbooks folder:
file = open('data/nls-text-handbooks/205336772.txt', 'r')
sample_text = file.read()
sample_tokens = word_tokenize(sample_text)
print(sample_tokens[:10])
However, in this Notebook, we're interested in the entire dataset, so we'll use all its files. Let's find out just how many files, and just how much text, we're working with.
def corpusStatistics(plaintext_corpus_read_lists):
total_tokens = 0
total_sents = 0
total_files = 0
for fileid in plaintext_corpus_read_lists.fileids():
total_tokens += len(plaintext_corpus_read_lists.words(fileid))
total_sents += len(plaintext_corpus_read_lists.sents(fileid))
total_files += 1
print("Total...")
print(" Tokens in Handbooks Data:", total_tokens)
print(" Sentences in Handbooks Data:", total_sents)
print(" Files in Handbooks Data:", total_files)
corpusStatistics(wordlists)
Note that I've print Tokens
rather than words, though the NLTK method used to count those was .words()
. This is because words in NLTK include punctuation marks and digits, in addition to alphabetic words.
Across the 50 files that make up the Handbooks dataset, there are over 90 million characters (which could be words, numbers, punctuation, abbreviations, etc.), over 16 million words, and nearly 600,000 sentences. Of course, OCR isn't perfect, so these numbers are estimates, not precise totals.
Variables that store the words and sentences in our dataset will be useful for future analysis. Let's create those now:
def getWordsSents(plaintext_corpus_read_lists):
all_words = []
all_sents = []
for fileid in plaintext_corpus_read_lists.fileids():
file_words = plaintext_corpus_read_lists.words(fileid)
all_words += [str(word) for word in file_words if word.isalpha()]
file_sents = sent_tokenize(plaintext_corpus_read_lists.raw(fileid)) #plaintext_corpus_read_lists.sents(fileid)
all_sents += [str(sent) for sent in file_sents]
return all_words, all_sents
handbooks_words, handbooks_sents = getWordsSents(wordlists)
print(handbooks_words[:10])
print(handbooks_words[-10:])
print()
sample_sentences = handbooks_sents[:5] + handbooks_sents[-5:]
for s in sample_sentences:
# remove new lines and tabs at the start and end of sentences
s = s.strip('\n')
s = s.strip('\t')
# remove new lines and tabs in the middle of sentences
s = s.replace('\n','')
s = s.replace('\t','')
print(s)
bife
isn't the only word the OCR incorrectly digitised. To get a sense of how much of the digitised text we can perform meaningful analysis on, let's figure out how many of NLTK's "words" are actually recognisable English words. We'll use WordNet,* a database of English words, to evaluate which of NLTK's "words" are not valid English words. Section 1. Data Cleaning and Standardisation walks through how to estimate the amount digitisation mistakes.
*Princeton University "About WordNet." WordNet. Princeton University. 2010.
Before we move onto cleaning and standardisation, we'll create a lists and dictionary that will help us easily access subsets of the Handbooks data. First need to load the inventory (CSV file) that lists which file name corresponds with which text in the Handbooks dataset. When you open the inventory in Microsoft Excel or a text editor, you can see that there are no column names. The Python library Pandas, which reads CSV files, calls these column names the header
. When we use Pandas to read the inventory, we'll create our own header by specifying that the CSV file as None
and providing a list of column names
.
When Pandas (abbreviated pd
when we loaded libraries in the first cell of this notebook) reads a CSV file, it creates a table called a DataFrame from that data. Let's see what the Handbooks inventory DataFrame looks like:
df = pd.read_csv('data/nls-text-handbooks/handbooks-inventory.csv', header=None, names=['fileid', 'title'])
df.head() # df.head() returns the first 5 rows of a table, df.tail() returns the last 5 rows of the table, and df returns the entire table
It looks like the titles of the handbooks in this collection changes over time! The returned DataFrame (the inventory table) isn't showing the entire titles, so let's increase the maximum width of the columns:
pd.set_option('display.max_colwidth', 150)
df.tail()
Perfect! Now we can see the full titles for each Handbook file. It looks like they're not in chronological order, though, so let's sort them from earliest to latest publication date.
Step 1: Let's extract the date from each title of the Handbooks dataset using Regular Expressions, which enable us to specify patterns to look for that may be a combination of letters, digits, punctuation, or white space:
titles = list(df['title'])
dates = []
for title in titles:
# Write a Regular Expression to extract the date from the title
# (I find it helpful to text out Regular Expressions with Pythex: https://pythex.org/)
# and turn the date into an Integer, so we can analyse it like a number
yr = int((re.search('\d{4}', title))[0])
dates += [yr]
print(dates)
Step 2: Next, we'll associate the extracted years with their titles and fileids, adding them to the DataFrame in a column named year
:
df['year'] = dates
df.sort_values(by=['year'], ascending=True, inplace=True) # 'inplace=False' would create a different DataFrame that's sorted, instead of sorting the 'df' DataFrame
# df
With a DataFrame, we can access individual cells, for example:
# To view the first (index = 0) row's fileid and title:
print(df.iloc[0][0], df.iloc[0][1])
# To view a title value given a fileid value:
print(df[df.fileid == '204882223.txt']['title'].values[0])
# To view a fileid value given a title value
print(df.loc[df['title'] == 'Britain: An official handbook - 1955 - GII.11'].values[0][0])
Now we can create a two dictionaries of fileids and their associated journal titles, so we can easily identify which wordlists correspond with which text in the Handbooks dataset (you can uncomment any lines of code by removing the #
to see what they print):
# 1. Obtain a list of all file IDs
fileids = list(df["fileid"])
# print("Sample file IDs from list of file IDs:\n", fileids[-5:])
# print()
# 2. Obtain a list of all titles
titles = list(df["title"])
# print("Sample titles from list of titles:\n", titles[-5:])
# print()
# 3. Create a dictionary where the keys are file IDs and the values are titles
inventory = dict(zip(fileids, titles))
# print(inventory)
# print()
# 4. Pick a file ID by its index number...
i = 10
a_file_id = fileids[i]
# ... and get the title corresponding with the file ID in the inventory dictionary
print("The title for the file ID at index " + str(i) + ":\n", inventory[a_file_id])
Python's Natural Language Toolkit (NLTK) library, which we use for text analysis later on, stores the lists of tokens (worldists
in the corpus_tokens
variable we created) by the file IDs, so it's useful to be able to match the file IDs with their handbook text!
However, the previous corpus_tokens
is based on the unsorted collection of wordlists
for the Handbooks dataset. We'll fix that in the next section.
There are several ways to standardise, or "normalise," text, with each way providing suitable text for different types of analysis. For example, to study the vocabulary of a text-based dataset, it's useful to remove punctuation and digits, lowercase the remaining alphabetic words, and then reduce those words to their root form (with stemming or lemmatisation, for example). Alternatively, to identify people and places using named entity recognition, it's important to keep capitalisation in words and keep words in the context of their sentences.
Additionally, when working with a range of files published at different times, sorting the files and their wordlists chronologically is useful so that you can study changes in vocabulary or topics over time.
Step 1: In the previous section, Preparation, we tokenized the Handbooks dataset, creating a list of the words and a list of the sentences in each file. However, the files were not sorted when we created those lists, so we'll create new lists of word tokens and sentence tokens using on the sorted list of Handbooks files, fileid
:
def getSortedWordsSents(plaintext_corpus_read_lists):
all_words = []
all_words_lower = []
all_sents = []
# Iterate through the list of SORTED fileids so that
# the words and sentences are tokenized in chronological order
for fileid in fileids:
file_words = plaintext_corpus_read_lists.words(fileid)
all_words_lower += [str(word).lower() for word in file_words if word.isalpha()]
all_words += [str(word) for word in file_words if word.isalpha()]
file_sents = sent_tokenize(plaintext_corpus_read_lists.raw(fileid))
all_sents += [str(sent) for sent in file_sents]
return all_words, all_words_lower, all_sents
handbooks_words, handbooks_words_lower, handbooks_sents = getSortedWordsSents(wordlists)
Step 2: To get an estimate of how accurately OCR digitised the Handbooks, though, we'll use words in the sense that they are recognizable words in the English language. Let's write a regular expression that can tell us whether a string is a word or abbreviation:
isWord = re.compile('[a-zA-z.]+') # include single letters and abbreviations
Step 3: Lastly, let's use that regular expression to write a function to distinguish words recognisable English words from unrecognisable strings:
def removeNonEnglishWords(list_of_strings):
english_only = []
nonenglish = []
for s in list_of_strings:
test = isWord.match(s) # fails if has characters other than letters or a period
if (test != None):
passed = test.group() # get the matching string
if wordnet.synsets(passed): # see if WordNet recognizes the matching string
english_only.append(passed)
else:
nonenglish.append(passed)
else:
nonenglish.append(passed)
return english_only, nonenglish
recognised, unrecognised = removeNonEnglishWords(handbooks_words)
print("Total alphabetic words recognised in WordNet:", len(recognised))
print("Total alphabetic words NOT reccognised in WordNet:", len(unrecognised))
print("Percentage of alphabetic words that are unrecognised in WordNet:", (len(unrecognised)/len(recognised))*100, "%")
Note that these totals and percentage should be used as rough estimates, not precise calculations. WordNet may not recognise some British English terms or Scottish terms since it was developed at Princeton, an American university. There are other data sources that provide lists of valid words to which you could compare words from the Handbooks dataset. Using a combination of several sources of valid English words could provide more accurate estimates.
In addition to tokenisation, lemmatisation is a method of standardising, or "normalising," text. NLTK's WordNet Lemmatizer reduces a token to its root only if the reduction of the token results in a word that's recognized as an English word in WordNet. Here's what that looks like:
# Lemmatize the text (reduce words to their root ONLY if the root is considered a word in WordNet)
wnl = nltk.WordNetLemmatizer()
lemmatised = [wnl.lemmatize(t) for t in handbooks_words_lower if t.isalpha()] # only include alphabetic tokens
print(lemmatised[500:600])
Now that we've created some different cuts of the Handbooks dataset, let's start investigating the frequency of terms as they appear across the dataset. One way to do so is with a frequency distribution, which is a line chart that shows how many times a token appears in the dataset.
Let's plot the frequency distribution using tokens that were recognised by WordNet as English words, excluding stop words (for example: a, an, the), digits, and punctuation:
rec_min_two_letters = [t.lower() for t in recognised if len(t) > 2]
to_exclude = set(stopwords.words('english') + list(string.punctuation) + list(string.digits) + ['also', 'per', '000', 'one', 'many', 'may', 'two', 'see'])
filtered_rec_tokens = [t for t in rec_min_two_letters if not t in to_exclude]
fdist_ft_rec = FreqDist(filtered_rec_tokens)
print("Total tokens after filtering:", fdist_ft_rec.N()) # count the total tokens after filtering
plt.figure(figsize = (20, 8))
plt.rc('font', size=12)
number_of_tokens = 30 # Try increasing or decreasing this number to view more or fewer tokens in the visualization
fdist_ft_rec.plot(number_of_tokens, title='Frequency Distribution for ' + str(number_of_tokens) + ' Most Common Tokens among Recognized English Words in the Handbooks Dataset (excluding stop words)')
We can create another data visualisation, one that illustrates when specific words are used within the Handbooks dataset. This is called a Lexical Dispersion Plot. Since capitalisation is important for identifying place names, we'll use the handbooks_words
list rather than the handbooks_words_lower
list for the plot. We'll pick some place names (the list of targets
) to see when they appear:
corpus_text = Text(handbooks_words)
targets = ['UK', 'Britain', 'British', 'England', 'English', 'Scotland', 'Scottish', 'Ireland', 'Irish', 'Wales', 'Welsh']
plt.figure(figsize=(18,10))
plt.rc('font', size=12)
displt(corpus_text, targets, ignore_case=True, title='Lexical Dispersion Plot of UK and Ireland Place-related Words in the Handbooks Dataset')
Since we sorted the Handbooks tokens by date, the lexical dispersion plot is showing changes in the use of our list of target words (tokens) over time, with the earliest publications to the left and the most recent publications to the right! We can see that the word UK
is used with increasing frequency over time. We can also see that English
, Welsh
, and Irish
occur less than British
and Scottish
.
Another way to summarise the Handbooks dataset is to look at the uniqueness and variety of word usage. We can obtain the vocabulary of the text by creating a set of unique tokens that occur in the dataset, as well as creating a set of unique lemmatised tokens that occur in the dataset.
# Remove duplicate tokens from the text (obtain the vocabulary of the text)
t_vocab = set(handbooks_words)
t_vocab_lower = set(handbooks_words_lower)
lemma_vocab = set(lemmatised)
print("Unique tokens:", len(t_vocab))
print("Unique lowercase tokens:", len(t_vocab_lower))
print("Unique lemmatised (lowercase) tokens:", len(lemma_vocab))
print()
rec_vocab = set(recognised)
unrec_vocab = set(unrecognised)
print("Unique recognised words:", len(rec_vocab))
print("Unique unrecognised words:", len(unrec_vocab))
The vocabulary of the entire Handbooks dataset contains 70,922 unique words, 36,780 of which are recognised English words in WordNet. The lemmatised vocabulary of the dataset contains 66,172 words.
print(list(lemma_vocab)[:100])
Since the Handbooks dataset contains multiple publications (one file per year of publication), we could try picking a subset of publications, or even a single publication, and then compare the vocabulary across different publications. What patterns would you expect to see? How might the lexical diversity of the Handbooks dataset compare to the lexical diversity of novels in the Lewis Grassic Gibbon First Editions collection?
Let's group the Handbooks TXT files into 10-year periods so that we can investigate patterns in the Handbooks' text over time, comparing one decade to the next.
We'll group the Handbooks using Regular Expressions on their titles to identify the decade in which they were published:
# Make a dictionary with years as keys and fileids as values
list_of_years = list(df['year'])
fileid_to_year = dict(zip(fileids, list_of_years))
# Create a list for each decade during which the Handbooks were published:
fifties = [f for f in fileids if re.match('195\d{1}', str(fileid_to_year[f]))]
print(fifties)
sixties = [f for f in fileids if re.match('196\d{1}', str(fileid_to_year[f]))]
seventies = [f for f in fileids if re.match('197\d{1}', str(fileid_to_year[f]))]
eighties = [f for f in fileids if re.match('198\d{1}', str(fileid_to_year[f]))]
nineties = [f for f in fileids if re.match('199\d{1}', str(fileid_to_year[f]))]
twotho = [f for f in fileids if re.match('200\d{1}', str(fileid_to_year[f]))]
# Check that the decade lists' lengths sum to the length of the list of all fileids (an error is thrown if they don't)
assert len(fifties) + len(sixties) + len(seventies) + len(eighties) + len(nineties) + len(twotho) == len(fileids)
# INPUT: a wordlist (from the PlaintextCorpusReader - see section 0. Preparation),
# a year and a list of fileids associated with that year
# OUTPUT: a list of word tokens for fileids from the inputted year
def getTokens(plaintext_corpus_read_lists, decade_files):
all_words = []
for fileid in decade_files:
file_words = plaintext_corpus_read_lists.words(fileid)
all_words += [str(word) for word in file_words if word.isalpha()] # isalpha() removes non-letter tokens
return all_words
fifties_tokens = getTokens(wordlists, fifties)
print(fifties_tokens[:100])
sixties_tokens = getTokens(wordlists, sixties)
seventies_tokens = getTokens(wordlists, seventies)
eighties_tokens = getTokens(wordlists, eighties)
nineties_tokens = getTokens(wordlists, nineties)
twotho_tokens = getTokens(wordlists, twotho)
Great! Now we can analyse the Handbooks dataset by 10-year periods!
The Handbooks were written for an international audience to impress people with the success and strength of Britain and the UK. Let's investigate how Britain and the UK are portrayed:
t = Text(corpus_tokens)
t.concordance('Britain', lines=10)
fdist = FreqDist(handbooks_words)
print("Frequency (percentage) of Britain and the UK in Handbooks dataset:")
print(" - Britain:", (fdist.freq('Britain'))*100, "%")
print(" - GB:", (fdist.freq('GB'))*100, "%")
print(" - UK:", (fdist.freq('UK'))*100, "%")
Let's create Frequency Distribution visualisations for the Handbooks published in the fifties and the 2000s.
Step 1: First, we'll lowercase the words and remove stopwords from the lists:
to_exclude = stopwords.words('english') + ['â', 'per', 'cent']
fifties_filtered = [w.lower() for w in fifties_tokens if not w.lower() in to_exclude]
twotho_filtered = [w.lower() for w in twotho_tokens if not w.lower() in to_exclude]
Step 2: Then, we'll stem the words in both lists:
porter = nltk.PorterStemmer()
porter_stemmed = [porter.stem(w) for w in fifties_filtered]
print("Fifties stems:", porter_stemmed[590:600])
porter_stemmed = [porter.stem(w) for w in twotho_filtered]
print("Twotho stems:", porter_stemmed[590:600])
Step 3: Lastly, we'll calculate the frequency distributions of stems for the 1950s and 2000s, and visualise those distributions:
fdist_fifties = FreqDist(fifties_filtered)
fdist_twotho = FreqDist(twotho_filtered)
number_of_tokens = 10 # Try increasing or decreasing this number to view more or fewer tokens in the visualization
plt.figure(figsize = (10, 6))
plt.rc('font', size=12)
fdist_fifties.plot(number_of_tokens, title='Frequency Distribution of Top ' + str(number_of_tokens) + ' Stems in Handbooks from 1954-59')
plt.figure(figsize = (10, 6))
plt.rc('font', size=12)
fdist_twotho.plot(number_of_tokens, title='Frequency Distribution of Top ' + str(number_of_tokens) + ' Stems in Handbooks from 2000-05')
Be careful when comparing these graphs to pay attention to the different scales on their y axes!
Step 1: Let's pick some words that relate to various industries that may appear in the Handbooks and visualise their occurrences over time:
targets = ['mining', 'technology', 'shipbuilding', 'football', 'medicine', 'research', 'digital']
Step 2: Let's group the Handbooks by decade to study these words' occurrences in the Handbooks based on their decade of publication:
# INPUT: a capitalised word (in String format)
# OUTPUT: a list of the ratios of the inputted word's occurrence
# (lowercased and capitalised) to all words in each
# decade group of Handbooks
def wordOccurrenceByDecade(word):
word_occurs = []
tokens_lists = [fifties_tokens, sixties_tokens, seventies_tokens, eighties_tokens, nineties_tokens, twotho_tokens]
for decade in tokens_lists:
word_count_capital = decade.count(word)
word_lower = word.lower()
word_count_lower = decade.count(word_lower)
total_words = len(decade)
occurrence = (word_count_capital + word_count_lower)/total_words
word_occurs += [occurrence]
return word_occurs
digital = wordOccurrenceByDecade('Digital')
mining = wordOccurrenceByDecade('Mining')
shipbuilding = wordOccurrenceByDecade('Shipbuilding')
technology = wordOccurrenceByDecade('Technology')
football = wordOccurrenceByDecade('Football')
Step 3: Now we'll create a DataFrame of the occurrence data to view the occurrence of the words by decade of Handbooks publications, and we can export the DataFrame as a CSV file so it can be opened in Microsoft Excel or loaded into another Jupyter Notebook as a DataFrame:
col_names = ['1950s', '1960s', '1970s', '1980s', '1990s', '2000s' ]
row_names = ['Digital', 'Football', 'Mining', 'Shipbuilding', 'Technology']
industry_df = pd.DataFrame(data=[digital, football, mining, shipbuilding, technology], columns=col_names, index=row_names)
industry_df.to_csv('handbooks_industry_occurrences.csv')
industry_df
# Transpose the data to rotate the columns and rows of a DataFrame with '.T' or '.transpose()'
industry_df = industry_df.T
industry_df
Step 4: Using Altair, we can visualise the occurrence of a single word over the decades of Handbooks publications:
source = pd.DataFrame({
'decade': col_names,
'occurrence': list(industry_df['Digital'])
})
alt.Chart(source, title="Occurrence of 'Digital' and 'digital' in Handbooks by Decade").mark_bar(size=60).encode(
x='decade',
y=alt.Y('occurrence', axis=alt.Axis(format='%', title='Occurrence'))
).configure_axis(
grid=False,
labelAngle=0
).configure_view(
strokeWidth=0
).properties(
width=440
)
Ta da!
We can also use Altair for other types of visualisations, such as line charts that display the occurrence of all the words per decade in a single plot. To plot the occurrence of multiple words at once, we need to create a DataFrame with a slightly different structure...
First, though, we'll calculate the occurrence of words in every Handbook:
def wordOccurrenceByFile(word, wordlists, fileids):
word_occurs = []
for file in fileids:
file_words = wordlists.words(file)
word_count_capital = file_words.count(word)
word_lower = word.lower()
word_count_lower = file_words.count(word_lower)
total_words = len(list(file_words))
occurrence = (word_count_capital + word_count_lower)/total_words
word_occurs += [occurrence]
return word_occurs
digital = wordOccurrenceByFile('Digital', wordlists, fileids)
mining = wordOccurrenceByFile('Mining', wordlists, fileids)
shipbuilding = wordOccurrenceByFile('Shipbuilding', wordlists, fileids)
technology = wordOccurrenceByFile('Technology', wordlists, fileids)
football = wordOccurrenceByFile('Football', wordlists, fileids)
assert (len(digital) == len(fileids))
assert (len(digital) == len(mining))
Now we'll create a new DataFrame:
word = (['digital'] * (len(digital))) + (['mining'] * (len(mining))) + (['shipbuilding'] * (len(shipbuilding))) + (['technology'] * (len(technology))) + (['football'] * (len(football)))
occurrence = digital + mining + shipbuilding + technology + football
yrs = list(df['year'])
year = yrs * 5
word_df = pd.DataFrame({'word': word, 'occurrence': occurrence, 'year': year})
word_df.tail()
Using that DataFrame, we'll visualise the occurrence of all the words in every Handbook in our dataset:
alt.Chart(word_df, title="Occurrence of Select Words in the Britain and UK Handbooks").mark_line().encode(
x='year:O',
y=alt.Y('occurrence', axis=alt.Axis(format='%')),
color='word',
tooltip='word'
)
# HINT 1: try using Regular Expressions to search for words related to Scotland and Scottish-ness...
# scot_strings = [s for s in handbooks_words_lower if (re.search('scot$', s) or re.search('scot[tcls]+', s))]
# print("Total tokens related to Scotland:", len(scot_strings))
####################
# HINT 2: Sets in Python are similar to Lists except that they can't have repeating items,
# so changing a list to a set is a quick way to remove duplicates from a list!
# unique_scot = set(scot_strings)
# print("Unique tokens related to Scotland:", len(unique_scot))
# print(unique_scot)