Created August-September 2020 for the National Library of Scotland's Data Foundry by Lucy Havens, Digital Library Research Intern
Lewis Grassic Gibbon is an early 20th century Scottish novelist who also published under his birth name, James Leslie Mitchell. He was an prolific writer for the short period of time (5 years) that he published fiction and non-fiction, and the NLS collection contains first editions of all his published books. Gibbon's stories often featured strong central female characters, unusual for an early 20th century writer. Gibbon's literary influence continues to be felt today: his book A Sunset Song was voted Scotland's favorite novel in 2016, and contemporary Scottish writers such as Ali Smith and E.L. Kennedy have noted Gibbon's influence on their own writing.
Import libraries to use for cleaning, summarising and exploring the data:
# To prevent SSL certificate failure
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
getattr(ssl, '_create_unverified_context', None)):
ssl._create_default_https_context = ssl._create_unverified_context
# Libraries for data loading
import pandas as pd
import numpy as np
import string
import re
from collections import defaultdict
# Libraries for visualization
import altair as alt
import matplotlib.pyplot as plt
# Libraries for text analysis
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
nltk.download('wordnet')
from nltk.corpus import wordnet
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.text import Text
from nltk.stem.porter import PorterStemmer
from nltk.probability import FreqDist
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
nltk.download('tagsets') # part of speech tags
from nltk.draw.dispersion import dispersion_plot as displt
To explore the text in the Lewis Grassic Gibbon First Editions collection, we'll mainly use the Natural Language Toolkit (NLTK), a library written for the programming language Python.
The nls-text-gibbon folder (downloadable as Just the text data from the website at the top of this notebook) contains TXT files of digitised text, with numerical names, as well as a CSV inventory file and a TXT ReadMe file. Load only the TXT files of digitised text and tokenise the text (which splits the text into a list of its individual words and punctuation in the order they appear):
corpus_folder = 'data/nls-text-gibbon/'
wordlists = PlaintextCorpusReader(corpus_folder, '\d.*', encoding='latin1')
corpus_tokens = wordlists.words()
print(corpus_tokens[100:115])
Note: If you'd like to see how to specify a single TXT file to load as data, check out the Jupyter Notebook for the Britain and UK Handbooks!
It's hard to get a sense of how accurately the text has been digitised from this list of 15 tokens, so let's look at one of these words in context. To see phrases in which "Scots" is used, we can use the concordance() method:
t = Text(corpus_tokens)
t.concordance("Scots")
There are some mistakes but not too many!
Note how NLTK's concordance
method works: the word "Scots" appears with different meanings, sometimes referring to the language, other times referring to the people. NLTK has a tagging method that identifies the parts of speech in sentences, so if we wanted to focus on the language Scots, we could look for instances of Scots being used as a noun. If we wanted to focus on the people Scots, we could look for instances of Scots being used as an adjective. This method wouldn't return perfect results, though. For example, we could improve our results by checking for instances of "Scots" being used as an adjective before the word "dialects," for example.
We'll wait to dive into this sort of text analysis until a bit later, though!
First, let's get a sense of how much data (in this case, text) we have in the Lewis Grassic Gibbon First Editions (LGG) dataset:
def corpusStatistics(plaintext_corpus_read_lists):
total_tokens = 0
total_sents = 0
total_files = 0
for fileid in plaintext_corpus_read_lists.fileids():
total_tokens += len(plaintext_corpus_read_lists.words(fileid))
total_sents += len(plaintext_corpus_read_lists.sents(fileid))
total_files += 1
print("Estimated total...")
print(" Tokens in LGG Data:", total_tokens)
print(" Sentences in LGG Data:", total_sents)
print(" Files in LGG Data:", total_files)
corpusStatistics(wordlists)
Note that I've print Tokens
rather than words, though the NLTK method used to count those was .words()
. This is because words in NLTK include punctuation marks and digits, in addition to alphabetic words.
Next, we'll create two subsets of the data, one for each journal. To do so we first need to load the inventory (CSV file) that lists which file name corresponds with which journal. When you open the CSV file in Microsoft Excel or a text editor, you can see that there are no column names. The Python library Pandas, which reads CSV files, calls these column names the header
. When we use Pandas to read the inventory, we'll create our own header by specifying that the CSV file as None
and providing a list of column names
.
When Pandas (abbreviated pd
when we loaded libraries in the first cell of this notebook) reads a CSV file, it creates a table called a dataframe from that data. Let's see what the Gibbon inventory dataframe looks like:
df = pd.read_csv('data/nls-text-gibbon/gibbon-inventory.csv', header=None, names=['fileid', 'title'])
df
Since we only have 16 files, we returned the entire dataframe above. With larger dataframes you may wish to use df.head()
or df.tail()
to print only the first 5 or last 5 rows of your CSV file (both of which will include the column names in the dataframe's header).
Now that we created a dataframe, if we want to determine the title of a Gibbon work based on it's file ID, we can use the following code:
# obtain a list of all file IDs
fileids = list(df["fileid"])
print("List of file IDs:\n", fileids)
print()
# obtain a list of all titles
titles = list(df["title"])
print("List of titles:\n", titles)
print()
# create a dictionary where the keys are file IDs and the values are titles
lgg_dict = dict(zip(fileids, titles))
print("Dictionary of file IDs and titles:\n", lgg_dict)
print()
# pick a file ID by its index number
a_file_id = fileids[10]
# get the title corresponding with the file ID in the dataframe
print("The title for the file ID at index 10:\n", lgg_dict[a_file_id])
print()
NLTK stores the lists of tokens in the corpus_tokens variable we created by the file IDs, so it's useful to be able to match the file IDs with their book titles!
Variables that store the word tokens and sentence tokens in our dataset will be useful for future analysis. Let's create those now:
def getWordsSents(plaintext_corpus_read_lists):
all_words = []
all_sents = []
for fileid in plaintext_corpus_read_lists.fileids():
file_words = plaintext_corpus_read_lists.words(fileid)
all_words += [str(word) for word in file_words if word.isalpha()]
file_sents = sent_tokenize(plaintext_corpus_read_lists.raw(fileid))
all_sents += [str(sent) for sent in file_sents]
return all_words, all_sents
lgg_words, lgg_sents = getWordsSents(wordlists)
For some types of analysis, such as identifying people and places named in Gibbon's works, maintaining the original capitalization is important. For other types of analysis, such as analysing the vocabulary of Gibbon's works, standardising words by making them lowercase is important. Let's create a lowercase list of words in the LGG dataset:
lgg_words_lower = [word.lower() for word in lgg_words]
print(lgg_words_lower[0:20])
print(lgg_words[0:20])
Perfect!
In addition to tokenisation, stemming is a method of standardising, or "normalising," text. Stemming reduces words to their root form by removing suffixes. For example, the word "troubling" has a stem of "troubl." NLTK has two types of stemmers that use different algorithms to determine what the root of a word is.
The stemming algorithms below can take several minutes to run, so two are provided below with one commented out (the lines begin with #
) so it won't run. If you'd like to see how the stemming algorithms differ, uncomment the lines by highlighting them and pressing cmd
+ /
.
First, though, let's see what stems of LGG data look like with the Porter Stemmer:
# Stem the text (reduce words to their root, whether or not the root is a word itself
porter = nltk.PorterStemmer()
porter_stemmed = [porter.stem(t) for t in lgg_words_lower] # only include alphabetic tokens
print(porter_stemmed[500:600])
# lancaster = nltk.LancasterStemmer()
# lancaster_stemmed = [lancaster.stem(t) for t in lgg_words_lower] # only include alphabetic tokens
# print(lancaster_stemmed[500:600])
Another approach to reducing words to their root is to lemmatise tokens. NLTK's WordNet Lemmatizer reduces a token to its root only if the reduction of the token results in a word that's recognised as an English word in WordNet. Here's what that looks like:
# Lemmatise the text (reduce words to their root ONLY if the root is considered a word in WordNet)
wnl = nltk.WordNetLemmatizer()
lemmatised = [wnl.lemmatize(t) for t in lgg_words_lower] # only include alphabetic tokens
print(lemmatised[500:600])
Now that we've created some different cuts of the LGG dataset, let's start investigating the frequency of terms as they appear across the dataset. One way to do so is with a frequency distribution, which is a line chart that shows how many times a token appears in the dataset. Let's plot the frequency distribution of alphabetic words in the entire LGG dataset, excluding stop words (for example: a, an, the).
Step 1: First, let's create a new list of lowercase words in the LGG dataset that excludes stop words and calculate it's frequency distribution:
to_exclude = set(stopwords.words('english'))
filtered_lower = [w for w in lgg_words_lower if (len(w) > 2 and not w in to_exclude)]
fdist_filtered_lower = FreqDist(filtered_lower)
print("Total words after filtering:", fdist_filtered_lower.N())
print("50 most common words after filtering:", fdist_filtered_lower.most_common(50))
Notice the â
that appears as the most common token on its own, but also at the end of other common tokens. This is probably an OCR error (a mistake in the digitisation process). Let's see what other words appear near â
to get a sense of its context:
t = Text(corpus_tokens)
t.concordance("â")
Hmmm. â
seems to appear in many different ways, so we probably shouldn't remove it entirely. From the sample of text displayed above, it looks as though sometimes it appears in place of '
or -
, while other times it seems nothing should be there at all. It's not possible to know without comparing each occurrence of â
to the original text manually.
Step 2: Since manual OCR correction is quite a time-consuming effort, for the purposes of a frequency distribution, we'll remove the â
s for now:
to_exclude = set(stopwords.words('english'))
for w in lgg_words_lower:
if re.match('â', w):
w = w.replace('â','')
if (len(w) > 2 and not w in to_exclude):
filtered_lower += [w]
Step 3: Now, we recalculate the frequency distribution and visualise it using NLTK and Matplotlib:
fdist_filtered_lower = FreqDist(filtered_lower)
plt.figure(figsize = (20, 8))
plt.rc('font', size=12)
number_of_tokens = 20 # Try increasing or decreasing this number to view more or fewer tokens in the visualization
fdist_filtered_lower.plot(number_of_tokens, title='Frequency Distribution for ' + str(number_of_tokens) + ' Most Common Tokens in the Standardised LGG Dataset (excluding stop words)')
Next, we can look at multiple words at a time. Using collocations()
, we can see which pairs of words occur together most often across the LGG dataset:
t.collocations()
thousand years
is a common bigram, or word pair. I wonder what other words appear in similar contexts to the word year
...
t.similar('years')
Let's pick a single work of Gibbon's to and see what its collocations are:
sunset_song = '205174251.txt'
sunset_song_words = wordlists.words(sunset_song)
s = Text(sunset_song_words)
s.collocations()
Many of the most common bigrams are names!
To measure the uniqueness and variety of words in Gibbon's works, we can calculate the lexical diversity of files in the LGG dataset. Lexical diversity measures the diversity of word choice, calculated by dividing the number of unique words by the total number of words in a work. Instead of simply counting the number of unique words in each of Gibbon's works, we divide by the total number of words in a work to normalise the metric, so we can compare the diversity of word choice of works that are of different lengths. Let's calculate the lexical diversity for each file in the LGG dataset and add this metric to our inventory DataFrame:
# INPUT: wordlists and the fileid of the wordlist to be tokenised
# OUTPUT: a list of word tokens (in String format) for the inputted fileid
def getWords(plaintext_corpus_read_lists, fileid):
file_words = plaintext_corpus_read_lists.words(fileid)
str_words = [str(word) for word in file_words]
return str_words
words_by_file = []
for file in fileids:
words_by_file += [getWords(wordlists, file)]
# INPUT: a list of words in String format
# OUTPUT: the number of unique words divided by
# the total words in the inputted list
def lexicalDiversity(str_words_list):
return len(set(str_words_list))/len(str_words_list)
lexdiv_by_file = []
for words in words_by_file:
lexdiv_by_file += [lexicalDiversity(words)]
df['lexicaldiversity'] = lexdiv_by_file
df_lexdiv = df.sort_values(by=['lexicaldiversity', 'title'], inplace=False, ascending=True)
df_lexdiv
print("Lexical diversity of the entire LGG dataset: ", lexicalDiversity(lgg_words_lower))
To make it easier to compare the lexical diversity scores, let's visualise them!
sorted_titles = list(df_lexdiv['title'])
sorted_lexdiv = list(df_lexdiv['lexicaldiversity'])
source = pd.DataFrame({
'Title': sorted_titles,
'Lexdiv': sorted_lexdiv
})
alt.Chart(source, title="Lexical Diversity of Gibbon's Works").mark_bar(size=30).encode(
alt.X('Title', axis=alt.Axis(title='Lewis Grassic Gibbon Work'), type='nominal', sort=None), # If sort unspecified, chart will sort x-axis values alphabetically
alt.Y('Lexdiv', axis=alt.Axis(format='%', title='Lexical Diversity')),
alt.Order(
# Sort the segments of the bars by this field
'Lexdiv',
sort='ascending'
)
).configure_axis(
grid=False
).configure_view(
strokeWidth=0
).properties(
width=500
)
This makes it clear that Hanno, or, The future of exploration is by far the most lexically diverse in the collection. Interestingly, this was the first book he wrote (it was published in 1928)!
Using information from the digital.nls.uk website, we can quickly find and add the publication dates to Gibbon's works. Since this LGG dataset is first editions, we know that the publication dates of Gibbon's books at the NLS will be the initial, original dates.
published = [1932, 1933, 1933, 1934, 1933, 1932, 1934, 1934, 1934, 1931, 1932, 1934, 1932, 1930, 1931, 1928]
df_lexdiv['published'] = published
df_pub = df_lexdiv.sort_values(by=['published', 'title'], inplace=False, ascending=True)
df_pub.head()
Now let's create another visualisation of the lexical diversity of Gibbon's works, but this time we'll sort the data by chronologically by published
date:
sorted_titles = list(df_pub['title'])
sorted_lexdiv = list(df_pub['lexicaldiversity'])
sorted_published = list(df_pub['published'])
source = pd.DataFrame({
'Title': sorted_titles,
'Lexdiv': sorted_lexdiv,
'Published': sorted_published
})
alt.Chart(source, title="Lexical Diversity of Gibbon's Works").mark_bar(size=30).encode(
alt.X('Title', axis=alt.Axis(title='Title of Lewis Grassic Gibbon Work'), type='nominal', sort=None), # If sort unspecified, chart will sort x-axis values alphabetically
alt.Y('Lexdiv', axis=alt.Axis(format='%', title='Lexical Diversity')),
alt.Order(
# Sort the segments of the bars by this field
'Lexdiv',
sort='descending'
),
color=alt.Color('Published:O', legend = alt.Legend(title='Date Published')),
tooltip='Title:N'
).configure_axis(
grid=False,
labelFontSize=12,
titleFontSize=12,
labelAngle=-45
).configure_title(
fontSize=14,
).configure_view(
strokeWidth=0
).properties(
width=500
)
Since we don't know the months in which Gibbon's works were published, let's calculate Gibbon's yearly lexical diversity to get a better sense of the trend over time.
Step 1: First, we need to group Gibbon's works by the year in which they were published:
pub_yr = {1928: [], 1930: [], 1931: [], 1932: [], 1933: [], 1934: []}
for index,row in df_pub.iterrows():
pub_yr[row[3]] += [row[0]]
print(pub_yr)
Perfect!
Step 2: Now we'll calculate the lexical diversity across all works published each year from 1928 through 1934 (excluding 1929, since the LGG dataset doesn't include any works published that year):
lexdiv_by_year = []
for key,value in pub_yr.items():
lexdiv_by_file = []
for fileid in value:
file_words = wordlists.words(fileid)
str_words = [str(w.lower()) for w in file_words if w.isalpha()]
lexdiv_by_file += [lexicalDiversity(str_words)]
lexdiv_by_year += [sum(lexdiv_by_file)/len(lexdiv_by_file)]
print(lexdiv_by_year)
pub_years = [1928, 1930, 1931, 1932, 1933, 1934]
pub_lex = dict(zip(pub_years, lexdiv_by_year))
pub_lex
source = pd.DataFrame({
'Year': pub_years,
'Average Lexical Diversity': lexdiv_by_year
})
alt.Chart(source, title="Average Yearly Lexical Diversity of Gibbon First Editions").mark_bar(size=60).encode(
alt.X('Year', axis=alt.Axis(title='Year of Publication'), type='ordinal'),
alt.Y('Average Lexical Diversity', axis=alt.Axis(format='%', title='Average Lexical Diversity'))
).configure_axis(
grid=False,
labelFontSize=12,
titleFontSize=12,
labelAngle=0
).configure_title(
fontSize=14,
).configure_view(
strokeWidth=0
).properties(
width=365
)
So Gibbon's lexical diversity does decrease over time, excepting a small increase in the last year he published, 1934!