Created in July-September 2020 for the National Library of Scotland's Data Foundry by Lucy Havens, Digital Library Research Intern
The dataset consists of 468 official publications from British India, mainly from 1850-1950, that report on public health, disease mapping, vaccination efforts, veterinary experiments, and other medical topics. The publications are a subset of a larger collection of 40,000 volumes that report on the administration of British India. The Wellcome Trust funded the digitisation of the medical history volumes in this dataset.
Import libraries to use for cleaning, summarising and exploring the data:
# To prevent SSL certificate failure
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
getattr(ssl, '_create_unverified_context', None)):
ssl._create_default_https_context = ssl._create_unverified_context
# Libraries for data loading
import pandas as pd
import numpy as np
import string
import re
# Libraries for visualization
import altair as alt
import matplotlib.pyplot as plt
# Libraries for text analysis
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
nltk.download('wordnet')
from nltk.corpus import wordnet
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.text import Text
from nltk.stem.porter import PorterStemmer
from nltk.probability import FreqDist
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
nltk.download('tagsets') # part of speech tags
from nltk.draw.dispersion import dispersion_plot as displt
To explore the text in the A Medical History of British India collection, we'll mainly use the Natural Language Toolkit (NLTK), a library written for the programming language Python.
The nls-text-indiaPapers folder (downloadable as Just the text data from the website at the top of this notebook) contains TXT files of digitised text with numerical names, as well as a CSV inventory file and a TXT ReadMe file. Load only the TXT files of digitised text and tokenise the text (which splits a string into separate words and punctuation):
corpus_folder = 'data/nls-text-indiaPapers/'
wordlists = PlaintextCorpusReader(corpus_folder, '\d.*', encoding='latin1')
corpus_tokens = wordlists.words()
print(corpus_tokens[:10])
Note: If you'd like to see how to specify a single TXT file to load as data, check out the Jupyter Notebook for the Britain and UK Handbooks!
It's hard to get a sense of how accurately the text has been digitised from this list of 10 tokens, so let's look at one of these words in context. To see phrases in which "India" is used, we can use the concordance()
method:
t = Text(corpus_tokens)
t.concordance('India', lines=20) # by default NLTK's concordance method displays 25 lines
The A Medical History of British India (MHBI) dataset has been digitised and then manually corrected for errors in the digitisation process, so we can be pretty confident in the quality of the text for this dataset.
Let's find out just how much text and just how many files we're working with:
def corpusStatistics(plaintext_corpus_read_lists):
total_tokens = 0
total_sents = 0
total_files = 0
for fileid in plaintext_corpus_read_lists.fileids():
total_tokens += len(plaintext_corpus_read_lists.words(fileid))
total_sents += len(plaintext_corpus_read_lists.sents(fileid))
total_files += 1
print("Total...")
print(" Tokens in MHBI Data:", total_tokens)
print(" Sentences in MHBI Data:", total_sents)
print(" Files in MHBI Data:", total_files)
corpusStatistics(wordlists)
Note that I've print Tokens
rather than words, though the NLTK method used to count those was .words()
. This is because words in NLTK include punctuation marks and digits, in addition to alphabetic words.
The fileids
are the names of the files in the data's source folder:
fileids = list(wordlists.fileids())
fileids[0:3]
We can use the inventory CSV file from the source folder to match the titles of the papers to the corresponding fileid
:
df = pd.read_csv('data/nls-text-indiaPapers/indiaPapers-inventory.csv', header=None, names=['fileid', 'title'])
df.head() # prints the first 5 rows (df.tail() prints the last 5 rows)
We can also create a list of the titles from the dataframe column:
titles = list(df['title'])
titles[0:3] # Display the first three titles (from index 0 up to but not including index 3)
Variables that store the characters, words, and sentences in our dataset could be useful for exploratory analysis. Uncomment the lines below (highlight them and press CTRL
+ /
or CMD
+ /
) if you'd like to use those variables for your own analysis:
# def getCharsWordsSents(plaintext_corpus_read_lists, fileids):
# all_chars = []
# chars_by_file = dict.fromkeys(fileids)
# all_words = []
# words_by_file = dict.fromkeys(fileids)
# all_words_lower = []
# words_lower_by_file = dict.fromkeys(fileids)
# all_sents = []
# sents_by_file = dict.fromkeys(fileids)
# for fileid in plaintext_corpus_read_lists.fileids():
# file_chars = plaintext_corpus_read_lists.raw(fileid)
# all_chars += [str(char).lower() for char in file_chars]
# chars_by_file[fileid] = all_chars
# file_words = plaintext_corpus_read_lists.words(fileid)
# all_words_lower += [str(word).lower() for word in file_words if word.isalpha()]
# words_lower_by_file[fileid] = all_words_lower
# all_words += [str(word) for word in file_words if word.isalpha()]
# words_by_file[fileid] = all_words
# file_sents = sent_tokenize(plaintext_corpus_read_lists.raw(fileid)) #plaintext_corpus_read_lists.sents(fileid)
# all_sents += [str(sent) for sent in file_sents]
# sents_by_file[fileid] = all_sents
# return all_chars, chars_by_file, all_words, words_by_file, all_words_lower, words_lower_by_file, all_sents, sents_by_file
# mhbi_chars, mhbi_file_chars, mhbi_words, mhbi_file_words, mhbi_words_lower, mhbi_file_lower_words, mhbi_sents, mhbi_file_sents = getCharsWordsSents(wordlists, fileids)
To make sure the function worked as expected, you can run some quick tests with the output lists and dictionaries:
# print(mhbi_file_chars[fileids[100]][:10])
# print(mhbi_file_words[fileids[355]][30:40])
# print(mhbi_file_sents[fileids[-1]][-20:-10])
# assert(len(mhbi_file_chars) == len(fileids)) # nothing prints if passes, error prints if doesn't pass
# print(mhbi_chars[:100])
# print(mhbi_words[6100:6120])
# print(mhbi_sents[-5:])
# assert(len(mhbi_words_lower) == len(mhbi_words)) # nothing prints if passes, error prints if doesn't pass
Looking good!
Since the OCR has already been manually cleaned, we'll focus this section on identifying the roots of words and the parts of speech in sentences, rather than getting a sense of how many mistakes were made in the OCR process.
First let's create lists of strings from the NLTK tokens that we can use in future analysis:
str_tokens = [str(word) for word in corpus_tokens]
assert(type(str_tokens[0]) == str) # quick test to make sure the output is as expected
print(str_tokens[0:10])
# Lowercase text
lower_str_tokens = [t.lower() for t in str_tokens]
print(lower_str_tokens[-10:])
# Exclude stop words (i.e. the, a, is) - note that the input text must be lowercased!
eng_stopwords = set(stopwords.words('english'))
no_stopwords = [t for t in lower_str_tokens if not t in eng_stopwords]
print(no_stopwords[500:510])
assert(len(no_stopwords) < len(str_tokens))
# Alphabetic tokens only (exclude digits and punctuation)
alpha_tokens = [t for t in str_tokens if t.isalpha()]
alpha_tokens_lower = [t for t in lower_str_tokens if t.isalpha()]
print(alpha_tokens[1000:1010])
assert(len(alpha_tokens_lower) == len(alpha_tokens))
Next, we'll stem the tokens, or reduce the tokens to their root. NLTK has two types of stemmers that use different algorithms to determine what the root of a word is. Here's a sample of what they look like.
# Stem the text (reduce words to their root, whether or not the root is a word itself
# porter = nltk.PorterStemmer()
# porter_stemmed = [porter.stem(t) for t in alpha_tokens_lower]
# print(porter_stemmed[500:600])
# lancaster = nltk.LancasterStemmer()
# lancaster_stemmed = [lancaster.stem(t) for t in alpha_tokens_lower]
# print(lancaster_stemmed[500:600])
Another approach to reducing words to their root is to lemmatise tokens. NLTK's WordNet Lemmatizer reduces a token to its root only if the reduction of the token results in a word that's recognised as an English word in WordNet. Uncomment the code below if you'd like to see what that looks like:
# Lemmatise the text (reduce words to their root ONLY if the root is considered a word in WordNet)
# wnl = nltk.WordNetLemmatizer()
# lemmatised = [wnl.lemmatize(t) for t in alpha_tokens_lower] # only include alphabetic tokens
# print(lemmatised[500:600])
Now that we've created some different cuts of the MHBI dataset, let's start investigating the frequency of terms as they appear across the dataset. One way to do so is with a frequency distribution, which is a line chart that shows how many times a token appears in the dataset.
First let's visualise a frequency distribution for common alphabetical tokens in the dataset (tokens composed of letters, not punctuation or numbers) EXCEPT stop words (i.e. the, and, a):
# Filter one-letter words, two-letter words, and stop words out of the list of alphabetic tokens
to_exclude = ["per", "two", "one", "also"]
filtered_tokens = [t for t in no_stopwords if (len(t) > 2 and not t in to_exclude)]
# Calculate the frequency distribution for the filtered list of tokens
fdist_ft = FreqDist(filtered_tokens)
print("Total tokens in filtered list:", fdist_ft.N())
# Visualise the frequency distribution for a select number of tokens
plt.figure(figsize = (14, 8)) # customise the width and height of the plot
plt.rc('font', size=12) # customise the font size of the title, axes names, and axes labels
fdist_ft.plot(20, title='Frequency Distribution of the 20 Most Common Words in the Medical History of British India Dataset (excluding stop words, 2-letter and 3-letter words)')
The medical focus is clear from the top 20 words in the MHBI papers, which includes cases
, hospital
, disease
, veterinary
, vaccination
, and plague
. Also, the frequency distribution suggests the people writing the papers made an effort to summarise what was going on, since the top three tokens, by far, are total
, year
, and number
(perhaps summarising cases
by district
?).
Another way to summarise the MHBI dataset is to look at the uniqueness and variety of word usage. We can obtain the vocabulary of the text by creating a set of unique words (alphabetic tokens) that occur in the dataset, as well as creating a set of unique lemmatised words that occur in the dataset.
# Remove duplicate words from the text (obtain the vocabulary of the text)
t_vocab = set(alpha_tokens)
t_vocab_lower = set(alpha_tokens_lower)
lemma_vocab = set(lemmatised)
print("Unique tokens:", len(t_vocab))
print("Unique lowercase tokens:", len(t_vocab_lower))
print("Unique lemmatised (lowercase) tokens:", len(lemma_vocab))
We can create a data visualisation that illustrates when specific words are used within the MHBI dataset. This is called a Lexical Dispersion Plot. We'll pick some terms (the list of targets
) from the list of the most common 20 words in the dataset:
fdist_ft.most_common(20)
Let's determine the top 20 most common words for each paper (file) in the MHBI dataset:
fileids = list(df['fileid'])
id_to_title = dict(zip(fileids,titles))
common_words = {}
for file in fileids:
tokens = wordlists.words(file)
# Filter non-alphabetic words and stop words out of the list of tokens
tokens_lower = [t.lower() for t in tokens if t.isalpha()]
to_exclude = list(set(stopwords.words('english'))) + ["per", "two", "one", "also"]
filtered_tokens = [t for t in tokens_lower if not t in to_exclude]
fdist = FreqDist(filtered_tokens)
title = id_to_title[file]
words = list(fdist.most_common(20))
common_words[title] = words
print(len(common_words))
Now you can use the title of a paper to find the 20 most common words in that paper:
paper = id_to_title[fileids[10]]
print(common_words[paper])
# t.concordance("cholera")
Consider comparing papers from before and after the 1857 rebellion when British East India Company rule in India was taken over by the British Crown. How does the word choice and sentiment of the language differ before and after?
# t.concordance("Native")
# t.concordance("lunatic")
# t.concordance("mental")
Consider lock(ed) hospitals and escapes from them, and prostitution permitted in army barracks.
# t.concordance("women")
# t.concordance("lock")
# t.concordance("vaccination")