Exploring A Medical History of British India

Created in July-September 2020 for the National Library of Scotland's Data Foundry by Lucy Havens, Digital Library Research Intern

About the A Medical History of British India Dataset

The dataset consists of 468 official publications from British India, mainly from 1850-1950, that report on public health, disease mapping, vaccination efforts, veterinary experiments, and other medical topics. The publications are a subset of a larger collection of 40,000 volumes that report on the administration of British India. The Wellcome Trust funded the digitisation of the medical history volumes in this dataset.

Table of Contents

  1. Preparation
  2. Data Cleaning and Standardisation
  3. Summary Statistics
  4. Exploratory Analysis


  • Alex, Beatrice and Llewellyn, Clare. (2020) Library Carpentry: Text & Data Mining. Centre for Data, Culture & Society, University of Edinburgh. http://librarycarpentry.org/lc-tdm/.
  • Bird, Steven and Klein, Ewan and Loper, Edward. (2019) Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit. O'Reilly Media. 978-0-596-51649-9. https://www.nltk.org/book/.

0. Preparation

Import libraries to use for cleaning, summarising and exploring the data:

In [27]:
# To prevent SSL certificate failure
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)):
    ssl._create_default_https_context = ssl._create_unverified_context

# Libraries for data loading
import pandas as pd
import numpy as np
import string
import re

# Libraries for visualization
import altair as alt
import matplotlib.pyplot as plt

# Libraries for text analysis
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.text import Text
from nltk.stem.porter import PorterStemmer
from nltk.probability import FreqDist
from nltk.tag import pos_tag
nltk.download('tagsets')  # part of speech tags
from nltk.draw.dispersion import dispersion_plot as displt

To explore the text in the A Medical History of British India collection, we'll mainly use the Natural Language Toolkit (NLTK), a library written for the programming language Python.

The nls-text-indiaPapers folder (downloadable as Just the text data from the website at the top of this notebook) contains TXT files of digitised text with numerical names, as well as a CSV inventory file and a TXT ReadMe file. Load only the TXT files of digitised text and tokenise the text (which splits a string into separate words and punctuation):

In [3]:
corpus_folder = 'data/nls-text-indiaPapers/'
wordlists = PlaintextCorpusReader(corpus_folder, '\d.*', encoding='latin1')
corpus_tokens = wordlists.words()
['No', '.', '1111', '(', 'Sanitary', '),', 'dated', 'Ootacamund', ',', 'the']

Note: If you'd like to see how to specify a single TXT file to load as data, check out the Jupyter Notebook for the Britain and UK Handbooks!

It's hard to get a sense of how accurately the text has been digitised from this list of 10 tokens, so let's look at one of these words in context. To see phrases in which "India" is used, we can use the concordance() method:

In [4]:
t = Text(corpus_tokens)
t.concordance('India', lines=20)  # by default NLTK's concordance method displays 25 lines
Displaying 20 of 16495 matches:
ffg . Secretary to the Government of India . Resolution of Government of India 
 India . Resolution of Government of India No . 1 - 137 , dated 5th March 1875 
rch 1875 . Letter from Government of India No . 486 , dated 5th September 1876 
ember 1876 . Letter to Government of India No . 1063 , dated 26th ditto . REFER
ffg . Secretary to the Government of India , Home Department . REFERRING to par
 to paragraph 8 of the Government of India ' s Resolu - tion No . 1 - 136 , dat
inion expressed by the Government of India that any measures of segragation and
filth with which all the villages in India are surrounded is quite sufficient t
the disease in Rajputana and Central India are in the hands of the Presidency S
ffg . Secretary to the Government of India , Home Dept . IN continuation of my 
 the Resolution of the Government of India , Home Department ( Medical ), No 1 
d by the orders of the Government of India dated 5th March 1876 . Report on lep
e to the orders of the Government of India , contained in paragraph 8 of Resolu
ries propounded by the Government of India . 5 . In regard to the extent to whi
 of leprosy in Europeans resident in India which have come to his knowledge - o
 case of an European lady , who left India in 1875 and suffered with symptoms o
ffg . Secretary to the Government of India , Home Dept . IN continuation of let
the information of the Government of India , copy of a letter from the Officiat
rnment of the Punjab . Government of India No . 141 , dated 5th March 1875 , pa
ffg . Secretary to the Government of India , Home Department . IN continuation 

The A Medical History of British India (MHBI) dataset has been digitised and then manually corrected for errors in the digitisation process, so we can be pretty confident in the quality of the text for this dataset.

Let's find out just how much text and just how many files we're working with:

In [5]:
def corpusStatistics(plaintext_corpus_read_lists):
    total_tokens = 0
    total_sents = 0
    total_files = 0
    for fileid in plaintext_corpus_read_lists.fileids():
        total_tokens += len(plaintext_corpus_read_lists.words(fileid))
        total_sents += len(plaintext_corpus_read_lists.sents(fileid))
        total_files += 1
    print("  Tokens in MHBI Data:", total_tokens)
    print("  Sentences in MHBI Data:", total_sents)
    print("  Files in MHBI Data:", total_files)

  Tokens in MHBI Data: 28333479
  Sentences in MHBI Data: 1671768
  Files in MHBI Data: 468

Note that I've print Tokens rather than words, though the NLTK method used to count those was .words(). This is because words in NLTK include punctuation marks and digits, in addition to alphabetic words.

The fileids are the names of the files in the data's source folder:

In [6]:
fileids = list(wordlists.fileids())
['74457530.txt', '74457800.txt', '74458285.txt']

We can use the inventory CSV file from the source folder to match the titles of the papers to the corresponding fileid:

In [7]:
df = pd.read_csv('data/nls-text-indiaPapers/indiaPapers-inventory.csv', header=None, names=['fileid', 'title'])
df.head()  # prints the first 5 rows (df.tail() prints the last 5 rows)
fileid title
0 74457530.txt Distribution and causation of leprosy in Briti...
1 74457800.txt Report of an outbreak of cholera in Suhutwar, ...
2 74458285.txt Report of an investigation into the causes of ...
3 74458388.txt Account of plague administration in the Bombay...
4 74458575.txt Inquiry into the circumstances attending an ou...

We can also create a list of the titles from the dataframe column:

In [8]:
titles = list(df['title'])
titles[0:3]   # Display the first three titles (from index 0 up to but not including index 3)
['Distribution and causation of leprosy in British India 1875 - IP/HA.2',
 'Report of an outbreak of cholera in Suhutwar, Bulliah sub-division - IP/30/PI.2',
 'Report of an investigation into the causes of the diseases known in Assam as Kála-Azár and Beri-Beri - IP/3/MB.5']

Variables that store the characters, words, and sentences in our dataset could be useful for exploratory analysis. Uncomment the lines below (highlight them and press CTRL + / or CMD + /) if you'd like to use those variables for your own analysis:

In [9]:
# def getCharsWordsSents(plaintext_corpus_read_lists, fileids):
#     all_chars = []
#     chars_by_file = dict.fromkeys(fileids)
#     all_words = []
#     words_by_file = dict.fromkeys(fileids)
#     all_words_lower = []
#     words_lower_by_file = dict.fromkeys(fileids)
#     all_sents = []
#     sents_by_file = dict.fromkeys(fileids)
#     for fileid in plaintext_corpus_read_lists.fileids():
#         file_chars = plaintext_corpus_read_lists.raw(fileid)
#         all_chars += [str(char).lower() for char in file_chars]
#         chars_by_file[fileid] = all_chars
#         file_words = plaintext_corpus_read_lists.words(fileid)
#         all_words_lower += [str(word).lower() for word in file_words if word.isalpha()]
#         words_lower_by_file[fileid] = all_words_lower
#         all_words += [str(word) for word in file_words  if word.isalpha()]
#         words_by_file[fileid] = all_words
#         file_sents = sent_tokenize(plaintext_corpus_read_lists.raw(fileid))  #plaintext_corpus_read_lists.sents(fileid)
#         all_sents += [str(sent) for sent in file_sents]
#         sents_by_file[fileid] = all_sents
#     return all_chars, chars_by_file, all_words, words_by_file, all_words_lower, words_lower_by_file, all_sents, sents_by_file
# mhbi_chars, mhbi_file_chars, mhbi_words, mhbi_file_words, mhbi_words_lower, mhbi_file_lower_words, mhbi_sents, mhbi_file_sents = getCharsWordsSents(wordlists, fileids)

To make sure the function worked as expected, you can run some quick tests with the output lists and dictionaries:

In [10]:
# print(mhbi_file_chars[fileids[100]][:10])
# print(mhbi_file_words[fileids[355]][30:40])
# print(mhbi_file_sents[fileids[-1]][-20:-10])
# assert(len(mhbi_file_chars) == len(fileids))  # nothing prints if passes, error prints if doesn't pass
In [11]:
# print(mhbi_chars[:100])
# print(mhbi_words[6100:6120])
# print(mhbi_sents[-5:])
# assert(len(mhbi_words_lower) == len(mhbi_words))  # nothing prints if passes, error prints if doesn't pass

Looking good!

1. Data Cleaning and Standardisation

Since the OCR has already been manually cleaned, we'll focus this section on identifying the roots of words and the parts of speech in sentences, rather than getting a sense of how many mistakes were made in the OCR process.

1.1 Tokenisation

First let's create lists of strings from the NLTK tokens that we can use in future analysis:

In [12]:
str_tokens = [str(word) for word in corpus_tokens]
assert(type(str_tokens[0]) == str)  # quick test to make sure the output is as expected

# Lowercase text
lower_str_tokens = [t.lower() for t in str_tokens]

# Exclude stop words (i.e. the, a, is) - note that the input text must be lowercased!
eng_stopwords = set(stopwords.words('english'))
no_stopwords = [t for t in lower_str_tokens if not t in eng_stopwords]
assert(len(no_stopwords) < len(str_tokens))

# Alphabetic tokens only (exclude digits and punctuation)
alpha_tokens = [t for t in str_tokens if t.isalpha()]
alpha_tokens_lower = [t for t in lower_str_tokens if t.isalpha()]
assert(len(alpha_tokens_lower) == len(alpha_tokens))
['No', '.', '1111', '(', 'Sanitary', '),', 'dated', 'Ootacamund', ',', 'the']
['g', '.', 'b', '.', 'c', '.', 'p', '.', 'o', '.']
[',', '424', '705', '491', '214', '8', '11', '5', 'surat', '607']
['the', 'twenty', 'cases', 'mentioned', 'above', 'seventeen', 'are', 'said', 'to', 'have']

1.2 Reducing Words to Root Forms

Next, we'll stem the tokens, or reduce the tokens to their root. NLTK has two types of stemmers that use different algorithms to determine what the root of a word is. Here's a sample of what they look like.

Note: This code can take several minutes to run (and a lot of memory, which is limited if you're using Binder), which is why the stemmer code has been commented out. You can uncomment the code so that it runs by removing the '#' before each line (highlight all lines and then press [CMD or CTRL] + /).
In [13]:
# Stem the text (reduce words to their root, whether or not the root is a word itself

# porter = nltk.PorterStemmer()
# porter_stemmed = [porter.stem(t) for t in alpha_tokens_lower]
# print(porter_stemmed[500:600])

# lancaster = nltk.LancasterStemmer()
# lancaster_stemmed = [lancaster.stem(t) for t in alpha_tokens_lower]
# print(lancaster_stemmed[500:600])
['the', 'govern', 'of', 'india', 'that', 'ani', 'measur', 'of', 'segrag', 'and', 'medic', 'treatment', 'of', 'leper', 'throughout', 'the', 'countri', 'would', 'be', 'impractic', 'as', 'a', 'state', 'measur', 'but', 'i', 'do', 'hold', 'that', 'the', 'improv', 'of', 'the', 'hygien', 'condit', 'under', 'which', 'the', 'mass', 'of', 'the', 'peopl', 'live', 'is', 'the', 'onli', 'sure', 'method', 'of', 'stamp', 'out', 'leprosi', 'or', 'ani', 'similar', 'diseas', 'the', 'filth', 'with', 'which', 'all', 'the', 'villag', 'in', 'india', 'are', 'surround', 'is', 'quit', 'suffici', 'to', 'prevent', 'ani', 'hope', 'of', 'success', 'in', 'combat', 'the', 'diseas', 'which', 'it', 'is', 'not', 'difficult', 'to', 'forese', 'will', 'prevail', 'until', 'such', 'an', 'objection', 'state', 'of', 'matter', 'is', 'alter', 'with', 'these']

Another approach to reducing words to their root is to lemmatise tokens. NLTK's WordNet Lemmatizer reduces a token to its root only if the reduction of the token results in a word that's recognised as an English word in WordNet. Uncomment the code below if you'd like to see what that looks like:

In [14]:
# Lemmatise the text (reduce words to their root ONLY if the root is considered a word in WordNet)

# wnl = nltk.WordNetLemmatizer()
# lemmatised = [wnl.lemmatize(t) for t in alpha_tokens_lower]  # only include alphabetic tokens
# print(lemmatised[500:600])
['the', 'government', 'of', 'india', 'that', 'any', 'measure', 'of', 'segragation', 'and', 'medical', 'treatment', 'of', 'leper', 'throughout', 'the', 'country', 'would', 'be', 'impracticable', 'a', 'a', 'state', 'measure', 'but', 'i', 'do', 'hold', 'that', 'the', 'improvement', 'of', 'the', 'hygienic', 'condition', 'under', 'which', 'the', 'mass', 'of', 'the', 'people', 'live', 'is', 'the', 'only', 'sure', 'method', 'of', 'stamping', 'out', 'leprosy', 'or', 'any', 'similar', 'disease', 'the', 'filth', 'with', 'which', 'all', 'the', 'village', 'in', 'india', 'are', 'surrounded', 'is', 'quite', 'sufficient', 'to', 'prevent', 'any', 'hope', 'of', 'success', 'in', 'combating', 'the', 'disease', 'which', 'it', 'is', 'not', 'difficult', 'to', 'foresee', 'will', 'prevail', 'until', 'such', 'an', 'objectionable', 'state', 'of', 'matter', 'is', 'altered', 'with', 'these']

2. Summary Statistics

2.1 Frequencies and Sizes

Now that we've created some different cuts of the MHBI dataset, let's start investigating the frequency of terms as they appear across the dataset. One way to do so is with a frequency distribution, which is a line chart that shows how many times a token appears in the dataset.

First let's visualise a frequency distribution for common alphabetical tokens in the dataset (tokens composed of letters, not punctuation or numbers) EXCEPT stop words (i.e. the, and, a):

In [16]:
# Filter one-letter words, two-letter words, and  stop words out of the list of alphabetic tokens
to_exclude = ["per", "two", "one", "also"]
filtered_tokens = [t for t in no_stopwords if (len(t) > 2 and not t in to_exclude)]
In [17]:
# Calculate the frequency distribution for the filtered list of tokens
fdist_ft = FreqDist(filtered_tokens)
print("Total tokens in filtered list:", fdist_ft.N())
Total tokens in filtered list: 11670424
In [18]:
# Visualise the frequency distribution for a select number of tokens
plt.figure(figsize = (14, 8))                # customise the width and height of the plot
plt.rc('font', size=12)                       # customise the font size of the title, axes names, and axes labels
fdist_ft.plot(20, title='Frequency Distribution of the 20 Most Common Words in the Medical History of British India Dataset (excluding stop words, 2-letter and 3-letter words)')
<matplotlib.axes._subplots.AxesSubplot at 0x1c69f2898>

The medical focus is clear from the top 20 words in the MHBI papers, which includes cases, hospital, disease, veterinary, vaccination, and plague. Also, the frequency distribution suggests the people writing the papers made an effort to summarise what was going on, since the top three tokens, by far, are total, year, and number (perhaps summarising cases by district?).

Try It! Create a frequency distribution for stemmed words, or stems. What differences do you see in the two frequency distributions?

2.2 Uniqueness and Variety

Another way to summarise the MHBI dataset is to look at the uniqueness and variety of word usage. We can obtain the vocabulary of the text by creating a set of unique words (alphabetic tokens) that occur in the dataset, as well as creating a set of unique lemmatised words that occur in the dataset.

In [20]:
# Remove duplicate words from the text (obtain the vocabulary of the text)
t_vocab = set(alpha_tokens)
t_vocab_lower = set(alpha_tokens_lower)
lemma_vocab = set(lemmatised)
print("Unique tokens:", len(t_vocab))
print("Unique lowercase tokens:", len(t_vocab_lower))
print("Unique lemmatised (lowercase) tokens:", len(lemma_vocab))
Unique tokens: 186247
Unique lowercase tokens: 151872
Unique lemmatised (lowercase) tokens: 144837

We can create a data visualisation that illustrates when specific words are used within the MHBI dataset. This is called a Lexical Dispersion Plot. We'll pick some terms (the list of targets) from the list of the most common 20 words in the dataset:

In [21]:
[('...', 1634899),
 ('total', 94009),
 ('year', 82764),
 ('number', 63297),
 ('cases', 39875),
 ('......', 39757),
 ('district', 30462),
 ('may', 30145),
 ('report', 29538),
 ('hospital', 28273),
 ('government', 26231),
 ('disease', 25705),
 ('veterinary', 25575),
 ('average', 25560),
 ('years', 24749),
 ('vaccination', 24190),
 ('plague', 23489),
 ('table', 21813),
 ('males', 21500),
 ('females', 21153)]

3. Exploratory Analysis

Let's determine the top 20 most common words for each paper (file) in the MHBI dataset:

In [23]:
fileids = list(df['fileid'])
id_to_title = dict(zip(fileids,titles))
In [24]:
common_words = {}
for file in fileids:
    tokens = wordlists.words(file)

    # Filter non-alphabetic words and stop words out of the list of tokens
    tokens_lower = [t.lower() for t in tokens if t.isalpha()]
    to_exclude = list(set(stopwords.words('english'))) + ["per", "two", "one", "also"]
    filtered_tokens = [t for t in tokens_lower if not t in to_exclude]
    fdist = FreqDist(filtered_tokens)
    title = id_to_title[file]
    words = list(fdist.most_common(20))
    common_words[title] = words


Now you can use the title of a paper to find the 20 most common words in that paper:

In [26]:
paper = id_to_title[fileids[10]]
[('microbe', 26), ('mr', 15), ('hankin', 14), ('plague', 14), ('bubo', 14), ('h', 13), ('found', 12), ('action', 10), ('l', 9), ('acid', 8), ('infection', 7), ('infected', 7), ('days', 7), ('e', 7), ('rt', 7), ('pneumonia', 7), ('water', 6), ('rec', 6), ('fem', 6), ('b', 6)]
Try It! Inspired by questions from MHBI's curator, the questions below offer starting points for topics you could consider exploring in the MHBI dataset!

3.1 Which publications are about cholera? Leprosy? Malaria? Plague? Laboratory medicine?

In [ ]:
# t.concordance("cholera")

3.2 How does the language around the people of India change over time?

Consider comparing papers from before and after the 1857 rebellion when British East India Company rule in India was taken over by the British Crown. How does the word choice and sentiment of the language differ before and after?

In [ ]:
# t.concordance("Native")

3.3 How does the language around mental hospitals change over time?

In [ ]:
# t.concordance("lunatic")
In [ ]:
# t.concordance("mental")

3.4 How are women portrayed?

Consider lock(ed) hospitals and escapes from them, and prostitution permitted in army barracks.

In [ ]:
# t.concordance("women")
In [ ]:
# t.concordance("lock")

3.5 What is the rhetoric around vaccinations and, more generally, public health?

In [ ]:
# t.concordance("vaccination")