Exploring Ladies' Edinburgh Debating Society¶

Created August-September 2020 for the National Library of Scotland's Data Foundry by Lucy Havens, Digital Library Research Intern

About the Ladies' Edinburgh Debating Society Dataset¶

The Ladies' Edinburgh Debating Society (LEDS) was founded by women in 1865 who were members of the upper-middle and high classes at a time when women had limited higher education opportunities. Members went on to play significant roles in education, suffrage, philanthropy, and anti-slavery efforts. The LEDS Dataset contains digitised text from all volumes of two journals the Society published: The Attempt and The Ladies' Edinburgh Magazine. The first journal contains 10 volumes published from 1865 through 1874. The second journal contains six volumes published from 1875 through 1880.

The Ladies' Edinburgh Debating Society, also known as the Edinburgh Essay Society and the Ladies' Edinburgh Essay Society, was dissolved in 1935. A year later, in 1936, the National Library of Scotland acquired the volumes that were digitised in this dataset.

Data format: digitised text
Data creation process: Optical Character Recognition (OCR)
Data source: https://data.nls.uk/data/digitised-collections/edinburgh-ladies-debating-society/

Citations¶

Alex, Beatrice and Llewellyn, Clare. (2020) Library Carpentry: Text & Data Mining. Centre for Data, Culture & Society, University of Edinburgh. http://librarycarpentry.org/lc-tdm/.
Bird, Steven and Klein, Ewan and Loper, Edward. (2019) Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit. O'Reilly Media. 978-0-596-51649-9. https://www.nltk.org/book/.

0. Preparation¶

Import libraries to use for cleaning, summarising and exploring the data:

# To prevent SSL certificate failure
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)):
    ssl._create_default_https_context = ssl._create_unverified_context

# Libraries for data loading
import pandas as pd
import numpy as np
import string
import re
from collections import defaultdict
import urllib.request
import urllib
import json

# Libraries for visualization
import altair as alt
import matplotlib.pyplot as plt

# Libraries for text analysis
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
nltk.download('wordnet')
from nltk.corpus import wordnet
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.text import Text
from nltk.stem.porter import PorterStemmer 
from nltk.probability import FreqDist
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
nltk.download('tagsets')  # part of speech tags
from nltk.draw.dispersion import dispersion_plot as displt

To explore the text in the Ladies' Edinburgh Debating Society collection, we'll mainly use the Natural Language Toolkit (NLTK), a library written for the programming language Python.

The nls-text-ladiesDebating folder (downloadable as Just the text data from the website at the top of this notebook) contains TXT files of digitised text with numerical names, as well as a CSV inventory file and a TXT ReadMe file. Load only the TXT files of digitised text and tokenise the text (which splits running text into separate words, numbers, and punctuation):

corpus_folder = 'data/nls-text-ladiesDebating/'
wordlists = PlaintextCorpusReader(corpus_folder, '\d.*', encoding='latin1')
corpus_tokens = wordlists.words()
print(corpus_tokens[:10])

['â', '\x80¢*', 'â', '\x80¢', 'UL', '.', 'u', '^\\,', 'THE', 'ATTEMPT']

Note: If you'd like to see how to specify a single TXT file to load as data, check out the Jupyter Notebook for Exploring Britain and UK Handbooks!

It's hard to get a sense of how accurately the text has been digitised from this list of 10 tokens, so let's look at one of these words in context. To see phrases in which "Edinburgh" is used, we can use the concordance() method:

t = Text(corpus_tokens)
t.concordance('Mrs', lines=20)

Displaying 20 of 1387 matches:
RE , LEITH . MDCCCLXVI . D CONTENTS . Mrs Gaskell , by Incha j ^ rflfl ^ At , f
e Slaves , by Euterpe . . ... . . 282 Mrs Gnmdy , by Dido *^ 4n t '- â ¢"..'- 
e . The way in which she writes about Mrs Gaskell ' s daughters shows that they
of the Slaves , by Euterpe ...... 282 Mrs G The V Morui ; fJ \ V THE ATTEMPT . 
 ; fJ \ V THE ATTEMPT . IS . 6a : sML Mrs Gaskell ' s death has created a blank
 after these are forgotten , those of Mrs Gaskell will retain their place , and
e . The way in which she writes about Mrs Gaskell ' s daughters shows that they
n the midst of an affectionate family Mrs Gaskell gently passed away , leaving 
ged by the violence of its emotions . Mrs Gaskell rightly supposes that a great
. It must have been a satisfaction to Mrs . Gaskell to feel she had given some 
hen we recollect the circumstances of Mrs Gaskell ' s own death . We must draw 
ner never hesitated to send for Mr or Mrs Melville if he wanted sympathy or adv
dea of separation so overwhelmed poor Mrs Campbell , that he , with true kindne
y that there does , Mr Munroe ," said Mrs Campbell , speaking for the first tim
any bad news 1 ' I asked fiercely . " Mrs Melville took my hand and burst into 
 it may prove an unfounded alarm .' " Mrs Melville then told me that great anxi
 directed to the Captain ' s mother . Mrs Melville tried to cheer me by saying 
school companion , Laura Leslie , now Mrs Lea . We have become fast friends aga
taken on the Abbey Hill , till Mr and Mrs Melville proposed coming here , when 
d ever since ." " Thank you very much Mrs Campbell ," said Mr Munroe , " and no

This dataset has not been manually cleaned after OCR digitised text from The Attempt and The Ladies' Edinburgh Magazine so it's not surprising to see some non-words appear in the concordance. Even with the digitisation errors, though, we can still get a sense of what's in the text using natural language processing (NLP) methods!

0.1 Dataset Size¶

Before we do much analysis, let's get a sense of how much data we're working with:

def corpusStatistics(plaintext_corpus_read_lists):
    total_chars = 0
    total_tokens = 0
    total_sents = 0
    total_files = 0
    
    # fileids are the TXT file names in the nls-text-ladiesDebating folder:
    for fileid in plaintext_corpus_read_lists.fileids():
        total_chars += len(plaintext_corpus_read_lists.raw(fileid))
        total_tokens += len(plaintext_corpus_read_lists.words(fileid))
        total_sents += len(plaintext_corpus_read_lists.sents(fileid))
        total_files += 1
    
    print("Total...")
    print("  Characters in Ladies' Edinburgh Debating Society (LEDS) Data:", total_chars)
    print("  Tokens in LEDS Data:", total_tokens)
    print("  Sentences in LEDS Data:", total_sents)
    print("  Files in LEDS Data:", total_files)

corpusStatistics(wordlists)

Total...
  Characters in Ladies' Edinburgh Debating Society (LEDS) Data: 15096132
  Tokens in LEDS Data: 3145535
  Sentences in LEDS Data: 108011
  Files in LEDS Data: 16

Note that I've printed Tokens rather than words, though the NLTK method used to count those was .words(). This is because words in NLTK include punctuation and numbers, in addition to letters.

0.2 Identifying Subsets of the Data¶

Next, we'll create two subsets of the data, one for each journal. To do so we first need to load the inventory (CSV file) that lists which file name corresponds with which journal. When you open the inventory in Microsoft Excel or a text editor, you can see that there are no column names. The Python library Pandas, which can read CSV files, calls column names the header. When we use Pandas to read the inventory, we'll create our own header by specifying header=None and providing a list of column names.

When Pandas (abbreviated pd when we loaded libraries in the first cell of this notebook) reads a CSV file, it creates a table called a dataframe from that data. Let's see what the LEDS inventory dataframe looks like:

df = pd.read_csv('data/nls-text-ladiesDebating/ladiesDebating-inventory.csv', header=None, names=['fileid', 'title'])
df

Since we only have 16 files (with indeces running from 0 through 15), we'll print the entire dataframe. With larger dataframes you may wish to use df.head() or df.tail() to print only the first 5 rows or last 5 rows, respectively.

Now we can create a two dictionaries of file IDs and their associated journal titles, one for The Attempt and one for The Ladies' Edinburgh Magazine:

attempts = {}
mags = {}
for index, row in df.iterrows():
    fileid = row['fileid']
    title = row['title']
    if 'Attempt' in title:
        attempts[fileid] = title
    else: # if 'Magazine' in title:
        mags[fileid] = title
print("The Attempt files:")
print(attempts)
print("\n Ladies' Edinburgh Magazine files:")  # \n is a newline character
print(mags)

The Attempt files:
{'109857781.txt': 'Attempt - Volume 1 and Select writings - U.431', '103655648.txt': 'Attempt - Volume 2 - U.431', '103655649.txt': 'Attempt - Volume 3 - U.431', '103655650.txt': 'Attempt - Volume 4 - U.431', '103655651.txt': 'Attempt - Volume 5 - U.431', '103655652.txt': 'Attempt - Volume 6 - U.431', '103655653.txt': 'Attempt - Volume 7 - U.431', '103655654.txt': 'Attempt - Volume 8 - U.431', '103655655.txt': 'Attempt - Volume 9 - U.431', '103655656.txt': 'Attempt - Volume 10 - U.431'}

 Ladies' Edinburgh Magazine files:
{'103655658.txt': "Ladies' Edinburgh Magazine - Volume 1 - U.393", '103655659.txt': "Ladies' Edinburgh Magazine - Volume 2 - U.393", '103655660.txt': "Ladies' Edinburgh Magazine - Volume 3 - U.393", '103655661.txt': "Ladies' Edinburgh Magazine - Volume 4 - U.393", '103655662.txt': "Ladies' Edinburgh Magazine - Volume 5 - U.393", '103655663.txt': "Ladies' Edinburgh Magazine - Volume 6 - U.393"}

For convenient reference of only fileids, we can also create lists from the dictionaries:

attempt_ids = list(attempts.keys())
mag_ids = list(mags.keys())
print(mag_ids)

['103655658.txt', '103655659.txt', '103655660.txt', '103655661.txt', '103655662.txt', '103655663.txt']

NLTK stores the lists of tokens in the corpus_tokens variable we created by the file IDs, so it's useful to be able to match the file IDs with their journal titles!

1. Data Cleaning and Standardisation¶

There are several ways to standardise, or "normalise," text, with each way providing data suitable to different types of analysis. For example, to study the vocabulary of a text, it's useful to remove punctuation and digits, lowercase the remaining alphabetic words, and then reduce those words to their root form (with stemming or lemmatisation - more on this later). Alternatively, to identify people and places using named entity recognition, it's important to keep capitalisation in words and keep words in the context of their sentences.

1.1 Tokenisation¶

In section 0. Preparation, we tokenised the LEDS dataset when we created the corpus_tokens list. corpus_tokens contains a list of all words, punctuation, and numbers that appear in the LEDS dataset separated into individual items and organised in the order they appear in the LEDS text files. In addition to tokenising words, NLTK also provides methods to tokenise sentences. This is how we counted the number of sentences in section 0.1 Dataset Size.

Tokenized words are helpful when analysing the vocabulary of text. Tokenised sentences are helpful when analysing linguistic patterns of a text. Let's create lists of tokens as strings (String is Python's data format for text) from the LEDS dataset:

# Create a list of tokens as strings for the entire corpus
str_tokens = [str(word) for word in corpus_tokens]
print(str_tokens[0:10])

# Create a list of tokens as strings for The Attempt
attempt_str_tokens = []
for fileid in attempt_ids:
    attempt_tokens = wordlists.words(fileid)
    attempt_str_tokens += [str(t) for t in attempt_tokens]
print(attempt_str_tokens[-10:])

# Create a list of tokens as strings for Ladies' Edinburgh Magazine
mag_str_tokens = []
for fileid in mag_ids:
    mag_tokens = wordlists.words(fileid)
    mag_str_tokens += [str(t) for t in mag_tokens]
print(mag_str_tokens[200:210])

['â', '\x80¢*', 'â', '\x80¢', 'UL', '.', 'u', '^\\,', 'THE', 'ATTEMPT']
['January', '.', 'EOISBUROH', ':', 'PRINTED', 'BY', 'COLSTON', 'AND', 'SON', '.']
['the', 'Royal', 'Scottish', 'Academy', ',', 'The', ',', 'by', 'M', '.']

Let's also create a list of tokens that are most likely to be valid English words by removing non-alphabetic tokens from str_tokens (e.g. punctuation, numbers):

alpha_tokens = [t for t in str_tokens if t.isalpha()]
print(alpha_tokens[1000:1010])

['but', 'her', 'friends', 'had', 'no', 'cause', 'to', 'complain', 'of', 'her']

Knowing that the digitised text in the LEDS dataset wasn't cleaned up after OCR, there may be words whose letters were incorrectly digitised as punctuation or numbers. To include those words, we'll put all tokens that each have at least one letter in a with_letters list:

with_letters = [t for t in str_tokens if re.search("[a-zA-z]+", t)]
print(with_letters[2000:2010])

['he', 'is', 'THE', 'ATT', 'LSI', 'PT', 'discharged', 'after', 'the', 'trial']

Next, we'll create lowercased versions (this is called casefolding in NLP) of the previous lists of tokens, which, as explained at the beginning of this section, can be useful for studying the vocabulary of a dataset:

str_tokens_lower = [(str(word)).lower() for word in corpus_tokens]
alpha_tokens_lower = [t for t in str_tokens_lower if t.isalpha()]
with_letters_lower = [t for t in str_tokens_lower if re.search("[a-zA-z]+", t)]

# Check that the capitalised and lowercased lists of tokens are the same length, as expected
assert(len(str_tokens_lower) == len(str_tokens))       # an error will be thrown if something went wrong   
assert(len(alpha_tokens_lower) == len(alpha_tokens))   # an error will be thrown if something went wrong
assert(len(with_letters_lower) == len(with_letters))   # an error will be thrown if something went wrong

As stated at the start of this section, we can also tokenise sentences. Tokenising sentences separates running text into individual sentences, which is necessary for analysing sentence structure. Let's create one list of all sentences in the LEDS corpus, and a dictionary of lists for each file in the corpus:

all_sents = []
sents_by_file = dict.fromkeys(wordlists.fileids())
# Iterate through each file in the LEDS corpus
for fileid in wordlists.fileids():
    file_sents = sent_tokenize(wordlists.raw(fileid))
    all_sents += [str(sent) for sent in file_sents]
    sents_by_file[fileid] = all_sents

print("Sample:", all_sents[200:205])

Sample: ['He reckoned his jokes as a sportsman would count his head of\r\ngame, but the effect was drearily oppressive.', 'Fun must be spontaneous to be delightÂ¬\r\nful, and no one enjoys a joke when he feels that the joker is scoring it as a hit or a\r\nmiss with painful care.', 'Many of the best things ever said are unrecorded, save in the memories of the\r\nhearers ; no one chronicles them, and they go floating about the undercurrents of\r\ntlie world of talk, making little speaking circles whenever they come to the surface,\r\ntill by constant wear the edge is taken off, and they sink for ever among the fossil\r\nB\r\n10 THE ATTEMPT\r\nwitticisms of bygone ages.', 'One such story I have heard of Tliackeray.', 'One Derby\r\nDay he was returning by one of the last trains to London, and saw a little man\r\nrushing wildly about the platform, exclaiming as he looked into each full carriage,\r\n" Good gracious me !']

I wonder if the language changed from The Attempt publication to the later The Ladies' Edinburgh Magazine publication? Let's create lists of all sentences for each of these publications so the language of the two publications can be compared and contrasted:

attempt_file_sents = dict.fromkeys(attempt_ids)
attempt_sents = []
# Iterate through each file of a publication of The Attempt
for fileid in attempt_ids:
    file_sents = sent_tokenize(wordlists.raw(fileid))
    attempt_sents += [str(sent) for sent in file_sents]
    attempt_file_sents[fileid] = attempt_sents
    
print("Total sentences in The Attempt:", len(attempt_sents))
print("Sample:", attempt_file_sents["103655648.txt"][400:405])
print()
    
mag_file_sents = dict.fromkeys(mag_ids)
mag_sents = []
# Iterate through each file of a publication of The Ladies' Edinburgh Magazine
for fileid in mag_ids:
    file_sents = sent_tokenize(wordlists.raw(fileid))  
    mag_sents += [str(sent) for sent in file_sents]
    mag_file_sents[fileid] = mag_sents
    
print("Total sentences in The Ladies' Edinburgh Magazine:", len(mag_sents))
print("Sample:", mag_file_sents["103655659.txt"][250:255])

Total sentences in The Attempt: 53053
Sample: ['Is there no one\r\nwhom we could comfort out of our abundance, and whose precious blessings we might\r\nhope to deserve!', 'Gretchen,â\x80\x9d he exclaimed, rising up as the thought seized him,\r\nâ\x80\x9c our old neighbour Dorothea can have but little since her loving son was taken\r\naway: it is the duty of the rich to help the poor, and rich we are compared to the\r\ndesolate widow.', 'What think you, my wife; would it not be seemly that she should\r\npartake of our Christmas bounty!â\x80\x9d The kindly housewife did not reply in words,\r\nbut looking up in her husbandâ\x80\x99s face, gave a pleasant smile and approving nod, and\r\nwas soon again busily engaged with her knitting.', 'With the somewhat impetuous\r\nCarl Holz, to determine anything was at once to do it; so, rising up, he began to\r\nprepare for a walk of some four or five miles through the dense forest in which his\r\nwooden hut was situated.', 'Gretchen would have objected.']

Total sentences in The Ladies' Edinburgh Magazine: 54958
Sample: ['Such trials do not injure the body, but they deaden the\r\nsoul; they induce a weariness of conflict, and lead into\r\nthe reaction of a deliberate refi;sal to feel at all.', 'In a dream once, I found myself choking, stifling in\r\ndeep waters; something like great masses of clinging\r\nâ\x96\xa0weeds imprisoned my feet, so that I could not get out;\r\nand the horror of the situation was heiglitened when I\r\ndiscovered that this was my own hair, which had all f;illen\r\noff.', 'At last 1 struggled on shore, and feebly walked away\r\nThe Ladies^ Edinburgh Magazine.', '13\r\nfrom the water.', 'After a while, I turned to look at the\r\ndepth I had escaped, and behold !']

1.2 Stemming¶

As we saw in the results of the concordance() method, OCR doesn't result in perfectly digitised text. To get a sense of how many mistakes may have been made in the digitisation process, we can measure how many words in the LEDS dataset are recognisable English words according to a list of words considered valid in the board game Scrabble (as demonstrated in this example).

As mentioned in section 1.1 Tokenisation, there are several ways to standardise ("normalise") text, with each way providing text suitable to different types of analysis. We're concerned with studying vocabulary, since we want to measure how many of the alphabetic tokens that NLTK has identified in the LEDS dataset are valid English words, so we'll work with lowercase, alphabetic tokens from our alpha_tokens_lower list.

To efficiently measure the number of valid and invalid English words, we can further standardise our data through stemming. Stemming reduces words to their root form by removing suffixes and prefixes. For example, the word "troubling" has a stem of "troubl."

In the next 3 steps we'll load the Scrabble dataset of valid English words, stem the Scrabble dataset and LEDS dataset, and then see if the stems from the LEDS dataset are present in the Scrabble dataset.

Step 1: First we'll load the Scrabble file of words (which helpfully includes British English spellings!) and create a list of those words as a frozen set, which prevents them from being modified accidentally:

file = open('data/scrabble_words.txt', 'r')
scrabble_words = file.read().split('\n')
scrabble_words_lower = [word.lower() for word in scrabble_words]

assert(len(scrabble_words) == len(scrabble_words_lower))  # the number of words shouldn't change when the list is lowercased

print("Total words in Scrabble list:", len(scrabble_words))
print("Sample of English words from the Scrabble list:", scrabble_words_lower[100:120])

Total words in Scrabble list: 267752
Sample of English words from the Scrabble list: ['abattoirs', 'abattu', 'abature', 'abatures', 'abaxial', 'abaxile', 'abaya', 'abayas', 'abb', 'abba', 'abbacies', 'abbacy', 'abbas', 'abbatial', 'abbe', 'abbed', 'abbes', 'abbess', 'abbesses', 'abbey']

Step 2: Next we'll stem the tokens in the Scrabble list and the LEDS dataset. There are different algorithms that one can use to determine the root of a word; we will use the Porter Stemmer algorithm. To make our code as efficient as possible, we'll create sets of the Scrabble and LEDS stems (sets are a Python data structure that are similar to lists, except that each item in a set is unique, so there are no duplicates).

This process should give us a smaller number of words to compare and should enable tokens in LEDS to be recognised as English words even if they appeared in a different form in the Scrabble list.

porter = nltk.PorterStemmer()

unique_alpha_lower = list(set(alpha_tokens_lower)) # Remove duplicates from the lowercased, alphabetic tokens in the LEDS dataset
leds_porter_stemmed = [porter.stem(t) for t in unique_alpha_lower]

scrabble_porter_stemmed = [porter.stem(t) for t in scrabble_words_lower]

# Remove duplicates from the Scrabble and LEDS lists of stems
leds_pstemmed_set = list(set(leds_porter_stemmed))
scrabble_pstemmed_set = list(set(scrabble_porter_stemmed))

print(leds_pstemmed_set[:10])
print(scrabble_pstemmed_set[50:60])

['gilpin', 'difeculti', 'comhil', 'lorjic', 'sla', 'indifler', 'niebelungen', 'kilda', 'angehc', 'trulj']
['nigritud', 'thurl', 'fungo', 'antinationalist', 'dout', 'epur', 'goy', 'crasser', 'emet', 'succinylcholin']

Try It! NLTK provides other stemming algorithms you can try, too, such as the Lancaster Stemmer. Try replacing 'nltk.PorterStemmer()' with 'nltk.LancasterStemmer()' to observe the differences in the stems that the algorithms return.

Step 3: Lastly, we'll compare the stems (root forms) of LEDS tokens to the stems of Scrabble words to gauge how many LEDS tokens are recognisable English words.

recognised_stems = 0
for stem in leds_porter_stemmed:
    if stem in scrabble_porter_stemmed:
        recognised_stems += 1
print("Recognised Stems:", (recognised_stems/len(leds_porter_stemmed))*100,"%")

Recognised Stems: 52.320395201129145 %

Rather than comparing stems in the Scrabble and LEDS dataset, you could also compare lemmas or the entire vocabularies (all lowercased, unique tokens). Comparing the entire vocabularies will take longer than comparing stems and lemmas, though.

It looks as though just under half the stems in the LEDS text aren't recognised...how might we figure out what some of those words are meant to be?

1.3 Part of Speech Tagging¶

Another form of standardisation in text analysis is tagging sentences, or identifying the parts of speech in sentences. Identifying parts of speech that compose the structure of sentences is important for analysing linguistic patterns and comparing the writing styles of different texts. We'll use NLTK's built-in part of speech tagger to tag sentences for the entire corpus:

fileids = list(df['fileid'])
tagged_sents = []
for fileid in fileids:
    file = wordlists.raw(fileid)
    sentences = nltk.sent_tokenize(file)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    tagged_sents += [sent for sent in sentences]

print("Total part-of-speech tagged sentences:", len(tagged_sents))

Total part-of-speech tagged sentences: 108011

print("Sample:", tagged_sents[1000:1003])

Sample: [[('The', 'DT'), ('voltaic', 'NN'), ('battery', 'NN'), ('has', 'VBZ'), ('rendered', 'VBN'), ('harmless', 'PDT'), ('a', 'DT'), ('once', 'RB'), ('deadly', 'JJ'), ('method', 'NN'), ('of', 'IN'), ('gilding', 'NN'), ('.', '.')], [('THE', 'DT'), ('ATTEMPT', 'NNP'), ('65', 'CD'), ('The', 'DT'), ('shipâ\x80\x99s', 'NN'), ('galley', 'NN'), ('for', 'IN'), ('distilling', 'VBG'), ('fresh', 'JJ'), ('water', 'NN'), ('from', 'IN'), ('the', 'DT'), ('ocean', 'NN'), (',', ','), ('together', 'RB'), ('with', 'IN'), ('the', 'DT'), ('discovery', 'NN'), ('of', 'IN'), ('the', 'DT'), ('means', 'NNS'), ('of', 'IN'), ('preserving', 'VBG'), ('meat', 'NN'), ('fresh', 'JJ'), (',', ','), ('have', 'VB'), ('driven', 'VBN'), ('the', 'DT'), ('once', 'RB'), ('fatal', 'JJ'), ('scurvy', 'NN'), ('from', 'IN'), ('our', 'PRP$'), ('navy', 'NN'), ('.', '.')], [('The', 'DT'), ('sewing-machine', 'NN'), ('has', 'VBZ'), ('already', 'RB'), ('proved', 'VBN'), ('a', 'DT'), ('great', 'JJ'), ('boon', 'NN'), ('to', 'TO'), ('shoemakers', 'NNS'), (',', ','), ('whose', 'WP$'), ('eyes', 'NNS'), ('were', 'VBD'), ('injured', 'VBN'), ('by', 'IN'), ('straining', 'VBG'), ('closely', 'RB'), ('over', 'IN'), ('their', 'PRP$'), ('black', 'JJ'), ('material', 'NN'), ('.', '.')]]

Great! We'll use these tagged sentences later on, in 3. Exploratory Analysis, to help us identify named entities (i.e. people, places, organisations) in the LEDS dataset.

2. Summary Statistics¶

2.1 Frequencies and Sizes¶

Now that we've created some different cuts of the LEDS dataset, let's start investigating the frequency of terms as they appear across the dataset. One way to do so is with a frequency distribution, which is a line chart that shows how many times a token appears in a dataset. The following 3 steps demonstrate how to visualise frequency distributions.

Step 1: Filter the tokens in each LEDS publication to exclude one-letter words, two-letter words, and stop words (such as and, a, and the), and then lowercase all the tokens:

# Use NLTK's provided stop words for the English language
to_exclude = list(set(stopwords.words('english')))
to_exclude += ['attempt', 'magazine', 'ladies', 'edinburgh']    # add words from the journals' titles

# Filter one-letter words, two-letter words, and stop words out of the list of The Attempt tokens 
attempt_min_three_letters = []
attempt_min_three_letters += [t.lower() for t in attempt_str_tokens if len(t) > 2]
attempt_filtered_tokens = [t for t in attempt_min_three_letters if not t in to_exclude]
print("Sample of The Attempt tokens after filtering:", attempt_filtered_tokens[60:70])

# Filter one-letter words, two-letter words, and  stop words out of the list of Ladies' Edinburgh Magazine tokens 
mag_min_three_letters = []
mag_min_three_letters += [t.lower() for t in mag_str_tokens if len(t) > 2]
mag_filtered_tokens = [t for t in mag_min_three_letters if not t in to_exclude]
print("Sample of Ladies' Edinburgh Magazine tokens after filtering:", mag_filtered_tokens[200:210])

Sample of The Attempt tokens after filtering: ['katie', 'veronica', '.......', 'royal', 'hospital', 'sick', 'children', 'new', 'yearâ', 'hymn']
Sample of Ladies' Edinburgh Magazine tokens after filtering: ['profession', 'women', 'eliza', 'dunbar', '383', 'years', 'naomi', 'smith', '....', '380']

Step 2: Calculate the frequency distribution for each LEDS publication using NLTK's FreqDist() method:

# Calculate the frequency distribution for each filtered list of tokens
attempt_fdist = FreqDist(attempt_filtered_tokens)
print("Total tokens in The Attempt after filtering:", attempt_fdist.N())

mag_fdist = FreqDist(mag_filtered_tokens)
print("Total tokens in Ladies' Edinburgh Magazine after filtering:", mag_fdist.N())

Total tokens in The Attempt after filtering: 675377
Total tokens in Ladies' Edinburgh Magazine after filtering: 622426

Step 3: Plot the frequency distributions for each LEDS publication:

# Visualise the frequency distribution for a select number of tokens 
plt.figure(figsize = (18, 8))                # customise the width and height of the plot
plt.rc('font', size=12)                       # customise the font size of the title, axes names, and axes labels
attempt_fdist.plot(20, title='Frequency Distribution of the 20 Most Common Words in The Attempt (excluding stop words, 1-letter and 2-letter words)')

<matplotlib.axes._subplots.AxesSubplot at 0x192dcbe10>

# Visualise the frequency distribution for a select number of tokens
plt.figure(figsize = (18, 8))                # customise the width and height of the plot
plt.rc('font', size=12)                       # customise the font size of the title, axes names, and axes labels
mag_fdist.plot(20, title="Frequency Distribution of the 20 Most Common Words in Ladies' Edinburgh Magazine (excluding stop words, 1-letter and 2-letter words)")

<matplotlib.axes._subplots.AxesSubplot at 0x146e58358>

2.2 Uniqueness and Variety¶

To measure the diversity of word choice in a text, we can use the lexical diversity metric, which is the length of the vocabulary of a text divided by the total length of the text. Length is the total number of words, and vocabulary is a non-repeating list of words (unique words) in a text.

Let's compare the lexical diversities of the two publications in the LEDS dataset:

Step 1: First, let's remove all tokens that aren't words by excluding tokens that are made up of punctuation and digits, rather than letters. We'll also casefold all the words to standardise them, so that The and the are considered the same word, for example.

# Remove non-alphabetic tokens (exclude punctuation and digits) and lowercase all tokens
attempt_alpha_lower = [t.lower() for t in attempt_str_tokens if t.isalpha()]
mag_alpha_lower = [t.lower() for t in mag_str_tokens if t.isalpha()]

# Print the lengths (total words) of each publication
print("The Attempt length:", len(attempt_alpha_lower), "words")
print("Ladies' Edinburgh Magazine length:", len(mag_alpha_lower), "words")

The Attempt length: 1378441 words
Ladies' Edinburgh Magazine length: 1272415 words

So The Attempt files have a total of slightly more words than those of Ladies' Edinburgh Magazine.

Step 2: Next, let's find the vocabulary of the two publications.

attempt_vocab = set(attempt_alpha_lower)
mag_vocab = set(mag_alpha_lower)

print("The Attempt vocabulary size:", len(attempt_vocab), "words")
print("Ladies' Edinburgh Magazine vocablary size:", len(mag_vocab), "words")

The Attempt vocabulary size: 50814 words
Ladies' Edinburgh Magazine vocablary size: 46582 words

So The Attempt has a larger vocabulary size than Ladies' Edinburgh Magazine. Given that The Attempt's overall length is longer, this isn't surprising. To compare the vocabularies (word choice) of the two publications relative to their lengths, we use the lexical diversity metric.

Step 3: Calculate the lexical diversity of each publication.

# INPUT: a list of all words and a vocabulary list for a text source
# OUTPUT: the number of unique words (length of the vocabulary) divided by
#         the total words of a text source (the lexical diversity score)
def lexicalDiversity(all_words, vocab):
    return len(vocab)/len(all_words)

print("The Attempt's lexical diversity score:", lexicalDiversity(attempt_alpha_lower, attempt_vocab))
print("Ladies' Edinburgh Magazine's lexical diversity score:", lexicalDiversity(mag_alpha_lower, mag_vocab))

The Attempt's lexical diversity score: 0.03686338406939434
Ladies' Edinburgh Magazine's lexical diversity score: 0.036609125167496454

The scores are very close! The word choice in The Attempt is only slightly more diverse than Ladies' Edinburgh Magazine.

3. Exploratory Analysis¶

3.1 Who is named in the dataset?¶

In NLP, named entity recognition is the process of identifying people, places, and organisations ("entities") that are named in a dataset. In order to recognise entities, a dataset of running text must be tokenised into sentences, and then those sentences must be tagged with parts of speech. Entities' names are often capitalised, so we do not casefold text on which we want to run named entity recognition.

We've already tokenised sentences in the LEDS dataset and tagged their parts of speech in 1.3 Part of Speech Tagging, so we can use the resulting tagged_sents list. We'll use SpaCy's named entitiy recognition tool:

First, we need to make sure we have the SpaCy langauge model we are going to use:

try:
    import en_core_web_sm
except ImportError:
    print("Downlading en_core_web_sm model")
    import sys
    !{sys.executable} -m spacy download en_core_web_sm
else:
    print("Already have en_core_web_sm")

Already have en_core_web_sm

import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

sentences = []
for fileid in fileids:
    file = wordlists.raw(fileid)
    sentences += nltk.sent_tokenize(file)

person_list = []
for s in sentences:
    s_ne = nlp(s)
    for entity in s_ne.ents:
        if entity.label_ == 'PERSON':
            person_list += [entity.text]

print(len(person_list))

33652

displacy.render(nlp(str(sentences[29997])), jupyter=True, style='ent')

unique_persons = list(set(person_list))
print(len(unique_persons))
names = []
for name in unique_persons:
    if re.search('([A-Z]{1}([a-z])+\.?)', name):
        names += [name]
print(len(names))

11777
10164

Next, we can use an API called genderize.io to guess how many of the names refer to a male or female:

def guessGender(person_name):
    genderize_url = 'https://api.genderize.io?name='
    country_gb = '&country=GB'
    url = genderize_url+person_name+country_gb
    content = (urllib.request.urlopen(url)).read()
    return str(content).strip("b'")

gender_guesses = []
errored = []
titles = ['mrs', 'ms', 'mr', 'miss', 'sir', "ma'am", 'lord', 'lady', 'king', 'queen', 'duchess', 'duke', 'mademoiselle', 'madame', 'monsieur', 'signora']
for name in names:
    name = name.lower()
    for title in titles:
        if title in name:
            # Remove the title and any whitespace after the title
            name = name.replace(title, "").strip()
    
    # If the name includes a more than a given name (i.e. family name, middle
    # name), create a list of each name and take only the first list item
    name = name.split()
    name = name[0]
    
    try:
        guess = guessGender(name)
        gender_guesses += [guess]
    except UnicodeEncodeError:
        errored += [name]
    # If there are too many requests (for genderize.io, 
    # only 1000 can be made in a day), end the loop
    except HTTPError:
        return
    
print(gender_guesses[:3])

['{"name":"matthew","gender":"male","probability":1,"count":35521}', '{"name":"der","gender":"male","probability":0.89,"count":2422}', '{"name":"nicholas","gender":"male","probability":0.99,"count":16755}']

print("Number of gender guesses made:" len(gender_guesses))

967

Try It! The genderize.io API only allows 1,000 requests (gender guesses for 1,000 names) in a single day. How else might you improve the accuracy of the names we extracted from the LEDS dataset so that you can be more certain that the names you send the gender guesser are valid given names?

Since there's a limit on the number of requests that one can make to genderize.io in single day, so for now let's simply use the 967 guesses we just made. Let's calculate the number of names guessed to be for a "male" and "female" with a probability of at least a 0.9 (90%). To make it easier to find guesses that meet this criteria, we'll convert the gender guesses to a different data structure. Genderize.io sends responses (returns gender guesses) in the JSON data format, which is similar to Python's dictionary data structure, so we'll convert the String representations of the JSON responses into dictionaries. Then we'll figure out whether a name is guessed as representing a "male" or "female" gender-identifying person with at least 90% probability.

Note: This process is limited because it considers gender a binary rather than a spectrum! Gender guesses from the genderize.io API should thus be taken with a grain of salt: the guesses provide an indication of two categories of people discussed in the LEDS dataset and should not be interpreted as 100% accurate.

import json
male_guesses = []
female_guesses = []
for response in gender_guesses:
    response = response.replace("\\","")
    response = response.replace("\'s","")
    try:
        response = json.loads(response)

        if response["probability"] >= 0.9:
            if response["gender"] == "male":
                male_guesses += [response["name"]]
            elif response["gender"] == "female":
                female_guesses += [response["name"]]

    # If there's an error, print the response
    # to see if the name is valid
    except:
        print(response)

print("Names guessed male:", len(male_guesses))
print("Names guessed female:", len(female_guesses))

{"name":"me,"said","gender":null,"probability":0,"count":0}
Names guessed male: 320
Names guessed female: 121

The name that through an error isn't a valid name, so we won't worry about that. Let's take a closer look at the names guessed as female:

print(female_guesses)

['sigourney', 'einna', 'ann', 'patty', 'alison', 'frangois', 'chamomile', 'mary', 'maria', 'katy', 'carrie', 'marie', 'thorhalla', 'mary', 'dora', 'gabrielle', 'dora', 'janey', 'mary', 'elizabeth', 'meantime', 'lorraine', 'madame', 'anna', 'evelyn', 'reme', 'melrose', 'miriam', 'emma', 'cassandra', 'jeanie', 'flora', 'agatha', 'flavia', 'madame', 'clara', 'ccncetta', 'cramond', 'madame', 'alice', 'matilda', 'signora', 'lucy', 'helen', 'alice', 'elizabeth', 'flora', 'scholastica', 'lizzy', 'campanula', 'edith', 'heather', 'florentine', 'selina', 'katie', 'lara', 'elsie', 'hale', 'silvia', 'katie', 'stacey', 'alma', "o'hara", 'auld', 'patricia', 'flora', 'honeysuckle', 'tableaux', 'susan', 'ellen', 'elissa', 'hannah', 'jemima', 'mary', 'peggy', 'filena', 'ann', 'augusta', 'ethel', 'mademoiselle', 'hester', 'lucy', 'meg', 'francesca', 'rebecca', 'jeanie', 'valeria', 'bonnie', 'polly', 'maria', 'madame', 'maggie', 'jane', 'nelly', 'lisbeth', 'charlotte', 'mademoiselle', 'alice', 'balaklava', 'malvina', 'elizabeth', 'lindsay', 'sandie', 'evelyn', 'agatha', 'aylmer', 'maddalena', 'mary', 'dinah', 'mary', 'lisa', 'myfanwy', 'patricia', 'alma', 'mary', 'emily', 'janet', 'anna', 'mary', 'merry', 'agnes']

t.concordance("Ann")

Displaying 8 of 8 matches:
ed , I looked from my window . Sister Ann - like , to see the conclusion of the
erhaps better knoAvn in literature as Ann Taylor ) has called attention to an e
 1764 , the only child of William and Ann Ward , who were in trade of some sort
e De Witts of Holland , and also that Ann Ward , afterwards Mrs Radcliffe , liv
ng , feeble - minded Monimia . " Miss Ann Jane Eliza HoUybourn , who equally re
 . Records of a Giklhood . By Frances Ann Kemble . London : Richard Bentley & S
 a person not in membership with us , Ann Gunn and Elizabeth Allen are desired 
ds .' A Sketch of the Quakers . 423 * Ann Gunnand Elizabeth Allen report that t

Try It! How could you use this list of names to begin identifying the women named in the LEDS dataset? Could you create a list of women and see if they have a place in Wikidata? If not, does the LEDS dataset tell you much about them so you could add them to Wikidata or contribute to the [Wikidata:WikiProject Women](https://www.wikidata.org/wiki/Wikidata:WikiProject_Women)?

3.2 Visualising words over time¶

Using Altair, we can visualise the occurrence of a single word in the LEDS dataset. Let's visualise the most commonly occuring name from among those guessed to be referring to a female (in the female_guesses list created above)!

Step 1: First we need to determine which name in the female_guesses list occurs most frequently in the LEDS dataset:

fdist = nltk.FreqDist(n for n in str_tokens_lower if n.lower() in female_guesses)
fdist.most_common(5)

[('mary', 420),
 ('maggie', 322),
 ('madame', 233),
 ('elsie', 197),
 ('marie', 197)]

str_tokens_lower.count('mary')

420

Okay so Mary is the most commonly identified, female, given name! Now let's count how many times Mary occurrs in every publication (file) in the LEDS dataset and create a DataFrame (table) with those counts:

def nameCountPerFile(name, plaintext_corpus_read_lists):
    name_count = []
    for file in fileids:
        file_tokens = plaintext_corpus_read_lists.words(file)
        lower_tokens = [t.lower() for t in file_tokens]
        name_count += [lower_tokens.count(name)]
    return name_count

mary_count = nameCountPerFile('mary', wordlists)
df_mary = df
df_mary['mary_count'] = mary_count
df_mary

source = df_mary

alt.Chart(source, title="Occurrence of the name 'Mary' in Ladies' Edinburgh Debating Society dataset").mark_bar(size=30).encode(
    alt.X('title:N', axis=alt.Axis(title='Volume'), sort=None),  # The source dataframe, df_mary, is in chronological order, so we don't want a different sorting
    alt.Y('mary_count:Q', axis=alt.Axis(title='Count'), sort=None)
).configure_axis(
    grid=False,
    labelFontSize=12,
    titleFontSize=12,
    labelAngle=-45
).properties(
    width=480
)

	fileid	title
0	109857781.txt	Attempt - Volume 1 and Select writings - U.431
1	103655648.txt	Attempt - Volume 2 - U.431
2	103655649.txt	Attempt - Volume 3 - U.431
3	103655650.txt	Attempt - Volume 4 - U.431
4	103655651.txt	Attempt - Volume 5 - U.431
5	103655652.txt	Attempt - Volume 6 - U.431
6	103655653.txt	Attempt - Volume 7 - U.431
7	103655654.txt	Attempt - Volume 8 - U.431
8	103655655.txt	Attempt - Volume 9 - U.431
9	103655656.txt	Attempt - Volume 10 - U.431
10	103655658.txt	Ladies' Edinburgh Magazine - Volume 1 - U.393
11	103655659.txt	Ladies' Edinburgh Magazine - Volume 2 - U.393
12	103655660.txt	Ladies' Edinburgh Magazine - Volume 3 - U.393
13	103655661.txt	Ladies' Edinburgh Magazine - Volume 4 - U.393
14	103655662.txt	Ladies' Edinburgh Magazine - Volume 5 - U.393
15	103655663.txt	Ladies' Edinburgh Magazine - Volume 6 - U.393