Created August-September 2020 for the National Library of Scotland's Data Foundry by Lucy Havens, Digital Library Research Intern
The Ladies' Edinburgh Debating Society (LEDS) was founded by women in 1865 who were members of the upper-middle and high classes at a time when women had limited higher education opportunities. Members went on to play significant roles in education, suffrage, philanthropy, and anti-slavery efforts. The LEDS Dataset contains digitised text from all volumes of two journals the Society published: The Attempt and The Ladies' Edinburgh Magazine. The first journal contains 10 volumes published from 1865 through 1874. The second journal contains six volumes published from 1875 through 1880.
The Ladies' Edinburgh Debating Society, also known as the Edinburgh Essay Society and the Ladies' Edinburgh Essay Society, was dissolved in 1935. A year later, in 1936, the National Library of Scotland acquired the volumes that were digitised in this dataset.
Import libraries to use for cleaning, summarising and exploring the data:
# To prevent SSL certificate failure
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
getattr(ssl, '_create_unverified_context', None)):
ssl._create_default_https_context = ssl._create_unverified_context
# Libraries for data loading
import pandas as pd
import numpy as np
import string
import re
from collections import defaultdict
import urllib.request
import urllib
import json
# Libraries for visualization
import altair as alt
import matplotlib.pyplot as plt
# Libraries for text analysis
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
nltk.download('wordnet')
from nltk.corpus import wordnet
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.text import Text
from nltk.stem.porter import PorterStemmer
from nltk.probability import FreqDist
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
nltk.download('tagsets') # part of speech tags
from nltk.draw.dispersion import dispersion_plot as displt
To explore the text in the Ladies' Edinburgh Debating Society collection, we'll mainly use the Natural Language Toolkit (NLTK), a library written for the programming language Python.
The nls-text-ladiesDebating folder (downloadable as Just the text data from the website at the top of this notebook) contains TXT files of digitised text with numerical names, as well as a CSV inventory file and a TXT ReadMe file. Load only the TXT files of digitised text and tokenise the text (which splits running text into separate words, numbers, and punctuation):
corpus_folder = 'data/nls-text-ladiesDebating/'
wordlists = PlaintextCorpusReader(corpus_folder, '\d.*', encoding='latin1')
corpus_tokens = wordlists.words()
print(corpus_tokens[:10])
Note: If you'd like to see how to specify a single TXT file to load as data, check out the Jupyter Notebook for Exploring Britain and UK Handbooks!
It's hard to get a sense of how accurately the text has been digitised from this list of 10 tokens, so let's look at one of these words in context. To see phrases in which "Edinburgh" is used, we can use the concordance()
method:
t = Text(corpus_tokens)
t.concordance('Mrs', lines=20)
This dataset has not been manually cleaned after OCR digitised text from The Attempt and The Ladies' Edinburgh Magazine so it's not surprising to see some non-words appear in the concordance. Even with the digitisation errors, though, we can still get a sense of what's in the text using natural language processing (NLP) methods!
Before we do much analysis, let's get a sense of how much data we're working with:
def corpusStatistics(plaintext_corpus_read_lists):
total_chars = 0
total_tokens = 0
total_sents = 0
total_files = 0
# fileids are the TXT file names in the nls-text-ladiesDebating folder:
for fileid in plaintext_corpus_read_lists.fileids():
total_chars += len(plaintext_corpus_read_lists.raw(fileid))
total_tokens += len(plaintext_corpus_read_lists.words(fileid))
total_sents += len(plaintext_corpus_read_lists.sents(fileid))
total_files += 1
print("Total...")
print(" Characters in Ladies' Edinburgh Debating Society (LEDS) Data:", total_chars)
print(" Tokens in LEDS Data:", total_tokens)
print(" Sentences in LEDS Data:", total_sents)
print(" Files in LEDS Data:", total_files)
corpusStatistics(wordlists)
Note that I've printed Tokens
rather than words, though the NLTK method used to count those was .words()
. This is because words in NLTK include punctuation and numbers, in addition to letters.
Next, we'll create two subsets of the data, one for each journal. To do so we first need to load the inventory (CSV file) that lists which file name corresponds with which journal. When you open the inventory in Microsoft Excel or a text editor, you can see that there are no column names. The Python library Pandas, which can read CSV files, calls column names the header
. When we use Pandas to read the inventory, we'll create our own header by specifying header=None
and providing a list of column names
.
When Pandas (abbreviated pd
when we loaded libraries in the first cell of this notebook) reads a CSV file, it creates a table called a dataframe from that data. Let's see what the LEDS inventory dataframe looks like:
df = pd.read_csv('data/nls-text-ladiesDebating/ladiesDebating-inventory.csv', header=None, names=['fileid', 'title'])
df
Since we only have 16 files (with indeces running from 0 through 15), we'll print the entire dataframe. With larger dataframes you may wish to use df.head()
or df.tail()
to print only the first 5 rows or last 5 rows, respectively.
Now we can create a two dictionaries of file IDs and their associated journal titles, one for The Attempt and one for The Ladies' Edinburgh Magazine:
attempts = {}
mags = {}
for index, row in df.iterrows():
fileid = row['fileid']
title = row['title']
if 'Attempt' in title:
attempts[fileid] = title
else: # if 'Magazine' in title:
mags[fileid] = title
print("The Attempt files:")
print(attempts)
print("\n Ladies' Edinburgh Magazine files:") # \n is a newline character
print(mags)
For convenient reference of only fileids, we can also create lists from the dictionaries:
attempt_ids = list(attempts.keys())
mag_ids = list(mags.keys())
print(mag_ids)
NLTK stores the lists of tokens in the corpus_tokens
variable we created by the file IDs, so it's useful to be able to match the file IDs with their journal titles!
There are several ways to standardise, or "normalise," text, with each way providing data suitable to different types of analysis. For example, to study the vocabulary of a text, it's useful to remove punctuation and digits, lowercase the remaining alphabetic words, and then reduce those words to their root form (with stemming or lemmatisation - more on this later). Alternatively, to identify people and places using named entity recognition, it's important to keep capitalisation in words and keep words in the context of their sentences.
In section 0. Preparation, we tokenised the LEDS dataset when we created the corpus_tokens
list. corpus_tokens
contains a list of all words, punctuation, and numbers that appear in the LEDS dataset separated into individual items and organised in the order they appear in the LEDS text files. In addition to tokenising words, NLTK also provides methods to tokenise sentences. This is how we counted the number of sentences in section 0.1 Dataset Size.
Tokenized words are helpful when analysing the vocabulary of text. Tokenised sentences are helpful when analysing linguistic patterns of a text. Let's create lists of tokens as strings (String is Python's data format for text) from the LEDS dataset:
# Create a list of tokens as strings for the entire corpus
str_tokens = [str(word) for word in corpus_tokens]
print(str_tokens[0:10])
# Create a list of tokens as strings for The Attempt
attempt_str_tokens = []
for fileid in attempt_ids:
attempt_tokens = wordlists.words(fileid)
attempt_str_tokens += [str(t) for t in attempt_tokens]
print(attempt_str_tokens[-10:])
# Create a list of tokens as strings for Ladies' Edinburgh Magazine
mag_str_tokens = []
for fileid in mag_ids:
mag_tokens = wordlists.words(fileid)
mag_str_tokens += [str(t) for t in mag_tokens]
print(mag_str_tokens[200:210])
Let's also create a list of tokens that are most likely to be valid English words by removing non-alphabetic tokens from str_tokens
(e.g. punctuation, numbers):
alpha_tokens = [t for t in str_tokens if t.isalpha()]
print(alpha_tokens[1000:1010])
Knowing that the digitised text in the LEDS dataset wasn't cleaned up after OCR, there may be words whose letters were incorrectly digitised as punctuation or numbers. To include those words, we'll put all tokens that each have at least one letter in a with_letters
list:
with_letters = [t for t in str_tokens if re.search("[a-zA-z]+", t)]
print(with_letters[2000:2010])
Next, we'll create lowercased versions (this is called casefolding in NLP) of the previous lists of tokens, which, as explained at the beginning of this section, can be useful for studying the vocabulary of a dataset:
str_tokens_lower = [(str(word)).lower() for word in corpus_tokens]
alpha_tokens_lower = [t for t in str_tokens_lower if t.isalpha()]
with_letters_lower = [t for t in str_tokens_lower if re.search("[a-zA-z]+", t)]
# Check that the capitalised and lowercased lists of tokens are the same length, as expected
assert(len(str_tokens_lower) == len(str_tokens)) # an error will be thrown if something went wrong
assert(len(alpha_tokens_lower) == len(alpha_tokens)) # an error will be thrown if something went wrong
assert(len(with_letters_lower) == len(with_letters)) # an error will be thrown if something went wrong
As stated at the start of this section, we can also tokenise sentences. Tokenising sentences separates running text into individual sentences, which is necessary for analysing sentence structure. Let's create one list of all sentences in the LEDS corpus, and a dictionary of lists for each file in the corpus:
all_sents = []
sents_by_file = dict.fromkeys(wordlists.fileids())
# Iterate through each file in the LEDS corpus
for fileid in wordlists.fileids():
file_sents = sent_tokenize(wordlists.raw(fileid))
all_sents += [str(sent) for sent in file_sents]
sents_by_file[fileid] = all_sents
print("Sample:", all_sents[200:205])
I wonder if the language changed from The Attempt publication to the later The Ladies' Edinburgh Magazine publication? Let's create lists of all sentences for each of these publications so the language of the two publications can be compared and contrasted:
attempt_file_sents = dict.fromkeys(attempt_ids)
attempt_sents = []
# Iterate through each file of a publication of The Attempt
for fileid in attempt_ids:
file_sents = sent_tokenize(wordlists.raw(fileid))
attempt_sents += [str(sent) for sent in file_sents]
attempt_file_sents[fileid] = attempt_sents
print("Total sentences in The Attempt:", len(attempt_sents))
print("Sample:", attempt_file_sents["103655648.txt"][400:405])
print()
mag_file_sents = dict.fromkeys(mag_ids)
mag_sents = []
# Iterate through each file of a publication of The Ladies' Edinburgh Magazine
for fileid in mag_ids:
file_sents = sent_tokenize(wordlists.raw(fileid))
mag_sents += [str(sent) for sent in file_sents]
mag_file_sents[fileid] = mag_sents
print("Total sentences in The Ladies' Edinburgh Magazine:", len(mag_sents))
print("Sample:", mag_file_sents["103655659.txt"][250:255])
As we saw in the results of the concordance()
method, OCR doesn't result in perfectly digitised text. To get a sense of how many mistakes may have been made in the digitisation process, we can measure how many words in the LEDS dataset are recognisable English words according to a list of words considered valid in the board game Scrabble (as demonstrated in this example).
As mentioned in section 1.1 Tokenisation, there are several ways to standardise ("normalise") text, with each way providing text suitable to different types of analysis. We're concerned with studying vocabulary, since we want to measure how many of the alphabetic tokens that NLTK has identified in the LEDS dataset are valid English words, so we'll work with lowercase, alphabetic tokens from our alpha_tokens_lower
list.
To efficiently measure the number of valid and invalid English words, we can further standardise our data through stemming. Stemming reduces words to their root form by removing suffixes and prefixes. For example, the word "troubling" has a stem of "troubl."
In the next 3 steps we'll load the Scrabble dataset of valid English words, stem the Scrabble dataset and LEDS dataset, and then see if the stems from the LEDS dataset are present in the Scrabble dataset.
Step 1: First we'll load the Scrabble file of words (which helpfully includes British English spellings!) and create a list of those words as a frozen set, which prevents them from being modified accidentally:
file = open('data/scrabble_words.txt', 'r')
scrabble_words = file.read().split('\n')
scrabble_words_lower = [word.lower() for word in scrabble_words]
assert(len(scrabble_words) == len(scrabble_words_lower)) # the number of words shouldn't change when the list is lowercased
print("Total words in Scrabble list:", len(scrabble_words))
print("Sample of English words from the Scrabble list:", scrabble_words_lower[100:120])
Step 2: Next we'll stem the tokens in the Scrabble list and the LEDS dataset. There are different algorithms that one can use to determine the root of a word; we will use the Porter Stemmer algorithm. To make our code as efficient as possible, we'll create sets of the Scrabble and LEDS stems (sets are a Python data structure that are similar to lists, except that each item in a set is unique, so there are no duplicates).
This process should give us a smaller number of words to compare and should enable tokens in LEDS to be recognised as English words even if they appeared in a different form in the Scrabble list.
porter = nltk.PorterStemmer()
unique_alpha_lower = list(set(alpha_tokens_lower)) # Remove duplicates from the lowercased, alphabetic tokens in the LEDS dataset
leds_porter_stemmed = [porter.stem(t) for t in unique_alpha_lower]
scrabble_porter_stemmed = [porter.stem(t) for t in scrabble_words_lower]
# Remove duplicates from the Scrabble and LEDS lists of stems
leds_pstemmed_set = list(set(leds_porter_stemmed))
scrabble_pstemmed_set = list(set(scrabble_porter_stemmed))
print(leds_pstemmed_set[:10])
print(scrabble_pstemmed_set[50:60])
Step 3: Lastly, we'll compare the stems (root forms) of LEDS tokens to the stems of Scrabble words to gauge how many LEDS tokens are recognisable English words.
recognised_stems = 0
for stem in leds_porter_stemmed:
if stem in scrabble_porter_stemmed:
recognised_stems += 1
print("Recognised Stems:", (recognised_stems/len(leds_porter_stemmed))*100,"%")
Rather than comparing stems in the Scrabble and LEDS dataset, you could also compare lemmas or the entire vocabularies (all lowercased, unique tokens). Comparing the entire vocabularies will take longer than comparing stems and lemmas, though.
It looks as though just under half the stems in the LEDS text aren't recognised...how might we figure out what some of those words are meant to be?
Another form of standardisation in text analysis is tagging sentences, or identifying the parts of speech in sentences. Identifying parts of speech that compose the structure of sentences is important for analysing linguistic patterns and comparing the writing styles of different texts. We'll use NLTK's built-in part of speech tagger to tag sentences for the entire corpus:
fileids = list(df['fileid'])
tagged_sents = []
for fileid in fileids:
file = wordlists.raw(fileid)
sentences = nltk.sent_tokenize(file)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
tagged_sents += [sent for sent in sentences]
print("Total part-of-speech tagged sentences:", len(tagged_sents))
print("Sample:", tagged_sents[1000:1003])
Great! We'll use these tagged sentences later on, in 3. Exploratory Analysis, to help us identify named entities (i.e. people, places, organisations) in the LEDS dataset.
Now that we've created some different cuts of the LEDS dataset, let's start investigating the frequency of terms as they appear across the dataset. One way to do so is with a frequency distribution, which is a line chart that shows how many times a token appears in a dataset. The following 3 steps demonstrate how to visualise frequency distributions.
Step 1: Filter the tokens in each LEDS publication to exclude one-letter words, two-letter words, and stop words (such as and
, a
, and the
), and then lowercase all the tokens:
# Use NLTK's provided stop words for the English language
to_exclude = list(set(stopwords.words('english')))
to_exclude += ['attempt', 'magazine', 'ladies', 'edinburgh'] # add words from the journals' titles
# Filter one-letter words, two-letter words, and stop words out of the list of The Attempt tokens
attempt_min_three_letters = []
attempt_min_three_letters += [t.lower() for t in attempt_str_tokens if len(t) > 2]
attempt_filtered_tokens = [t for t in attempt_min_three_letters if not t in to_exclude]
print("Sample of The Attempt tokens after filtering:", attempt_filtered_tokens[60:70])
# Filter one-letter words, two-letter words, and stop words out of the list of Ladies' Edinburgh Magazine tokens
mag_min_three_letters = []
mag_min_three_letters += [t.lower() for t in mag_str_tokens if len(t) > 2]
mag_filtered_tokens = [t for t in mag_min_three_letters if not t in to_exclude]
print("Sample of Ladies' Edinburgh Magazine tokens after filtering:", mag_filtered_tokens[200:210])
Step 2: Calculate the frequency distribution for each LEDS publication using NLTK's FreqDist()
method:
# Calculate the frequency distribution for each filtered list of tokens
attempt_fdist = FreqDist(attempt_filtered_tokens)
print("Total tokens in The Attempt after filtering:", attempt_fdist.N())
mag_fdist = FreqDist(mag_filtered_tokens)
print("Total tokens in Ladies' Edinburgh Magazine after filtering:", mag_fdist.N())
Step 3: Plot the frequency distributions for each LEDS publication:
# Visualise the frequency distribution for a select number of tokens
plt.figure(figsize = (18, 8)) # customise the width and height of the plot
plt.rc('font', size=12) # customise the font size of the title, axes names, and axes labels
attempt_fdist.plot(20, title='Frequency Distribution of the 20 Most Common Words in The Attempt (excluding stop words, 1-letter and 2-letter words)')
# Visualise the frequency distribution for a select number of tokens
plt.figure(figsize = (18, 8)) # customise the width and height of the plot
plt.rc('font', size=12) # customise the font size of the title, axes names, and axes labels
mag_fdist.plot(20, title="Frequency Distribution of the 20 Most Common Words in Ladies' Edinburgh Magazine (excluding stop words, 1-letter and 2-letter words)")
To measure the diversity of word choice in a text, we can use the lexical diversity metric, which is the length of the vocabulary of a text divided by the total length of the text. Length is the total number of words, and vocabulary is a non-repeating list of words (unique words) in a text.
Let's compare the lexical diversities of the two publications in the LEDS dataset:
Step 1: First, let's remove all tokens that aren't words by excluding tokens that are made up of punctuation and digits, rather than letters. We'll also casefold all the words to standardise them, so that The
and the
are considered the same word, for example.
# Remove non-alphabetic tokens (exclude punctuation and digits) and lowercase all tokens
attempt_alpha_lower = [t.lower() for t in attempt_str_tokens if t.isalpha()]
mag_alpha_lower = [t.lower() for t in mag_str_tokens if t.isalpha()]
# Print the lengths (total words) of each publication
print("The Attempt length:", len(attempt_alpha_lower), "words")
print("Ladies' Edinburgh Magazine length:", len(mag_alpha_lower), "words")
So The Attempt files have a total of slightly more words than those of Ladies' Edinburgh Magazine.
Step 2: Next, let's find the vocabulary of the two publications.
attempt_vocab = set(attempt_alpha_lower)
mag_vocab = set(mag_alpha_lower)
print("The Attempt vocabulary size:", len(attempt_vocab), "words")
print("Ladies' Edinburgh Magazine vocablary size:", len(mag_vocab), "words")
So The Attempt has a larger vocabulary size than Ladies' Edinburgh Magazine. Given that The Attempt's overall length is longer, this isn't surprising. To compare the vocabularies (word choice) of the two publications relative to their lengths, we use the lexical diversity metric.
Step 3: Calculate the lexical diversity of each publication.
# INPUT: a list of all words and a vocabulary list for a text source
# OUTPUT: the number of unique words (length of the vocabulary) divided by
# the total words of a text source (the lexical diversity score)
def lexicalDiversity(all_words, vocab):
return len(vocab)/len(all_words)
print("The Attempt's lexical diversity score:", lexicalDiversity(attempt_alpha_lower, attempt_vocab))
print("Ladies' Edinburgh Magazine's lexical diversity score:", lexicalDiversity(mag_alpha_lower, mag_vocab))
The scores are very close! The word choice in The Attempt is only slightly more diverse than Ladies' Edinburgh Magazine.
In NLP, named entity recognition is the process of identifying people, places, and organisations ("entities") that are named in a dataset. In order to recognise entities, a dataset of running text must be tokenised into sentences, and then those sentences must be tagged with parts of speech. Entities' names are often capitalised, so we do not casefold text on which we want to run named entity recognition.
We've already tokenised sentences in the LEDS dataset and tagged their parts of speech in 1.3 Part of Speech Tagging, so we can use the resulting tagged_sents
list. We'll use SpaCy's named entitiy recognition tool:
First, we need to make sure we have the SpaCy langauge model we are going to use:
try:
import en_core_web_sm
except ImportError:
print("Downlading en_core_web_sm model")
import sys
!{sys.executable} -m spacy download en_core_web_sm
else:
print("Already have en_core_web_sm")
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
sentences = []
for fileid in fileids:
file = wordlists.raw(fileid)
sentences += nltk.sent_tokenize(file)
person_list = []
for s in sentences:
s_ne = nlp(s)
for entity in s_ne.ents:
if entity.label_ == 'PERSON':
person_list += [entity.text]
print(len(person_list))
displacy.render(nlp(str(sentences[29997])), jupyter=True, style='ent')
unique_persons = list(set(person_list))
print(len(unique_persons))
names = []
for name in unique_persons:
if re.search('([A-Z]{1}([a-z])+\.?)', name):
names += [name]
print(len(names))
Next, we can use an API called genderize.io to guess how many of the names refer to a male or female:
def guessGender(person_name):
genderize_url = 'https://api.genderize.io?name='
country_gb = '&country=GB'
url = genderize_url+person_name+country_gb
content = (urllib.request.urlopen(url)).read()
return str(content).strip("b'")
gender_guesses = []
errored = []
titles = ['mrs', 'ms', 'mr', 'miss', 'sir', "ma'am", 'lord', 'lady', 'king', 'queen', 'duchess', 'duke', 'mademoiselle', 'madame', 'monsieur', 'signora']
for name in names:
name = name.lower()
for title in titles:
if title in name:
# Remove the title and any whitespace after the title
name = name.replace(title, "").strip()
# If the name includes a more than a given name (i.e. family name, middle
# name), create a list of each name and take only the first list item
name = name.split()
name = name[0]
try:
guess = guessGender(name)
gender_guesses += [guess]
except UnicodeEncodeError:
errored += [name]
# If there are too many requests (for genderize.io,
# only 1000 can be made in a day), end the loop
except HTTPError:
return
print(gender_guesses[:3])
print("Number of gender guesses made:" len(gender_guesses))
Since there's a limit on the number of requests that one can make to genderize.io in single day, so for now let's simply use the 967 guesses we just made. Let's calculate the number of names guessed to be for a "male" and "female" with a probability of at least a 0.9 (90%). To make it easier to find guesses that meet this criteria, we'll convert the gender guesses to a different data structure. Genderize.io sends responses (returns gender guesses) in the JSON data format, which is similar to Python's dictionary data structure, so we'll convert the String representations of the JSON responses into dictionaries. Then we'll figure out whether a name is guessed as representing a "male" or "female" gender-identifying person with at least 90% probability.
import json
male_guesses = []
female_guesses = []
for response in gender_guesses:
response = response.replace("\\","")
response = response.replace("\'s","")
try:
response = json.loads(response)
if response["probability"] >= 0.9:
if response["gender"] == "male":
male_guesses += [response["name"]]
elif response["gender"] == "female":
female_guesses += [response["name"]]
# If there's an error, print the response
# to see if the name is valid
except:
print(response)
print("Names guessed male:", len(male_guesses))
print("Names guessed female:", len(female_guesses))
The name that through an error isn't a valid name, so we won't worry about that. Let's take a closer look at the names guessed as female:
print(female_guesses)
t.concordance("Ann")
Using Altair, we can visualise the occurrence of a single word in the LEDS dataset. Let's visualise the most commonly occuring name from among those guessed to be referring to a female (in the female_guesses
list created above)!
Step 1: First we need to determine which name in the female_guesses
list occurs most frequently in the LEDS dataset:
fdist = nltk.FreqDist(n for n in str_tokens_lower if n.lower() in female_guesses)
fdist.most_common(5)
str_tokens_lower.count('mary')
Okay so Mary is the most commonly identified, female, given name! Now let's count how many times Mary occurrs in every publication (file) in the LEDS dataset and create a DataFrame (table) with those counts:
def nameCountPerFile(name, plaintext_corpus_read_lists):
name_count = []
for file in fileids:
file_tokens = plaintext_corpus_read_lists.words(file)
lower_tokens = [t.lower() for t in file_tokens]
name_count += [lower_tokens.count(name)]
return name_count
mary_count = nameCountPerFile('mary', wordlists)
df_mary = df
df_mary['mary_count'] = mary_count
df_mary
source = df_mary
alt.Chart(source, title="Occurrence of the name 'Mary' in Ladies' Edinburgh Debating Society dataset").mark_bar(size=30).encode(
alt.X('title:N', axis=alt.Axis(title='Volume'), sort=None), # The source dataframe, df_mary, is in chronological order, so we don't want a different sorting
alt.Y('mary_count:Q', axis=alt.Axis(title='Count'), sort=None)
).configure_axis(
grid=False,
labelFontSize=12,
titleFontSize=12,
labelAngle=-45
).properties(
width=480
)