This notebook was originally prepared for the workshop Advanced Text Analysis with SpaCy and Scikit-Learn, presented as part of NYCDH Week 2017. Here, we try out features of the SpaCy library for natural language processing. We also do some statistical analysis using from the scikit-learn library.
Installing this software is easiest on a Linux-like system. If you're not already running Linux, you can easily download a distribution and copy it to a USB disk, which you can then boot from. I recommend getting DH-USB, a Linux-based operating system made for the Digital Humanities. DH-USB already has most of this software installed, including the SpaCy data, but might need
If you have a different Linux-like system, (including, to greater or lesser degrees, Ubuntu, MacOS, Cygwin, and Bash for Windows), you should be able to run these commands to install SpaCy, Scikit-Learn, Pandas, and the other required libraries. Ete3 is a library for tree visualization which is optional.
sudo pip install spacy scikit-learn pandas ete3
Note that if your system has Python 2 as the default, instead of Python 3, you might have to run
pip3 instead of
Now download the SpaCy data with this command:
python -m spacy.en.download all
To get my sent2tree library and all the sample data, simply
git clone the repository where this notebook lives:
git clone https://github.com/JonathanReeve/advanced-text-analysis-workshop-2017.git
You can also use the cloud-based Digital Humanities workstation platform DH Box to run the code in this notebook. The installation procedure is slightly different. First, run these commands in the Command Line tab:
1 2 3 4 5 6
sudo pip3 install spacy scikit-learn seaborn sudo pip3 install --pre ete3 sudo apt-get update sudo apt-get install python3-pyqt4 # Required by ete3 sudo python3 -m spacy.en.download all git clone https://github.com/JonathanReeve/advanced-text-analysis-workshop-2017.git
Then open the Jupyter Notebook tab, log in, navigate to the directory
advanced-text-analysis-workshop-2017, and open this notebook,
The sample data is the script of the 1975 film Monty Python and the Holy Grail, taken from the NLTK Book corpus, and the Project Gutenberg edition of Jane Austen's novel Pride and Prejudice.
Each SpaCy document is already tokenized into words, which are accessible by iterating over the document:
You can also iterate over the sentences.
doc.sents is a generator object, so we can use
Or you can force it into a list, and then do things with it:
For example, let's find the longest sentence(s) in Pride and Prejudice:
Using just the indices (
.i), we can make a lexical dispersion plot for the occurrences of that word in the novel. (This is just the SpaCy equivalent of the lexical dispersion plot from the NLTK Book, chapter 1.)
See if you can tell which characters end up getting together at the end, just based on this plot.
Named entities can be accessed through
doc.ents. Let's find all the types of named entities from Monty Python and the Holy Grail:
What about those that are works of art?
How about groups of people?
"French" here refers to French people, not the French language. We can verify that by getting all the sentences in which this particular type of entity occurs:
Each word already has a part of speech and a tag associated with it. Here's a list of all the parts of speech in Pride and Prejudice:
It's fun to compare the distribution of parts of speech in each text:
Now we can see, for instance, what the most common pronouns might be:
Let's try this on the level of a sentence. First, let's get all the sentences in which Sir Robin is explicitly mentioned:
Now let's analyze just one of these sentences.
Let's look at the tags and parts of speech:
This sentence has lots of properties:
To drill down into the sentence, we can start with the root:
That root has children:
Let's see all of the children for each word:
This is very messy-looking, so let's create a nicer visualization. Here I'll be using a class I wrote called sentenceTree, available in the
sent2tree module in this repository. It just shoehorns a SpaCy span (sentence or other grammatical fragment) into a tree that can be read by the
ete3 library for handling trees. This library just allows for some pretty visualizations of trees.
You can already see how useful this might be. Since adjectives are typically children of the things they describe, we can get approximations for adjectives that describe characters. How is Sir Robin described?
Looks like we shouldn't always trust syntactic insight! Now let's do something similar for Pride and Prejudice. First, we'll use named entity extraction to get a list of the most frequently mentioned characters:
Now we can write a function that walks down the tree from each character, looking for the first adjectives it can find:
We'll try it on Mr. Darcy:
Now let's do the same sort of thing, but look for associated verbs. First, let's get all the sentences in which Elizabeth is mentioned:
And we can peek at one of them:
We want the verb associated with Elizabeth, remained, not the root verb of the sentence, walked, which is associated with Mr. Darcy. So let's write a function that will walk up the dependency tree from a character's name until we get to the first verb. We'll use lemmas instead of the conjugated forms to collapse remain, remains, and remained into one verb: remain.
We can now merge these counts into a single table, and then we can visualize it with Pandas.
SpaCy has a list of probabilities for English words, and these probabilities are automatically associated with each word once we parse the document. Let's see what the distribution is like:
Let's peek at some of the improbable words for Monty Python and the Holy Grail.
Now we can do some rudimentary information extraction by counting the improbable words:
What are those words for Pride and Prejudice?
We can do this with ngrams, too, with some fancy Python magic:
Word embeddings (word vectors) are numeric representations of words, usually generated via dimensionality reduction on a word cooccurrence matrix for a large corpus. The vectors SpaCy uses are the GloVe vectors, Stanford's Global Vectors for Word Representation. These vectors can be used to calculate semantic similarity between words and documents.
Let's look at vectors for Pride and Prejudice. First, let's get the first 150 nouns:
Now let's get vectors and labels for each of them:
A single vector is 300-dimensional, so in order to plot it in 2D, it might help to reduce the dimensionality to the most meaningful dimensions. We can use Scikit-Learn to perform truncated singular value decomposition for latent semantic analysis (LSA).
Plot the results in a scatter plot:
This uses a non-semantic technique for vectorizing documents, just using bag-of-words. We won't need any of the fancy features of SpaCy for this, just scikit-learn. We'll use a subset of the Inaugural Address Corpus that contains 20th and 21st century inaugural addresses.
First, we'll vectorize the corpus using scikit-learn's
TfidfVectorizer class. This creates a matrix of word frequencies. (It doesn't actually use TF-IDF, since we're turning that off in the options below.)
Let's load the Inaugural Address documents into SpaCy to analyze things like average sentence length. SpaCy makes this really easy.
This sort of thing you've probably already seen in the NLTK book, but it's made even easier in SpaCy. We're simply going to count the occurrences of words and divide by the total number of words in the document.
We can easily slice this data frame with words we're interested in, and plot those words across the corpus. For example, let's look at the proportions of the words "America" and "world":
We can even compute, say the ratio of uses of the word "America" to uses of the word "world."
.similarity() method from earlier that uses word vectors, we can very easily compute the document similarity between all the documents in our corpus.