Advanced Text Analysis with SpaCy and Scikit-Learn

#Advanced-Text-Analysis-with-SpaCy-and-Scikit-Learn

This notebook was originally prepared for the workshop Advanced Text Analysis with SpaCy and Scikit-Learn, presented as part of NYCDH Week 2017. Here, we try out features of the SpaCy library for natural language processing. We also do some statistical analysis using from the scikit-learn library.

Prepared by Jonathan Reeve (Group for Experimental Methods in the Humanities, Columbia University). All code here is licensed under the MIT License.

Installation

#Installation

Installing this software is easiest on a Linux-like system. If you're not already running Linux, you can easily download a distribution and copy it to a USB disk, which you can then boot from. I recommend getting DH-USB, a Linux-based operating system made for the Digital Humanities. DH-USB already has most of this software installed, including the SpaCy data, but might need ete3.

If you have a different Linux-like system, (including, to greater or lesser degrees, Ubuntu, MacOS, Cygwin, and Bash for Windows), you should be able to run these commands to install SpaCy, Scikit-Learn, Pandas, and the other required libraries. Ete3 is a library for tree visualization which is optional.

1
sudo pip install spacy scikit-learn pandas ete3

Note that if your system has Python 2 as the default, instead of Python 3, you might have to run pip3 instead of pip.

Now download the SpaCy data with this command:

1
python -m spacy.en.download all

To get my sent2tree library and all the sample data, simply git clone the repository where this notebook lives:

1
git clone https://github.com/JonathanReeve/advanced-text-analysis-workshop-2017.git

Installation on DH Box

#Installation-on-DH-Box

You can also use the cloud-based Digital Humanities workstation platform DH Box to run the code in this notebook. The installation procedure is slightly different. First, run these commands in the Command Line tab:

1
2
3
4
5
6
sudo pip3 install spacy scikit-learn seaborn
sudo pip3 install --pre ete3 
sudo apt-get update
sudo apt-get install python3-pyqt4 # Required by ete3
sudo python3 -m spacy.en.download all
git clone https://github.com/JonathanReeve/advanced-text-analysis-workshop-2017.git

Then open the Jupyter Notebook tab, log in, navigate to the directory advanced-text-analysis-workshop-2017, and open this notebook, advanced-text-analysis.ipynb.

The sample data is the script of the 1975 film Monty Python and the Holy Grail, taken from the NLTK Book corpus, and the Project Gutenberg edition of Jane Austen's novel Pride and Prejudice.

Exploring the Document

#Exploring-the-Document

Each SpaCy document is already tokenized into words, which are accessible by iterating over the document:

Loading output library...
Loading output library...

You can also iterate over the sentences. doc.sents is a generator object, so we can use next():

Loading output library...

Or you can force it into a list, and then do things with it:

Loading output library...
Loading output library...

For example, let's find the longest sentence(s) in Pride and Prejudice:

Loading output library...

Exploring Words

#Exploring-Words

Each word has a crazy number of properties:

Loading output library...
Loading output library...

Using just the indices (.i), we can make a lexical dispersion plot for the occurrences of that word in the novel. (This is just the SpaCy equivalent of the lexical dispersion plot from the NLTK Book, chapter 1.)

Loading output library...
Loading output library...
Loading output library...

See if you can tell which characters end up getting together at the end, just based on this plot.

Exploring Named Entities

#Exploring-Named-Entities

Named entities can be accessed through doc.ents. Let's find all the types of named entities from Monty Python and the Holy Grail:

Loading output library...

What about those that are works of art?

Loading output library...

Place names?

Loading output library...

Organizations?

Loading output library...

How about groups of people?

Loading output library...

"French" here refers to French people, not the French language. We can verify that by getting all the sentences in which this particular type of entity occurs:

Loading output library...

Parts of Speech

#Parts-of-Speech

Each word already has a part of speech and a tag associated with it. Here's a list of all the parts of speech in Pride and Prejudice:

Loading output library...

It's fun to compare the distribution of parts of speech in each text:

Loading output library...
Loading output library...

Now we can see, for instance, what the most common pronouns might be:

Loading output library...
Loading output library...

Let's try this on the level of a sentence. First, let's get all the sentences in which Sir Robin is explicitly mentioned:

Loading output library...

Now let's analyze just one of these sentences.

Loading output library...

Let's look at the tags and parts of speech:

Dependency Parsing

#Dependency-Parsing

Now let's analyze the structure of the sentence.

This sentence has lots of properties:

Loading output library...

To drill down into the sentence, we can start with the root:

Loading output library...

That root has children:

Loading output library...

Let's see all of the children for each word:

This is very messy-looking, so let's create a nicer visualization. Here I'll be using a class I wrote called sentenceTree, available in the sent2tree module in this repository. It just shoehorns a SpaCy span (sentence or other grammatical fragment) into a tree that can be read by the ete3 library for handling trees. This library just allows for some pretty visualizations of trees.

Loading output library...

You can already see how useful this might be. Since adjectives are typically children of the things they describe, we can get approximations for adjectives that describe characters. How is Sir Robin described?

Looks like we shouldn't always trust syntactic insight! Now let's do something similar for Pride and Prejudice. First, we'll use named entity extraction to get a list of the most frequently mentioned characters:

Loading output library...

Now we can write a function that walks down the tree from each character, looking for the first adjectives it can find:

We'll try it on Mr. Darcy:

Loading output library...

Now let's do the same sort of thing, but look for associated verbs. First, let's get all the sentences in which Elizabeth is mentioned:

And we can peek at one of them:

Loading output library...
Loading output library...

We want the verb associated with Elizabeth, remained, not the root verb of the sentence, walked, which is associated with Mr. Darcy. So let's write a function that will walk up the dependency tree from a character's name until we get to the first verb. We'll use lemmas instead of the conjugated forms to collapse remain, remains, and remained into one verb: remain.

Loading output library...

We can now merge these counts into a single table, and then we can visualize it with Pandas.

Loading output library...
Loading output library...

Probabilities

#Probabilities

SpaCy has a list of probabilities for English words, and these probabilities are automatically associated with each word once we parse the document. Let's see what the distribution is like:

Loading output library...
Loading output library...

Let's peek at some of the improbable words for Monty Python and the Holy Grail.

Loading output library...

Now we can do some rudimentary information extraction by counting the improbable words:

Loading output library...

What are those words for Pride and Prejudice?

Loading output library...

We can do this with ngrams, too, with some fancy Python magic:

Word Embeddings (Word Vectors)

#Word-Embeddings-(Word-Vectors)

Word embeddings (word vectors) are numeric representations of words, usually generated via dimensionality reduction on a word cooccurrence matrix for a large corpus. The vectors SpaCy uses are the GloVe vectors, Stanford's Global Vectors for Word Representation. These vectors can be used to calculate semantic similarity between words and documents.

Loading output library...
Loading output library...
Loading output library...

Let's look at vectors for Pride and Prejudice. First, let's get the first 150 nouns:

Now let's get vectors and labels for each of them:

Loading output library...

A single vector is 300-dimensional, so in order to plot it in 2D, it might help to reduce the dimensionality to the most meaningful dimensions. We can use Scikit-Learn to perform truncated singular value decomposition for latent semantic analysis (LSA).

Plot the results in a scatter plot:

Loading output library...

Document Vectorization

#Document-Vectorization

This uses a non-semantic technique for vectorizing documents, just using bag-of-words. We won't need any of the fancy features of SpaCy for this, just scikit-learn. We'll use a subset of the Inaugural Address Corpus that contains 20th and 21st century inaugural addresses.

First, we'll vectorize the corpus using scikit-learn's TfidfVectorizer class. This creates a matrix of word frequencies. (It doesn't actually use TF-IDF, since we're turning that off in the options below.)

Loading output library...
Loading output library...

Average Sentence Lengths

#Average-Sentence-Lengths

Let's load the Inaugural Address documents into SpaCy to analyze things like average sentence length. SpaCy makes this really easy.

Loading output library...
Loading output library...

Term Frequency Distributions

#Term-Frequency-Distributions

This sort of thing you've probably already seen in the NLTK book, but it's made even easier in SpaCy. We're simply going to count the occurrences of words and divide by the total number of words in the document.

Loading output library...

We can easily slice this data frame with words we're interested in, and plot those words across the corpus. For example, let's look at the proportions of the words "America" and "world":

Loading output library...
Loading output library...

We can even compute, say the ratio of uses of the word "America" to uses of the word "world."

Loading output library...
Loading output library...

Document Similarity Matrix

#Document-Similarity-Matrix

Using the .similarity() method from earlier that uses word vectors, we can very easily compute the document similarity between all the documents in our corpus.

Loading output library...
Loading output library...

Exercises

#Exercises
  • Extract all the events from Pride and Prejudice.
  • Make a lexical dispersion plot of the word "ni" in Monty Python and the Holy Grail. What does this tell us?
  • Find the shortest sentence in any inaugural address from our corpus.
  • Find the president that used the lowest proportions of adjectives (or nouns, verbs) in his inaugural address.
  • Find which of Charles Dickens's novels (or those of any other author) are the most semantically similar.

Learn More

#Learn-More

See Also

#See-Also