Explore Zika research through text data mining


What is this, where am I?


This is the Jupyter notebook, the second part of the Zika virus tutorial, which focuses on the analysis. We will explore the extracted facts from part one and apply text and data mining techniques on it to get a better understanding of the Zika virus.

You can find the first part, where the used data gets downloaded and the facts extracted here. You need to execute it before to get the data necessary for this part of the tutorial.

You find all necessary informations on our central FutureTDM GitHub repository.

A Jupyter notebook consists of cells like this, which contain documentary text, images, or executable code. You can proceed through each cell by selecting it and pressing Ctrl + Enter or by clicking on the Play button in the menu bar. You don't need to know how to program in order to use this notebook. If you already know some python, please feel free to modify, change and experiment with the code and data. If not, just have a look and engage where you want.

This work is part of Future TDM - The Future of Text and Data Mining, an EU Horizon2020 research project with participation of Open Knowledge International and ContentMine.

What are we going to do?


We will explore the scientific publications downloaded as explained in the Zika virus tutorial.

There are several methods applied. Some of them are descriptive and show the wanted outcome, but some are explorativ and conclusions must be done by a domain expert by exploring the data and it's presentation by her/himselves. Following analysis is done:

  • plot a timeline of the publication years
  • get the most mentioned words and species over the full corpus
  • find relations between terms (species, words, authors, journals, publications) through network analysis methods, like community detection, co-occurences and network-projection.
  • find all publications in which a term was mentioned

Set up your environment


Import the prepared python functions into the notebook. If you want to know more about pyCProject, the ContentMine Python wrapper for the CProject, have a look at the GitHub repository.

Define all functions used in the notebook


Reading in the datasets


In the next cell we read in the prepared data, which must be located in the zika/ folder (we are already in this directory), into the notebook and assign it to the zika variable. This step stores all metadata and facts from every single publication downloaded in the first part and stores them in a CProject object. Each CProject itself contains many CTrees, one for each publication, which are the building blocks of our dataset. So each CTree then contains the extracted facts and metadata of one scientific publication.

Explore the metadata


This are the basic informations about the publications, and a good starting point to put the corpus in context.

See the publication years


The timeline of the publication years gives an overview, how old the research field of Zika is, how many publications are made at all and how it is in comparison to the other corpuses. First we get a list with the number of publications for each year, which then gets plotted as a bar-chart. The chart is then saved as SVG-file.

Loading output library...

The timelines show us, that the research field around Zika is very new and seem to be mostly influence by the outbreak in the last year. Aedes aegypti and the Usutu virus are both increasing rapidly since 2000.

Explore the authors


Another interesting thing to know is, who are the most active reasearchers in the field. For this, we compute the most common authors for each corpus. First we get the complete list of authors ordered by number of works associated, which we then filter by the top n-authors. The value for how many authors you want to get shown can be adapted easily by changing the value for num_authors.

We now want to take a look on how many authors can be found in the Zika and in the Usutu virus dataset. Be aware: This is not the general truth, it only can make a statement about the dataset we use, and cause our dataset is not complet with all publications ever done by any researcher, the outcome has limited validity.

Discussion of outcomes:

We could now use the found authors and try to find out more information about them, or look for other publications. Or analyse the list of authors further and find matches between the three corpuses. This is at least a good starting point to get some people/researchers/faces connected to Zika research.

Explore the journals


The final metadata we will have a look on is the journal, in which the publications are published. This is a good hint where we can find more research related publications. We again get the journals ordered by most mentions and print the top n out.

Discussion of outcomes:

The journals have a high rate of matches with some differences, but mostly they differ in the rank.

Explore the facts --> species, genus, words


Here we use the extracted facts from the text of the publications itself. This is ContentMine specific and seperates from the more ordinary usage of metadata from above. It is also more exploratory, so maybe it gives us new insights, but maybe also not. Here we have to especially be aware of false interpretation and about having "wrong data". Stay critical.

Explore the words


To get a general understanding of the corpus at a whole, we first want to know the most common words. This will give us more a feeling than an understanding of what the downloaded corpus is in general about. Maybe we will get through this some new insight, but maybe also not, but let's give it a try. To change the number of terms you want to have listed, just change the variable of num_words.

Plot the frequencies as a histogram.

Let's dive into the data!


First we create a network between papers and entities such as genes or genus. In this network a node is either a unique identifier of a paper, or the name of the entity. An edge or link between nodes is created when a paper mentions an entity. The entities have been identified through the ami-plugins.

Choose the type of entity


Check the most mentioned species


Here we list up the most mentioned species in each corpus. This should give us a good overview, and as you can see, it should provide us with some information about virus-transmitters such as Aedes, Flavivirus, etc.

Get all publications where a specific entity is mentioned


Now we want to explore, in which publications an entity was mentioned. Pick a name from the list above!

Choose one specific entity for your further analysis.

Change the string to your chosen species. Beware: If you choose one with more than 5 neighbors it should give you some interesting results.

Loading output library...

Find local communities of species


We identify the three biggest communities of entities and plot them separately. A community subgraph is a collection of e.g. persons that are connected with each other, but not with the rest of the network. It also prints the location/organization/perons with the most connections in the community.

Loading output library...
Loading output library...
Loading output library...

Your network


The next cell creates a high resolution visualization of the network around your chosen organization/person/location, where related facts, papers they are mentioned in, and also the authors who wrote the papers are visualized. The network will also be saved onto your disk for further usage, like wallpapers, blogging or documentation. You can find it in the MozFest2015 folder.

Don't forget to choose a color!

Put a fact into context


In the next cell you can enter an institution/location/person of you interest, and see in which papers they are mentioned, what their titles are. Depending on how busy it was, the list can get a bit long! You can look into the trialsjournal folder, open the sbfolder with the ID, and compare it with the acknowledgements section of the paper.

Loading output library...
Loading output library...