As my first analysis, I really want to see if there are some patterns/trends in the article titles. Particularly, if popular topics have evolved over the year and just how clickbait-y the article titles have become.
First, let's import the concatenated data set from part 2:
Looking at the data, I quickly see that there are issues with the scraped article titles. With Scrapy, I could use either CSS or XPath selectors to grab the desired elements form the HTML, which require that the formating of each story card and article page to be quite consistent. Surprisingly, for a blogging platform, the formatting of the elements was quite diverse, particularly of the article title. This led me to list all possible CSS selectors (XPath selectors didn't do much better, so I omit those here) in an attempt to grab everything, but still I was getting some partial and blank titles.
As an example, you can see that I was getting only the first part ("Alastair Majury") of several articles written by an author of the same name, even though there are more to the titles when I look at the article page directly.
I'm sure there are ways that I could have handled this better (again, please let me know in the comments if you have suggestions!), but I noticed that the last part of each article URL contains pretty much the article title. So, I thought an easy way of getting clean-ish (emphasis on the -ish) article titles would be to parse the URLs.
So, I split each URL into parts by
/, then further split the hyphen-separated portion and remove the alphanumeric artile ID at the very end. Finally, I put the cleaned strings into a new column,
Looking at the same truncated titles above, this approach seems to give me more info. Also, this seems to allow me to get some informative English words for article titles that are not in English.
It's not perfect, but let's give this a try for now.
Here, I will use the workflow presented in the excellent reference "Text Mining with R" for extracting word pairs (bigrams) from a corpus and examining their relationships.
I will first prepare the data in Python and use the R packages
ggraph to produce the visualization. I can't emphasize how much I love being able to use both Python and R in Jupyter notebooks.
First, I will use the Python natural language processing packages
spaCy to tokenize each (parsed) article title, filter out non-English words, singularize nouns, generate bigrams and finally create a tally of the frequency of each bigram.
Unlike many other examples I have seen, I opted to not remove stop words, as otherwise I would lose bigrams like 'how', 'to' and 'need', 'to' that are so prevalent in Medium articles and hallmarks for clickbait.
Also, as this step takes quite a while, I added a
tqdm progress bar to keep me informed of the progress.
Due to singularization, "data" has been converted to "datum".
Unsurprisingly, "data science", "machine learning" and "how to" are the most frequently appearing bigrams in article titles. Looks like we are on the right track.
Finally, as this step takes so long, I will output the bigram counts to a CSV, so I can bypass this step in the further.
Now, to visualize the relationship between the top 60 (otherwise the figure is too crowded) most frequently appearing bigrams in titles of Medium articles tagged with "Data Science".
Each word appears as a node and the directionality of the arrow connecting them to each other indicates the order in which they appears in a bigram. Finally, the darkness of the arrow connecting each pair of words is proportional to the frequency of appearance for that bigram. For example, we see much darker arrows connecting 'datum'(data), 'science' or 'machine', 'learning'.
Phew, that was a lot of writing.
I will be back tomorrow with some topic modeling! :)