Stack Exchange


Stack Exchange hosts sites on a multifude of fields and subjects, including mathematics, physics, philosophy, and data science.

Stack Exchange employs a reputation award system for its questions and answers. Each post — each question/answer — is a post that is subject to upvotes and downvotes. This ensures that good posts are easily identifiable.

In this project, we are going to expore about Data Science Stack Exchange and find a way to increase its popularity.

This link allows you to explore the data collected within DSSE using T-SQL. We are going to gather post that are created in 2019.

  FROM posts
  WHERE year(CreationDate) = 2018

The result is stored in '2019_questions.csv' file.

Loading output library...

Exploring Data

  • Id: An identification number for the post.
  • PostTypeId: An identification number for the type of post.
  • CreationDate: The date and time of creation of the post.
  • Score: The post's score.
  • ViewCount: How many times the post was viewed.
  • Tags: What tags were used.
  • AnswerCount: How many answers the question got (only applicable to question posts).
  • FavoriteCount: How many times the question was favored (only applicable to question posts).
Loading output library...

From the result above, we can see that there are missing values in the FavoriteCount field.

Loading output library...
Loading output library...

We can see that about 7,432(84.08%) records are missing FavoriteCount values. The columns is only applicable to Questions types of posts. If it doesn't have values, it's probably reasonable to think it's not a question post. Let's see what the unique values are. It looks safe to fill the missing values with 0s.

Other values data types looks adequate except CreationDate and FavoriteCount. We can change the type of CreationDate to datetime and FavoriteCount to int64 since it's count we don't need decimal points . The other issue is that Tags fields has that data looks like this.

That could be problemetic if we want to track the tags of posts. We are going to change it to something like this.


Then, we can split on , to obtain a list.

Loading output library...

Most Used and Most Viewed

Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

From the result above, the most used and viewed tags are: python, machine-learning, deep-learning, neural-network, keras, tensorflow, classification, scikit-learn

Those are related to machine-learning. More specifically deep-learning. It seems like that's what people are most interested in 2019.Let's see if that's true for other time frame.

Just a Fad?


The following file has Id, CreationTime, and Tags from all time. We will see if the interest in deep learning is just one time was only 2019.

We are going to classify the question as deep-learning questiong when the question has at least one of the tags of machine-learning, deep-learning, neural-network, keras, tensorflow.

Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...



From the exploring above, we found that deep learning is the most interested area in DSSE at the moment. researching 6 years worth of data (from 2014 to early 2020), we can conclude that it isn't just a fad. The ratio of this subject has been steadily growing for the past half decades. By focusing on those subjects aread, we