Understanding the Problem


Have you ever think about how hard it is to create a showcase in a way that everybody understands, including non-technical people? Even harder is for a group, such as research groups, or enterprise teams. To overcome this, many groups are now investing in online platforms (such as GitHub, Kaggle, Kyso) to create their private or public online community to knowledge sharing. After start using these platforms, the customer wants to measure how it is going.

A group (team of users) may have their activities using some metrics such as content creation, content performance and, engagement measured by the number of posts, comments, and views.

The providers of such platforms have the data, being capable of measuring the same metrics for all teams, and understand their behavior. In this way, I propose here to analyze several groups with different sizes to discover how size impacts on group activities.

Table of Contents


1.Connecting to MongoDB

2.Extracting Data

3.Preprocessing Data

4.Measuring Activity Level


Data sources: MongoDB + Mixpanel + Google analytics

Reproduction: If you want to run this notebook, the install & setup instructions are in the Readme.md

Loading output library...

1. Connecting to MongoDB


The data comes from a NoSQL MongoDB database. So, the first step is to configure the connection to MongoDB. To secure my connection, I saved my username, password and server in another file, called secret.yml, which has the format:

username: "your_username"
password: "your_password"
server: "@your_server"

Note: If you are reproducing this notebook, please remember to add 'secret.yml' file to your .gitignore!

Let's use pymongo to connect to the MongoDB instance. The python connection (URI) string usually has the format:


In the case of special characters, use urllib. For example, if you use your email with '@' in username or any special characters in the password, I recommend you to use urllib.parse(), such as:

Note: If your connection begins with "mongodb+srv:" you need to make sure to install dnspython with: python -m pip install dnspython

3. Extracting Data


The datasource has 3 MongoDB collections:

  • users (and teams)
  • posts
  • comments

To extract data from MongoDB to Pandas, I've first to select a database:

db = client.user_activity

Then, extract each collection to a DataFrame, collection by collection. Example:

users = pd.DataFrame(list(db.users.find()))

I've repeated this for each one of the 3 collections.

Remember that is a good practice to close the connection to MongoDB after data extraction.



I've saved the extracted data to a cache file so I don't need to download all the time that I run the notebook. Also because some APIs have historical limits, so it's best to save/update that data every time it's pulled in.

4. Preprocessing Data


First, let's prepare the data. I combined the users, posts and comments datasets into a new dataset called teamactivity, grouped by a team in order to analyze the metrics of activity per teams. Take a look at the dataset format:

Loading output library...

There are 151 teams, which the size ranging from 1 to 505 users. It's more frequent to have teams in the range of 1-19 users.

Loading output library...

Now, let's discover if does that exist a relation between the team size and their activities.

5. Measuring Activity Level


Let's look for how the team size impact on the amount of knowledge sharing and engagement activities, by measuring:

  • Content Creation: the number of posts per team size
  • Views: the number of posts views per team size
  • Content Performance: the number of views per post
  • Team Engagement: the number of users that comments per team and per team size
Loading output library...

As the team size increases, it also exponentially increases the number of posts per team.

Loading output library...

Here, as the team size increases, it also increases the overall number of posts views per team. Users can view posts inside of their groups or any other public posts in platform that is open to any reader. Visualizations is not unique. That means that, if a user views the same post 100 times, it will count as 100 views (not as 1).

Loading output library...

As the number of posts increases, it also increases the overall number of views. These views can represent both reach or impression. A reach is a metric that tells how many people are seeing your content. Impressions means it was displayed but may not have generated an engagement or comment. The peak of 4202 posts registered together almost 560 thousand views, which represents an average of 7.5 views per post.

Loading output library...

Some teams comment a lot more than others, and that is visualized in the peaks within each range of team size. For example, the team in the middle with 387 and 394 users do more comments than the average on their range of users. As the team size increases, it also exponentially increases team engagement. If we sum all the comments made, the team with +500 users had an average of 60 comments per person considering all periods analyzed together.

6. Conclusion

  • The activity level of groups of users measured by the total number of posts, comments, and visualization of posts is substantially increased as the team size or posts increases.
  • If you plan to increase activity levels in your community, two strategies are valid:
      • Increase the number of users, or
      • Encourage the publication of new good posts. Consequently, the number of comments and views will follow the tendency.