Have you ever think about how hard it is to create a showcase in a way that everybody understands, including non-technical people? Even harder is for a group, such as research groups, or enterprise teams. To overcome this, many groups are now investing in online platforms (such as GitHub, Kaggle, Kyso) to create their private or public online community to knowledge sharing. After start using these platforms, the customer wants to measure how it is going.
A group (team of users) may have their activities using some metrics such as content creation, content performance and, engagement measured by the number of posts, comments, and views.
The providers of such platforms have the data, being capable of measuring the same metrics for all teams, and understand their behavior. In this way, I propose here to analyze several groups with different sizes to discover how size impacts on group activities.
Data sources: MongoDB + Mixpanel + Google analytics
Reproduction: If you want to run this notebook, the install & setup instructions are in the Readme.md
The data comes from a NoSQL MongoDB database. So, the first step is to configure the connection to MongoDB. To secure my connection, I saved my username, password and server in another file, called secret.yml, which has the format:
1 2 3
username: "your_username" password: "your_password" server: "@your_server"
Note: If you are reproducing this notebook, please remember to add 'secret.yml' file to your .gitignore!
Let's use pymongo to connect to the MongoDB instance. The python connection (URI) string usually has the format:
In the case of special characters, use
urllib. For example, if you use your email with
'@' in username or any special characters in the password, I recommend you to use
urllib.parse(), such as:
Note: If your connection begins with "mongodb+srv:" you need to make sure to install dnspython with:
python -m pip install dnspython
The datasource has 3 MongoDB collections:
To extract data from MongoDB to Pandas, I've first to select a database:
db = client.user_activity
Then, extract each collection to a DataFrame, collection by collection. Example:
users = pd.DataFrame(list(db.users.find()))
I've repeated this for each one of the 3 collections.
Remember that is a good practice to close the connection to MongoDB after data extraction.
I've saved the extracted data to a cache file so I don't need to download all the time that I run the notebook. Also because some APIs have historical limits, so it's best to save/update that data every time it's pulled in.
First, let's prepare the data. I combined the users, posts and comments datasets into a new dataset called
teamactivity, grouped by a team in order to analyze the metrics of activity per teams. Take a look at the dataset format:
There are 151 teams, which the size ranging from 1 to 505 users. It's more frequent to have teams in the range of 1-19 users.
Now, let's discover if does that exist a relation between the team size and their activities.
Let's look for how the team size impact on the amount of knowledge sharing and engagement activities, by measuring:
As the team size increases, it also exponentially increases the number of posts per team.
Here, as the team size increases, it also increases the overall number of posts views per team. Users can view posts inside of their groups or any other public posts in platform that is open to any reader. Visualizations is not unique. That means that, if a user views the same post 100 times, it will count as 100 views (not as 1).
As the number of posts increases, it also increases the overall number of views. These views can represent both reach or impression. A reach is a metric that tells how many people are seeing your content. Impressions means it was displayed but may not have generated an engagement or comment. The peak of 4202 posts registered together almost 560 thousand views, which represents an average of 7.5 views per post.
Some teams comment a lot more than others, and that is visualized in the peaks within each range of team size. For example, the team in the middle with 387 and 394 users do more comments than the average on their range of users. As the team size increases, it also exponentially increases team engagement. If we sum all the comments made, the team with +500 users had an average of 60 comments per person considering all periods analyzed together.