Feed Boost Analysis

A Q&A website wants to analyze if new users who get a "feed boost" will get more answers to their questions. Is the feed boost effective?

Exploring the data

Loading output...
Loading output...

Overall, the data is skewed to the right towards higher number of answers.

Below, we confirm that the average account age of the control and experimental users are reasonably similar. We also confirm that the number of control and experimental users are about the same.

Loading output...
Loading output...

Now we merge the questions and users dataframes so that the questions dataframe can contain information on whether or not the question was asked by a boosted or non-boosted user.

Loading output...
Loading output...

We confirm that the number of questions asked by boosted and non-boosted users are about the same. Below, we see that the standard deviations for the two groups are also similar. We can use this to assume homogeneity of variance.

Loading output...
Loading output...
Loading output...
Loading output...

1. The effect of the treatment overall, with a statistical test.

Hypothesis testing

We will consider two tests: the t-test and the permutation test. In both we will have the following hypotheses:

Null hypothesis: There is no difference between the number of answers for questions asked by non-boosted and boosted users. Any observed differences are due to chance.

Since we want to test if the boost does associate with a higher number of answers, our alternate hypothesis is one-sided.

Alternate hypothesis: The number of answers for questions asked by non-boosted users is on average, less, than the number of answers for questions asked by boosted users.

We know that users are randomly sorted as boosted and non-boosted users, but for both tests we also need to assume that our questions data is randomly selected from or otherwise representative of the all the questions asked by new users on the website. We must also assume independence in our samples. From our exploratory data analysis, we see that the two sample sizes and standard deviations are reasonably similar, so we can assume homogeneity of variance.

While the two above two visualizations show that the data is in fact not normal, the t-test only needs assume that the distribution of the sample means is aprroximately normal. We can assume this with the Central Limit Theorem. So, even if we were extra cautious and went with the permutation test instead, like we do here below, we will see that the simulated means are approximately normally distributed.

Loading output...

An example of one iteration of shuffling up class labels during permutation test:

Loading output...

Simulated Mean Differences Generated with Permutation Test:

Loading output...

Observed Difference: -0.30061649383815325, p-value = 0.08

The distribution of the sample means is approximately normal, like we said so.

At 0.0896, the p-value is not small enough for an alpha value of 0.05 to say that the observed difference is statistically significant. Thus, the results of the permutation test show that overall, the the boost does not correspond with a higher number of answers. The results from the t-test below show a similar result, with a p-value of 0.0865.

t-statistic, p-value, degrees of freedom:

Loading output...

2. The effect of treatment given the age of the user's account.

The min, mean, and max account age:

Loading output...

Now, compare between:

  • Below account age mean, non-boosted and below account age mean, boosted number of answers

and

  • Above or equal account age mean, non-boosted and above or equal account age mean, boosted number of answers

Compare the below account age mean data, and the above account age mean data separately. We will use the t-test from here onwards.

"t-statistic, p-value, degrees of freedom:

Loading output...

We get a very large p-value of 0.999 for the below account age mean data, which means that the number of answers for boosted questions is not signicantly greater than the number of answers for non-boosted questions. In fact, when we look at the visualization below, we see that the mean number of answers is actually greater for non-boosted questions for accounts below the average account age.

Loading output...
Loading output...
Loading output...

The below t-test also confirms that non-boosted questions receive a statistically significant more number of answers than boosted questions for accounts below the average age.

t-statistic, p-value, degrees of freedom:

Loading output...

Now, we run a t-test comparing the number of answers for questions asked by boosted and non-boosted users, where the accounts are above the mean account age. The p-value is very small, which suggests that the boosted questions do receive a significantly more number of answers than non-boosted questions for users above the average account age.

t-statistic, p-value, degrees of freedom:

Loading output...
Loading output...
Loading output...
Loading output...

The overall, below average account age, and above average account age results are very different. While the overall test showed no significant difference between the non-boosted and boosted, the below average account age results showed that there was a significant difference in favor of the non-boosted, and the above average account age results showed hat there was a significant difference in favor of the boosted.

From the two visualizations, we also observe that for accounts above the average age, the means are much more influenced by extrema than for accounts below the average age. Another possible test we could run would be a permutation test based on the medians, to see if we would still get a signficant difference even after accounting for the outliers.

3. Possible consequences of the experiment, and other metrics

A potential positive consequence of the experiment would be that it generates actual, productive conversations and interactions. To measure this, we would need to know more than the number of answers. Evaluating the number of upvotes, shares, and/or comments made for each answer in addition to number of answers can provide more insight on the 'quality' of the conversation, even if we're not measuring the actual content of the answers. Of course, more popular questions could also attract low-quality answers, so we could also evaluate the number of downvotes too.

A negative consequence that could arise would be that questions that may be seen as 'spammy', not relevant to a browser's interests, or otherwise non-productive may be boosted over others. This would obviously hurt the average browser's experience. Currently, the "report" option includes categories like "spam", "insincere", "poorly written" and "incorrect topics" that are relevant to this issue, so the number of reports in those categories could also be evaluated. On the flip side, quality questions are often interesting enough that browsers can request a well-known writer or expert to answer it, and a metric measuring the number of requests could provide insight into the types of questions that may be seen as worth boosting by the users.