Feed Boost Analysis

#Feed-Boost-Analysis

A Q&A website wants to analyze if new users who get a "feed boost" will get more answers to their questions. Is the feed boost effective?

Overall, the data is skewed to the right towards higher number of answers.

Below, we confirm that the average account age of the control and experimental users are reasonably similar. We also confirm that the number of control and experimental users are about the same.

Now we merge the questions and users dataframes so that the questions dataframe can contain information on whether or not the question was asked by a boosted or non-boosted user.

We confirm that the number of questions asked by boosted and non-boosted users are about the same. Below, we see that the standard deviations for the two groups are also similar. We can use this to assume homogeneity of variance.

1. The effect of the treatment overall, with a statistical test.

#1.-The-effect-of-the-treatment-overall,-with-a-statistical-test.

We will consider two tests: the t-test and the permutation test. In both we will have the following hypotheses:

Null hypothesis: There is no difference between the number of answers for questions asked by non-boosted and boosted users. Any observed differences are due to chance.

Since we want to test if the boost does associate with a higher number of answers, our alternate hypothesis is one-sided.

Alternate hypothesis: The number of answers for questions asked by non-boosted users is on average, less, than the number of answers for questions asked by boosted users.

We know that users are randomly sorted as boosted and non-boosted users, but for both tests we also need to assume that our questions data is randomly selected from or otherwise representative of the all the questions asked by new users on the website. We must also assume independence in our samples. From our exploratory data analysis, we see that the two sample sizes and standard deviations are reasonably similar, so we can assume homogeneity of variance.

While the two above two visualizations show that the data is in fact not normal, the t-test only needs assume that the distribution of the sample means is aprroximately normal. We can assume this with the Central Limit Theorem. So, even if we were extra cautious and went with the permutation test instead, like we do here below, we will see that the simulated means are approximately normally distributed.

An example of one iteration of shuffling up class labels during permutation test:

Simulated Mean Differences Generated with Permutation Test:

Observed Difference: -0.30061649383815325, p-value = 0.08

The distribution of the sample means is approximately normal, like we said so.

At 0.0896, the p-value is not small enough for an alpha value of 0.05 to say that the observed difference is statistically significant. Thus, the results of the permutation test show that overall, the the boost does not correspond with a higher number of answers. The results from the t-test below show a similar result, with a p-value of 0.0865.

t-statistic, p-value, degrees of freedom:

2. The effect of treatment given the age of the user's account.

#2.-The-effect-of-treatment-given-the-age-of-the-user's-account.

The min, mean, and max account age:

Now, compare between:

• Below account age mean, non-boosted and below account age mean, boosted number of answers

and

• Above or equal account age mean, non-boosted and above or equal account age mean, boosted number of answers

Compare the below account age mean data, and the above account age mean data separately. We will use the t-test from here onwards.

"t-statistic, p-value, degrees of freedom:

We get a very large p-value of 0.999 for the below account age mean data, which means that the number of answers for boosted questions is not signicantly greater than the number of answers for non-boosted questions. In fact, when we look at the visualization below, we see that the mean number of answers is actually greater for non-boosted questions for accounts below the average account age.

The below t-test also confirms that non-boosted questions receive a statistically significant more number of answers than boosted questions for accounts below the average age.

t-statistic, p-value, degrees of freedom:

Now, we run a t-test comparing the number of answers for questions asked by boosted and non-boosted users, where the accounts are above the mean account age. The p-value is very small, which suggests that the boosted questions do receive a significantly more number of answers than non-boosted questions for users above the average account age.

t-statistic, p-value, degrees of freedom: