A/B Testing

Overview

The following is an example of A/B testing using R. This is for training purposes only and the example dataset is taken from https://github.com/etomaa/A-B-Testing/blob/master/data/Website%20Results.csv.

Split testing is another name of A/B testing . It’s used online when one wants to test a new feature or a product. This method outlines a repeatable example to infer whether the new feature is worth using or not.

Variant A is from the control group which tells the existing features or products on a website. Variant B is from the experimental group to check the new version of a feature or product to see if users like it or if it increases the conversions(bookings). Converted is based on the data set given, there are two categories defined by logical value. It’s going to show true when the customer completes bookings and it’s going to show false when the customer visits the sites but not makes a booking. Therefore we are looking for an increase or decrease of conversions with Variant B, to see how the new feature performs.

Assumptions

Test Version A is the control group which depicts the existing products or features on the website.
Test Version B is the experimental group to experiment the new version of product or feature to see if users like it or if it increases hotel bookings (conversions).
Converted – Based on the given dataset, there are two categories defined by logical values (True or false):

Converted = True when customer successfully makes a booking
Converted = False when customer visits the sites but does not make a booking

Out of Scope

A/B Testing procedure/process
Type 1 & Type II Errors
Given that practical significance level is not known or not provided with this case study, there is no basis for making a recommendation based on that.

Objective

To lay out the conclusions drawn from the results of sample data provided
To make recommendations on possible paths and provide relevant justification.Pcont_A = PExp_B

A/B Test Hypothesis

Null Hypothesis

Both version A and B have the same probability of driving customer booking or conversion. In other words, there is no effect or no difference between version A and B.

Alternative Hypothesis

Both version A and B have a different probability of driving customer booking or conversion. There is a difference between version A and B. Version B is better than A in driving customer bookings. PExp_B != Pcont_A

library(readr)
library(tidyverse)
library(pwr)
ABTest <- read.csv("https://raw.githubusercontent.com/etomaa/A-B-Testing/master/data/Website%20Results.csv")

First we find the current conversion rates of A and B respectively.

# Let's filter out conversions for variant_A  
conversion_subset_A <- ABTest %>% 
    filter(variant == "A" & converted == "TRUE")
  
# Total Number of Conversions for variant_A
conversions_A <- nrow(conversion_subset_A)
  
# Number of Visitors for variant_A
visitors_A <- nrow(ABTest %>% 
    filter(variant == "A"))
  
# Conversion_rate_A
conv_rate_A <- (conversions_A/visitors_A)  
print(conv_rate_A)

## [1] 0.02773925

# Let's take a subset of conversions for variant_B
conversion_subset_B <- ABTest %>% 
    filter(variant == "B" & converted == "TRUE")
  
# Number of Conversions for variant_B
conversions_B <- nrow(conversion_subset_B)
  
# Number of Visitors for variant_B
visitors_B <- nrow(ABTest %>% 
    filter(variant == "B"))
  
# Conversion_rate_B
conv_rate_B <- (conversions_B/visitors_B)  
print(conv_rate_B)

## [1] 0.05068493

Therefore, the basic conversion rate for set A is 0.028, and for B it is 0.051. This implies an increase in bookings with the new feature but we can go further.

# Testing the success for both sets

control_results <- prop.test(x = conversions_A,
                             n = visitors_A,
                             alternative = 'two.sided', conf.level = 0.975)
control_lower_ci <- control_results$conf.int[1]
control_upper_ci <- control_results$conf.int[2]

alternate_results <- prop.test(x = conversions_B,
                               n = visitors_B,
                               alternative = 'two.sided', conf.level = 0.975)

alternate_lower_ci <- alternate_results$conf.int[1]
alternate_upper_ci <- alternate_results$conf.int[2]


conv_rate <- c(conv_rate_A, 
               conv_rate_B)

# And another one for the lower CI value.
# I'm going to round these to 3 decimal places.
lower_ci  <- c(round(control_lower_ci, 3), 
               round(alternate_lower_ci, 3))

# And one for the upper CI value.
# I'm going to round these to 3 decimal places.
upper_ci <- c(round(control_upper_ci, 3), 
              round(alternate_upper_ci, 3))

# Then add a name for each.
version <- c('A. Control',
             'B. Alternative')

# Finally, create the dataframe from the four vectors.
conv_tbl <- data.frame(version, conv_rate, lower_ci, upper_ci)

# Display the results.
conv_tbl

##          version  conv_rate lower_ci upper_ci
## 1     A. Control 0.02773925    0.016    0.046
## 2 B. Alternative 0.05068493    0.035    0.073

Here we have the different in confidence intervals between set A and B. The booking conversion is far higher in set B, meaning the new feature is beneficial. We can plot this to get a visual representation of these results.

Now we can create a graph comparing the two. I’m going to use an error bar and not highlight the conversion rates. Instead, the focus will be on the error bars and nothing else.

library(scales)
library(ggthemr)
ggthemr("dust")
conv_tbl %>%
  ggplot(aes(x=version, y=conv_rate, group=1)) +
    geom_errorbar(width=.1, aes(ymin=lower_ci, ymax=upper_ci), colour="black")+
  ggtitle('Conversion Rate - Confidence Intervals') + 
  ylab('Conversion Rate') + 
  xlab('Version') +
  scale_y_continuous(labels = scales::percent)

Uplift

We can look more in depth into the uplift and the significance of the new feature.

uplift <- (conv_rate_B - conv_rate_A) / conv_rate_A * 100
uplift

## [1] 82.71918

# Pooled sample proportion for variants A & B
p_pool <- (conversions_A + conversions_B) / (visitors_A +
                                             visitors_B)
print(p_pool) # 0.03928325

## [1] 0.03928325

# Let's compute Standard error for variants A & B (SE_pool)
SE_pool <- sqrt(p_pool * (1 - p_pool) * ((1 / visitors_A) + 
                                         (1 / visitors_B)))
print(SE_pool) # 0.01020014

## [1] 0.01020014

# Let's compute the margin of error for the pool
MOE <- SE_pool * qnorm(0.975)
print(MOE) # 0.0199919

## [1] 0.0199919

# Point Estimate or Difference in proportion
d_hat <- conv_rate_B - conv_rate_A

# Compute the Z-score so we
# can determine the p-value
z_score <- d_hat / SE_pool

# Let's compute p_value 
# using the z_score value
p_value <- pnorm(q = -z_score, 
                 mean = 0, 
                 sd = 1) * 2


# Let's compute Confidence interval for the 
# pool using pre-calculated results
ci <- c(d_hat - MOE, d_hat + MOE) 

ci <- c(d_hat - MOE, d_hat + MOE)
ci

## [1] 0.002953777 0.042937584

ci_lower <- d_hat - MOE
ci_lower

## [1] 0.002953777

ci_upper <- d_hat + MOE
ci_upper

## [1] 0.04293758

Analysis

Let’s use the following Library:

library(tidyverse)
library(extrafont)

2a. Let’s load the data into a dataframe ABTest for the analysis:

ABTest <- read.csv("https://raw.githubusercontent.com/etomaa/A-B-Testing/master/data/Website%20Results.csv")

3a. Let’s take a subset of conversions for test_version_A:

conversion_subset_A <- ABTest %>% filter(variant == "A" & converted == "TRUE")

3b. Number of Conversions for test_version_A:

conversions_A <- nrow(conversion_subset_A)

3c.Number of Visitors for test_version_A:

visitors_A <- nrow(ABTest %>% filter(variant == "A"))

3d. Let’s compute conversion rate for test_version_A:

conv_rate_A <-  (conversions_A/visitors_A)  
conv_rate_A

## [1] 0.02773925

3e. Let’s take a subset of conversions for test_version_B:

conversion_subset_B <- ABTest %>% filter(variant == "B" & converted == "TRUE")

3f. Number of Conversions for test_version_B:

conversions_B <- nrow(conversion_subset_B)

3g.Number of Visitors for test_version_B:

visitors_B <- nrow(ABTest %>% filter(variant == "B"))

3h. Let’s compute conversion rate for test_version_B:

conv_rate_B <-  (conversions_B/visitors_B)  
conv_rate_B

## [1] 0.05068493

Relative Uplift

Let’s Calculate Relative uplift in Conversion Rate:

uplift <- (conv_rate_B - conv_rate_A)/ conv_rate_A * 100
uplift

## [1] 82.71918

Deduction: B is better than A by 82.7%. This is high enough to decide a winner but we need more supporting evidence before we can arrive at a final decision.

Pooled Probability for Test Versions A & B

5a. Let’s compute pooled probability:

p_pool <- (conversions_A + conversions_B)/(visitors_A + visitors_B)
p_pool

## [1] 0.03928325

5b. Let’s compute the standard error (SE_pool) for the pool:

SE_pool<- sqrt(p_pool*(1-p_pool) * ((1/visitors_A) + (1/visitors_B)))
SE_pool

## [1] 0.01020014

5c. Let’s compute the margin of error for the pool:

MOE <- SE_pool * qnorm(0.975)
MOE

## [1] 0.0199919

5d. Point Estimate or Difference in proportion:

d_hat <- conv_rate_B - conv_rate_A
d_hat

## [1] 0.02294568

5e. Compute the Z-score so we can determine the p-value:

z_score <- d_hat/SE_pool
z_score

## [1] 2.249546

P-Value for Test Version A & B

5f. Let’s compute p_value using the z_score value:

p_value <- pnorm(q = -z_score, mean = 0, sd = 1) * 2
p_value

## [1] 0.02447777

5g. Let’s compute Confidence interval for the pool

ci <- c(d_hat - MOE, d_hat + MOE)
ci

## [1] 0.002953777 0.042937584

ci_lower <- d_hat - MOE
ci_lower

## [1] 0.002953777

ci_upper <- d_hat + MOE
ci_upper

## [1] 0.04293758

6a. Let’s compute Standard Error and Confidence Intervals for Test version A separately

X_hat_A <- conversions_A/visitors_A
X_hat_A

## [1] 0.02773925

se_hat_A <- sqrt(X_hat_A*(1-X_hat_A)/visitors_A) 
print(se_hat_A)

## [1] 0.006116051

ci_A <- c(X_hat_A - qnorm(0.975)*se_hat_A, X_hat_A + qnorm(0.975)*se_hat_A) 
ci_A

## [1] 0.01575201 0.03972649

6b. Let’s compute Standard Error and Confidence Intervals for Test version B separately

X_hat_B <- conversions_B/visitors_B
X_hat_B

## [1] 0.05068493

se_hat_B <- sqrt(X_hat_B*(1-X_hat_B)/visitors_B) 
print(se_hat_B)

## [1] 0.008118638

ci_B <- c(X_hat_B - qnorm(0.975)*se_hat_B, X_hat_B + qnorm(0.975)*se_hat_B) 
ci_B

## [1] 0.03477269 0.06659717

6c. *Let’s extract the lower and upper confidence intervals into lower and upper respectively

ci_lower_A <- X_hat_A - qnorm(0.975)*se_hat_A
ci_lower_A

## [1] 0.01575201

ci_upper_A <- X_hat_A + qnorm(0.975)*se_hat_A
ci_upper_A

## [1] 0.03972649

ci_lower_B <- X_hat_B - qnorm(0.975)*se_hat_B
ci_lower_B

## [1] 0.03477269

7a. Let’s create a dataframe of results for the pool

vis_result_pool <- data.frame(
  metric=c(
    'Estimated Difference',
    'Relative Uplift(%)',
    'pooled sample proportion',
    'Standard Error of Difference',
    'z_score',
    'p-value',
    'Margin of Error',
    'CI-lower',
    'CI-upper'),
  value=c(
    conv_rate_B - conv_rate_A,
    uplift,
    p_pool,
    SE_pool,
    z_score,
    p_value,
    MOE,
    ci_lower,
    ci_upper
  ))
vis_result_pool

##                         metric        value
## 1         Estimated Difference  0.022945680
## 2           Relative Uplift(%) 82.719178082
## 3     pooled sample proportion  0.039283253
## 4 Standard Error of Difference  0.010200138
## 5                      z_score  2.249546089
## 6                      p-value  0.024477774
## 7              Margin of Error  0.019991903
## 8                     CI-lower  0.002953777
## 9                     CI-upper  0.042937584

7b. Let’s create a dataframe of results per test version

vis_result <- data.frame(variant = c("A","B"), visitors = c(visitors_A, visitors_B),
                         conversions = c(conversions_A,conversions_B),conversion_rate = round(c(conv_rate_A, conv_rate_B),4),
                         Standard_error = round(c(se_hat_A,se_hat_B),5), Conf_Interval_A = c(ci_A[1],ci_A[2]))
vis_result

##   variant visitors conversions conversion_rate Standard_error Conf_Interval_A
## 1       A      721          20          0.0277        0.00612      0.01575201
## 2       B      730          37          0.0507        0.00812      0.03972649

Visualize the Results

8a. Let’s create the dataframe

ABTdata = data.frame(
  variant = factor(rep(c("A", "B"), each=200)),
  conf_interval = c(rnorm(200, 0.02773925, 0.005891128), rnorm(200, 0.05068493, 0.007793573))
)
head(ABTdata, 4)

##   variant conf_interval
## 1       A    0.02235077
## 2       A    0.02049335
## 3       A    0.03216455
## 4       A    0.02444700

8b. Let’s create a sub-dataframe we will call in our plot

pe = data.frame(variant = c("A","B"), conversion_rate = round(c(conv_rate_A, conv_rate_B),4), lower_confidence = round(c(ci_lower_A,ci_upper_A),4))
pe

##   variant conversion_rate lower_confidence
## 1       A          0.0277           0.0158
## 2       B          0.0507           0.0397

8c. #Let’s plot the distributions

a <- ggplot(ABTdata, aes(x = conf_interval))
a + geom_density(aes(fill = variant), alpha = 0.3) +
  geom_vline(aes(xintercept = lower_confidence[1], color = variant),
             data = pe, linetype = "dashed") +
  geom_vline(aes(xintercept = lower_confidence[2], color = variant),
             data = pe, linetype = "dashed") +
  geom_vline(aes(xintercept = conversion_rate, color = variant),
             data = pe, linetype = "dashed") +
  scale_color_manual(values = c("#b8b7b8", "#b8b7b8"))+
  scale_fill_manual(values = c("#e9e8e9", "#4b779d")) +
  annotate(geom="text", x= 0.0507,  y=29, 
           hjust = 0,     # zero left justifies the text annotation horizontally. 0.5 is center, 1 is right justified.  
           vjust = 0.5,   # > +0.1 range pushes the text annotation down vertically
           #alpha = 0.2,  #alpha reduces the transparency of the text
           label="CR B: 5.06%",
           fontface = "bold",
           colour = "black",
           angle = 90,
           family="Century Gothic",
           size = 4.0) +
  annotate(geom="text", x= 0.0277,  y=35, 
           hjust = 0,     # zero left justifies the text annotation horizontally. 0.5 is center, 1 is right justified.  
           vjust = 0.5,   # > +0.1 range pushes the text annotation down vertically
           #alpha = 0.2,  #alpha reduces the transparency of the text
           label="CR A: 2.77%",
           fontface = "bold",
           angle = 90,
           colour = "black",
           family="Century Gothic",
           size = 4.0)+
  annotate(geom="text", x= 0.0403,  y=24, 
           hjust = 0,     # zero left justifies the text annotation horizontally. 0.5 is center, 1 is right justified.  
           vjust = 0.5,   # > +0.1 range pushes the text annotation down vertically
           #alpha = 0.2,  #alpha reduces the transparency of the text
           label="95%",
           #fontface = "bold",
           angle = 90,
           colour = "black",
           family="Century Gothic",
           size = 4.0)+
  annotate(geom="text", x= 0.0164,  y=24, 
           hjust = 0,     # zero left justifies the text annotation horizontally. 0.5 is center, 1 is right justified.  
           vjust = 0.5,   # > +0.1 range pushes the text annotation down vertically
           #alpha = 0.2,  #alpha reduces the transparency of the text
           label="95%",
           #fontface = "bold",
           angle = 90,
           colour = "black",
           family="Century Gothic",
           size = 4.0)+
  labs(title = "Expected Distributions of Variants A and B",
       x = "Confidence Interval",
       y = "",
       caption = "Etoma Egot") +
  theme_classic() +
  theme(legend.position = "none") +
  theme(axis.ticks.y = element_blank(), #remove y axis ticks
        axis.title.y = element_blank()) + #remove y axis title
  theme(plot.title = ggplot2::element_text(family="Century Gothic",
                                           size=20,
                                           #face="bold",
                                           color="#002849")) +
  theme(plot.subtitle = ggplot2::element_text(family="Century Gothic",
                                              size=16,
                                              face="bold",
                                              color = "#4b779d"))+
  theme(panel.grid.major.y = ggplot2::element_line(color="#FAFAFA"))

Let’ visualize the pool using a normal distribution

ABTdata_pool = data.frame(Conf_interval = rnorm(n = 10000, mean = 0, sd = SE_pool))
#Let's plot the distributions
b <- ggplot(ABTdata_pool, aes(x = Conf_interval))
b + geom_histogram(aes(y = ..density..),
                   colour="black", fill="white", binwidth = 0.00029) +
  geom_density(alpha = 0.2, fill = "#FF6666")+
  theme_classic()

Conclusions Drawn from Data

There are 711 hits and 18 conversions for test version A and 720 hits and 33 conversions for test version B.
Relative uplift of 82.71% based on a Conversion rate for A = 2.77%, Conversion rate for B = 5.06 %. Hence B is better than A by 82.71%.
P-value computed for this analysis was 0.024(p < 0.05 or p lower than 5% significance level). Hence, the tests results show strong statistical significance. You can be 95% confident that this result is a consequence of the changes made and not a result of random chance.
The computed pooled data confidence interval (lower = 0.001305794, upper = 0.03972796) does not include zero. The confidence interval of “B” is further away from zero than “A” which further strengthens the recommendation.