Simpson’s paradox, or the Yule–Simpson effect, is a phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. It is sometimes given the descriptive title reversal paradox or amalgamation paradox.
Let’s consider the following scatterplot built from dummy data.
# Libraries
library(tidyverse)
library(hrbrthemes)
library(babynames)
library(viridis)
# Create data
a <- data.frame( x = rnorm(100), y = rnorm(100)) %>% mutate(y = y-x/2)
b <- a %>% mutate(x=x+2) %>% mutate(y=y+2)
c <- a %>% mutate(x=x+4) %>% mutate(y=y+4)
data <- do.call(rbind, list(a,b,c))
data <- data %>% mutate(group=rep(c("A", "B", "C"), each=100))
ggplot(data, aes(x=x, y=y)) +
geom_point( size=2) +
theme_ipsum()
Here, it totally makes sense to state that there is a positive correlation
between the X and the Y axis. Actually, the Pearson
correlation coefficient is 0.63.
However, let’s check what happens if we consider the groups
present in the dataset (3 groups):
# Libraries
library(tidyverse)
library(hrbrthemes)
library(babynames)
library(viridis)
# Create data
a <- data.frame( x = rnorm(100), y = rnorm(100)) %>% mutate(y = y-x/2)
b <- a %>% mutate(x=x+2) %>% mutate(y=y+2)
c <- a %>% mutate(x=x+4) %>% mutate(y=y+4)
data <- do.call(rbind, list(a,b,c))
data <- data %>% mutate(group=rep(c("A", "B", "C"), each=100))
ggplot(data, aes(x=x, y=y, color=group)) +
geom_point( size=3) +
scale_color_viridis(discrete=TRUE) +
theme_ipsum()
Here, we understand that the positive correlation was due to a difference between groups
. Actually, the correlation coefficient is even negative if each group is considered separately.
This is the Sympson’s paradox: the trend between two different variables reverses when a third variable is included.
The impact is strong for data analytics and dataviz. The Simpson’s paradox can lead to a wrong conclusions with spurious correlation. Always double check the potential effect of confounding variables available in your dataset.
Any thoughts on this? Found any mistake? Disagree? Please drop me a word on twitter or in the comment section below:
A work by Yan Holtz for data-to-viz.com