Observational Studies and Causal Inference

Last Modified

June 24, 2025

Disclaimer

This article was written solely for informational and educational purposes, and does not constitute medical advice. If you have any health concerns, please consult a qualified medical professional.

Randomised controlled trials (RCTs) are widely recognised as the gold standard of study designs, but they are not always possible to perform. For example, it would have been neither ethical nor feasible for Richard Doll and Austin Bradford Hill to randomly assign a group of volunteers to a smoking and non-smoking group in order to show that cigarettes cause lung cancer. However, they were still able to establish a link between smoking and the disease, using first a case-control study and then a prospective cohort study¹.

These are both types of observational study², an alternative to RCTs that can be very informative – as in the case of smoking and lung cancer. Unfortunately, the fact the participants are not randomised to different groups in observational studies can make their results particularly difficult to interpret. In this article, I will discuss how diagrams representing causal relationships can help us to spot potential sources of bias in these kinds of studies.

Causal Inference

Causal relationships are notoriously difficult to distinguish from mere statistical associations, as summed up in the famous saying “correlation does not imply causation”. A humorous example of this can be found in an educational statistics paper in which the author investigates the fairy tale that storks deliver babies by analysing 1980s estimates of stork populations and birth rates in seventeen different European countries³.

Estimates of European Stork Populations and Birth Rates in the 1980s
	Country
	Albania	Austria	Belgium	Bulgaria	Denmark	France	Germany	Greece	Holland	Hungary	Italy	Poland	Portugal	Romania	Spain	Switzerland	Turkey
Stork Pairs	100	300	1	5000	9	140	3300	2500	4	5000	5	30,000	1500	5000	8000	150	25,000
Yearly Births	83,000	87,000	118,000	117,000	59,000	774,000	901,000	106,000	188,000	124,000	551,000	610,000	120,000	367,000	439,000	82,000	1,576,000

We can plot these data and observe a linear relationship between stork populations and birth rates, with a correlation coefficient of 0.62 to two significant figures.

Plotting the Data and Calculating the Correlation Coefficient in R

stork_pairs <- c(100, 300, 1, 5000, 9, 140, 3300, 2500, 4, 5000, 5, 30000, 1500, 5000, 8000, 150, 25000)
yearly_births <- c(83000, 87000, 118000, 117000, 59000, 774000, 901000, 106000, 188000, 124000, 551000, 610000, 120000, 367000, 439000, 82000, 1576000)
correlation <- cor(stork_pairs, yearly_births) # calculating the pearson correlation coefficient
par(mar=c(bottom=4.2, left=5.1, top=1.8, right=0.9), family="Roboto Slab") # setting the plot options
plot(stork_pairs, yearly_births, # plotting the data points
     xlab="Stork Pairs", ylab="Yearly Births", 
     pch=4, cex=1.8, col="#009ed8", lwd=4.8, cex.lab=1.5, cex.axis=1.5)
abline(lm(yearly_births ~ stork_pairs), # adding the line of best fit
       col = "#ffc000", lty=2, lwd = 4.2)
legend("topleft", # adding a key
       legend=paste0("Line of Best Fit  (r=", round(correlation, 2), ")"), 
       col="#ffc000", lty=2, lwd=4.2, cex=1.5)

Moreover, performing a significance test for correlation with the null hypothesis that there is no relationship between stork populations and birth rates gives us a p-value of less than 0.05, a common threshold for a result to be considered statistically significant⁴.

Testing the Null Hypothesis in R

significance_test <- cor.test(stork_pairs, yearly_births)
p_val <- significance_test$p.value
cat("p-value:", p_val)

p-value: 0.007897861

Clearly, more storks do not cause more babies to be born. But the non-causality of statistical associations might not always be so obvious. To deal with the challenges of understanding causes and effects, a new branch of statistics known as ‘causal inference’ has emerged over the past few decades⁵.

This type of statistics often involves the use of directed acyclic graphs – ‘DAGs’ for short – to model the relationships between variables. Let’s explore the essential features of these diagrams.

Directed Acyclic Graphs

In discrete mathematics, a graph is a structure made up of vertices linked by edges. A path between two vertices is a sequence of edges linking them together⁶.

For example, in the above graph, there is a path between A and C along the the sequence of edges between vertices A, B, D, E, and C.

Directed graphs are a particular type of graph in which all the edges are arrows. In these types of graph a vertex can be an descendant or ancestor of another vertex if there is a path of arrows all pointing in the same direction between them; the descendant lies at the arrowhead and ancestors lies at the tail. If this path is only one edge long, the ancestor can be called the descendant’s parent and the descendant can be called the parent’s child⁷.

Here, B is a parent (and thus also an ancestor) of C and D, a child (and descendant) of A, and and ancestor of E.

Directed graphs are ideal for causal inference, because we can use vertices to represent the variables and edges to represent the causal relationships between them, with parents having a direct effect on their children. For example, a simplified DAG showing the effects of different factors on a person’s chance of developing lung cancer might show the variables ‘Genetics’, ‘Environmental Factors’, and ‘Smoking’ all to be parents of the variable ‘Lung Cancer Risk’.

An acyclic graph is simply one that contains no cycles (a cycle is a directed path leading from a variable back to itself), so when we use a DAG to model causal relationships a variable cannot cause itself.

An example of a DAG which might explain the correlation between the stork populations and birth rates of European countries could introduce extra variables, such as land area and human population size.

This DAG could account for the association between the ‘Stork Population’ and ‘Birth Rate’ variables without requiring there to be a causal relationship between them; their correlation can be explained by the fact that ‘Land Area’ is both a parent of ‘Stork Population’ and an ancestor of ‘Birth Rate’.

There are several structures that commonly appear appear in DAGs representing observational studies. In the rest of this article I want to focus on two of the most important – confounders and colliders – and explore how both of them can help to explain misleading results in medical research.

Confounders

When we are investigating the effect of a variable X on another variable Y using an observational study, one of the most common ways we can be misled is by confounding⁸. This occurs when a third variable has an effect on both X and Y, which can lead to them being correlated even if there is not in fact a causal relationship between them. For instance, we could say that land area is a confounder when we are investigating the relationship between stork populations and birth rates, because an increase in land area leads to an increase in both variables. In a DAG, we can define a confounder to be a variable on a path between X and Y which is an ancestor of both of them⁹.

In the most extreme cases, confounding can actually mean that the true effect of X on Y is reversed, in a phenomenon known as Simpson’s paradox¹⁰. Robert Heaton gives an helpful illustration of this in his blog post ‘Making peace with Simpson’s Paradox’. He asks us to imagine a hypothetical illness with a specific drug treatment that causes faster recovery in higher dosages. This would seem quite straightforward to prove by collecting data on the amount of treatment different patients were given and the time it took them to recover. Then, however, he complicates the scenario by introducing a third variable – age. If older people tend to recover more slowly, and also tend to require higher dosages of the treatment drug to recover at all, an observational study that does not control for age may actually end up showing a negative correlation between drug dosage and recovery speed.

Simulating and Plotting Data to Illustrate Robert Heaton’s Example of Simpson’s Paradox in R

set.seed(31415) # setting a seed so the simulated data is reproducible
group_1_dosages <- sample(1:20, 20, replace = TRUE) # simulating a range of dosages 
                                                    # taken by 20-39 year olds
group_2_dosages <- sample(21:40, 20, replace = TRUE) # 40–59 year old dosages
group_3_dosages <- sample(41:60, 20, replace = TRUE) # 60–79 year old dosages 
all_dosages <- c(group_1_dosages, # concatenating all the dosages
                 group_2_dosages, 
                 group_3_dosages) 
group_1_recovery_speeds <- 1.2 * group_1_dosages + 100 + rnorm(20,0,5) # simulating the recovery speeds of                                                                        # 20–39 year olds according to a 
                                                                       # linear relationship with dosage 
                                                                       # with independent gaussian errors
group_2_recovery_speeds <- 1.2 * group_2_dosages + 50 + rnorm(20,0,5) # 40–59 year old recovery speeds
group_3_recovery_speeds <- 1.2 * group_3_dosages + rnorm(20,0,5) # 60–79 year old recovery speeds
all_recovery_speeds <- c(group_1_recovery_speeds, # concatenating all the recovery speeds
                         group_2_recovery_speeds, 
                         group_3_recovery_speeds)  
par(mar=c(bottom=2.4, left=3, top=2.4, right=0.9), family="Roboto Slab") # setting the plot options
plot(group_1_dosages, group_1_recovery_speeds, # plotting the simulated data for 20-39 year olds
     xlim=c(-3,72), ylim=c(36,120), 
     pch=16, cex=2.4,  col="#009ed8",
     bty="n",  axes = FALSE,) 
points(group_2_dosages, group_2_recovery_speeds, # plotting the simulated data for 40-59 year olds
       pch=17, cex=2.4,  col="#009ed8")
points(group_3_dosages, group_3_recovery_speeds, # plotting the simulated data for 40-79 year olds 
       pch=18, cex=2.4,  col="#009ed8")
arrows(x0=0, y0=39, x1=72, y1=39, length=0.15, code=2, lwd=3, col="#808080") # x-axis
arrows(x0=0, y0=39, x1=0, y1=120, length=0.15, code=2, lwd=3, col="#808080") # y-axis
mtext("Drug Dosage", side=1, line=1, cex=1.8) # x-axis label
mtext("Recovery Speed", side=2, line=0, cex=1.8) # y-axis label
abline(lm(group_1_recovery_speeds ~ group_1_dosages), # plotting the line of best fit for 
                                                      # only the data from 20–39 year olds
       col="#006a90", lwd=4.8, lty=2) 
abline(lm(group_2_recovery_speeds ~ group_2_dosages), # plotting the line of best fit for 
                                                      # only the data from 40–59 year olds
       col="#006a90", lwd=4.8, lty=2) 
abline(lm(group_3_recovery_speeds ~ group_3_dosages), # plotting the line of best fit for 
                                                      # only the data from 60–79 year olds
       col="#006a90", lwd=4.8, lty=2) 
abline(lm(all_recovery_speeds ~ all_dosages), # plotting the line of best fit for all the data
       col="#ffc000", lwd=4.8, lty=2)
legend(x=54, y=120, # adding a key for the data points
       legend=c("20–39 Year Olds ", "40–59 Year Olds ", "60–79 Year Olds "), 
       pch=c(16, 17, 18), col="#009ed8", cex=1.5)
legend(x=3, y=54, # adding a key for the lines of best fit
       legend=c("Overall Line of Best Fit ", "Age Group Lines of Best Fit "), lty=2,
       col=c("#ffc000","#006a90"), lwd=4.2, cex=1.5)

In the simulated data plotted above, a higher dosage of the treatment drug corresponds with a faster recovery speed within each age category. However, when we combine data from all the age categories we find a negative correlation between drug dosage and recovery speed.

Finding the Correlation Coefficient in R

correlation <- cor(all_dosages, all_recovery_speeds) # calculating the pearson correlation coefficient
cat("Pearson Correlation Coefficient:", correlation)

Pearson Correlation Coefficient: -0.7453944

When we draw the DAG for this example, we can see that the variable ‘Age’ is acting as a counfounder, obscuring the true causal relationship between the ‘Drug Dosage’ and ‘Recovery Speed’ variables.

Simpson’s paradox has been observed several times in the medical literature, perhaps most famously in a 1986 observational study by a group of urologists on the effectiveness of different types of procedure for the removal of kidney stones¹¹. Wisely, the researchers recognised that stone size might be a confounding factor, so they split their data according to whether the original stone was less than two centimetres in diameter or not.

Observed Success Rate of Kidney Stone Removal Procedures
		Success Rate
		Open Surgery	Percutaneous Nephrolithotomy
Stone Size	Small	≈ 93% \(\left( \frac{81}{87} \right)\)	≈ 87% \(\left( \frac{234}{270} \right)\)
	Large	≈ 73% \(\left( \frac{192}{263} \right)\)	≈ 69% \(\left( \frac{55}{80} \right)\)
	Both	≈ 78% \(\left( \frac{273}{350} \right)\)	≈ 83% \(\left( \frac{289}{350} \right)\)

The urologists found that open surgery had been more successful than percutaneous nephrolithotomy in both small and large stones. But if they had not stratified their data by stone size, they would have obtained a higher overall success rate for percutaneous nephrolithotomy than for open surgery. The reason for this was that patients with small stones were more likely to be treated with percutaneous nephrolithotomy than patients with larger stones, and these were also the patients most likely to have a good result regardless of the procedure they underwent. So stone size influenced both the choice of treatment and the treatment outcome, and thus acted as a counfounder.

The size of a patient’s stone might have influenced their probability of having one treatment over another for any number of reasons. For instance, perhaps patients with smaller stones were more inclined to be concerned about the side effects of open surgery, or perhaps there was a longer waiting list for percutaneous nephrolithotomy that patients with larger stones tend to be less willing to tolerate. (These are just examples of possible explanations, not necessarily the true reasons that the size of kidney stones affected doctors and patients’ choice of treatment in the 1986 study.)

Colliders

Another issue that can arise in observational studies is collider bias. When we are investigating the effect of X on Y, a collider is a variable lying on a path between X and Y where two causal arrows meet, meaning the collider has more than one parent.

In observational medical studies, it is not uncommon that the variables we are investigating, X and Y, both have a causal effect on the probability that a patient is included in the study. In these cases, selection into the study is itself a collider variable.

To understand how collider bias can cause unrelated variables to appear to have a correlation, let’s imagine a hypothetical virus called ‘colliditis’. We’ll supose that the severity of disease experienced by people who catch colliditis is unrelated to the number of cigarettes they smoke in a day. For simplicity, we can assume that the average number of cigarettes of smoked a day and the severity of disease are uniformly distributed in the population of people who have caught colliditis. So if we plotted a random sample of one thousand people from the population of people with the viral infection, it would look like this.

Plotting the Relationship Between Disease Severity and Average Number of Cigarettes Smoked in a Day in a Simulated Sample of People Infected With Colliditis in R

set.seed(31415)
n <- 1000 
infected_sample <- matrix(0, nrow=n, ncol=2) # initialising an empty matrix to store the data for each person
for (i in 1:n) {
  severity <- runif(1, min=0, max=20) # assigning each person in the sample a random severity score between 0 and 20
  cigarettes <- runif(1,0,20) # assigning each person in the sample a random average number of cigarettes smoked per day between 0 and 20
  infected_sample[i,1] <- severity
  infected_sample[i,2] <- cigarettes
}
par(mar=c(bottom=4.2, left=5.1, top=1.8, right=0.9), family="Roboto Slab") 
plot(infected_sample[,1], infected_sample[,2], pch=4, cex=1.2, lwd=2.4, col="#808080",
     xlab="Colliditis Severity Score", ylab="Average Number of Cigarettes Smoked in a Day",
     cex.lab=1.5, cex.axis=1.5)
abline(lm(infected_sample[,1] ~ infected_sample[,2]), lwd=4.8, col="#009ed8",) # plotting the line of best fit

Now suppose that out of this sample, 90% the people with a colliditis severity score ≥15 are in hospital, and of those with a colliditis severity score <15, those who smoke ≥10 cigarettes a day on average have a 60% chance of being in hospital, compared to a 10% chance for those who smoke <10 cigarettes a day on average. If we plot cigarettes vs colliditis severity of only those in our sample who are hospitalised, we see an interesting result.

Plotting the Relationship Between Disease Severity and Average Number of Cigarettes Smoked in a Day in a Simulated Sample of Hospitalised People Infected With Colliditis in R

set.seed(31415)
hospital_infected_sample <- matrix(rep(NA, 2*n), nrow=n, ncol=2) 
for (i in 1:n) {
  dice_roll <- runif(1, min=0, max=1) # generating random number between 0 and 1
  person_i <- infected_sample[i,]
  if(person_i[1] >= 15 && dice_roll <= 0.9) { # these people have a 90% chance of being in hospital
    hospital_infected_sample[i,] <- person_i
  } else if(person_i[2] >= 10 && dice_roll <= 0.6) { # these people have a 60% chance of being in hospital
    hospital_infected_sample[i,] <- person_i
  } else if(person_i[2] < 10 && dice_roll <= 0.1) { # others have a 10% chance of being in hospital
    hospital_infected_sample[i,] <- person_i
  }
}
# Now we remove the people not in hospital from our sample:
hospital_infected_sample <- na.omit(hospital_infected_sample)
# And plot the result:
par(mar=c(bottom=4.2, left=5.1, top=1.8, right=0.9), family="Roboto Slab") 
plot(hospital_infected_sample[,1], hospital_infected_sample[,2], pch=4, cex=1.2, lwd=2.4, col="#808080",
     xlab="Colliditis Severity Score", ylab="Average Number of Cigarettes Smoked in a Day",
     cex.lab=1.5, cex.axis=1.5)
abline(lm(hospital_infected_sample[,1] ~ hospital_infected_sample[,2]), lwd=4.8, col="#009ed8",) # plotting the line of best fit

If we took the hospitalised population as representative of the general population we would find that there is a negative relationship between average number of cigarettes smoked in a day and severity of colliditis. But we would be failing to account for the fact that having a severe case of colliditis, and smoking a large number of cigarettes per day are both factors that make you more likely to be in hospital. So if you are in hospital with a low colliditis score, you are more likely to smoke a lot of cigarettes than if you are in hospital with a high colliditis score, since there must be some other reason for your hospitalisation than colliditis, and this reason may be smoking.

The DAG for this example would show that ‘Hospitalisation’ is a collider, and that there is no true causal relationship between the variables ‘Smoking’ and ‘Severity’.

For an example of collider bias in the medical literature, let’s turn to a case-control study of hospital patients with and without pancreatic cancer published in 1981¹², which found a strong association between patients’ coffee consumption and their likelihood of having developed pancreatic cancer.

Observed Coffee Consumption in Patients With and Without Pancreatic Cancer
		Coffee Consumption
		Pecentage That Drink O Cups Per Day	Pecentage That Drink 1 or 2 Cups Per Day	Pecentage That Drink 3+ Cups Per Day
Diagnosis	Pancreatic Cancer	≈ 5% \(\left( \frac{20}{367} \right)\)	≈ 42% \(\left( \frac{153}{367} \right)\)	≈ 53% \(\left( \frac{194}{367} \right)\)
Diagnosis	Not Pancreatic Cancer	≈ 14% \(\left( \frac{88}{643} \right)\)	≈ 42% \(\left( \frac{271}{643} \right)\)	≈ 44% \(\left( \frac{284}{643} \right)\)

In the decades since this study was published, several large prospective cohort studies have found no significant relationship between coffee consumption and pancreatic cancer risk¹³ ¹⁴ ¹⁵, suggesting that the original study may have had a flaw in its design. What could it have been?

In his seminal epidemiology textbook, Leon Gordis points out the control group for this study – patients without pancreatic cancer – was chosen in an unusual way¹⁶. When a patient who did have pancreatic cancer agreed to answer a survey about their dietary habits, control patients were selected by asking the doctor responsible for their care to identify other patients of theirs without the condition. Since these doctors were often gastroenterologists, patients with gastrointestinal disorders were over-represented in the control group. These patients tended to drink less coffee than the average member of the population, either because it exacerbated their symptoms or because they had been advised to cut down their intake by their doctors.

As a result of the atypical coffee consumption in the control group, the results of this observational study are hard to interpret. This is because even if it was the case that there was no causal relationship between coffee consumption and pancreatic cancer, we would still expect to see an association between them due to the fact that people with GI disorders were more likely to be chosen as controls than other patients. We can show this in a DAG in which ‘Selection into Study’ is a collider.

References

Jonathan Wood, “Life of a revolutionary,” OxSciBlog, November 11, 2009.↩︎
“Study designs,” Centre for Evidence-Based Medicine, accessed February 26, 2025.↩︎
Robert Matthews, “Storks Deliver Babies (p=0.008),” Teaching Statistics 22, no. 2 (June 2000): 36–38.↩︎
Martin Bland, An Introduction to Medical Statistics, 4th ed. (Oxford University Press, 2015), 117.↩︎
Judea Pearl and Dana Mackenzie, The Book of Why: The New Science of Cause and Effects (Penguin Books, 2019), 1.↩︎
Sander Greenland, Judea Pearl, and James Robins, “Causal Diagrams for Epidemiologic Research,” Epidemiology 10, no. 1 (January 1999): 37–48.↩︎
Judea Pearl, Madelyn Glymour, and Nicholas Jewell, Causal Inference in Statistics: A Primer (Wiley, 2016), 25.↩︎
Jeffrey Aronson, Clare Bankhead, and David Nunan, “Confounding,” Catalogue of Bias, accessed February 26, 2025.↩︎
Ricardo Silva, “Causality,” (lecture, Advanced Tutorial Lecture Series on Machine Learning, Cambridge, November 9, 2006).↩︎
Steven Julious and Mark Mullee, “Confounding and Simpson’s Paradox,” BMJ 309, no. 6968 (December 1994): 1480–1481.↩︎
Clive Charig et al., “Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy,” British Medical Journal (Clinical Research Edition) 292, no. 6524 (March 1986): 879–882.↩︎
Brian MacMahon et al., “Coffee and cancer of the pancreas,” The New England Journal of Medicine 304, no. 11 (March 1981): 630–633.↩︎
Simak Bidel et al., “Coffee consumption and risk of gastric and pancreatic cancer – A prospective cohort study,” International Journal of Cancer 312, no.7 (April 2012): 1651–1659.↩︎
Kristin Guertin et al., “A prospective study of coffee intake and pancreatic cancer: results from the NIH-AARP Diet and Health Study,” British Journal of Cancer 113, no. 7 (September 2015): 1081–1085.↩︎
Charlie Zhou et al., “Coffee and pancreatic cancer risk among never-smokers in the UK prospective Million Women Study,” International Journal of Cancer 145 no. 6 (September 2019): 1484–1492.↩︎
Leon Gordis, Epidemiology, 4th ed. (Saunders, 2009), 183–185.↩︎

Causal Inference

Directed Acyclic Graphs

Confounders

Colliders

Further Reading

References