Emily Oster

11 min Read Emily Oster

Emily Oster

Why I Look at Data Differently

A lesson on residual confounding

Emily Oster

11 min Read

A question I get frequently: Why does my analysis often disagree with groups like the American Academy of Pediatrics or other national bodies, or other public health experts, or Andrew Huberman (lately I get that last one a lot)? The particular context is often in observational studies of topics in nutrition or development.

Some examples:

The recent analysis of processed food and cancer is emblematic of many of these cases. In that post, I argued that the relationship observed in the data was extremely likely to reflect correlation and not causation. My argument rested on the observation that people who ate differently also differed on many other features.

In response, a reader wrote in this question:

You emphasize causation vs. correlation, and I think you are pointing to potential confounders that could actually be the root cause of the findings. My question is — can’t and don’t study researchers control for that in their analysis? Can’t they look at the link between screen time and academic success while keeping potential confounders equal across the comparison groups? And if so, wouldn’t that help rule out the impact of other factors and strengthen the case that there is a true link?

This is a very good question, and it clarifies for me where many of the disagreements lie.

The questioner essentially notes: the reason we know that the processed food groups differ a lot is that the authors can see the characteristics of individuals. But because they see these characteristics, they can adjust for them (using statistical tools). While it’s true that education levels are higher among those who eat less processed food, by adjusting for education we can come closer to comparing people with the same education level who eat different kinds of food.

However, in typical data you cannot observe and adjust for all differences. You do not see everything about people. Sometimes this is simply because our variables are rough: we see whether someone has a family income above or below the poverty line, but not any more details, and those details are important. There are also characteristics we almost never capture in data, like How much do you like exercise? or How healthy are your partner’s behaviors? or even Where is the closest farmers’ market? 

For both of these reasons, in nearly all examples, we worry about residual confounding. That’s the concern that there are still other important differences across groups that might drive the results. Most papers list this possibility in their “limitations” section.

We all agree that this is a concern. Where we differ is in how much of a limitation we believe it to be. In my view, in these contexts (and in many others), residual confounding is so significant a factor that it is hopeless to try to learn causality from this type of observational data. 

This position drives a lot of my concerns with existing research. Thinking about these issues is a huge part of my research and teaching. So I thought I’d spend a little time today explaining why I hold this position. I’m going to start with theory and then discuss two pieces of evidence.

A quick note: this post focuses on concerns about approaches which take non-randomized data and argue for causality based on including observed controls. There are other approaches to non-randomized data (i.e. difference-in-difference, event studies) which have stronger causality claims. See some discussion of those in this older post.

Theory

Conceptually, the gold standard for causality is a randomized controlled trial. In the canonical version of such a trial, researchers randomly allocate half of their participants to treatment and half to control. They then follow them over time and compare outcomes. The key is that because you randomly choose who is in the treatment group, you expect them, on average, to be the same as the control other than the presence of the treatment. So you can get a causal effect of treatment by comparing the groups.

Randomized trials are great but not always possible. A lot of what is done in public health and economics aims to estimate causal effects without randomized trials. The key to doing this is to isolate a source of randomness in some treatment, even if that randomization is not explicit.

For example: Imagine that you’re interested in the effect of going to a selective high school on college enrollment. One simple thing to do would be to compare the students who went to the selective high school with those who did not. But this would be tricky, because there are so many other differences across the students.

Now imagine that the way that admission to the high school works is based on a test score: if you get a score above some cutoff, you get in, and if you are below, you do not. With that kind of mechanism, we can get closer to causality. Let’s say the cutoff score is 150. You’ve got some students who scored 149 and some who scored 150. The second group gets in, the first doesn’t. But their scores are really similar. It may be reasonable to claim that it is effectively random whether you got 149 or 150 — the difference is so small, it could happen by chance. In that case, you can try to figure out the causal effect of the selective high school by comparing the students just above the cutoff with those just below.

This particular technique is called regression discontinuity; it’s part of a suite of approaches to estimate causal effects that take advantage of these moments of randomness in the world. The moments do not need to be truly random, but they do need to be driving the treatment and not driving the outcome you are interested in.

We can take this lens to the kind of observational data that we often consider. Let’s return to the processed food and cancer example. The approach in that paper was to compare people who ate a lot of processed food with those who ate less. Clearly, in raw terms, this would be unacceptable because there are huge differences across those groups. The authors argue, though, that once they control for those differences, they have mostly addressed this issue.

This argument comes down to: once I control for the variables I see, the choice about processed food is effectively random, or at least unrelated to other aspects of health.

I find this fundamentally unpalatable. Take two people who have the same level of income, the same education, and the same preexisting conditions, and one of them eats a lot of processed food and the other eats a lot of whole grains and fresh vegetables. I contend that those people are still different. That their choice of food isn’t effectively random — it’s related to other things about them, things we cannot see. Adding more and more controls doesn’t necessarily make this problem better. You’re isolating smaller and smaller groups, but still you have to ask why people are making different food choices.

Food is a huge part of our lives, and our choices about it are not especially random. Sure, it may be random whether I have a sandwich or a salad for lunch today, but whether I’m eating a bag of Cheetos or a tomato and avocado on whole-grain toast — that is simply not random and not unrelated to other health choices.

This is where, perhaps, I conceptually differ from others. I have to imagine that researchers doing this work do not hold this view. It must be that they think that once we adjust for the observed controls, the differences across people are random, or at least are unrelated to other elements of their health.

This is a theoretical disagreement. But there are at least two things in data that have really reinforced my view — one from my own research and one example from my books.

Selection on observables: Vitamins

Underlying the issue of correlation versus causation are human choices. This is especially true in nutrition. The reason it is hard to learn about causality is that different people make different choices. One of the possible reasons for those different choices is different information, or different processing of information.

A few years ago, I got curious about the role of information — of news — in driving these choices, and I wrote a paper that looked at what happened to health behaviors after changes in health information. I wrote at more length about that paper here, but the basic idea was to analyze who adopts new health behaviors when news comes out suggesting those behaviors are good.

The main application is vitamin E. In the early 1990s, a study came out suggesting vitamin E supplements improved health. What happened as a result was that more people took vitamin E. But not just any people. The new adopters were more educated, richer, more likely to exercise, less likely to smoke, more likely to eat vegetables. In turn, over time, as these people started taking the vitamin, vitamin E started to look even better for health.

Over a period of about a decade, vitamin E went from being only mildly associated with lower mortality to being strongly associated with lower mortality. This is not because the impacts of the vitamin changed! It was because the people who took the vitamin changed. And, importantly, these patterns persisted even when I put in controls.

What this says to me is that these biases in our conclusions — and I saw this in vitamins, but also in sugar and fat — are malleable based on the information out there in the world. Once you acknowledge that what is going on here is people are reading news and reacting to it in different ways, it is hard to believe that the limited observable characteristics we can control for are enough.

Evolving coefficients: Breastfeeding

The second important data point for me is looking carefully at what happens in many of these situations when we introduce more and better controls.

The link between breastfeeding and IQ is a good example. This is a research space where you can find many, many papers showing a positive correlation. The concern, of course, is that moms who breastfeed tend to be more educated, have higher income, and have access to more resources. These variables are also known to be linked to IQ, so it’s difficult to isolate the impacts of breastfeeding.

What these papers typically do is control for some observable differences. And, like the discussion above, we might think, “Well, isn’t that enough? If we can see these detailed demographics, isn’t that going to address the problem?”

The paper I like the best to illustrate the fact that, no, that doesn’t address the problem is one that used data that — among other things — included sibling pairs. The authors of this paper do four analyses of the relationship between breastfeeding and IQ:

  1. Raw correlation — no adjustment for anything
  2. Regression adjusting for standard demographics (parental education, etc.)
  3. Regression adjusting for standard demographics plus adjusting for mom IQ score
  4. Within-sibling analysis: compare two siblings, one of whom was breastfed and one of whom was not

The graph below shows their results. When they just compare groups — without adjusting for any other differences — there is a large difference in IQ between breastfed and non-breastfed children. When they add in some demographic adjustments, this difference falls but is still statistically significant. This is where most papers stop. But as these authors add their additional controls, eventually they get to an effect of zero. Comparing across siblings, there is no difference at all.

The point of this discussion is not to get in the weeds on breastfeeding (you can read my whole chapter from Cribsheet about it). This is an illustrative example of a general issue: the control sets we typically consider are incomplete. There are a lot of papers that report effectively only the first two bars in the graph above. But those simple observable controls are just not sufficient. The residual confounding is real and it is significant.

(If you want another example, you can look back to the very similar kind of graph in Panic Headlines from last week. This problem is everywhere.)

Conclusions

The question of whether a controlled effect in observational data is “causal” is inherently unanswerable. We are worried about differences between people that we cannot observe in the data. We can’t see them, so we must speculate about whether they are there. Based on a couple of decades of working intensely on these questions in both my research and my popular writing, I think they are almost always there. I think they are almost always important, and that a huge share of the correlations we see in observational data are not close to causal.

There are two final notes on this.

First: A common approach in these papers is to hedge in the conclusion by saying, “Well, it might not be causal.” I find this hedge problematic. If the relationship between processed food and cancer isn’t causal, why do we care about it? The obvious interpretation of this result is that you should stop eating processed foods. But if the result isn’t causal, that interpretation is wrong. This hedge is a cop-out. And this approach — to bury the hedge in the conclusion — encourages the poorly informed and inflammatory media coverage that often follows.

Second: I recognize that other people may disagree and find these relationships more compelling. I believe we can have productive conversations about that. To my mind, though, these conversations need to be grounded in the theory I started with. That is, if you want to argue that there is a causal relationship between processed food and cancer, you need to be willing to make a case that you’re approximating a randomized trial with your analysis. If we focus our discussion on that claim, it will discipline our disagreements.

And last: Thank you for indulging my love of econometrics today. My dad may be the only person who gets this far in the newsletter, but even so, it was worth it. Back with more parenting content on Thursday.

Community Guidelines
Subscribe
Notify of
2 Comments
Inline Feedbacks
View all comments
trackback
Can You Really Make Your Baby Smarter By Eating a Mediterranean Diet? | ParentData ParentData
1 year ago

[…] This paper has a very common correlation-is-not-causality problem. Families where kids were exposed to more screen time were also different in other notable ways, including household income, maternal education, presence of a grandparent, maternal age, and number of siblings. All of these differences are important predictors of communication. Although the authors are able to control for some of them, we should continue to worry that there are significant other unobserved differences. I talk about that much more in this post on why I look at data differently. […]

trackback
Data Literacy for Parenting | ParentData ParentData
1 year ago

[…] reading: Why I Look at Data Differently, Coffee Definitely Either Kills You or Makes You Live Forever, My interview with Bapu […]

Dec. 5, 2022

12 min Read

Where Does Data Come From?

Weight and weighting

A child holds up an abacus with green and red beads arranged to look like a data chart.

Aug. 10, 2023

9 min Read

Data Literacy for Parenting

The ParentData mission statement (it’s on my door!) is “To create the most data-literate generation of parents.” The other day, Read more

An illustration of a head, with the top opening up to reveal a rainbow of colors against a blue background.

Oct. 10, 2023

10 min Read

I Hit My Head and Learned Three Lessons

At the end of September, I went to a conference in Denver. The first morning, I went for a run Read more