“Wine before pregnancy ‘changes baby’s face’”

Welcome to another entry in our new Panic Headlines series. Today we’re going to tackle prenatal alcohol usage and child face shape, and induced labor and IQ.

A few weeks ago we had this headline in The Telegraph. The first sentence of the article does not disappoint: “Drinking just one small glass of wine a week in the three months before pregnancy may alter the face of your child, a study suggests.” The author follows up with this: “Scientists caution that the face acts as a ‘health mirror’ and the findings could indicate some deeper health issues…”

The study on which this is based appeared in the journal Human Reproduction. In short, the authors use data from a study called the Generation R Study, run in the Netherlands. In it, about 10,000 women are asked many pregnancy questions — including about alcohol consumption before and during pregnancy — and there are a lot of measurements taken of their children. For the purposes of this particular study, they use 3-D face images of children at ages 9 and 13.

The authors generate 200 face measurements (“traits”). They then analyze the relationship between these traits and prenatal alcohol exposure, including both drinking during pregnancy and before pregnancy. They find that, at age 9, several traits are significantly associated with prenatal alcohol exposure, including with pre-pregnancy exposure.

To understand what I see as the key problem with this paper, it is important to understand what we mean by “significant.” You’ll often see papers talk about a result being “significant at the 5% level.” Colloquially, people think of this as meaning the result is real or correct — this is how you know the result is to be believed.

However: there is a formal meaning here. A p-value of 5% means that if the true effect was zero, only 5% of the time would you expect to see an effect of this size. Put differently: imagine a setting in which a treatment had no effect on the outcome. If you analyzed the relationship with 100 different samples, 5% of the time you’d expect to get a significant relationship, just by chance. This is because of how sampling works.

This is relevant here. Imagine that, in reality, there were no differences across groups in their facial features. However: if we study 200 different features, we’ll expect to see at least 10 of them where a relationship is significant at the 5% level. Just by chance.

The authors know this, of course, and they run a standard adjustment for multiple hypothesis testing. But such adjustments are somewhat ad hoc, and there remains the real concern that when you look at so many different outcomes, across multiple age groups, you’re bound to find something.

Given this concern, it’s important to look for evidence of consistency in the results. If we were consistently seeing one facial feature jump out, across age groups and variations in the analysis, this would provide added confidence.

In fact, the results are all over the place. A few examples…

They run analysis that relies on a binary measure of alcohol exposure and analysis that relies on a continuous measure of exposure. The two analyses pick up different facial features, making it look like drinking at all affects one thing, but drinking more affects something else (and not the first thing).
The effects only show up at age 9, not at age 13.
The data includes multiple ethnic groups. When they limit to only those of Dutch nationality, the effects are not only less significant but also different facial features show up.
There are a set of facial traits that are associated with exposure before pregnancy. However, some of these do not show up for babies who are also exposed during pregnancy. For this to hang together, it must be that the exposure during pregnancy somehow cancels out the exposure before.

This is all challenged as well by the fact that there is no theory behind it. There are reasons to think that heavy alcohol exposure during pregnancy could change face shape, but it is unclear by what biological mechanism exposure before pregnancy would matter.

In the end, this paper feels like a data-mining exercise. There is nothing consistent to hang our hat on. It’s scare-mongering by way of a version of p-hacking. Bad!