Baby name data

The website of the US Social Security Administration provides baby name data, but sometimes it can be a bit unstable or hard to reach. In the github repository of this script you can find a file with the latest data (BabyData2019.csv.gz), so far including baby name statistics between 1880 and 2019. If you want to know more about how to access and process this data, as well as codes to reproduce this analysis, check the Rmd version of this handout on Github. Fist, let’s look at how much data per year we get from the dataset:

There are several million babies per year. The speed up around the late 30s is due to the change how the US SSA recorded data. People registered themselves as adults before 1937 and only from the 1940s it became widespread to register babies when newly born. So it is better not to read trends that predate that decade.

Let’s look at the trend of two names as an example. Here we plot the number of babies called “Angelina” and “Leonardo” (regardless of gender):

You can see the peak of the fashion of naming your daughter like a Hollywood actress, while the level of popularity of the name is pretty much back to before it was fashionable. The case for Leonardo is a bit different, a slower increase but so far, for a longer period. The Simmel effect predicts that nothing stays fashionable forever, which in this case would mean that sooner or later we will see a decay in popularity like the one you see for Angelina.

The QWERTY effect in baby names

The QWERTY effect is a hypothesis in Psychology that postulates that words that are written with more right-hand letters of the keyboard are, on average, more positive than words that are written with more left-hand letters of the keyboard. Kyle Jasmin and Daniel Casasanto found this effect for the first time when comparing how words are written and how they are scored in a scale of positive to negative. They got further results in English, Portuguese, and German. It appears both in left-handed and right-handed people, and even in pseudowords (words that look like they could mean something but are meaningless). I also found some evidence of the effect in the way people give likes to online content and review products, but figuring out the mechanism behind the effect is still an open research question.

One of the most surprising manifestations of the QWERTY effect is baby names. If we try to give “nice” names to our babies, in theory there should be a trend to give more right-handed names to babies since keyboards became popular when computers penetrated society in the 1990s. Here, we reproduce previous evidence that looked at these trends since the 1960s, with a slightly different calculation. We average the number of right-handed letters minus the number of left-handed letters of all baby names in a year and plot the resulting trend:

Looks like the original result by Casasanto et al. replicates, but we can see a difference with their analysis, which covered data only until 2012. Since the early 2010s, the trend seems to have stopped. Perhaps the QWERTY effect getting softer since phones and tablets are replacing keyboards. While this result replicates, you cannot see the QWERTY effect when you correlate baby name popularity and they way it is typed over the decades, as this paper has shown. So there might be a trend, but not strong enough to say that names with more right-hand letters are more popular than names with more left-hand letters.

Wacky baby name research

There are many papers using the SSA baby name database, some of them published in prestigious journals like PNAS and PRSB. There is a sarcastic journal called “Proceedings of the Natural Institute of Science” (PNIS) that made fun of this trend in a parody paper titled “We are entering an unprecedented age in baby name flux”. The most cheeky graph is Figure 2, where the authors show a scatter plot of the number of unique baby names for girls and for boys versus the yearly average US temperature, reaching the conclusion that “baby name diversity also seems to have risen with the increasing annual temperature of the US (i.e., climate change)”. Here we reproduce that analysis using the average US temperature annomaly from the US Environmental Protection Agency:

The lines show the results of linear regression for boys and girls separately, check our linear regression tutorial to learn more about it. We find the same result as the PNIS article, a positive correlation between the number of unique baby names in a year and the average US temperature, even though we measure it as anomaly rather than raw Fahrenheit like in the original paper. In particular, we get a correlation coefficient of 0.591 for boys and of 0.544 for girls. But do not be deceived, this does not mean that climate change is causing baby name diversity. Both quantities have an upwards trend and this correlation is a result of that. If you want to dig more on this topic, you can run yourself a Granger test and you will see how we do not have evidence that rising temperatures cause larger numbers of names in any of the genders.

The limits of baby name predictability

Baby names are a popular example to illustrate scientific topics. The book Freakonomics explains the imitation part of the Simmel effect and explains how people imitate their richer neighbors when naming their babies. The book goes as far as making a prediction of what will be the top US baby names in 2015, based on a data analysis exercise that is never explained in detail in the article. Here is the prediction:

Enough time has passed and now we can evaluate the prediction with the SSA dataset. We can get the top 24 male and female names in 2015:

topFemale2015 topFemale2004 topMale2015 topMale2004
Abigail Abigail Aiden Alexander
Addison Alexis Alexander Andrew
Amelia Alyssa Benjamin Anthony
Aubrey Anna Carter Brandon
Ava Ashley Daniel Christian
Avery Brianna David Christopher
Charlotte Chloe Elijah Daniel
Chloe Elizabeth Ethan David
Elizabeth Emily Gabriel Dylan
Ella Emma Jackson Ethan
Emily Grace Jacob Jacob
Emma Hailey James James
Evelyn Hannah Jayden John
Grace Isabella Joseph Jonathan
Harper Jessica Liam Joseph
Isabella Kayla Logan Joshua
Madison Lauren Lucas Matthew
Mia Madison Mason Michael
Olivia Natalie Matthew Nathan
Scarlett Olivia Michael Nicholas
Sofia Samantha Noah Ryan
Sophia Sarah Oliver Samuel
Victoria Sophia Samuel Tyler
Zoey Taylor William William

As you see, there is not much overlap between the prediction and the results for 2015. As a comparison, the table includes the same top list for 2004, the year when the Freakonomics book was published. Just using the 2004 list, you would have made a better prediction. The Freakonomics prediction still has some credit, like the increase in popularity of names like “Ava”, “Avery”, “Ella”, “Carter”, “Jackson”, “Liam”, and “Oliver”.

What you see is that predicting which names in particular will be the most popular is a very difficult task. The Simmel effect describes forces that create observable patterns, but that does not mean that the model is predictive to tell us which of all names will become popular ten years from now, even if we had data of the social status of parents. This is the difference between explanatory and predictive power of a model. A model can explain phenomena without being useful to make predictions, as in this case, but can also be predictive without giving explanations, like in the case of deep learning or other black-box approaches.

Take home message: understanding does not imply predictive power and vice versa