It's clear that a lot of people lost their jobs after the 2008 financial crash. You can read this chart and say how many: The unemployment rate went up by 5 percent. This is a very ordinary, very reasonable way of talking about this data, exactly the sort of thing that should pop into your head when you see this image. We're going to look deeper.
Where did these numbers come from? What do they actually count? What can the journalist say about this data, in light of recent history? What should the audience do after seeing it? Why do we believe charts like this, and should we? How is an unemployment chart any better, or different, than just asking people about their post-crash lives?#4537•
We'll need to go well outside of statistics to make any sense of things. I've been raiding psychology and social science and ethnography, and further places too like intelligence analysis and the neurobiology of vision. I've been collecting pieces, hoping to use data more thoughtfully and effectively in my journalism work.
I've tried to organize the things that can be said into three parts: Quantification is what makes data, then the journalist analyzes it, then the result is communicated to the audience.
This process creates "stories," the central products of journalism.#4540•
In journalism, a story is a narrative that is not only true but interesting and relevant to the intended audience. Data journalism is different from pure statistical analysis—if there is such a thing—because we need culture, law, and politics to tell us what data matters and how. A procurement database may tell us that the city councilor has been handing out lucrative contracts to his brother.
But this is interesting only if we understand this sort of thing as "corruption" and we've decided to look for it.
A sports journalist might look for entirely different stories in the same data, such as whether or not the city is actually going to build that proposed new stadium.
The data alone doesn't determine the story.
But the story still has to be true, and hopefully also thorough and fair.
What exactly that means isn't always obvious.
The relationship between story, data, culture, and truth is one of the key problems of twenty-first-century journalism#4536•
There is a complex relationship between the idea conveyed by the words "unemployment rate" and the process that produces a particular set of numbers. Normally all of this is backstage, hidden behind the chart. it's the same for any other data. Data is created. It is a record, a document, an artifact, dripping with meaning and circumstance.
A machine recorded a number at some point on some medium, or a particular human on a particular day made a judgment that some aspect of the world was this and not that, and marked a 0 or a 1.
Even before that, someone had to decide that some sort of information was worth recording, had to conceive of the categories and meanings and ways of measurement, and had to set up the whole apparatus of data production.ii#4549•
Data production is an elaborate process involving humans, machines, ideas, and reality. It is social, physical, and specific to time and place. I'm going to call this whole process "quantification," a word which I'll use to include everything from dreaming up what should be counted to wiring up sensors.#4542•
Suppose you want to know if the unemployment rate is affected by, say, tax policy. You might compare the unemployment rates of countries with different tax rates. The logic here is sound, but a simple comparison is wrong. A great many things can and do affect the unemployment rate, so it's difficult to isolate just the effect of taxes.
Even so, you can build statistical models to help you guess what the unemployment rate would have been if all factors other than tax policy were the same between countries.
We're now talking about imaginary worlds, derived from the real through force of logic.
That's a tricky thing—not always possible, and not always defensible even when formally possible.#4545•
After analysis comes communication. This makes journalism different from scholarship or science, or any field that produces knowledge but doesn't feel the compulsion to tell the public about it in an understandable way. Journalism is for the audience—which is often a very broad audience, potentially millions of people.#4552•
From experience and experiments we know quite a lot about how minds work with data. Raw numbers are difficult to interpret without comparisons, which leads to all sorts of normalization formulas. Variation tends to get collapsed into stereotypes, and uncertainty tends to be ignored as we look for patterns and simplifications.
Risk is personal and subjective, but there are sensible ways to compare and communicate odds.#4546•
But more than these technical concerns is the question of what is being said about whom. Journalism is supposed to reflect society back to itself, but who is the "we" in the data? Certain people are excluded from any count, and astonishing variation is abstracted into uniformity. The unemployment rate reduces each voice to a single bit: are you looking for work, yes/no? A vast social media data set seems like it ought to tell us deep truths about society, but it cannot say anything about the people who don't post, or the things they don't post about.
Omniscience sounds fantastic, but data is a map and not the territory.
And then there's the audience. What someone understands when they look at the data depends on what they already believe. If you aren't unemployed yourself, you have to rely on some image of "unemployed person" to bring meaning to the idea of an unemployment rate. That image may be positive or negative, it may be justified or untrue, but you have to fill in the idea of unemployment with something to make any sense at all of unemployment statistics.
Data can demolish or reinforce stereotypes, so it's important for the journalist to be aware that these stereotypes are in play.
That is one reason why it's not enough for data to be presented "accurately." We have to ask what the recipient will end up believing about the world, and about the people represented by the data.
Often, data is best communicated by connecting it to stories from the individual lives it represents.#4534•
Every authoritarian planner dreams of utopia, but totalitarian technocratic visions have been uniformly disastrous for the people living in them. A fully quantified social order is an insult to freedom, and there are good reasons to suspect such systems will always be defeated by their rigidity. Questions of action can hone and refine data work, but actual action—making a choice and doing—requires practical knowledge, wisdom, and creativity.
The use of statistics in journalism, like the use of statistics in general, will always involve artistry.#4543•
4
There were no Hispanics living in the United States before 1970. At least, there weren't according to the census. There couldn't be, because the census form did not include "Hispanic" or "Latino" or anything like it.iii
Actually there were about nine million Hispanics living in the country by 1970.7 In many ways the lack of census data made them invisible.#4538•
Quantification is the process that creates data. You can only measure what you can conceive. That's the first challenge of quantification. The next challenge is actually measuring it, and knowing that you measured it accurately. Data is only useful because it represents the world, but that link can be fragile.
At some point, some person or machine counted or measured or categorized, and recorded the result.
The whole process has to work just right, and our understanding of exactly how it all works has to be correct, or the data won't be meaningful.#4539•
5
That's the logic behind historian G. Kitson Clark's advice for making generalizations:
Do not guess; try to count. And if you cannot count, admit that you are guessing.#4554•
Many words have quantitative aspects. Words like "all," "every," "none," and "some" are so explicitly quantitative that they're called quantifiers in mathematics. Comparisons like "more" and "fewer" are clearly about counting, but much richer words like "better" and "worse" also imply counting or measuring at least two things.
There are words that compare different points in time, like "trend," "progress," and "abandoned." There are words that imply magnitudes such as "few," "gargantuan," and "scant."#4544•
6
Article I, Section 2 of the 1787 Constitution established the census and divided people into three categories: "free persons"; "Indians not taxed"; and "other persons," which really meant "slaves." Although aligned with race, these were also political categories because the census was created to apportion representatives and taxes between the states.
Indians counted for neither representation nor taxes, while slaves were only counted as three-fifths of a person.
This was the compromise between the slave and non-slave states that created the country.
It seems insane now, but that's the history, and a reminder that the census is not an "objective" count but a bureaucratic process that generates data for specific purposes.
Asking why the data was collected does not answer how it was collected, but it's often a big hint.#4535•
If self-identification seems the obvious way to determine race, that's because we now understand race as an entanglement of identity, culture, and biology, as much social as genetic. But that is a late twentieth-century understanding. The census officials of the 1950s do not seem to have understood race this way; they simply wanted a more accurate count and took for granted that a person knows their own race.#4553•
There is something about self-identification that feels like a step forward in codifying race, a better way of making it visible in the aggregate. it's a more dignified approach. But it has its own serious limitations. it's not the data you need if you want to study race-linked genetic diseases or how people treat strangers differently based on skin color.
We can think of race in many different ways, but the available data has no obligation to match our conceptions.
If you want to know what the data really measures, the only thing that matters is how it was collected.
Hence, the census up to 1950 counts something different than the census from 1960 onward, even though both call it "race." How is it different? That depends on the question you wish to ask of the data.#4555•
7
Quantification always involves complex choices, even in the hard sciences. Although friction is a basic force of classical physics, it comes from micro-interactions between surfaces that aren't fully understood. A high school physics textbook will tell you that we usually describe it with two numbers: the coefficient of static friction which is how hard you have to push to start sliding, and the coefficient of kinetic friction which is how hard you have to push to keep sliding.
But more sophisticated measurements show that friction is actually quite a complex force.
It also depends on velocity, and even on how fast you were sliding previously.
Anyone working with friction has to choose how to quantify it.#4551•
In practice we end up replacing such rich concepts with much simpler proxies. We get "test scores" instead of "educational attainment" and "income" as a proxy for "quality of life," while "intelligence" is today measured by a battery of tests which assess many different cognitive skills. In experimental science this is called operationalizing a variable, a fancy name for picking a definition that's both analytically useful and practical enough to create data.#4548•
8
You should be skeptical of any headline that says the number of jobs in the United States has changed by fewer than about 105,000 since last month. That's because the monthly jobs growth estimate has a margin of error of about plus or minus 105,000.16#4556•
Political polls also have built-in error. If one candidate is ahead of the other 47 percent to 45 percent, but the margin of error is 5 percent, there is a pretty good chance that another identical poll will show the candidates the other way around. Pretty much any sort of public survey will have intrinsic error, and a reputable source will report the margin of error along with the results.
The error of a measurement is a necessary part of understanding what that measurement means.#4558•
Expressing how much error there is may seem obvious now, but it was a key innovation in the history of statistics. There is a random sample in the Old Testament: "The people cast lots to bring one out of every ten of them to live in Jerusalem."vi It couldn't have been long before someone thought of counting by letting each of the chosen stand for 10, but millennia passed before anyone was able to estimate the accuracy of this process.#4557•
If this doesn't strike you as audacious, you've probably never thought about just what a poll claims to be able to do. Extrapolating from 150,000 people to 300,000,000 people means collecting information from one person in 2,000 then saying it speaks for the other 1,999. It's like asking only one person in each neighborhood whether he or she is employed#4566•
This distribution tells us everything we can know about the possible error in our sample value. But we'll often want a more understandable summary, and one way of summarizing an error distribution is to say how often we'll get within a certain distance of the correct answer. Let's say we want to know how often we can expect to get either the true answer of 40%, or the closest incorrect answers of 20% and 60%.
This requires adding up the probabilities that we get 20%, 40%, or 60%, which corresponds to seeing one, two, or three unemployed people our sample.
There's a probability of 0.26 + 0.36 + 0.23 = 0.85 that we'll see any of these three answers.#4588•
Among the 2,118,760 different samples of five that we could draw from our population of 50 people, we find that 1,815,400 or 85 percent of them contain one, two, or three unemployed people. Put another way, 85 percent of all samples contain between 20% and 60% unemployed.x is known as an 85-percent confidence interval.
Because this interval covers a 40% range, and our best estimate is right in the middle, we say that the estimate has a margin of error of 20%.
The margin of error is always half of the width of the confidence interval.#4584•
There many different ways of phrasing our result, which all mean the same thing.
The 85-percent confidence interval is 20% to 60%
The answer is 40% with a margin of error of 20%, 17 times out of 20.
We are 85 percent certain that the true answer is between 20% and 60%
The answer is 40% ± 20% at 85 percent confidence.#4567•
Notice that we always use two values to measure the uncertainty: a margin of error and the probability that the true answer falls within that margin of error.xi of error, in this case 20% to 60%, is called the 85-percent confidence interval. The 85 percent figure itself is called the confidence level.
Whatever language we use, we have quantified the error in our survey in two values: a range of error and how often you'll see that something within that range.#4569•
If 40% ± 20% at an 85-percent confidence level is a precise enough answer, you've reduced your work by a factor of 10 by asking only five out of 50 people.#4580•
11
Finding a story in the data will always be an act of cultural creation. But those stories must still be true! So the rest of this chapter is an introduction to three big ideas that can help draw truth from data. The first is the effect of chance, randomness, or noise, which can obscure the real relation between variables or create the appearance of a connection where none exists.
The second is the nature of cause, and the situations where we can and can't ascribe cause from the data.
Above all is the idea of considering multiple explanations for the same data, rather than just accepting the first explanation that makes sense.#4577•
12
Our very first questions have to be about the source of the data, the quantification process. Who recorded this and how? Of course the police knew that there was a new closing time being tested—did this influence them to count differently? Even a true reduction in assaults doesn't necessarily mean this is a good policy.
Maybe there was another way to reduce violence without cutting the evening short, or maybe there was a way to reduce violence much more.#4563•
14
The odds are defined as the number of events we are counting divided by the number we are not counting. In gambling the odds are the number of times you win divided by the number of times you don't. The odds of cake here are 2/3 or 0.66, but we usually report odds by giving the numerator and the denominator separately: the odds are 2 to 3.
You can convert odds to probability by dividing the first number by the sum of the two: 2 to 3 odds is a probability of 2 / (2+3).
Odds of 1 to 1 mean a probability of 1 / (1+1) = 1/2, or a 50/50 chance.#4573•
15
The less likely it is that something can occur by chance, the more likely it is that something other than chance is the right explanation. This sensible statement is no less profound when you think it through. This idea emerged in the 1600s when the first modern statisticians asked questions about games of chance.
If you flip a coin 10 times and get 10 heads, does that mean the coin is rigged or are you just lucky? The less likely it is to get 10 heads in a row from a fair coin, the more likely the coin is a fake.
This principle remains fundamental to the disentangling of cause and chance.#4583•
There's a convention of saying that your data is statistically significant if p < 0.05, that is, if there is a 5 percent probability (or less) that you'd see data like yours purely by chance.#4578•
16
Bayesian statistics works by asking: What hypothetical world is most likely to produce the data we have? And how much more likely is it to do so than the alternatives? The possible "worlds" are captured by statistical models, little simulations of hypothetical realities that produce fake data. Then we compare the fake data to the real data to decide which model most closely matches reality.#4589•