Saturday, October 30, 2010

statistical inference and racism

Up-front disclaimer: None of the following should be construed as supporting racism. If anything, my point is that racism is as shaky intellectually as it is morally. It's simply wrong to unfairly treat an individual based on indirect inferences from their similarities to a particular race (or ethnicity, gender, religion, etc.). I also apologize up-front for offending statisticians.

It occurred to me, as it doubtless has to others, that racism has some connections to faulty statistical inferences. Of course, I don't mean that the typical racist is computing actual statistical regressions, but that part of their error consists of an "intuitive" misunderstanding that would appear unsound if it was quantified and written out. And naturally, the inability to estimate correct conclusions about probability and statistics is present in any untrained person, not just the racist one.

the instrumental variable

I've heard people aver that, since racial discrimination has continued to decrease, race should no longer be considered a determining factor in people's lives. While this is a happy thought, it naively misses a crucial aspect: (direct) discrimination is not the only way that racism acts.

Assume a statistical model or equation in which race and discrimination aren't present. It could be for any personal outcome like educational attainment, as long as race isn't included in the model's variables. Essentially, the mathematical equation expresses a complete lack of obvious relevance and causation between race and the outcome. Discrimination doesn't even enter the equation, as we posited.

Now run a thought experiment on our model. Suppose we apply the model to randomly selected people, i.e. we grab the info for each person and then calculate (not measure) the outcome with the model. Further suppose that we also record each person's race. Then we find, as in reality, that when we partition the calculated outcomes by race, the distribution is drastically and consistently different. How can this be? Race wasn't one of our variables, because without outright discrimination it had no mechanism of causation.

The answer is that, in this hypothetical model and thought experiment, race could still act as an instrumental variable. Roughly speaking, race is correlated with the outcome through the model variables. For instance, in a model of personal income, it's highly reasonable to have the parents' income as a variable. Although the model may leave out "race", race can nevertheless affect personal income (the outcome) as long as it affects the parents' income (the modeled variables). Thus race can be a cause of an outcome despite not being a "cause".

Moreover, race is likely to be an instrumental variable for an array of models, and all the models are likely to be related too. Bad (or missing) education tends to lead to low-paying employment or unemployment, which tends to lead to poverty, which tends to lead to crime, which tends to lead to low property values and taxes, which tends to lead to low school funding, which tends to lead to bad education. Regardless of whether there's a blatant racial barrier presently at work, the long-lasting "fingerprints" of racial differences may be active (e.g., systematic segregation sometime in the past that restricted access to the same set of opportunities).

the overeager Bayesian

Stereotypes are simplifications by nature. Sometimes a stereotype can be a helpful mental shortcut, but often it doesn't reflect the most accurate concept of a "typical" sample from a population. This is notoriously true of racial stereotypes that highlight the worst representatives.

For compared to a proper Bayesian analysis, "reasoning" by a stereotype is overeager to apply shaky conditional probabilities. Consider a very hypothetical population with only two races that are each 50% of the population. 10% of the entire population has an undesirable characteristic (left unnamed). Of the people in the entire population who have the undesirable characteristic, 75% are of race A. Whenever someone in this population is of race A, what is the probability of having the undesirable characteristic?

When it comes to a stereotype for the undesirable characteristic, chances are the stereotype is of race A due to prevalence within that subgroup. So the stereotype would tend to override the right Bayesian answer, 7.5%, with a much greater probability that overlooks the facts that 1) only 10% of the entire population has the undesirable characteristic at all and 2) therefore 85% of race A does not have it.


  1. Anonymous10:58 AM

    This describes HOW statistical cultural-ism is pro-perpetuated into infinitum, undercover!! People/systems can, and do use such “shallow” statistical inferences to promote self fulfilling prophecies and, in many instances, scams. We see the evidence in instances i.e. ferrying out govt contracts, road construction crews, NASA control room, congressional senate.

  2. Anonymous7:01 PM

    Hi. You said

    So the stereotype would tend to override the right Bayesian answer, 7.5% ...

    Well, the right Bayesian answer isn't 7.5% -- it's 15%.

    The question was

    Whenever someone in this population is of race A, what is the probability of having the undesirable characteristic?

    In Bayesian terms, we can rephrase the question as

    P(undesirable | A) = P(A | undesirable) * P(undesirable) / P(A) = .75 * .10 / .5 = .15

  3. Yes, I got the calculation wrong. I see know that 7.5% doesn't even make intuitive sense. If someone is picked at random without knowing they're in A or B, the probability is 10% for the characteristic. Adding the knowledge that the someone is in A should raise, not reduce, the original probability.

    To translate it to integers: if there are 120, 60 are in A and 60 are in B. 10% or a total of 12 have the undesirable characteristic. 3/4 of that 12 are in A, so 9 of those in A have it. If someone is known to be in A then the probability is 9 out of 60, 15%.