Introduction

One of the big lessons I have learnt is that no matter how well (you think) you understand something, there is almost always a deeper or clearer level of understanding available. Even for trivially simple things, like addition. Today, I will describe my experience with this phenomena in the context of Baye’s Theorem, by going through four different ways I have of conceptualising it. The final way is surprisingly simple but largely unknown; I learnt about it from this 80000 Hours’ interview of Spencer Greenberg.

Recall that Bayes Theorem states that: \(P(A | B) = \frac{P(A) P(B|A)}{P(B)}\)

Algebraic

For a long time, the main way I thought about Baye’s Theorem was that it was just a consequence of some algebraic re-arrangement:

\[P(B \cap A) = P(A \cap B) \\ P(B) P(A|B) = P(A) P(B|A) \\ P(A|B) = \frac{P(A) P(B|A)}{P(B)}\]

Pros:

  • It is a derivation of the formula
  • It is easy to follow this argument

Cons:

  • Provides no insight or understanding

Unfortunately, this is how I treated a lot of mathematics during my teenage and undergraduate years. The purpose of a proof is to establish the truth of something, and once you know it is true, you are free to use it. Somehow, I never stepped back and asked myself if there was more understanding available.

‘Changing the universe’ + some algebra

Often when dealing with probabilities, you have to determine all the possibilities of the situation (‘the universe of possibilities’) and then determine which of those corresponds to the event you are interested in.

For example, to determine the probability of getting two heads when you toss a fair coin twice, you might say that there are four total options (that are all equally likely), and one of those is what we are interested in, so the probability is \(\frac{1}{4}\).

How does this relate to Bayes? Well, the way I think about conditional probability is that when calculating the probability of A given B, I am changing the universe so that the possibilities are precisely those that correspond to B. Once I am in this restricted universe, I continue reasoning as normal.

For example, what is the probability of getting two heads given that at least one of them (but we don’t know which) is heads. In this case, the universe of possibilities is reduced to three options (we have ruled out the option of TT), so the probability is now \(\frac{1}{3}\).

Converting this to algebra, we get: \(P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(A) P(B|A)}{P(B)}\)

Pros:

  • It is a derivation of the formula
  • Slightly more insight than the original, e.g. we divide by $P(B)$ because $B$ has become the ‘universe of possibilities’.

Cons:

  • Still not particularly insightful

Adjusting a first estimate

A third way I have of thinking about Baye’s Theorem (which if I remember correctly, only arose after being exposed to the fourth way below) starts with the idea that a reasonable default belief or first estimate is:

\(P(A \vert B) = P(A)\).

Why is this a reasonable first estimate?

  • The probability of $A$ happening given some information ought to be related to the probability of $A$ happening without any information.
  • If $B$ has no influence on $A$ whatsoever, then this first estimate is exactly correct. (And this is the basis of the formal definition of two events being independent.)

Now that we have a first estimate, we need adjust it to get the exact probability. This adjustment factor is the role of the term \(\frac{P(B \vert A)}{P(B)}\). Why is this a sensible adjustment factor? Let us see when we increase or decrease our first estimate.

  • If \(P(B \vert A) > P(B)\), then the adjustment factor is bigger than 1, so we increase our first estimate. This makes sense, or at least feels intuitive: if $B$ is more likely when $A$ occurs than when $A$ does not occur, then we should increase our estimate of $A$ occuring.
  • If \(P(B \vert A) < P(B)\), then the adjustment factor is smaller than 1, so we decrease our first estimate. The makes sense, following similar reasoning to the point above.

I hope that makes sense - please let me know if it does not, it is somewhat vague and fuzzy. The strange thing about this is that this is the whole point of ‘priors’ and ‘posteriors’ in Bayesian updating. Yet somehow, I never made the intuitive leap that \(P(A \vert B)\) ought to be $P(A)$ with some kind of adjustment. I cannot remember what I thought when I first learnt about Bayesian stats at university, but my guess is that I treated it more mechanically: “If we have these various bits of information, which for formality sake we label with fancy terms like ‘prior distribution’ and ‘likelihood function’, then we can use Baye’s Theorem to calculate this new thing which we label ‘posterior distribution’”.

Pros:

  • Provides insight into what Bayes’ Theorem is actually saying
  • Corresponds to Bayesian statistics / Bayesian updating

Cons:

  • Does not provide a derivation of the formula

Bayes via odds: the unknown version of Bayes’ Theorem

There are two main ideas in this version of Bayes’ Theorem.

  • Replacing probabilities with odds
  • Adjusting Bayes’ Theorem to odds

Instead of giving a general derivation or discussion, I think the easiest way to illustrate and show-off this version of Bayes’ Theorem is by walking through some explicit examples. Before that though, we should first get used to thinking in terms of odds.

Odds

When describing uncertainty, the usual way is to list all the possible options along with their probabilities. With odds, we instead assign relative probabilities, and collectively these relative probabilities are called the odds.

  • Example. Instead of saying there is a probability of 0.5 of getting H and 0.5 of getting T, we say the odds are 1:1 of getting H and T.
  • Example. Instead of saying the probabilities are 0.25, 0.25 and 0.5 of getting two heads, two tails or one of each, we say the odds are 1:1:2.

(Note, I do not actually know what the official phrasing is. I have just come up with something that feels OK and whose meaning is hopefully clear.)

As a little exercise to check your understanding, determine how you would convert between probabilities and odds.

No really, I recommend you do this. The best way to learn is by actively doing something with the ideas, rather than passively reading.

Here is how I would describe the conversions:

  • Given some odds, you determine the probabilities by dividing all the relative probabilities by the sum of the relative probabilities. This process is known as normalising and the sum of the relative probabilities is the normalising factor or normalising constant.
  • Given some probabilities, there are infinitely many different odds you could create. Say we had probabilities $p_1, p_2$ and $p_3$, then the odds would be any multiple of these three numbers: $ap_1:ap_2:ap_3$. Of course, you should pick $a$ to create the most intuitive list of relative probabilities.

Do not worry if this feels strange. That is to be expected with a different way of looking at something that we are so used to looking at in a certain way.

Example 1. Disease given a positive test result

This is a famous question that demonstrates how bad our intuitions are when it comes to reasoning about uncertainty and probabilities. Here are the assumptions:

  • Without any other information, there is a 1 in a 1000 chance somebody has the disease, independent of everybody else.
  • There is a test for the disease which is 99% effective. Explicitly, 99% of people who have the disease will get a positive test result, and 99% of people who do not have the disease will get a negative test result.

(Do not think about how one could know these facts. E.g. how can you know the effectiveness of a test if there is not already a test which is 100% effective?)

The question: you take the test and get a positive result. What is the probability that you have the disease?

The most common instinctive answer is 99%, because we know the test is 99% effective. However, the error here is getting mixed up between the probability of having the disease given that you have a positive test versus the probability of getting a positive test result given that you have the disease.

Before continuing, try to answer the question using standard tools and Bayes’ Theorem as usual.

OK, I assume you have tried it. Or at least, that you have done it before. If it was me, I would draw a probability tree and then do the various calculations.

Now, here is how to answer this question using the magical alternative: Bayes’ via odds.

  1. Determine the odds without any information, i.e. the prior odds.
    • They are 1:999 of having the disease versus not having the disease. (Why is it 999 and not 1000?)
  2. Determine the probabilities of getting a positive test result, conditioned on the possibilities from Step 1.
    • If I have the disease, I have 0.99 chance of getting positive result.
    • If I do not have the disease, it is a 0.01 chance.
  3. Multiply the corresponding numbers from Steps 1 and 2 together. Re-scale so numbers are nice.
    • From Step 1 we had 1:999. From Step 2 we had 0.99:0.01
    • Multiplying them together gives 0.99:9.99
    • Re-scaling gives 99:999
  4. Marvel at the fact we’re done. Those final numbers you worked out are the posterior odds!

!!! This is amazing! I still find it surreal how straightforward the odds perspective makes the whole process. It is basically magic. (Though, if you did do the question using traditional means, you should see how all the numbers and calculations match up).

Furthermore, this process makes explicit the effect of the new information. The prior belief is that we have the disease with roughly 1:1000 odds, and the posterior belief is that we have the disease with roughly 1:10 odds. The odds have increased by a factor of 100. By looking at the calculations (which is easy to do because they are so short), we see that this factor of 100 arises as the ratio of values in Step 2. I have a 99% chance of a positive result if I have the disease, and a 1% chance if I do not have the disease, so I should put 99 times more weight on having the disease once I know I have a positive test result.

Example 2. The Monty-Hall Problem

Another famous question. I assume you know about it, but here is a brief description of the set-up. You are on a game show. There are three doors. Behind two of the doors are goats and behind the third door is a prize (which you value more than a goat). You initially guess that the prize is behind Door 1. The gameshow host is nice, and reveals that Door 3 has a goat behind it. You are then offered to switch your choice from Door 1 to Door 2.

Question: Should you stay on Door 1, switch to Door 2, or does it make no difference (statistically speaking)?

This question is a real brain-burner because, counter-intuitively, the answer is that you should switch. The most common thought is that once Door 3 is ruled out, there is a 50:50 chance of the prize being behind Door 1 or Door 2, so it makes no difference if you switch or not.

Here, I will illustrate how using odds, we can gain some understanding of this situation.

  1. Determine the odds without any information (i.e. before Door 3 was revealed).
    • The odds are 1:1:1 for the prize being behind Doors 1, 2 and 3 respectively.
  2. Determine the probabilities of Door 3 being revealed, conditioned on the events in Step 1.
    • If the prize is behind Door 1, then there is a 50% chance of Door 3 being opened by the host. (The host picks randomly if they have a choice).
    • If the prize is behind Door 2, then there is a 100% chance of Door 3 being opened by the host. (The host will never open the door you originally guessed.)
    • It the prize is behind Door 3, then there is 0% chance of Door 3 being opened by the host. (The host always reveals a goat)
  3. Multiply the corresponding values to generate the posterior odds.
    • 0.5:1:0
    • Re-scaling we get, 1:2:0.

The prize is twice as likely to behind Door 2 as it is Door 1, so you should switch.

Note, I do not think this is the most intuitive explanation of the Monty-Hall Problem. But I hope it is an illustration of the incredible ease of this odds-based Bayes’ Theorem.

Exercises

As I said above, the best way to learn is by actively engaging with the ideas, rather than passively reading. Hence, here are some exercises to try out.

  1. Formulate a statement of Bayes’ Theorem in this odds framework. Then prove it.
  2. A bag contains 1000 coins - 999 are fair coins but one of them is dodgy and has two heads. You take one out randomly, toss it five times-in-a-row, and get heads each time. What is the probability you have the dodgy coin? How many heads in a row would you have to observe so that you think it is more likely you have a dodgy coin than not?
  3. You meet somebody. They tell you they have two kids and that at least one of them is born on a Tuesday. What is the probability the other is born on Friday?

Pros and Cons

Pros

  • Easy to use
  • Better matches how Bayesian reasoning is used in real-life. You want to update all probabilities, not just a single one.
  • The ease of use makes explicit exactly how the new information changes the relative probabilities.
  • It reveals that $P(B)$ in the standard formulation has no information, it is just a normalising constant.

Cons

  • It is not a derivation.
  • It requires an unfamiliar change in perspective.

Final remarks

I hope you gained some new insights into Bayes’ Theorem. If you have any other perspectives on it, please let me know! As I said at the start, it is always surprising how things that you feel you understand contain hidden layers and deeper insights.

Lastly, I recommend you listen to the 80000 Hours’ interview of Spencer Greenberg. First, you will hear Spencer’s description and perspectives on this odds framework. Second, you will learn a whole bunch of other insightful things: Spencer researches thinking and rationality and tries to develop software tools that take make use of his insights. For example, Spencer discusses their research on when people are over- or under-confident.