Suppose you write a program and you send the source code to two of your friends, ${\cal A}$ and ${\cal B}$. Your two friends read the code and when they finish, $A$ errors are detected by ${\cal A}$, $B$ errors are detected by ${\cal B}$, and $C$ errors are detected by both. So, in total, $A+B-C$ errors are detected and can now be eliminated. We wish to *estimate* the number of errors that remain unnoticed and uncorrected.

The original version of this problem concerns manuscripts and proofreaders, instead of source code and programmers. It was posed and solved by George Polya and published in 1976 on The American Mathematical Monthly under the name of *Probabilities in Proofreading*. Because the problem is interesting and Polya’s solution is short and elegant, I have decided to record and share it. Also, since code sharing and reading is a frequent activity in the software development world, estimating the desired value can be helpful for some readers of this blog.

## Estimating the number of unnoticed errors

Let $E$ be the number of all errors, noticed and unnoticed, in the source code. Our goal is to estimate the value of $E-(A+B-C)$. Let $p$ be the probability that friend ${\cal A}$ notices any given error and $q$ the analogous probability for friend ${\cal B}$. The expected number of errors that may be detected by ${\cal A}$ is $E{\cdot}p$ and by ${\cal B}$ is $E{\cdot}q$. Assuming that these probabilities are independent, the expected number of errors that may be mutually detected by both friends is $E{\cdot}p{\cdot}q$.

Because we are interested in an *estimate*, we can safely assume that the expected numbers are approximately equal to the number of errors detected, that is, $E{\cdot}p \sim A$, $E{\cdot}q \sim B$, and $E{\cdot}p{\cdot}q \sim C$. (We use the notation $\sim$ to denote that two numbers are approximately equal.)

We now have all the ingredients to conclude the solution. Recall that our goal is to estimate the value of $E-(A+B-C)$. We calculate:

\[

\beginproof

\pexp{E-(A+B-C)}

\hint{=}{we rewrite $E$ to prepare the introduction of $A$, $B$, and $C$}

\pexp{\frac{E{\cdot}p{\cdot}E{\cdot}q}{E{\cdot}p{\cdot}q} - (A+B-C)}

\hint{\sim}{assumption on estimates}

\pexp{\frac{A{\cdot}B}{C} - (A+B-C)}

\hint{=}{arithmetic}

\pexp{\frac{(A-C){\cdot}(B-C)}{C}~~.}

\endproof

\]

This is the desired estimate!

Pingback: Math World | Probabilities in Proofreading : Joao Ferreira

Nice piece. This is a well known practice in ecology for estimating population size(and recently in sociology and policy-making for estimating, for example, homeless people). It’s called mark and recapture or capture-recapture.

http://en.wikipedia.org/wiki/Mark_and_recapture

There are even some(not many) papers applying this technique to defect estimation in software.

In 1976, this was probably a good answer. Nowadays, we have a peculiar kind of proofreader: The computer. If you have several proofreaders, and you ignore one in these calculations, what effect does this have on the estimate?