On page 24 in The Elements of Statistical Learning (ESL) by Hastie et al, the Bias-Variance decomposition is shown, but not derived. It turns out the derivation is quite easy, but also a bit tedious. I am presenting the derivation here using notation similar to ESL. I hope that this saves someone some time.

I’d also like to credit these notes, which provided me the trick necessary to derive this, but which unfortunately did not provide the gory details.

To recap the notation used in ESL, we have x_0 as the point at which we want to evaluate our estimate of the function f, while f(x_0), and \hat{y_0} denote the true value of the function and our estimate respectively. However, from here on out, we’ll drop the subscripts.

Recall the definitions of Variance and Bias Squared:

\text{Variance} = E[(\hat{y} - E[\hat{y}])^2]
= E[\hat{y}]^2 - 2E[\hat{y}]\hat{y} + E[\hat{y}]^2]
\text{Bias}^2 = (E[\hat{y}] - f(x))^2
= E[\hat{y}]^2 - 2E[\hat{y}]f(x) + f(x)^2

Now we have mean-squared error:

\text{MSE} = E[(f(x)-\hat{y})^2]
= E[(f(x)- E[\hat{y}] + E[\hat{y}] - \hat{y})^2]
= E[(f(x)- E[\hat{y}] + E[\hat{y}] - \hat{y})(f(x)- E[\hat{y}] + E[\hat{y}] - \hat{y})]
= E[\underline{f(x)^2} - \underline{f(x)E[\hat{y}]} + f(x)E[\hat{y}] - f(x)\hat{y}
- \underline{E[\hat{y}]f(x)} + \underline{E[\hat{y}]^2} - E[\hat{y}]^2 + E[\hat{y}]\hat{y}
+ E[\hat{y}]f(x) - E[\hat{y}]^2 + \underline{E[\hat{y}]^2} - \underline{E[\hat{y}]\hat{y}}
-\hat{y}f(x) + \hat{y}E[\hat{y}] - \underline{\hat{y}E[\hat{y}]} + \underline{\hat{y}^2}]
= E[\hat{y}^2-2E[\hat{y}]\hat{y} + E[\hat{y}]^2]
+ E[E[\hat{y}]^2 -2E[\hat{y}]f(x) + f(x)^2]
+ E[f(x)E[\hat{y}] - f(x)\hat{y} - E[\hat{y}]^2 + E[\hat{y}]\hat{y}
+ E[\hat{y}]f(x) - E[\hat{y}]^2 - \hat{y}f(x) + \hat{y}E[\hat{y}]]
= \text{Variance} + \text{Bias}^2
f(x)E[\hat{y}] -f(x)E[\hat{y}] + E[\hat{y}]^2 -E[\hat{y}]^2
+ f(x)E[\hat{y}] - f(x)E[\hat{y}] +E[\hat{y}]^2 - E[\hat{y}]^2
= \text{Variance} + \text{Bias}^2

The big trick required to get the result is to simultaneously add and subtract E[\hat{y}] to the MSE. After that we only have the tedium of expanding it and then based upon the above definitions of bias and variance recombining it using the linearity of expectations, i.e. E[aX] = aE[X], and E[X+Y] = E[X]+E[Y]. We also use the fact that E[E[X]]=E[X].

Note that the underlined terms used in the third step are combined as the first two lines of the fourth step and are the terms that make up the variance and bias squared.