The Oracle (8) - let’s go all the way!

Bayesian statistics

Hierarchical models

Author

Written by Gianluca

Published

July 7, 2014

This is (may be) the final post in the series dedicated to the prediction of the World Cup results \(-\) I’ll try and actually write another to wrap things up and summarise a few comments, but this will probably be a bit later on. Finally, we’ve decided to use our model, which so far has been applied incrementally, ie stage-by-stage, to predict the result of both the semifinals and the finals.

The first part is relatively straightforward; the quarter finals have been played and we do know the results that have occurred. Thus, we can re-iterate the procedure (which we described here) and i) update the data with the observed results; ii) update the “current form” variable and the offset; iii) re-run the model to estimate each team’s propensity to score; iv) predict the result of the unobserved games \(-\) in this case the two semifinals (Brazil-Germany and Argentina-Netherlands).

However, to give the model a nice twist, I thought we should include some piece of extra information that is available right now, ie the fact that Brazil will, for certain, play their semifinal without their suspended captain Thiago Silva and their injured “star player” Neymar (who will also miss the final, due to the gravity of his injury). Thus, we ran the model by modifying the offset variable (see a more detailed description here) for Brazil, to slightly decrease their “short-term” quality. [NB: if this were a “serious” model, we would probably try to embed these changes in a more formal way, rather than as “ad hoc” modifications to the general set up. Nevertheless, I believe that the possibility of dealing with additional information, possibly in the form of subjective/expert knowledge, is actually a strength of the modelling framework. Of course, you could say that the selection of the offset distribution is arbitrary and other possibilities were possible \(-\) that’s of course true and a “serious” model would certainly require more extensive sensitivity analysis at this stage!]

Using this formulation of the model, we get the following results, in terms of the _overall _probability of going through to the final (ie accounting for potential draws in the 90 minutes and then extra times and possibly penalties, as discussed here):

Brazil	Germany	0.605	0.395
Argentina	Netherlands	0.510	0.490

So, the second semifinal is predicted to be much tighter (nearly 50:50), while Brazil are still favourites to reach the final, according to the model prediction.

As I said earlier, however, this time we’ve gone beyond the simple one-step prediction and have used these results to also re-run the model before the actual results of the semifinals are known and thus predict the overall outcome, _ie _who’s winning the World Cup.

Overall, our estimation gives the following probabilities of winning the championship (these may not sum to 1 because of rounding):

Brazil: 0.372

Germany: 0.174

Argentina: 0.245

Netherlands: 0.206

Of course, these probabilities encode extra uncertainty, because we’re going one extra step forward in the future \(-\) we don’t know which of the potential futures will occur for the semifinals. Leaving the model aside), I think would probably like the Netherlands to win \(-\) if only for the fact that in that way, Italy would still be the 2nd most frequent World Cup winners, only one title behind Brazil, and one and two above Germany and Argentina, respectively.