1) Introduction

In principle, it is a problem to venture to test the quality of predictions. One predicts a probability with which a certain event will occur. Then it occurs or it does not. There is no real yardstick, since the very concept of probability admits both cases, the occurrence and the non-occurrence, whether this prediction was “good” or “bad”. So if we predict 70% for an event and it occurs, then it was not automatically a good prognosis. If you bet on it, with money staked, and then win, then it’s very nice. And maybe the bet also came about because we predicted a different probability from other people. But still, it’s not really a measure of that.

The 70% that we predicted may only occur at most (as well as at least) 70% of the time. If we repeatedly rate an event at 70% and it comes again and again, then the statement would rather be: “You didn’t predict that well. It happens much more often than 70%.” In this respect, every event that we forecast with a value of less than 1 (=100%) – and this actually concerns every event that lies in the future — must sometimes occur, but sometimes also not occur. Exactly this must occur in the correct, i.e. the predicted, ratio. At 70%, it would have to occur 70 times and not 30 times in 100 attempts.

But the problem remains mainly this: In the real world, a certain random experiment – and I am simply taking the event “a football match takes place” here – is only carried out once under the given conditions. Even if we could persuade Borussia Dortmund and Bayern Munich to play each other again the next day after their match, the conditions would definitely not be the same. One player got injured. The result was a surprise, Dortmund won, Bayern doubled their efforts. The conditions are completely different. And even if they were the same for two times: You can’t make a real random experiment out of it, where you check your results in the long term. It would have to be feasible 100 or 1000 times.

So you only ever have a forecast and then a result. Was the forecast good? If you bet, count your money. But that’s not enough. With the method presented here, you cannot check the individual forecast, but actually you examine the quality of the prophet. Each individual value could be wrong, badly estimated, totally off the mark. But in the long run, one can still see whether this prophet achieves results with his predictions that are good within the framework of his own predictions. The numbers check themselves, as curious as that sounds.

Of course, you can extend this afterwards and hold two prophets or even just two forecasts against each other. If one predicts 70% for such a one-time event, but the other predicts 60% and it occurs, then for this one case the man with the 70% was “better”. (This principle will be examined in more detail in the following chapter “The perfect betting game”).

So the topic here is: how can one check in the long run — apart from the money counting method, in case one competes on the betting market with it — whether predictions made were good? A prediction is always just an estimate of the various outcomes of a one-time, non-repeatable random experiment, expressed in probabilities. So it’s true, there is a method. Here, the reader is to be introduced very carefully to this problem and its solution.

2) The method of hit expectation

If one does not find counting money as a method sufficiently good – of course, nowadays and in our society this is the decisive criterion, but there is also a scientific part to how close one can get to a “truth” –, then there is a second method that is still quite simple and descriptive. This statistic was also carried in the specially created database for all bets placed. This statistic concerns the expectation of hits in relation to the hits that occurred.

The method works in such a way that through the specially developed system (explained in the chapter of the same name), a probability was calculated for each event on which a bet was placed. And the “expected hits” for this one game are, of course, exactly the size of the predicted probability. If one bets on an event for which a probability of 50% was predicted, then one expects 0.5 or even half a hit. This is logical. If you add up all the probabilities for the events on which you have bet, you get a total hit expectation. You can do this per day, per week, per month or per year. Of course, you can also do it for your whole life. So this is the hit expectation. The sum of all probabilities for the events on which bets were placed in the amount of the probability determined by oneself.

Then there is a result for each of these bets placed. Did the bet event occur or not? Each event that has occurred results in a hit, each event that has not occurred results in 0 hits. The sum of all bets that have occurred can therefore be compared with the sum of the expected hits.

This method works well and reliably. If you always keep in mind that the bets are placed under the condition that “the odds bet are higher than the fair odds” (explained in the chapters “How an odds is created” and “My system”), then you can assume that you will also win money if you achieve your hit expectation reasonably well. In addition to the two numbers, there is also the number “minimum required hits for par”, where the reciprocals of the achieved odds are added up (also explained in “My System”). If you exceed these, you should win.

However, from a scientific point of view, you can already feel the limitations here. After all, the quality of the predictions is only checked for the games on which the bets were placed. Perhaps that is only a fraction of the games? Furthermore, bets are only placed in the case of extreme deviations in the assessments. That could already sufficiently distort statistics that would depend on objectivity. In addition, the forecast is only checked on two of the three outcomes. If we stick to classic betting, then bets are placed on 1, X or 2. So three outcomes are estimated, but only two of them are checked. The reason why there are two is that one event is always predicted and thus the counter probability is also estimated. So if you bet on the said 70%, you are also betting against the remaining 30%. Only how these are divided between the other two outcomes is not checked, although in reality it is also predicted.

3) Up to this point, the mathematician still feels at ease.

In the first and simplest example to illustrate this, we take a known probability and forecast its occurrence in the size of the known probability. That is not even “forecasting”. Nevertheless, there might already be a learning effect. There is a term that does not yet exist in this form in probability theory. Namely, there is an “average expected probability”. Why mathematicians have ignored this term and circumstance so far (well, here is the danger of violating scientific papers; however, such a thing could not yet be discovered in mathematics) also becomes clear quite quickly: the classical examples are mostly repeatable experiments with fixed probabilities of occurrence. In reality, at least in sports betting, they are neither fixed nor known, nor is the experiment ever repeatable. So on this form of black ice, a mathematician in particular no longer feels comfortable. He then probably prefers to stay on solid ground.

The learning effect to be achieved now is the following: There is this so-called “average expected probability”. This is calculated analogously to the calculation of any other expected value: You take the probability and multiply it by the numerical value of the outcome. As a reminder, let’s take the example of throwing dice. The expected value of the number of dice rolled is calculated as: 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 *6 = 3.5. In each case, the probability of occurrence (1/6 each) is multiplied by the value of the dice (1 to 6). (The average expected probability is identical to the “determination” explained a little further below. The terms are nevertheless explained separately).

Analogously, the expected value for the average expected probability is calculated, in the same example as: 1/6 * 1/6 + 1/6 * 1/6 + 1/6 * 1/6 + 1/6 * 1/6 + 1/6 * 1/6. The probability of occurrence is exactly equal to the numerical value. It is virtually a squaring of the probabilities. The result is the average expected probability. A chance of 1/6 occurs at 1/6.

After multiplying out and adding up, the result is 1/6. Of course. All probabilities were the same. Then it makes no sense. (See also the chapter “Expected value and equity”). The 1 comes with a probability of 1/6, as does the 2. We simply multiply the probability of occurrence by the numerical value, so in this case the numerical value is the probability of occurrence itself, so the result is 1/6 * 1/6 for the 1, the same for the 2, all values are added up, the sum of the individual probabilities was also 1, as prescribed, and we get the number 1/6. This is the “average expected probability” for this example. And this expectation is also fulfilled. Because no matter what number you then roll, its probability of occurrence was 1/6 and your expectation beforehand was also 1/6. This does not result in any increase in information. But it is still true. Here, however, the cases should be examined in which there are different and, what is more, all unknown probabilities of occurrence (nevertheless predicted, estimated in their amount), but that happens further down.

Nevertheless, one can examine the simple example even further: if one checks it practically, something very boring happens: one rolls the dice once, or even a hundred times. But each time there is an event to which we have given the probability 1/6 in advance. And we compare that with the average expected probability. And that, as calculated above, was also 1/6. So our result coincides with our expectation. To what extent has mathematics been enriched by this? A trivial statement is confirmed with a trivial calculation and a simple experiment. What was the point of that?

To make sense of it, we have to examine an example in which at least initially there is no uniform distribution. So we look at an unequal distribution. To do this, we put 10 balls in a pot. Two are white, eight are red. You draw a ball blindfolded. This is a random experiment with two outcomes. But the two outcomes do not have the same probability. We note down for each experiment: white or red. We note the probability of the event occurring. And we calculate the “average expected probability” again. In this example, this is now 0.80.8 + 0.20.2 = 0.68. Now we simply note down what happens if we carry out the experiment 50 times.

Here is the result of the experiment, looking specifically at the event “We draw a red ball”. This has a probability of 80%:

(The experiment was conducted in Excel; the random numbers each ensure occurrence or non-occurrence, depending on whether greater or less than 0.8).

What do these columns including the sums tell us? Well, column 1 shows the probability of drawing a red ball. Column 2 shows our predicted probability. In this case, this is also quite boring, because we know the truth (apparently!! gladly be reminded of other chapters). In column 3 is the expected average probability. This is the expected value, calculated like other expected values. One expects an average probability of 68%. Yes, that is how it is. Because: sometimes the 80 comes and sometimes the 20 comes, the 80 comes at 80%, the 20 at 20%. So multiply out, add up.
In column 4 is a random number that determines whether the event has occurred or not. This random number was generated by the computer for this experiment. Its neutrality was assumed.
Column 5 only shows whether the event “drawing the red ball” occurred (1) or did not occur (0).
The next column shows the probability of the event occurring. This is logical. If the 80, the more probable event, has occurred, we note the 80 there, if the less probable event comes, i.e. a white ball was drawn, we note the 20(%) in the column. Good. So whenever the random number in column 4 is less than 0.8, the more likely event has occurred, when the random number is greater than 0.8, the less likely event has occurred.

Now consider again for a moment what one would expect to happen in this column in the long run? Yes, one would expect that 20% of the time there would be a 20 and that 80% of the time there would be an 80. Of course, knowledge of the probability of occurrence is assumed (this is only emphasised here because it is almost never known in the case of the later events that are forecast, i.e. also in life as it really is). So in the “average” there should be a 68, exactly. 80the 80 and 20 the 20, results in an average of 68. That’s how we calculated it and that’s how we expect it.

In this experiment, “by chance” the 80 came once too rarely, so only 39 times out of 50, instead of the “expected” 40. This is absolutely not unusual.

I add the following: The outcome in a random experiment in which there is a known 80% probability of occurrence that one will get exactly the expected 40 hits after doing it 50 times is the most probable of all outcomes. However, the probability is still only 13.98%. The calculation rule for this is 0.8 to the power of 40 * 0.2 to the power of 10 * (50 over 10). So the most likely outcome is to draw a red ball exactly the expected 40 times. Nevertheless, it is, colloquially speaking, “rather unlikely”. The outcome “we draw the red ball exactly 39 times in 50” had a probability of 12.71%. Analogously, we calculate 0.8 to the power of 39 * 0.2 to the power of 11 * (50 over 11) = 12.71%. (this is the calculation rule for the binomial distribution).

However, the effect of this small deviation of 39 instead of 40 times “red ball drawn” can also be seen in the column of the occurred probability. We take the sum and average of this column at the very end and see that the “average probability” was only 66.80%. We had expected it to coincide with the other figure. The average expected. Only it didn’t happen in this experiment. Sure, that was also because the red ball was drawn once too infrequently. If it had been drawn 40 times, the so beautifully exact 68% would also have come out in this column. One must always bear in mind that with the probabilities that are still known, nothing decisive has yet been improved with my newly introduced size. That was not my claim. But later I will talk about the events that are not based on a known probability. And there is an advantage to looking at this figure. Patience, then.

The mathematician, however, feels really comfortable up to here. Everything is right and everything works out exactly. That’s how you love it. There are small statistical deviations, he likes to tolerate these. We can even calculate the probabilities for the deviations. We know almost everything. And if, as incidentally in my first attempt, the red ball is only drawn 36 times, then we wonder briefly, calculate the standard deviation and say: “Ok, can happen.”

But now it gets a bit more complicated…

4) A relatively good prophet

In the second example, we are a bit mean to a selected test candidate. We show him a transparent drum, let’s say it contains 100 balls, he even knows that, and let him guess how many red and white balls he thinks are inside. He should then check this “experimentally”.

But we sort a few more white balls in front, visible to him. He estimates that there are 70 red and 30 white balls, because the red balls outnumber the white ones. But we only have 20 white and 80 red balls, so 80:20 as before. He estimates 70:30. Now let’s see what happens in this experiment. And I’ll just use the same random numbers. Then you can also analyse the difference quite well.

The contents of the columns have already been explained above. Now let’s try to interpret the differences. Our candidate still has no idea how many white and red balls there were and should now try to draw conclusions from this data. So he would have expected 35 hits. But it turned out to be 39. Well, if you don’t know any better, you can tolerate that, can’t you? It is also logical that he has a deviation in the column “expected probability” and “average probability”. Since in this experimental set-up the distribution is identical for each draw (with putting back!), he naturally suspects that he might have underestimated the number of red balls. But he cannot be sure.

However, the average probability in column 6 is now higher than the one he expected. This was reversed in the previous example.

In this still quite simple and illustrative example, one can of course simply use the “relative frequency”. So how often did he draw red? Ok, if he doesn’t know better, it is advisable to estimate the probability at 39 (hits) / 50 (attempts), i.e. 78%. In this respect, my measure of the expected and achieved W-ness does not help that much. It simply delivers the same statement, which here, best expressed, is: “I probably underestimated the number of red balls.” Both results, the number of hits and the comparison of the average probabilities can only suggest this conclusion. “Coincidentally”, it is also true. But only because we still know the true distribution of the balls.

For an even better illustration, here is the whole thing represented as a diagram:

The purple line represents the current average probability. The jags at the beginning are caused by the fact that every time the more probable event occurs, i.e. a red ball is drawn, it jags upwards, when the less probable event occurs it jags downwards (even further than the other way round). After 50 trials, the movement is still relatively clear. If one were to make even more attempts, the jags would gradually no longer be recognisable. The participant’s assumption here was that the probability was constant at 70%, i.e. 70% contained red balls. Thus the blue line is constant at 58% (0.70.7+0.30.3). Certainly, one can assume that the number of red balls was underestimated.

But one must always bear in mind that with a test number of 50, there could be such a curve even with a correct estimation (i.e. actually 70% red balls). So if one would now correct the estimation based on this observation, this curve drawn above, one could also make a mistake.

The diagram is also interesting from the point of view of reality (known here). The diagram looks like this:

The interpretation is easy with such an aesthetic diagram: the purple curve, i.e. the current average probability that has occurred, jags exactly like the upper one, for the same reasons, but it then approaches the blue one, i.e. the expected average probability, beautifully. The higher, here correct value of the average expected probability of 0.80.8 + 0.20.2 = 0.68 is therefore higher here than in the previous diagram, so that the curves really do converge.

The mathematician would have conventionally thought that the error was due to the too high number of hits. Nevertheless, I am introducing a term that is important for the time being. Its use and understanding could well be significant for further reading.

The term is “the determination”. The question is how much can I “commit” myself in a random experiment with unknown probabilities. The measure for the commitment is the deviation from the average expected probability to be assumed in the case of equal distribution. If the distribution is assumed to be equal, the commitment would be minimal, so one could then intuitively speak of “no commitment”.

This sounds much more complicated than it is. In reality, we have to take n, i.e. possibly many outcomes, as a basis for each event to be predicted. For the question “Who will win the European Football Championship this year?” or “Who will be Formula 1 World Champion?” each has n (for the European Football Championship finals I know, n=16 at the beginning of the finals; for Formula 1 n = number of participants, but certainly greater than 2) answers or outcomes. But of course you can also reduce every prediction to the two outcomes, such as these two: “Germany will be European champion” and the counterstatement: “Germany will not be European champion”. Nevertheless, it makes sense to allow for the n outcomes. In football, in a match, there are already three, i.e. team 1 wins, team 2 draws or team 2 wins. And the equal distribution would always mean that all n outcomes are equally probable.

If someone wants to make it easy for himself as a prophet, then he simply predicts the probability 1/n for each possible outcome. Just as with dice or roulette. Each number has the probability 1/6 (dice) or 1/37 (roulette). Full stop, end, done with the prognosis. A football match is about to start? All right, I’ll predict 1, X or 2, probabilities? Sure, a third of each. This person obviously hasn’t made up his mind. He doesn’t know any better or it’s the truth. But it does not result in a determination.

Now, if you use the formula for calculating the average expected probability for this simple case, something very boring happens. Each event is assigned 1/n. Each 1/n is multiplied by itself. At the end, the sum is formed. And the sum of all 1/n1/n + 1/n1/n + … +1/n*1/n = 1/n. So the expected average probability is 1/n. And what happens in the column of the event that occurred? Oh wonder, yes, it also says 1/n each time, that’s clear. The probability of the event occurring is also always 1/n, because that is exactly what he predicted for the sake of simplicity.

So he achieves exact congruence between the numbers of expected average probability and average probability that has occurred. But whether his forecast was also good and correct? One thing is certain: he did not commit himself. We can only compare him with someone who commits himself. And you will notice that the number of average expected probabilities increases as you consider one of the outcomes to be more probable than another. Now this brings me back to the initial proposition of what and how you measure commitment.

So the more you commit to a random experiment, regardless of whether the outcomes are known or unknown, the larger the number of average expected probabilities becomes. The number 0.90.9 + 0.10.1 is larger than 0.80.8 + 0.20.2 and this in turn is larger than 0.70.7 + 0.30.3. The smallest number that can come out of this form of calculation is 0.50.5 + 0.50.5, because that is only 0.5. (The calculation is always the same: Two numbers that add up to 1 are squared and then added together).

A small diagram to illustrate this:

The blue line represents the probability. The yellow line is the “fixing”, the sum of the squares of the probability and its counter probability, in the sum the values are 1. The smallest value of the fixing is given at a probability of 0.5. Obviously, the curve is symmetrical around this 0.5, because it does not matter whether you commit yourself in the form “the event is rather probable” or “the event is rather improbable”, because you have simultaneously estimated the counter-probability, and this in turn is either (very) large or (very) small.

Determination, therefore, does not mean saying, “Here’s the deal, Bayern will win today” or something like that. Determination is measurable. It measures the deviation of the uniform distribution of a random experiment with n outcomes from the minimum determination. The minimum determination (i.e. none at all) is always the sum of the products 1/n*1/n, and this always results in 1/n. So it is a major determination if you say “Schumi wins Formula 1 90% of the time” (oh, he doesn’t race any more?) or “Becker wins Wimbledon 60% of the time” (those were the days!), Jan Ullrich wins the Tour de France 40% of the time (boo-hoo) or “Germany becomes European champion 22% of the time” (I calculated this before the European Championship and announced it live on TV during Gerd Delling’s week. Measured against the number of participants (16), where 1/16, i.e. approx. 6%, would have been the case if there had been equal distribution, it is quite a high value, i.e. a high determination. It was even the highest number of all participants, which is why Germany was the favourite according to my computer, although not the best team. That was luck of the draw).

Determination is measurable. But let’s move on to another experiment…

3) We are still simulating, but at least we are already simulating life.

We now expand our experiment even more. The lottery wheel is now huge but still transparent.
The experiment is carried out as before. Only I take the right to change the number of red and white balls before each draw.

I can now represent any probability. We can assume that even the number of total balls is open, unknown. But you can assume a very large number. Maybe it is 10,000.

So our candidate looks at the drum and tries to determine how many red balls are contained, as a percentage. He guesses or estimates a number. He also has certain preconditions to know this number a little bit from the order of magnitude. But I know the number exactly. Let us now see what will happen.

Here is an illustration of a possible sequence:

Now we try to interpret the results here. We look again at the individual columns. Column 1 shows how many red balls (in percent) were actually in the drum. So we are not yet at reality, because we (almost always) don’t know.

Column 2 still shows the average probability expected on the basis of this prediction (which is, however, exact). In column 3 is the probability assumed by the test candidate, i.e. predicted, estimated, guessed. This column always differs from column 1. However, it guesses quite well, as can be seen by examining individual figures. But still: there is a deviation from reality (still known here).

In column 4 is the expected average probability from the candidate’s point of view. If the number of red/white balls were not known, it would be his only possibility to check the quality of his numbers. But at least he would have them.

Column 5 now still contains the random number. If it is smaller than the correct probability, a 1 appears in column 5, if it is larger, it results in a 0 for column 5. The 1 stands for “red ball drawn”, the 0 stands for “white ball drawn”. Column 6 finally gives the probability of the event occurring, but again that of the real probability and not the candidate’s assumption….

And in column 7 I have added who predicted “better” in the individual attempt. Of course, the first player, i.e. myself, has the advantage of “knowing” the probability. Therefore, the first player should have an advantage with each tip. Nevertheless, it is possible that he will be caught up by reality, by what actually happens, and that the random experiment will turn out to his disadvantage. After all, it remains a matter of chance. Someone predicts a higher probability, and rightly so, but the event does not occur. That is commonplace.

In this run, however, the favourite prevailed. He was better 54 times, the opponent only 46 times.

Despite knowing the truth, there is a discrepancy in the values expected/accomplished average probability. Nevertheless, player 1, i.e. reality, prevailed over player 2. Player 1 expected 71.09%, while player 2 expected 72.49%. However, only 69.40% came true. So in this example, the outsider event occurred too often.

Nevertheless, in order to make it even more illustrative here, too, I have created two more diagrams. Look:

The purple curve represents the average expected probability. This can (obviously) never be less than 50%. It also develops reasonably steadily. The jags at the beginning are caused by the randomness to have very extreme (one of the two sides is close to 100%) or rather balanced probabilities. The fluctuations decrease noticeably later on.

The blue curve shows the average probability. This moves very far downwards at the beginning. The reason for this is equally obvious: the outsider event occurred several times. And although the curves almost touch in the middle, the blue one always remains below (due to the initial outsider successes). But 100 attempts are not excessive. One may assume that it would adjust at some point.

The two curves from the perspective of player 2 are somewhat different:

The difference is always greater. And it stays that way until the end. The points are similar, however, because the same random experiment is used. Nevertheless, the fact that player 2 is only guessing the probabilities has an effect. If you notice that the blue curves are not identical either (this is, by the way, also the reason why I made two diagrams), I still have to explain the fact: For player 2, you would actually still have to have the column “Probability of event occurring from player 2’s point of view” in the number columns as well. Since he doesn’t know the truth, he would always have to enter his own estimated value (or its equivalent in the case of non-occurrence) in this column. For the diagram, however, I have used this column of numbers.

I can’t help it, I’ll have to do a second run to see if it can look different, and if so, how. Are you also curious? I’ll also spare you the string of numbers and just show the last two diagrams. First from the perspective of truth, player 1:

I knew there was another way. First, the outsider event also occurred too often. Hence the blue line far in the basement. But then the favourite events come too often and overtake the expected ones. But, remember, 100 attempts are still relatively few.

Here is the perspective of player 2:

Parallels can be seen, but, oh horror, player 2 is ahead at the end. His result is clearly better. I also checked, in fact player 2 has also been “right” more often, the better assessment. A total of 52 times. No wonder that he is in the lead overall.

Now I have increased the number of attempts to 1000. Look at these results:

Here, in the long run, there is almost exact congruence between the two lines. However, I confess that I also had deviations in a few more runs. So, at least over the distance of 1000 trials, it is far from guaranteed that the lines will move so nicely. Apart from that, as you can see, there was also quite a big deviation between the 200th and the 300th attempt.
Of course, this curve is the reality. The exact and known probability was used as a basis.

Here is the picture of player 2.

So I can breathe a sigh of relief. The difference is so obvious. He makes a permanent mistake and it has an effect. That is reassuring.

But if you enjoy pondering the little wonders of mathematics, I will be happy to discuss the following:
You may have noticed that at first glance it does not seem logical why the blue curve should move so obviously below the purple curve. Why it might strike you is in itself easily explained: Since the candidate always has only a small deviation from reality. He estimates the number of balls. He is always wrong, that is logical. Sometimes he errs in one direction and sometimes in the other. So sometimes he underestimates the number, sometimes he overestimates it. So the error should sometimes fluctuate upwards and sometimes downwards. The deviation of the two lines could be either positive or negative. So you would probably expect the lines to intersect from time to time and the blue one could also be too high.

Now the explanation why the result nevertheless follows a mathematical logic. So, if one were to assume very high or very low numbers of red balls, i.e. high or low probabilities or a high determination, then this would also be absolutely correct. The problem always arises when the probabilities are fairly balanced. So let’s assume the case that there are about 45% red balls. But the candidate, who does a perfectly good job, is mistaken in the sense that he estimates the number to be 55%. Then the following effect occurs: The determination is exactly hit (take the sum of the squares of the probabilities). So he hits the expected value (by chance) exactly. But the probability he assigns to the occurrence of the event “draw red ball” is valued much too high. Then, whenever the event then does not occur (which is even the favourite event, i.e. the event whose probability was underestimated), the occurred one will fall behind the expected one. In the long run, this effect works out as seen in the diagram.

5) In real life

The difference in real life is clear: there, there is no one who knows the true probabilities. So if we forecast a Bundesliga match day, write down estimates for 1-X-2, each adding up to 100%, then this is pure fantasy. It can be a more or less good estimate. But to check the quality, you can either bet on it, settle up afterwards and count the money, or just look at it and enjoy it or even (dis)doubt it. Then, after the games are over, you can write next to them which event occurred in each case. But how does that help us?

Well, with the help of this method we can actually now check our own assessments. Not a single one, but after a large number of predictions and events that have occurred, you gradually get an impression of whether you have made a good or bad guess.

I can simply put down my figures for the last Bundesliga season here, and we’ll try to interpret them:

First the predictions for 1-X-2

(These figures are calculated as the sum of the probabilities for each 1-X-2. As you can see, the sum is 306 in each case, which corresponds to the number of games in a season).

So I “underestimated” the home advantage somewhat in the 2007/2008 season in the BL. There were slightly more home wins than expected, slightly more draws, but too few away wins. However, my computer reacts independently and automatically to such developments. It is questionable, however, whether the trend of increasing home advantage will be confirmed. So the computer reacts rather slowly, but in the long run it has turned out to be right to adjust the parameters only slowly.

The same tendency, of course, with the goals scored:

Too many goals for the home teams, too few for the away teams. The difference here is actually rather frighteningly large. But if you look at the figures from previous years, you realise that last season was rather just an “outlier”.

The other two figures, however, the expected and the actual probability, look much better again:

Now let’s try, in keeping with the chapter, to interpret these two numbers: First of all, it is pleasing when the numbers are close to each other. This makes it possible to rule out gross misjudgements quite reliably. Since the expected probability is even higher than the actual probability, it rather suggests that the favourites have even won a little too seldom. So unlike the statistics on home and away wins, one would rather have overestimated the favourites, albeit very slightly (usually the home team is the favourite in practice). This means, then, that the somewhat overestimated number of home victories was rather due to surprises. So that one would have to come to the conclusion here again that the numbers are rather correct overall after all. The discrepancies between home and away are more likely to be coincidental.

Since it is a bit boring to see only the results of one season, I also include the figures of the 2006/2007 season here:

Predictions for 1-X-2

The trend here is the other way round: too few home wins in relation to those expected. Slightly too few draws, but too many away wins.

The predictions for home-away goals

Here the ratio is also even less favourable for the home teams. However, you can also see that the predictions for the total of the two years complement each other quite well. So it is rather just a (normal, expected or admitted) statistical deviation. However, now to the expected/accomplished w-nesses:

Here, however, the clear trend is confirmed, no redress: the favourites were overestimated. There were too few favourite wins and too few home wins. And mostly the home team is the favourite.

Still, the numbers and the overall deviations are not necessarily worrying.

6) Weather forecast

Now I can give you another example from real life. You can even try it out yourself to see how my method works and how good the weather forecasts are. Or you can compete against the weather services. So let’s look at the simple example of the rain forecast. And I guess that the fact that the rain forecast is made in the form of probabilities is due to the fact that people have complained that they were promised rain but none came. Or were the complaints rather the other way round?

In any case, it is clearly also a real problem to predict rain. Especially when the forecast is supposed to cover the entire area. It can happen in one region and not in another. The question of timing is also still unresolved. And furthermore, the amount that is sufficient to classify the forecast as “occurred” (have you ever had a drop of rain and then wondered whether it would rain now?).

For our simplified example, let’s pretend for a moment that we want to forecast the likelihood of rain at a particular location at a particular time and then be able to rate that as having occurred or not at that time. I am still puzzling whether the meteorologists actually at least take note of whether their forecast values are checked by at least assessing whether the forecast has occurred. I imagine it to be like this: The forecast probability of rain was 70%. The next day a note is made: yes, it rained or no, it did not rain. If that is the case, then the quality of this forecast could of course be checked in exactly the same way using my method.

And for my little example, I introduced an additional component: Another player who, in principle, knows the overall chance of rain over, say, a month quite well, and is thus also correct on average, but ultimately has no real assessment for the individual forecast. Let us first take a brief look at the example and then try to interpret it.

So let’s interpret this result: The first participant in this case knows the truth. As in the example before, this is the one who specifies the number of balls. This is unrealistic and unfair, but good enough for the example to illustrate the effects. He has the correct estimate for the average expected w-ness and the correct hit expectation (in the example, out of 50 days, 30.46 expected rainy days; converted to 61%; thus a true rainy season). The random number then decides whether it rains or not. Column 5, as usual, indicates the W-ness of the event that occurred.

We first interpret the result of player 1, who in this case knew the truth exactly. He expected 30.46 hits, but 31 occurred. That is pretty much correct, but still pure coincidence. The larger deviation in the columns “average expected” and “average occurred” provides us with a new insight: this deviation is rather disturbingly large. How does it come about? Well, a closer look reveals that although the number of hits may be correct, the “wrong” events have come. The distribution of hits among the probabilities is out of proportion: the more probable event has occurred too often. You can compare columns 4 and 5 directly with each other, case by case. You will see that there is often a deviation in favour of the greater estimated probability. And it does not matter whether the favourite event was “it rains” or “it does not rain”.

So if we did not know the truth (this is an Excel simulation based on knowledge of the actual probability, but this “truth” is not usually known in reality; I repeat this from time to time for the sake of internalisation), we would not know the truth; at the same time, the so-called pseudo-random numbers used ensure equalisation in the long run anyway, which is not guaranteed in any practical random experiment carried out), the result would give cause for concern in the sense that we have thus underestimated the favourite event and that the example examined would have allowed a higher determination (i.e. another player perhaps, who would have had a “better” estimate of the truth, could have assumed a higher determination and thus defeated us). The way the result came about here, it is a purely statistical, randomly occurring deviation that we, like so many other things, simply have to tolerate.

But now let’s turn to the opponent’s assessment: He had a good idea of how often it would rain on the given 50 days (possibly based on statistics from previous years). This estimate would possibly satisfy an old-fashioned mathematician: He compares hit expectation and hits scored and congratulates the prophet: “Well done. You can’t expect more.”

We, however, with our gained knowledge, can expose him to charlatanry. He may have guessed the sum of the hits, but he was wrong in almost every single case. This has the following consequence: his expected/accomplished values are close to each other, but they are not high enough by a large margin. Because even we had suspected a very high determination (69.05%) but, as already mentioned, it was not even high enough for this experimental distance.

I also show this in the diagram, first the perspective of player 2:

These curves look fantastic and, if there were no opponent, would definitely cause satisfaction. One has given a certain determination, this is small but obviously correct. The curves converge and meet almost exactly towards the end.
Of course, this is because player 2 hit the number of hits pretty well. In other words: there was no deviation by his forecast from those of previous years or however he may have derived his values. The long-standing statistics were confirmed.

But here is the diagram from the point of view of player 1, i.e. from the point of view of reality, of the correct assessments:

Although the values here do differ noticeably (please take into account the short distance of 50 attempts, 50 days), they are nevertheless so much higher that one simply has to realise that this forecast was better. The experiment itself allowed for a higher determination. The player has recognised this and also estimated it that way. A deviation nevertheless occurs, a statistical coincidence.

We can even look at the four curves together in this case:

Player 1’s significantly higher level always outweighs the slightly higher deviation. Still, consider the two points: The set of predicted events was absolutely identical. Point two: the person who always predicted 50-50 would not have a deviation at any single point. This only serves to justify why the one at the lower level of prediction would have a much easier time achieving his expectation.

We need to look a little more closely at why he seems to have such good scores and would not doubt his results if there were no comparison: he has forecast according to an average. This can lead to good results if the average is achieved overall. In other words, there is no deviation from the basis for the forecast. Our result, that of player 1, however, comes from the fact that we have given each day individually, i.e. without any long-term expectations or insights. We did it by analysing, let’s say, the high and low pressure areas, wind movements, satellite images, air pressure, etc. This resulted in a forecast for the overall weather. This also resulted in a forecast for the total number of hits. But this was only the sum of the individual probabilities.

So if, in a comparable period of time, the average should not occur but a completely different number, our forecast would still be as good as it is: every day a forecast, every day perhaps a small deviation from reality, but on the whole good. The other person, who forecasts according to the average value, would have made a clearly recognisable (also systematic) mistake here. He has assumed the average. But this can easily change due to external circumstances. That’s where he gets the penalty.

In any case, “correct” forecasting is much easier to recognise by the fact that one expects a higher determination and also achieves it (approximately) well. It is much easier to forecast a low determination and then achieve it (such as the one who does not commit at all).

So the result of player 1 would also be practically seen as clearly better despite the higher deviation. He dared to set a high value and also achieved it (approximately). The experiment gave a higher fixing. The man recognised that.

You should never forget that in reality there is always the doubt. Here our player 1 had it easy. In reality, however, he would not have known the “correct” assessments either. Nevertheless, even if he had not known the truth, the result would have been better.

7) Summary

I will try to summarise again what we are actually investigating here:

In order to test the quality of our predictions, we have to achieve the best possible congruence between the average expected probability and the average actual probability. If we succeed in keeping these numbers close together, we have predicted well in our sense. However, the goal must be to get the numbers as high as possible.

For, as we have seen, the one who predicts only 1/n, i.e. in the case of football matches 1/3, for each outcome, achieves his specifications playfully well in the sense of the deviation; this will be small. If the experiment does not allow it otherwise, however, we have to keep the 1/n as an estimate or seriously consider it (examples: roulette, dice). But if the experiment allows us to “fix” it, we have to get the two numbers as high as possible, but close to each other. Realistically, we have to fix them as much as the experimental set-up allows, for each individual case. We must, so to speak, recognise the inherent determination of the experiment in order to become good prophets.

Of course, the other numbers also contribute to the verification (goal expectation; one can and should always carry this along; in ancient times it was the only criterion, so to speak), goal expectation (especially in football). But nevertheless, the criterion of average expected/achieved probabilities remains a very essential one.

After all, if you want to think about it again, it is not very easy to get a statement from a series of completely independent, non-repeatable events with unknown probabilities of occurrence. At least I have now provided you with an additional criterion. And: it did not exist in mathematics until now.

Another point worth mentioning is that the average expected probabilities of approx. 38.4% for both years together, which I determined for the 2006/2007 and 2007/2008 Bundesliga seasons, also provide another clue: The basis for such a “determination” are roughly figures like 51.17% for the favourite, 25% for the draw and 23.3% for the underdog, as you can see by adding up the squares of the individual values, the arithmetic rule (0.5170.517 + 0.250.25 + 0.233*0.233 = 0.384).

So the distribution I assumed for the favourite is already at over 50%. So emotionally speaking, I’ve already made a fairly firm decision. Because there are quite a few very even games in the Bundesliga. Nevertheless, the favourite is on average over 50% (better: the favourite event; in the rarest cases, however, it is the draw, but occasionally it is the away team; most often it is the home team).

So we have found a verification method for our own predictions. Nevertheless, it would of course be interesting to see how the results might compare. So if we hold two prophets against each other.

I realise that these statements here can all sound a little confusing. Nevertheless, I would at least like to have it mentioned: Even if I have achieved a (seemingly) good approximation to reality with my results, I cannot completely rule out the possibility that another prophet will predict an even higher expected probability and achieve it. We are on thin ice due to the experiment. It may be that the experiment itself (non-repeatable events, independent, unknown probabilities) allows for a higher determination. So a second player could possibly have expected 41% and also achieve 41%. He would simply have to have predicted a higher probability than I did for the event that occurred at the end. Then it would work. Not for everyone, but only for a sufficiently large number. One would then also be able to say: Here “he knew something” that I did not know. He suddenly writes down 70% for an event such as Wolfsburg winning against Stuttgart, and Wolfsburg then wins. I had assumed 40% as normal, because I “didn’t know anything”. If he had misjudged that, then he could still get away scot-free in that one case. Only if he often takes such risks and writes down excessive, unrealistic chances, then the mathematics would strike inexorably and punish him in the form of high deviations. But if he knows it so well, then he may, no, he must predict it that way.

However, the availability and thus verifiability of my other numbers (expected 1-X-2 and goal expectations, although I would have to see his numbers first for that) makes it rather unlikely that there is a better prediction.

But if we now want to determine the quality of several prophets or tipsters, players actually in the long term and other than by betting and counting money, it makes sense to look at the system of “perfect betting” that I invented. It is suitable as a betting game without a financial stake, so it should be called a “perfect betting game”; with a financial stake, it is the perfect method of settling two or more assessments against each other, merely on the basis of the assigned probabilities.

Investigations in this regard can be found in the chapters “Betting games” and “The perfect betting game”.