An
interesting article by statistician Jeff Sonas.
While
I am not enough of a mathematician or statistician to comment on this, I do know
it does not take a genius to know that something is wrong with the FIDE rating
formula. This is an interesting article about proposed changes for the system. I
saved this article, as it seems whenever I count on a website to save something
   they delete it! So here it is.
by Jeff Sonas
Every three months, FIDE publishes a list of chess ratings for thousands of players around the world. These ratings are calculated by a formula that Professor Arpad Elo developed decades ago. This formula has served the chess world quite well for a long time, but I believe that the time has come to make some significant changes to that formula.
At the start of August, I participated in a fourday conference in Moscow about rating systems, sponsored by WorldChessRating. One of the conclusions from this conference was that an extensive "clean" database of recent games was needed, in order to run tests on any new rating formula that was developed. In subsequent weeks, Vladimir Perevertkin collected the raw results from hundreds of thousands of games between 1994 and 2001, and I have imported that information into my own database for analysis.
I have experimented with lots of different rating formulas, generating historical ratings from 19942001 based upon those formulas. For instance, we can see what would have happened if all of the blitz and rapid games were actually included in the rating calculation, or if different coefficients within the formulas were adjusted. All of the following suggestions are based upon that analysis
EXECUTIVE SUMMARY – FOUR MAIN SUGGESTIONS
Suggestion #1: Use a more dynamic KFactor
I believe that the basic FIDE rating formula is sound, but it does need to be modified. Instead of the conservative KFactor of 10 which is currently used, a value of 24 should be used instead. This will make the FIDE ratings more than twice as dynamic as they currently are. The value of 24 appears to be the most accurate KFactor, as well. Ratings that use other KFactors are not as successful at predicting the outcomes of future classical games.
Suggestion #2: Get rid of the complicated Elo table
Elo's complicated table of numbers should be discarded, in favor of a simple linear model where White has a 100% expected score with a 390point (or more) rating advantage, and a 0% expected score with a 460point (or more) rating disadvantage. Other expected scores in between can be extrapolated with a simple straight line. Note that this assigns a value of 35 rating points to having the White pieces, so White will have an expected score of 50% with a 35point rating deficit, and an expected score of 54% if the players' ratings are identical. This model is far more accurate than Elo's table of values. Elo's theoretical calculations do not match the empirical data from actual results, and do not take the color of pieces into account either. They also show a statistical bias against the higherrated players.
Suggestion #3: Include faster time control games, which receive less weight than
a classical game
Classical games should be given their normal importance. Games played at the "modern" FIDE control are not as significant, and thus should only be given an 83% importance. Rapid games should be given a 29% importance, and blitz games an 18% importance. The choice to rate these types of games will actually improve the ratings' ability to predict the outcome of future classical games. By using these particular "weights", the ratings will be more accurate than if rapid and blitz games were completely excluded. The exact values of 83%, 29%, and 18% have been optimized for maximal accuracy and classical predictive power of the ratings. If you prefer a more exact definition that recognizes different types of rapid controls, or one that incorporates increments, I have included a graph further down which allows you to calculate more precise coefficients for arbitrary time controls.
Suggestion #4: Calculate the ratings monthly rather than quarterly
There is no reason why rating lists need to be out of date. A monthly interval is quite practical, considering that the calculation time for these ratings is almost negligible. The popularity of the Professional ratings shows that players prefer a more dynamic and more frequentlyupdated list.
A SIMPLER FORMULA
In some ways, the Elo approach is already very simple. Whenever a "rated" game of chess is played, the difference in FIDE ratings is checked against a special table of numbers to determine what each player's "predicted" score in the game should be. If you do better than that table predicts, your rating will increase by a proportionate amount. If you do worse than "predicted", your rating will decrease correspondingly.
Let's say, for instance, that you have a rating of 2600, and you play a 20game match against somebody rated 2500. In these games, your rating advantage is 100 points. The sacred Elo table of numbers tells us that your predicted score in that match is 12.8/20. Thus if you actually score +5 (12.5/20), that would be viewed as a slightly subpar performance, and your rating would decrease by 3 points as a result.
However, the unspoken assumption here is that the special table of numbers is accurate. Today's chess statistician has the advantage of incredible computing power, as well as millions of games' worth of empirical evidence. Neither of these resources were available to Elo at the time his table of numbers was proposed. Thus it is possible, today, to actually check the accuracy of Elo's theory. Here is what happens if you graph the actual data:
Elo's numbers (represented by the white curve) came from a theoretical calculation. (If you care about the math, Elo's 1978 book tells us that the numbers are based upon the distribution of the difference of two Gaussian variables with identical variances but different means.) This inverse exponential distribution is so complicated that there is no way to provide a simple formula predicting the score from the two players' ratings. All you can do is consult the special table of numbers.
I don't know why it has to be so complicated. Look at the blue line in my graph. A straight line, fitted to the data, is clearly a more accurate depiction of the relationship than Elo's theoretical curve. Outside of the +/ 350 range, there is insufficient data to draw any conclusions, but this range does include well over 99% of all rated games. I have a theory about where Elo's calculations may have gone astray (having to do with the uncertainty of rating estimates), but the relevant point is that there is considerable room for improvement in Elo's formula.
Why do we care so much about this? Well, a player's rating is going to go up or down, based on whether the player is performing better than they "should" be performing. If you tend to face opponents at the same strength as you, you should score about 50%; your rating will go up if you have a plus score, and down if you have a minus score. However, what if you tend to face opponents who are 80120 points weaker than you? Is a 60% score better or worse than predicted? What about a 65% score? More than half of the world's top200 actually do have an average rating advantage of 80120 points, across all of their games, so this is an important question.
Let's zoom into that last graph a little bit (also averaging White and Black games together). The white curve in the next graph shows you your predicted score from the Elo table, if you are the rating favorite by 200 or fewer points. That white curve is plotted against the actual data, based on 266,000 games between 1994 and 2001, using the same colors as the previous graph:
There is a consistent bias in Elo's table of numbers against the higherrated player. To put it bluntly, if you are the higherrated player, a normal performance will cause you to lose rating points. You need an aboveaverage performance just to keep your rating level. Conversely, if you are the lowerrated player, a normal performance will cause you to gain rating points.
For instance, in that earlier example where you had a rating of 2600 and scored 12.5/20 against a 2500rated opponent, you would lose a few rating points. As it turns out, your 12.5/20 score was actually a little BETTER than would be expected from the ratings. Using the blue line in the last graph, you can see that a 100point rating advantage should lead to a score slightly over 61%, and you actually scored 62.5%. Thus, despite a performance that was slightly above par, you would actually lose rating points, due to the inaccuracy of Elo's table of numbers.
It may seem trivial to quibble over a few rating points, but this is a consistent effect which can have large cumulative impact over time. For instance, it appears that this effect cost Garry Kasparov about 15 rating points over the course of the year 2000, and the same for Alexei Shirov. With their very high ratings, each of those players faced opposition that (on average) was weaker by 80120 points, and so the ratings of both Kasparov and Shirov were artificially diminished by this effect.
In contrast, Vladimir Kramnik also had a high rating in 2000, but due to his large number of games against Kasparov during that year, Kramnik's average rating advantage (against his opponents) was far smaller than Kasparov's or Shirov's. Thus, this bias only cost Kramnik 1 or 2 rating points over the course of the year 2000.
The bias also has an effect on the overall rating pool. It compresses the ratings into a smaller range, so the top players are underrated and the bottom players are overrated. Players who tend to be the rating favorites in most of their games (such as the top100 or top200 players) are having their ratings artificially diminished due to this effect. Thus the rise in grandmaster ratings, that we have seen in recent years, would have been even greater had a more accurate rating system been in place. You will see an illustration of this later on, when we look at some monthy topten lists since 1997 using various rating formulas.
It's great to have some sort of scientific justification for your formula, as Professor Elo did, but it seems even more important to have a formula which is free of bias. It shouldn't matter whether you face a lot of stronger, weaker, or similarstrength opponents; your rating should be as accurate an estimate of your strength as possible, and this simply does not happen with Elo's formula. My "linear model" is much simpler to calculate, easier to explain, significantly more accurate, and shows less bias.
A MORE DYNAMIC FORMULA
For all its flaws, the Elo rating formula is still a very appealing one. Other rating systems require more complicated calculations, or the retention of a large amount of historical game information. However, the Professional ratings are known to be considerably more dynamic than the FIDE ratings, and for this reason most improving players favor the Professional ratings. For instance, several months ago Vladimir Kramnik called the FIDE ratings "conservative and stagnant".
Nevertheless, it is important to realize that there is nothing inherently "dynamic" in Ken Thompson's formula for the Professional ratings. And there is nothing inherently "conservative" in Arpad Elo's formula for the FIDE ratings. In each case there is a numerical constant, used within the calculation, which completely determines how dynamic or conservative the ratings will be.
In the case of the Elo ratings, this numerical constant is the attenuation factor, or "KFactor". In case you don't know, let me briefly explain what the KFactor actually does. Every time you play a game, there is a comparison between what your score was predicted to be, and what it actually was. The difference between the two is multiplied by the KFactor, and that is how much your rating will change. Thus, if you play a tournament and score 8.5 when you were predicted to score 8.0, you have outperformed your rating by 0.5 points. With a KFactor of 10, your rating would go up by 5 points. With a KFactor of 32, on the other hand, your rating would go up by 16 points.
In the current FIDE scheme, a player will forever have a KFactor of 10, once they reach a 2400 rating. With a KFactor of 5, the FIDE ratings would be far more conservative. With a KFactor of 40, they would leap around wildly, but the ratings would still be more accurate than the current ratings. The particular choice of 10 is somewhat arbitrary and could easily be doubled or tripled without drastic consequences, other than a more dynamic (and more accurate) FIDE rating system.
As an example of how the KFactor affects ratings, consider the following graph for Viktor Korchnoi's career between 1980 and 1992. Using the MegaBase CD from Chessbase, I ran some historical rating calculations using various KFactors, and this graph shows Korchnoi's rating curve for KFactors of 10, 20, and 32. Note that these ratings will differ from the actual historical FIDE ratings, since MegaBase provides a different game database than that used by the FIDE ratings.
You can see that the red curve (KFactor of 10) is fairly conservative, slower to drop during 19823 when Korchnoi clearly was declining, and remaining relatively constant from 1985 through 1992, almost always within the same 50point range. For a KFactor of 20, however, Korchnoi's rating jumps around within a 100point range over the same 19851992 period (see the blue curve), whereas with a KFactor of 32 there is almost a 200point swing during those years (see the yellow curve). Thus the KFactor can easily cause an Elo formula to be either very conservative or very dynamic.
For the Thompson formula, there is also a numerical constant which determines how dynamic the ratings will be. The current Professional ratings use a player's last 100 games, with the more recent games weighted more heavily. If they used the last 200 games instead, the ratings would be sluggish and resistant to change. If they used the last 50 games, they would be even more dynamic. You might think that Professional ratings using only the last 50 games would be far more dynamic than any reasonable Elostyle formula, but in fact the Elo formula with a KFactor of 32 seems to be even more dynamic than a Thompson formula which uses only the last 50 games. Take a look at the career rating curve for Jan Timman from 1980 to 1992, using those two different formulas. Again, I did these calculations myself, using data from MegaBase 2000.
It is clear that the red curve (Elo32) is even more dynamic than the blue curve (Thompson50), with higher peaks and lower valleys. However, it should also be clear that the two rating systems are very similar. If you could pick the right numerical constants, the Thompson and Elo formulas would yield extremely similar ratings. In these examples, I chose Korchnoi and Timman more or less at random; my point was to show that there is nothing inherently "dynamic" about the Professional ratings or "conservative" about the FIDE ratings. It is really almost a mathematical accident that they are this way, unless perhaps the initial Thompson formula was specifically intended to be more dynamic than FIDE's ratings.
So, it is clear that the FIDE ratings could be made more dynamic simply by increasing the KFactor. Is this a good idea?
In an attempt to answer this question, I have run many rating calculations for the time period between 1994 and 2001, using various formulas. In each case, I retroactively determined how accurate the ratings were at predicting future results. Based on those calculations, it became possible to draw a curve showing the relationship between KFactor and accuracy of the ratings:
It appears that a KFactor of
24 is optimal. For smaller values, the ratings are too slow to change, and so ratings are not as useful in predicting how well players will do each month. For larger values, the ratings are too sensitive to recent results. In essence, they "overreact" to a player's last few events, and will often indicate a change in strength when one doesn't really exist. You can see from this graph that even using a superdynamic KFactor of 40 would still result in greater accuracy than the current value of 10.
RAPID AND BLITZ
Recent years have seen an increased emphasis on games played at faster time controls. Official FIDE events no longer use the "classical" time controls, and rapid and blitz games are regularly used as tiebreakers, even at the world championship level. There are more rapid events than ever, but rapid and blitz games are completely ignored by the master FIDE rating list. Instead, a separate "rapid" list, based on a small dataset, is maintained and published infrequently and sporadically.
For now, to keep things simple, I want to consider only four classifications of time controls. The "Classical" time control, of course, refers to the traditional time controls of two hours for 40 moves, one hour for 20 moves, and then half an hour for the rest of the game. "Modern" (FIDE) controls are at least 90 minutes per player per game, up to the Classical level. "Blitz" controls are always fiveminute games with no increments, and "Rapid" has a maximum of 30 minutes per player per game (or 25 minutes if increments are used). I understand that these four classifications don't include all possible time controls (what about g/60, for instance?). However, please be patient. I will get to those near the end of this article.
The question of whether to rate faster games, and whether to combine them all into a "unified" list, is a very controversial topic. I don't feel particularly qualified to talk about all aspects of this, so as usual I will stick to the statistical side. Let's go through the argument, pointbypoint.
(1) I am trying to come up with a "better" rating formula.
(2) By my definition, a rating formula is "better" if it is more accurate at
predicting future classical games.
(3) The goal is to develop a rating formula with "optimal" classical predictive
power.
(4) Any data which significantly improves the predictive power of the rating
should be used.
(5) If ratings that incorporate fastertimecontrol games are actually "better"
at predicting the results of future classical games, then the faster games
should be included in the rating formula.
It is clear that Modern, Rapid, and Blitz games all provide useful information about a player's ability to play classical chess. The statistics confirm that conclusion. However, the results of a single Classical game are more significant than the results of a single Modern game. Similarly, the results of a single Modern game are more significant than the results of a single Rapid game, and so on.
If we were to count all games equally, than a 10game blitz tournament, played one afternoon, would count the same as a 10game classical tournament, played over the course of two weeks. That doesn't feel right, and additionally it would actually hurt the predictive power of the ratings, since they would be unduly influenced by the blitz results. Thus it appears that the faster games should be given an importance greater than zero, but less than 100%.
This can be accomplished by assigning "coefficients" to the various time controls, with Classical given a coefficient of 100%. For example, let's say you did quite well in a sevenround Classical tournament and as a result you would gain 10 rating points. What if you had managed the exact same results in a sevenround Rapid tournament instead? In that case, if the coefficient for Rapid time controls were 30%, then your rating would only go up by 3 points, rather than 10 points.
How should those coefficients be determined? The question lies somewhat outside of the realm of statistics, but I can at least answer the statistical portion of it. Again, I must return to the question of accuracy and predictive power. If we define a "more accurate" rating system as one which does a better job of predicting future outcomes than a "less accurate" rating system, then it becomes possible to try various coefficients and check out the accuracy of predictions for each set. Data analysis would then provide us with "optimal" coefficients for each time control, leading to the "optimal" rating system.
Before performing the analysis, my theory was that a Modern (FIDE) time control game would provide about 70%80% as much information as an actual classical game, a rapid game would be about 30%50%, and a blitz game would be about 5%20%. The results of the time control analysis would "feel" right if it identified coefficients that fit into those expected ranges. Here were the results:
The "optimal" value for each coefficient appears as the peak of each curve. Thus you can see that a coefficient of 83% for Modern is ideal, with other values (higher or lower) leading to less accurate predictions in the ratings. Similarly, the optimal value for Blitz is 18%, and the optimal value for Rapid is 29%. Not quite in the ranges that I had expected, but nevertheless the numbers seem quite reasonable.
A MORE ACCURATE FORMULA
To summarize, here are the key features of the Sonas rating formula:
(1) Percentage expectancy comes from a simple linear formula:
White's %score = 0.541767 + 0.001164 * White rating advantage,
treating White's rating advantage as +390 if it is better than +390,
or 460 if it is worse than 460.
(2) Attenuation factor (KFactor) should be 24 rather than 10.
(3) Give Classical games an importance of 100%, whereas Modern games
are 83%, Rapid games are 29%, and Blitz games are 18%. Alternatively,
use the graph at the end of this article to arrive at an exact coefficient which
is specific to the particular time control being used.
(4) Calculate the rating lists at the end of every month.
This formula was specifically optimized to be as accurate as possible, so it should come as no surprise that the Sonas ratings are much better at predicting future classical game outcomes than are the existing FIDE ratings. In fact, in every single month that I looked at, from January 1997 through December 2001, the total error (in predicting players' monthly scores) was higher for the FIDE ratings than for the Sonas ratings:
How can I claim that the Sonas ratings are "more accurate" or "more effective at predicting"? I went through each month and used the two sets of ratings to predict the outcome of every game played during that month. Then, at the end of the month, for each player, I added up their predicted score using the Elo ratings, and their predicted score using the Sonas ratings. Each of those rating systems had an "error" for the player during that month, which was the absolute difference between the player's actual total score and the rating system's predicted total score.
For example, in April 2000 Bu Xiangzhi played 18 classical games, with a +7 score for a total of 12.5 points. Based on his rating and his opponents' ratings in those games, the Elo rating system had predicted a score of 10.25, whereas the Sonas rating system had predicted a score of 11.75. In this case, the Elo error would be 2.25, whereas the Sonas error would be 0.75. By adding up all of the errors, for all players during the month, we can see what the total error was for the Sonas ratings, and also for the Elo ratings. Then we can compare them, and see which rating system was more effective in its predictions of games played during that month. In the last graph, you can see that the Sonas ratings turned out to be more effective than the Elo ratings in every single one of the 60 months from January 1997 to December 2001.
You are probably wondering what the toptenlist would look like, if the Sonas formula were used instead of the Elo formula. Rather than giving you a huge list of numbers, I'll give you a few pictures instead.
First, let's look at the "control group", which is the current Elo system (including only Classical and Modern games). These ratings are based upon a database of 266,000 games covering the period between January 1994 and December 2001. The game database is that provided by Vladimir Perevertkin, rather than the actual FIDErated game database, and these ratings are calculated 12 times a year rather than 2 or 4. Thus the ratings shown below are not quite the same as the actual published FIDE ratings, but they do serve as an effective control group.
Next, you can see the effect of a higher KFactor. Using a KFactor of 24 rather than 10, players' ratings are much more sensitive to their recent results. For instance, you can see Anatoly Karpov's rating (the black line) declining much more steeply in the next graph. Similarly, with the more dynamic system, Garry Kasparov dropped down very close to Viswanathan Anand after Linares 1998. In fact, Kasparov briefly fell to #3 on this list in late 2000, after Kramnik defeated him in London and then Anand won the FIDE championship. And Michael Adams was very close behind at #4.
Finally, by examining the next graph, you can see the slight effect upon the ratings if faster time controls are incorporated. In the years between 1994 and 1997, Kasparov and Anand did even better at rapid chess than at classical chess, and so you can see that their ratings are a little bit higher when rapid games are included. Some other players show some differences, but not significant ones. In general, the two graphs are almost identical.
You might also notice that the ratings based upon a linear model with a KFactor of 24 are about 50 points higher than the ratings with the current formula. As I mentioned previously, this is mostly due to a deflationary effect in the current formula, rather than an inflationary effect in the linear model. Since there is an unintentional bias against higherrated players in the Elo table of numbers, the top players are having their ratings artificially depressed in the current system. This bias would be removed through the use of my linear model.
It is unsurprising that a rating system with a higher KFactor would have some inflation, though. If a player does poorly over a number of events and then stops playing, they will have "donated" rating points to the pool of players. Perhaps someone scored 30/80 rather than the predicted 40/80, over a few months. In the current system, they would have donated 100 points to the pool, whereas with a KFactor of 24, it would have been 240 points instead. Since a very successful player will probably keep playing, while a very unsuccessful player might well stop playing, this will have an inflationary effect on the overall pool. Of course, this is a very simplistic explanation and I know that the question of inflation vs. deflation is a very complicated one.
I am not suggesting that we suddenly recalculate everyone's rating and publish a brandnew rating list. For one thing, it's not fair to retroactively rate games that were "unrated" games at the time they were played. By showing you these graphs, I am merely trying to illustrate how my rating system would behave over time. Hopefully this will illustrate what it would mean to have a KFactor of 24 rather than 10, and you can also see the impact of faster time controls.
For the sake of continuity of the "official" rating list, it seems reasonable that if this formula were adopted, everyone should retain their previous rating at the cutover point. Once further games were played, the ratings would begin to change (more rapidly than before) from that starting point.
OTHER TIME CONTROLS
The above conclusions about time controls were based upon only four different classifications: Blitz, Rapid, Modern, and Classical. However, those classifications do not include all typical time controls. For instance, Modern has a minimum of 90 minutes per player per game, whereas Rapid has a maximum of 30 minutes per player per game. Ideally, it would be possible to incorporate the coefficients for these four classifications into a "master list" which could tell you what the coefficient should be for g/60, or g/15 vs. g/30 for that matter.
I did a little bit of analysis on some recent TWIC archives, and determined that about 50% of games last between 30 and 50 moves, with the average game length being 37 moves. I therefore defined a "typical" game length as 40 moves, and then looked at how much time a player would use in a "typical" game in various time controls, if they used their maximum allowable time to reach move 40.
This means a player would spend 5 minutes on a typical Blitz game, 530 minutes on a typical Rapid game, 90120 minutes on a typical Modern game, and 120 minutes on a typical Classical game. Finally, I graphed my earlier coefficients of 18%, 29%, 83%, and 100% against the typical amount of time used, and arrived at the following important graph:
This sort of approach (depending upon the maximum time used through 40 moves) is really useful because it lets you incorporate increments into the formula. A blitz game where you have 5 minutes total, will obviously count as a 5minute game in the above graph, and you can see that the coefficient would be 18%. A blitz game where you get 5 minutes total, plus 15 seconds per move, would in fact typically be a 15 minute game (5 minutes + 40 moves, at one extra minute per four moves = 15 minutes), and so the recommended coefficient would be 27% instead for that time control.
The very common time control of 60 minutes per player per game, would of course count as a 60minute game, and you can see that this would be 55%. And the maximum coefficient of 100% would be reached by a classical time control where you get a full 120 minutes for your first 40 moves.
CONCLUSION
It is more important than ever before for ratings to be accurate. In the past, invitations to Candidate events were based upon a series of qualification events. Now, however, invitations and pairings are often taken directly from the rating list. The field for the recent Dortmund candidates' tournament was selected by averaging everyone's FIDE and Professional ratings into a combined list, and then picking the top players from that list. For the first time, a tournament organizer has acknowledged that the FIDE ratings are not particularly accurate, and that a different formula might work better.
The FIDE ratings are way too conservative, and the time control issue also needs to be addressed thoughtfully. I know that this is an extremely tricky issue, and it would be ridiculous to suggest that it is simply a question of mathematics. If change does come about, it will be motivated by dozens of factors. Nevertheless, I hope that my efforts will prove useful to the debate. I also hope you agree with me that the "Sonas" formula described in this article would be a significant improvement upon the "Elo" formula which has served the chess world so well for decades.
Please send me email at jeff@chessmetrics.com if you have any questions, comments, or suggestions. In addition, please feel free to distribute or reprint text or graphics from this article, as long as you credit the original author (that's me).
Jeff Sonas
Additional reading
*
FIDE Rapid Ratings
* Shirov on Rapid Ratings
* Krasenkow replies to Shirov
* Milov replies to Krasenkow and Shirov
Click
here to return to the parent page.
Click here to return to my home page.
(Or click the 'back' button on your web
browser.) Page last edited on: 01/04/2013
.
