Chess Engines and Playing Strength: Are We Reaching the Limit?

The release of the latest version of Stockfish, version 17, has not brought significant Elo gains. Although in autotuning and with highly unbalanced opening suites, the program seems to be around 50 Elo points stronger than version 16, in reality, most well-known rating lists (CCRL/CEGT/SP-CC…) agree that the actual increase is much smaller. In fact, if using non-biased openings, it is almost negligible.

This result bring me to wonder if we are reaching a limit, beyond which, unregardless of computing power or the time given to the engine for each move, the performance improvements have become marginal.

This raises a fundamental question: have we reached a point where, even with enormous computational resources or extended time for each move, we no longer see a significant improvement?

If so, this stagnation could be due to the fact that engines like Stockfish are coming closer to a sort of “perfect play.” After solving most critical positions, only marginal details remain, which require enormous resources to improve but do not produce tangible differences in the final outcome. At a certain point, additional calculation depth no longer leads to better moves but only confirms those previously chosen.

To determine if this phenomenon is real, I decided to conduct tests using a set of non-biased openings, initially comparing Stockfish 17 with itself at progressively increasing time intervals. In addition to Stockfish 17, I later decided to include Crystal 8 and Stockfish 15.1 in the test pool. The reason for this choice is twofold: on one hand, a curiosity to see how Crystal—a derivative engine of Stockfish highly valued for game analysis—performs, as it has certain code modifications that slightly weaken it in pure strength but allow it to solve positions that Stockfish either cannot or takes much longer to solve. On the other hand, to have engines with different playing styles, which is why I added Stockfish 15.1. It has a similar strength to its more recent counterpart, but a distinct playing style, thanks to a neural network of different size and different training methods.

Chosen openings

To avoid excessively unbalanced starting positions, I selected 12 classic, balanced openings, each limited to a maximum of three moves:

  • Open Game
  • Queen’s Gambit
  • Slav Defense
  • Ruy Lopez
  • Sicilian Defense
  • English Opening
  • Caro-Kann
  • Pirc Defense
  • Reti Opening
  • King’s Indian Defense
  • Nimzo-Indian Defense
  • West Indian Defense

The engines faced each other, playing each opening once as White and once as Black.

The suite of openings in EPD format is downloadable by clicking here: openings10.epd

Thinking time and computation power

The test was conducted under the limit of 40 moves per X seconds, with an initial X value of 5 seconds, which was doubled each time, up to a reflection time 1024 times greater. Specifically, the intervals are as follows:

  • x1 = 40 moves in 5 seconds, 256MB hash
  • x2 = 40 moves in 10 seconds, 256MB hash
  • x4 = 40 moves in 20 seconds, 256MB hash
  • x8 = 40 moves in 40 seconds, 256MB hash
  • x16 = 40 moves in 80 seconds, 256MB hash
  • x32 = 40 moves in 160 seconds, 512MB hash
  • x64 = 40 moves in 320 seconds, 512MB hash
  • x128 = 40 moves in 640 seconds, 512MB hash
  • x256 = 40 moves in 1280 seconds, 1024MB hash
  • x512 = 40 moves in 2560 seconds, 1024MB hash
  • x1024 = 40 moves in 5120 seconds, 1024MB hash

I couldn’t go further due to hardware limitations.

As for the available computing power, the tests were conducted with the Threads=1 option, using a PC with a Core i7 12700 CPU.

The test was conducted using CuteChess, and the results were calculated with the Ordo 1.0 program by Miguel Ballicora.

Test results

Below is the rating list obtained using the Ordo 1.0 software. For each engine, the reflection time used is indicated (for example: Stockfish 17 x8 used a reflection time of 40 moves in 40 seconds repeatedly). In addition to the three engines listed previously, I added others to estimate the actual playing strength.

The starting Elo rating is set to 3550 for Stockfish 11, with a fixed time of 40 moves every 80 seconds repeatedly. This is comparable, for the PC used in these tests, to 40 moves in 120 minutes repeatedly on a hypothetical Pentium 90 MHz. I use this unusual “unit of measure” to anchor the rating to a value that provides results comparable to those obtained in the past when programs were also tested against human players.

As a cross-check, I ran some tests with the older Rebel 6, also on a PC emulating the performance of a Pentium 90. The rating obtained, about 2430 Elo, is comparable to the rating achievable by the program on an actual P90 (accounting for margins of error and different openings).

   # PLAYER                        : RATING  ERROR   POINTS  PLAYED    (%)
   1 Stockfish 17 x512             :   3971     22    179.5     336   53.4%
   2 Stockfish 17 x1024            :   3970     23    150.0     288   52.1%
   3 Stockfish 17 x256             :   3968     22    156.0     288   54.2%
   4 Crystal 8 x1024               :   3963     30     60.5     120   50.4%
   5 Stockfish 15.1 x1024          :   3961     32     60.0     120   50.0%
   6 Stockfish 17 x128             :   3960     22    173.0     323   53.6%
   7 Crystal 8 x512                :   3959     26     84.5     168   50.3%
   8 Stockfish 15.1 x512           :   3953     27     82.0     168   48.8%
   9 Stockfish 17 x64              :   3950     20    231.5     406   57.0%
  10 Stockfish 15.1 x256           :   3950     24     96.5     192   50.3%
  11 Crystal 8 x256                :   3940     23    121.0     240   50.4%
  12 Crystal 8 x128                :   3936     22    110.0     216   50.9%
  13 Crystal 8 x64                 :   3932     20    199.0     360   55.3%
  14 Stockfish 15.1 x128           :   3924     29     84.0     144   58.3%
  15 Stockfish 17 x32              :   3914     17    326.5     549   59.5%
  16 Stockfish 15.1 x32            :   3898     21    183.5     312   58.8%
  17 Crystal 8 x32                 :   3893     20    203.0     360   56.4%
  18 Stockfish 15.1 x64            :   3888     23    141.0     240   58.8%
  19 Stockfish 17 x16              :   3887     18    363.5     572   63.5%
  20 Stockfish 17 x64 MultiPV=4    :   3871     22    116.0     216   53.7%
  21 Stockfish 15.1 x16            :   3852     19    295.5     480   61.6%
  22 Stockfish 17 x8               :   3829     19    245.5     452   54.3%
  23 Stockfish 15.1 x8             :   3824     21    226.0     384   58.9%
  24 Crystal 8 x16                 :   3821     19    221.0     432   51.2%
  25 Stockfish 17 x32 MultiPV=4    :   3810     23     98.0     216   45.4%
  26 Stockfish 15.1 x4             :   3760     21    230.0     408   56.4%
  27 Crystal 8 x8                  :   3759     19    258.0     480   53.8%
  28 Stockfish 17 x16 MultiPV=4    :   3749     22    127.5     288   44.3%
  29 Stockfish 17 x4               :   3749     19    266.0     504   52.8%
  30 Crystal 8 x4                  :   3688     23    153.5     312   49.2%
  31 Stockfish 15.1 x2             :   3677     17    364.0     696   52.3%
  32 Stockfish 17 x8 MultiPV=4     :   3662     22    123.5     360   34.3%
  33 Stockfish 17 x2               :   3643     21    273.5     528   51.8%
  34 Crystal 8 x2                  :   3588     21    213.5     456   46.8%
  35 Stockfish 15.1                :   3553     21    210.0     432   48.6%
  36 Stockfish 11 (40/80s)         :   3550     26    191.0     480   39.8%
  37 Stockfish 17                  :   3536     24    216.5     528   41.0%
  38 Crystal 8                     :   3453     28    123.0     384   32.0%
  39 Gull 3                        :   3192     44     90.5     312   29.0%
  40 Laser 1.7                     :   3187     72     29.0     120   24.2%
  41 Rybka 2.3.2                   :   2919     72     62.5     120   52.1%
  42 Ice 4                         :   2851     70     60.5     288   21.0%
  43 Naraku                        :   2586     99     24.5      72   34.0%
  44 Rebel 6                       :   2432    103     11.0     120    9.2%

White advantage = 45.51 +/- 2.18
Draw rate (equal opponents) = 87.55 % +/- 0.77

Interpretation of the results

The following graph shows the Elo increase as calculation time increases.

SF17 elo increase with different time

 

As shown in the table, the Elo ratings exhibit an initially increasing trend as thinking time increases for each engine, highlighting a correlation between the time available per move and the engine’s performance. However, despite the margin of error, after a certain threshold, this increase in Elo rating tends to stabilize around a value of approximately 3960-3970 Elo. This is clearly reflected in the results obtained for the x512 and x1024 sets, where the Elo values of Stockfish 17, Stockfish 15.1, and Crystal 8 tend to converge.

It is interesting to note that while both Crystal 8 and Stockfish 15.1 score lower than Stockfish 17 at shorter times (as expected), the gap between the three programs tends to narrow until they almost coincide. Furthermore, as the thinking time increases, the victories of one engine over another also decrease, to the point where, with the engines assigned 512 times the initial thinking time, all games resulted in draws unregardless of the opening, and the difference in strength between the engines was determined against weaker versions of themselves.

I also conducted a test with the MultiPV=4 mode activated to see if, as reflection time increases, using multiple main variations considered by the program could enhance the program’s strength. What is observed is that this increase in strength does not actually occur, and the results of the programs with MultiPV=4 activated align with the same programs without the MultiPV option and with a thinking time of a quarter, or are even weaker, excluding truly minimal reflection times (40 moves in less than 10 seconds or even less). Indeed, even though the MultiPV option allows simultaneous consideration of multiple variations, the engine must divide its computing resources among these lines, reducing the time dedicated to each variation. Therefore, while MultiPV mode can be useful for quickly evaluating alternatives, analyzing a single line proves to be more effective for maximizing the engine’s performance over longer time controls.

This test seems to confirm that with balanced openings, increasing the thinking time does not lead to further strength increases in the engine, but instead reaches a plateau. Beyond a certain amount of time, engines are no longer able to translate the additional time into significant improvements in their position, reaching an asymptote in terms of their playing strength.

Furthermore, the test suggests that even weaker engines, when given sufficient computing time, reach the same limit.

At least for Stockfish and its main variants, this limit appears to be around 3950-4000 Elo.

These tests were conducted in single-core mode, so the effect of multi-core analysis on engine performance remains to be evaluated. The use of multiple cores makes the calculation process non-deterministic, as each core performs independent evaluations on different portions of the game tree, extending the search window. This leads to greater variability in results, which could potentially improve the engine’s performance. However, I do not believe the outcome would deviate significantly from the limit of around 4000 Elo.

Leave a Reply

Your email address will not be published. Required fields are marked *