|
Organizers |
Reinforcement Learning for 2x2-Games: Studying the Stationary Probability Distribution
by
Thomas Brenner
Max-Planck-Institute for Research into Economic Systems
Studies of learning processes in games have become increasingly frequent in the last years. Many different approaches based on different models of learning have been suggested and their characteristics and implications have been analysed. One of the most frequently studied types of learning is reinforcement learning. This kind of learning process is also supported by many experimental studies in economics as well as in psychology. Therefore, this paper also studies the implications of reinforcement learning for the behaviour in games.
The previous studies of reinforcement learning processes in games have focussed on the questions of whether they converge to Nash equilibrium-like behaviour, of whether they lead to a dynamic on the individual level similar to the replicator dynamic, and of how the dynamics of behaviour on the population level looks like. From these studies it has become obvious that reinforcement learning does not imply that every individual chooses the same action if we wait long enough. Instead, the action taken by an individual depends on the individual history, including the own moves. Since at each time the behaviour is described by a probability distribution for the possible actions, actions are chosen randomly. Thus, even if two individuals face exactly the same situation, their experience is generally different. As a consequence, a theoretical investigation into reinforcement learning processes in game will not be able to give a single prediction for the strategy that an individual will use after an infinitely large number of repetitions of the game. However, such a theoretical investigation allows to figure out the probability for each strategy to be taken after an infinitely long period of time. This probability distribution is calculated here.
To this end, the reinforcement learning process is formulated in a stochastic manner for a repeated decision situation with 2 options. There are several formulations of reinforcement learning used in the literature. This paper uses the formulation proposed by Cross. Then, a continuous approximation of the resulting stochastic dynamic is calculated. This continuous formulation, in contrast to the usual formulation in the literature, is still stochastic and has the form of a Fokker-Planck equation. For a Fokker-Planck equation the stable probability distribution can be calculated. Thus, we obtain a mathematical expression for the probabilities that the players use certain strategies in a arbitrarily chosen 2x2 game after an infinite number of repetitions. This mathematical expression is not very handsome. However, if all payoffs are positive, the probabilities are different from zero only for pure strategies, that means four different combinations of strategies.
To illuminate the results, the explicit probability distribution is given for three prominent games, the prisoner's dilemma, the chicken game, and the matching pennies. Furthermore, these results are compared with the results of simulations of the reinforcement learning process. Through this it is shown that the approximations which are necessary to obtain the stable probability distribution do not influence the results significantly. Finally, some general statements about the convergence of reinforcement learning to the optimal choice in the case of a 2-armed bandit and to a Nash equilibrium in the case of 2x2-games are deduced from the results. It turns out that Nash equilibria are, given some usual assumptions, more likely to be reached than other outcomes. However, dependent on the payoffs, a Nash equilibrium is reached with a probability smaller than one.
Date received: June 7, 2000
Copyright © 2000 by the author(s). The author(s) of this document and the organizers of the conference have granted their consent to include this abstract in Atlas Conferences Inc. Document # cafi-01.