acknowledgments - stanford university · 2002. 11. 18. · acknowledgments i a. m. turing,...

Acknowledgments

I

A. M. Turing, Computing machinery and intelhgence, by permlsslon from the editors of M i d , October, 1950, 59:433-460

A. Newell, J. C Shaw, and H Slmon, Chess playing programs and the problem of complexity, by permission from the IBM Journal of Re-

A L Samuel, Some studres m machine Iearmng usmg the game of checkers, by permission from the IBM Journal of Research and Development,

t search and Development, October, 1958, 2:320-335.

1 July, 1959, 3:211-229 The author acknowledges:

I

Many drfferent people have contrrbuted to these studies throrrglz stimulatrng drscussions of the basic problems. From time to time the writer was assisted by several dzfferent programmers, although most of the detailed work was h u own. The forbearance of the machrne room operators and their willingness to play the machlne at all hours of the day and night are also greatly appreciated.

A. Newell, J. C. Shaw, and H. Simon, Empirical explorations with the Loglc Theory Machine, by permission of the authors from the Pto- ceedings of the Western Joint Computer Conference, 1957, 15:218- 239. This research was part of a project conducted jomtly by NewclI

the Carnegie Institute of Technology. I and Shaw of the RAND Corporatlon, Santa Monm, and H Smon of

, H. Gelernter, Reahzatlon of a geometry-theorem provmg machine, by I permission from the Proceedrngs of an International Conference on I n -

formation Processmg, Pans: UNESCO House, 1959, pp. 273-282 The author acknowledges :

The technical and programming assistance of J X. Ransen and D. W Loveland has been indrspensable to the success o f thu prolect N Rochester and J . McCarthy contributed much to the early dcvel- opment of ideas, and Rochester supplied the necessary ndmlnrstrntlve support as well Other members of the Information Research Depart- ment of IBM, and W G Bouricius, P. C Grlmore, J . P. Laznrus, and P. D. Welch, ~n partrcular, contributed to the author’s under- standmng of the problem m his conversations with them

The research project itself is a consequence of the Dartmouth Summer Research Project on Artificial Intelbgence held rn 1956, during which M . L. Minsky pornted out the potential utility of fhe dragram to a geometry theorem-proving machine.

ix

70 ARTIFICIAL INTELLIGENCE

White is lost but relatively best was 22. R-Q3 blockadlng the passed Queen Pawn 22 R-QBI wdrcates that an order is mrssmg to nvmd exchanges after losmg mnterial, unless such ewhanges deserve a high rutlng for spec& reasons covered by other orders.

22 . . . KR-QB 1 23 QR-Q1

White IS lust floundermg in a lost posltlon

23 . . . KR-B 6 24 P-N4

“There are no good moves In bad positions!”

24 . . . KRXP 25 B-N3

Best, White at least stops the mating attack.

25 . . . P-Q6 26 R-QB1 B-N4

26 R-QBl indicates that an order IS mlsstng that would make the machrne avold getting forked. Better was 26 . . . , P-Q7 winnmg instantly (26 . . , P-Q7, 27. R X R, P X R = Qch, 28 K-N2, Q-Q8!, 29. R-B8ch, B-Ql) .

27 R X R P X R

29 R X Q B X R 30 Resigns

28 B-K5 P-B8 Q

Best, but 1’111 sure the programmers were just gettmg tlrcdl S ld t test games give mdeed excellent mdlcrrtions as to the type of genpr(1l prrncrples the program should include i n nddltmt to tnuterrul Dulance, development, and center control, to elrmlnate untlposrtionul moves as much as posszble.

SOME STUDIES I N MACHINE LEARNING USING

T H E GAME O F CHECKERS

by A L. Samuel

Introduction

The studies reported here have been concerned with the programming of a dlgital computer to behave in a way whlch, if done by human beings or animals, would be described as mvolvmg the process of learning. Whlle thls is not the place to dwell on the importance of machine-learnmg procedures, or to discourse on the philosophical aspectsY1 there is obviously J. very large amount of work, now done by people, which IS qulte trlvial In Its dcmands on the lntcllect but does, nevertheless, involve somc lcarning We have at our command computers wlth adequate data-lxmdllng abllity and wlth sufficlent computatlonal speed to make use of machlne-learn~ng technlques, but our knowledge of the baslc prmciples of these tcchnlqucs I$ still rudlmcntary Lackmg such knowledge, it IS nccessaIy to spccrfy rncthods of problem solution in minute and exact dctail, a timc-consuming and costly procedure, Programmmg computers to learn from expcrlcnce should eventually eliminate the need for much of thls detailed progtam- ming effort.

General Methods of Appronch At the outset it mlght be well to distinguish sharply between two genelal approaches to the problem of machme learnmg. One method, which nllght be called the Neural-Net Appronclz, deals with the possibility of induclng learned behavior into a randomly connected switching net (or its smula-

Some of these are qulte profound and have a bearlng on the quesllons ralsed by Nelson Goodman m Fact, Frction and Forecasr, Carnbndge, Mass : Warvard, 1954

71


tion on a dlgital computer) as a result of a reward-and-punishment routme A second, and much more efficient approach, is to produce the equlvalent of a hlghly organized network w h d l has been deslgned to learn only certain specific thrngs The first method should lead to the development of general-purpose lcarnrng machmes A comparison between the size of the swltchlng nets that can be reanonably constructed or simulated at the present time and the s ~ z e of the neural nets used by anlmals, suggests that wc have a long way to go before we obtain practml devlces The second procedure requlres reprogrammmg for each new application, but it is capable of reahzation at the present tlme The experlments to be descrlbed here were based on thls second approach.

Choice of Problem For some years the wnter has devoted 111s spare tlme to the subject of machine learning and has concentrated on the development of learnmg procedures as applied to games s A game provides a convenient vehicle for such study as contrasted with a problem taken from life, since many of the complmtions of detail are removed Checkers, rather than chess (Shannon, 1950, Bernstem and Roberts, 1958b; Klster et a l , 1957; Newell, Shaw, and Slmon, 19586) , was chosen because the smpl~ci ty of Its rules permlts greater emphasis to be placed on learning techniques Regardless of the relative merits of the two games as Intellectual pastimes, it is falr to state that checkers contains all of the baslc characteristics of an mtellectual actlvlty In which heuristlc procedures and learning processes can play a major role and In whlch these processes can be evaluated

Some of these characteristics mlght well be enumerated. They are:

( 1 ) The activlty must not be deternmistic in the practical sense. There exists no known algorithm which will guarantee a wln or a draw in checkers, and the complete exploratlons of every possible path through a checker game would involve perhaps loio choices of moves which, at 3 choices per mlllimicrosecond, would st111 take 10“ centuries to consider

(2) A definite goal must exlst-the wlnnmg of the game-and at least one criterion or lntermedlate goal must exist whlch has a bearing on the achievement of the final goal and for whlch the sign should be known In checkers the goal I S to depnve the opponent of the posslbdity of movmg,

‘Warren S. McCulloch (1949) has compared the dlgltal cornputcr to the nervous system of a flatworm To exrend thls comparison to the sltuatlon under d w cusslon would be unfalr to the worm, smce Its nervous system 1s actually qulte highly organized as compared wlth the random-net studles by Farley and Clark (1954), Rochester, Holland, Halbt, and Duda (1956), and by Rosenblatt (1958)

’The first operating checker plogram for the IBM 701 was written In 1952 This was recoded for the IBM 704 ~n 1954 The first program wlth learnlng was completed In 1955 and demonstrated on televlslon on February 24, 1956.

MACHINE LEARNING USING TIIE GAME OF CHECKERS 73

and the dominant criterion is the number of pieces of each color on the board The importance of having a known criterion will be discussed later

( 3 ) The rules of the actlvlty must be definite and thcy should be known Games satlsfy thls requlrcment Unfortunately, many problems of eco- nonuc importance do not While In prlncrple the determinatlon of thc rulcs can be a part of the learning process, thls IS a compllcatlon whlch might well be lcft untd Iatcr.

(4) There should be a background of knowledge concernlng t h e actlvlty against whlch the learnmg progress can be tested

( 5 ) The actlvlty should be one that IS familiar to a substantlal body of people so that the behavior of the program can be made understandablc to them. The ability to have the program play against human opponents (or antagonlsts) adds splce to the study and, Incidentally, provldes a con- vmcing demonstration for those who do not belleve that machines can learn

Having settled on the game of checkers for our learning studies, we must, of course, first program the computer to play legal checkers, that I F ,

we must express the rules of the game In machine language and we must arrange for the mechanics of accepting an opponent’s moves and of re- porting the computer’s moves, together with all pertinent data deslred by the experimenter The general mcthods for dotng this were described by Shannon I n 1950 as applled to chess rather than checkers. The basic program used in these experments IS qultc smilar to the program descrlbcd by Strachey In 1952. The availability of a larger and faster machlne (the IBM 704), coupled wlth many dctalled changes i n thc programming p10- cedure, lcads to a fairly lntercstir~g game, even without any lcarlmg The basic forms of thc program will now be descrlbed

The Baslc Checker-playing PI ograrn

The computer plays by lookmg ahead a few moves and by evaluating the resulting board positions much as a human player might do. Board positions arc stored by sets of machlne words, four words normally being used to represent any particular board position Thirty-two blt positiolls (of the 36 available in an IBM 704 word) are, by convcntion, assigncd to thc 32 playmg squares on the checkerboard, and p~cccs appcarmg on these squares are represented by 1’s appearing in the assigned blt positlons o f the corresponding word “Looking ahead” is prepared for by compuli% all posslble next moves, starting with a glven board posrlron The indicated moves are explored in turn by producmg new board-posltion records corresponding to the conditions aftcr the move in question (the old board positions being saved to facditate a return to the starting polnt) and the process can be repeated. This look-ahead procedure is carried several


Figure 1 A “tree” of moves whlch might be mvestlgated during the look-ahead procedure. The actual branchmgs are much more numerous than those shown, and the “tree” IS apt to extend to as many as 20 levels,

moves in advance, as illustrated in Fig 1 The resultlng board positions are then scored in terms of their relatlve value to the machlne

The standard method of sconng the resulting board posltions has been in terms of a h e a r polynomial A number of schemes of an abstract sort were tried for evaluating board posltlons without regard to the usual checker concepts, but none of these was successful One way of looking at the various terms in the scorlng polynomial is that those terms with

‘One of the more Interesting of these was IO express a board posltlon in terms of the first and hlgher moments of the white and black pleces separately about two orthogonal axes on the board Two such sets of axes were trled, one set bemg parallel to the sldes of !he board and the second set being those through the dlagonals

MACHINE LEARNING USING THE GAME OF CHECKERS 75

numerlcally small coefficients should measure criterla related as lntermedlate goals to the crlteria measured by the larger terms The achievement o f thcse lntermedlate goals indlcates that the machine IS golng In the right direction, such that the larger terms wlll eventually Increase If the program could look far enough ahead we need only ask, “1s the machlne stdl In the game?”5 Slnce It cannot look this far ahead in the usual situation, we must substltute somethlng else, say the piece ratio, and let the nlachlnc contlnue the look-ahead untll one side has gamed a piece advantage But even thls IS not always posslble, so we have the program test to see if the machine has gamed a posltlonal advantage, et cetera Numerlcal measures of these varlous properties of the board posltions are then added together (each with an approprlate coefficlent which defines Its relative lmportdnce) to form the evaluation polynomial.

More specifically, as defined by the rules for checkers, the domlnant scoring parameter is the lnabllity for one slde or the other to move ‘I Slnce this can occur but once In any game, it is tested for separately and IS not included in the scormg polynomial as tabulated by the computer dutlng play. The next parameter to be considered IS the relative piece advantage. I t is always assumed that it IS to the machine’s advantage to reduce the number of the opponent’s pleces as compared to Its own. A reversal of the s ~ g n of this term will, in fact, cause the program to play “giveaway” checkers, and wlth learning It can only learn to play a better and bettcr glveaway game Were the slgn of thls term not known by the prograrnmet It could, of course, be determmed by tests, but I t n m t be fixed by the ex- perlmenter and, in effect, it is one of the lnstructlons to the machlne de- fining its task The numerical computation of the plece advantage has becn arranged in such a way as to account for the well-known property that it is usually to one’s advantage to trade pleces when one is ahead and to avoid trades when behmd. Furthermore, it IS assumed that kings ate moIe valuable than pieces, the relative welghts assigned to them bang three to two 7 Thls ratio means that the program wilI tradc three men for two kings, or two kmgs for three men, if by so doing it can obtaln some posltlonal advantage.

The choice for the parameters to follow this first term of the scorlng polynomial and their coefficlents then becomes a matter of concern TWO courses are open--elther the experimenter can decide what these subsequent terms are to be, or he can arrange for the program to makc the selection. We wdl discuss the first case In some detail In connectlon with

eNot the capture of all the opponent’s pieces, as popularly assumed, although all games end In this fashlon

‘The use of a weight ratio rather than thls, conforming more Closely to the values assumed by many players, can lead Into certain logical COmpliCatlOnS, as found by Strachey (1952)

This apt phraseology was suggested by John McCarthy


@ Machlne chooses brnnch with largest score

@ opponent expected to choose branch wlih smallest score

20 -7 +4 -3 0 +3 -10 -20 -70 -100 4-3 t7 +I5 -5

Frgure 2. Slrnpllfied dlagram showlng how the evaluations are backed up through the “tree” of posslble moves to arrlve at the best next move The evaluation process starts at ( 3 ) .

the rote-learnrng studies and leave for a later section the discusslon of various program methods of selecting parameters and adjusting then coefficients,

I t IS not satisfactory to select the inltial move which leads to the board positlon with the highest score, smce to reach this posltlon would require the cooperatlon of the opponent. Instead, an analysls must be made pro- ceedmg backward from the evaluated board posltlons through the “tree” of possible moves, each time wlth consideratlon of the intent of the slde whose move IS being cxamined, assummg that the opponcnt would always attempt to mmnuze the mach~ne’s score whde the maclune acts to nlaxl- mlze Its score At each branch polnt, then, the corresponding board POSF

tion IS glven the score of the board positlon whlch would result from the most favorable move. Carrymg thls “mlnlmax” procedure back to the startmg point results in the selection of a “best move.” The score of the board positlon at the end of the most likely chain is also brought back, and for learning purposes this score is now assigned to the present board position. This process is shown in Fig. 2 The best move 1s executed, reported on the console lights, and tabulated by the printer.

The opponent is then permitted to make his move, which can be com- munlcated to the machme either by means of console switches or by means of punched cards The computer verifies the legallty of the opponent’s move, rejectlngs or acceptlng It, and the ptoccss IS repeated When the program can look ahead and predict a wm, this fact IS reported on the

’The only departure from conlplele generality of the game as programmed I $

that the program requlres the opponent to make a perrnlsslble move, lncludlng the rtlklng of a capture I f one IS offered “Huffing” IS not permltted.


printer. Simdarly, the program concedes when it sees that I t is gomg to lose.

Ply Limitations Playmg-tlme consldcrations make it necessary to limit thc look-ahead dlstance to some fairly small value Thls dlstance is defined as the ply (a ply of 2 consistmg of one proposed move by the machine and the an- ticlpated reply by the opponent). The ply IS not fixed but depends upon the dynamics of the situatlon, and it varies from move to move and from branch to branch during the move analysis A great many schemes of ad- ~ustlng the look-ahead dlstance have been trled at varlous times, somc of them qulte compllcated The most effectwe one, although qulte detallcd, 1s slmple In conccpt and is as follows. The program always looks ahead a rnlnlmum distance, which for the openmg game and wlthout learning IS usually set at three moves At this mlnlmum ply the program w11l evaluate the board positlon if none of the following conditlons occurs: (1) the next move is a jump, (2) the last move was a jump, or ( 3 ) an exchange offer is posslble. If any one of these condltions exists, the program contlnues looking ahead At a ply of 4 the program wdl stop and evaluate the resultlng board posltlon If conditions (1) and (3) above are not met At a ply of 5 or greatel, the program stops the look-ahead whenever the next ply level does not ofler a jump. At a ply of 11 or greater, the program w~ll termmate the look-ahead, even d the next move IS to be a jump, should one slde at this time be ahead by more than two kings (to prevent the needless exploration of obviously losing or winning sc- qucnces) Thc program stops at a ply of 20 regardless of all condltlons (s~ncc the mcmory space for the look-ahcad moves is then cxhnuqtcd) and an adjustment In score is made to allow fol the pending ~ u m p Finally, an adjustment IS made In thc levels of the break polnts between the different conditlons when time is saved through rote learning (see below) and when the total number of pleces on the board falls bclow an arbitrary number All break points are determined by single data words whIch can be changed a t any time by manual intcrventlon.

Thls tymg of the ply with board conditions achieves three desired results In the first place, It permlts board evaluations to bc made under conditlons of relative stabdlty for so-called dead positions, as defined by Turing (Bowdcn, 1953). Secondly, it causes grcater surveillance of those paths which offer better opportunlties for gaining or Ioslng an advantage Finally, smcc branching is usually scllously restrictcd by a jump situ;\- tion, the total number of board positions and moves to be consdelcd is still hcld down to a rcasonablc nunlbcr and is more equitably dlstnbutcd between the varlous poss~blc initial movcs

As a practical matter, machine playing time usually has been lmited


to approximately 30 seconds per move Elaborate table look-up procedures, fast sorting and searching procedurcs, and a varicty of ncw programmrng trlcks wwc dcvclopcd, and f u l l u w was made of all of the rccourccs of thc IBM 704 to Increase the operatjng speed as n~uch as poss~ble One can, of course, set the playlng tlme at any deslred value by ad~ustrnents of the permltted ply; too small a ply results in a bad game and too large a ply makes the game unduly costly in terms of machine time,

Other Modes of Play For study purposes the program was written to accommodate several variations of thls basic plan. One of these permits the program to play against ~tself, that IS, to play both sldes of the game Thls mode of play has been found to be especlally good dunng the early stages of learning

The program can also follow book games presented to It either on cards or on magnetlc tape When operating In this mode, the program dectdes at each pomt In the game on its next move in the usual way and reports thls proposed move Instead of actually making thls move, the program refers to the stored record of a book game and makes the book move. The program records Its evaluation of the two moves, and it also counts and re- polts the number of possible moves whlch the program rates as being better than the book move and the number it rates as being poorer. The sldes are then reversed and the process is repeated. At the end of a book game a correlatlon coefficlent is computed, relatmg the machme’s In- dicated moves to those moves adpdged best by the checker masters.”

It should be noted that the emphasis throughout all of these studles has been on learning techniques The temptatlon to improve the machxne’s game by givlng it standard openmgs or other man-generated knowledge of playlng techniques has been conslstently resrsted. Even when book games are played, no welght IS glven to the fact that the moves as llsted are pre- sumably the best possible moves under the cncumstances

For dcmonstratlon purposes, and also as a means of avoiding loct machme time whlle an opponent IS thmkmg, it IS sometrmes convenlent to play several slmultaneous games against dlfferent opponents. Wlth the program In Its present form the most convenlent number for thls purpose has been found to be SIX, although eight have been played on a number of occasions.

Games may be started wlth any mtial configuratlon for the board position so that the program may be tested on end games, checker puzzlcs, et cetera. For nonstandard startlng conditions, the program llsts the inltial

’l’hls coefficrent IS defined as C = ( L - R ) / ( L + H ) , where L IS the total number of dlfferent legal moves whlch the machlne Judged to be poorer than the indlcated book moves, and H IS the total number whlch It judged to be better than the book moves.


piece arrangement From time to tlme. and at the end of each game, thc program also tnbulcltcs various bits of statistical rnformatlon whlch awst In thc cvdu,ltlon oE plnylng pcr fornl,lncc

Numerous other features have also been added to make the program convenlent to operate (for details sce Appendlx A ) , but these have no dmct bearing on the problem of learnlng, to whlch we will now turn our attention.

Rote Learning and Its Variants

Perhaps the most elementary type of learning worth discussing would be a form of rote learning in whlch the program simply saved all of the board posltions encountered during play, together wlth their computed scores Reference could then be made to this memory record and a certnm amount of conlputmg time might be saved. This can hardly be called a vel y advanced form of learnlng; nevertheless, if the program then utillzes thc saved trme to compute further In depth it wdl mprove wlth tlme

Fortunately, the abihty to store board information at a ply of 0 and to look up boards at a larger ply provides the posslbility of lookmg much farther in advance than might otherwise be posslble To understand this, conslder a very simple case where the look ahead is always terminated at a fixed ply, say 3. Assume further that the program saves only the board positions encountered during the actual play with their associated backed-up scores. Now it is thts llst of previous board posltlons that IS used to look up board posltlons while at a ply level of 3 In the subsequent games If a board posltlon is found, its score has, in effect, already been backed up by three levels, and If it becomes effective m determming the move to be made, It is a 6-ply score rather than a simple 3-ply score Thls new initial board posltion with its 6-ply score IS, In turn, saved and It may be encountered in a future game and the score backed up by an add1tion:d set of threc levels, et cctera Thls procedure is dlustrated in Fig 3. The mcorporatlon of thls vanation, together wlth the simpler iote-learl1lng feature, results in a fairly powerful learning technique whlch has been stuclled In some detad

Several addltronal features had to be incorporated into the program before It was practical to embark on learnlng studies usmg this stalage scheme. In the first place, it was necessary to impart a sense of direction to the program In order to force it to press on toward a wm TO lllustratc this, consider the situation o l two kmgs agalnst one king, which 1s a win- nlng combmation for practically all variations in board positions. In tme, the program can be assumed to have stored all of these variations, each associated with a winnrng score, Now, If such a situation is encountcrcd, the program will look ahead along all possible paths and each pat11 will


0 Typ~col boord posttlon found In memory with score from prevlous look-oheod search

Ply number I

2

Evoluotlons would norrnolly be mode of thrs level 3

Flgure 3 Slrnphfied representation of the rote-learnlng process, in whlch Information saved from a prewous game is used to Increase the effectwe ply of the backed-up score

lead to a wlnnlng combination, in spite of the fact that only one of the posslble lnltlal moves may be along the dlrect path toward thc wln while all of the rest may be wastmg tlme. How IS the program to differentiate between these?

A good solution IS to keep a record of the ply value of the different board positions at all tlmes and to make a further choice between board

. . positions on thls basis If ahead, the program can be arranged to push directly toward the wm whrle, if behmd, It can be arranged to adopt delaying tactics, The most recent method used IS to carry the effective ply along with the score by slmply decreaslng the magnltude of the score a small amount each tlme It is backed up a ply level durlng the analyses. If the program IS now faced wlth a choice of board posltlons whose scores dlffer only by the ply number, it wlll automatically make the most advan-

should not be overlooked. Even without “learnlng,” It is very important. Several of the early attempts a t l ea rmg faded because the dlrectlon sense was not properly taken into account.

Cataloging and Culling Stored Information Since practical conslderations limit the number of board positions which can be saved, and since the time to search through those that arc saved can

MACHINE LEARNING USING THE GAME OF CHECKERS 8 1

tions which are not belleved to be of much value The most effective cata- loglng system found to date starts by standardizlng all board posltlons, first by rcversing the plcces and plece positmns If It IS a board posltlon In which Whlte IS to move, so that all boards are reported as if It were Black’s turn to move This reduces by nearly a factor of two the number of boards whlch must be saved Board positlons, In which all of the pleces are kings, can be reflected about the diagonals wlth a posslble fourfold reductlon in the number which must be saved A more compact board representatlon than the one employed during play 1s also used so as to mlnlmlze the storage requlrements.

After the board positions are standardlzed, they are grouped into records on the basls of (1 ) the number of pieces on the board, (2) the presence or absence of a plece advantage, ( 3 ) the slde possesslng this advantage, (4 ) the presence or absence of klngs on the board, ( 5 ) the slde havlng the so-called “move,” or opposltlon advantage, and finally ( 6 ) the first moments of the pieces about normal and diagonal axes through the board. Durlng play, newly acqulred board positions are saved m the memoly untll a reasonable number have been accumulated, and they are then merged with those on thc “memory tape” and a new memory tape IS produced Board positions wlthin a record are listed in a serial fashlon, being sorted with respect to the words which define them, The records are arranged on the tape In the order that they are most likely to be needed during the course of a game, board posltions wlth 12 pleces to a side coming first, et cetera. Thls method of cataloging is very Important because I t cuts tape- searching time to a mlnlmum

Reference must be made, of course, to the board positlons already saved, and thls is done by rcadlng the correct record Into the memory and searching through It by a dichotomous search procedure. Usually fivc or more records are held in memory at one tlme, the exact number at any time depending upon the lengths of the particular records in questlon Normally, the program calls threc or four new rccords Into memory during each new move, making room for them as needed, by dlscardlng the records whlch have been held the longest

Two different procedures have been found to be of value In linuting the number of board posltlons that are saved; one based on the frequency of use, and the second on the ply To keep track of the frequency of use, an age term IS carried along with the score. Each new board positioll to be saved is arbltrarlly assigned an age When reference is madc to a stored board position, either to update Its score or to utlllze it in the look-ahead procedure, the age recorded for thls board positlon is divided by two This is called refreshing Offsetting this, each board posltion IS automatically aged by one unit at the memory merge tlmes (normally occurrlng about once every 20 moves). When the age of any one board position reaches an


arbitrary maxmum value thls board posltlon IS expungcd from the recold Thls IS a form of forgetrmg New board positions whlch remaln unused are soon forgottcn, whllc board posltlons whlch are used several tmes 111

succession wdl be refreshed to such an extent that they wlll be remembered even if not used thereafter for a fairly long period of tune Thls form of refreshmg and forgethng was adopted on the bass of reflections as to the frallty of human memorles It has proven to be very effective.

In addltlon to the llmltatlons Imposed by forgetting, It seemed desmble to place a restrtctlon on the maxlmum sue of any one record Whenever an arbitrary limlt IS reached, enough of the lowest-ply board posltlons are automatically culled from the record to bring the srze well below the maxlmum.

Before embarktng on a study of the learning capabditles of the system as lust descrtbed, it was, of course, first necessary to fix the terms and coefficlents In the evaluatlon polynomial To do thls, a number of dlfferent sets of values were tested by playing through a senes of book games and computmg the move correlatlon coefficlents These values varied from 0 2 for the poorest polynonual tested, to approximately 0 6 for the one finally adopted. The selected polynomml contained four terms (as contrasted with the use of 16 terms in later experments) In decreasmg order of lmportance these were: (1) plece advantage, ( 2 ) denial of occupancy, (3) mobihty, and (4) a hybnd term whlch combmed control of the center and plece advancement

Rote-learning Tests After a scormg polynomial was arbltrardy picked, a series of games was played, both self-play and play agamst many dlfferent individuals (several of these bemg checker masters) Many book games were also followed, some of these bemg end games. The program learned to play a very good opening game and to recognize most wlnning and loslng end posltlons many moves in advance, although its mtdgame play was not greatly im- proved Thls program now qualifies as a rather better-than-average novlce, but definitely not as an expert

At the present t m e the memory tape contams something over 53,000 board positrons (averaging 3 8 words each) whlch have been selected from a much larger number of positions by means of the cullmg techniqucs descrlbed While th~s 1s still far from the numbcr whlch would tax the listlng and searchmg proccdures used In the program, rough estmates, based on the frequency wlth whlch the saved boards are utilized durmg normal play (these figures being tabulated automatically), Indicate that a library tape contaming at least 20 tlmes the present number of board posltlons would be needed to improve the midgame play slgnificantly. At the

MACHINE LEARNING USING THE GAME OF CIlECKCIIS 83

present rate of acqulsltion of new posltlons this would rcqulre an Inordl- nate amount of pldy and, conscquently, of machine tme

The general concluslons whxh can be drawn from these tests arc that (1) An effective rote-learnmg technlquc must Include a ploccdure

to glve the program a sense of dlrectlon, and it must contam a refincd system for cataloging and storlng mforrnatlon

(2) Rote-learnrng procedures can be used effectlvely on machmcs w t h the data-handhng capaclty of the 1BM 704 If the informatlon whlch m u s t be saved and searched does not occupy more than, roughly, one nullion words, and I f not more than one hundred or so references necd to be macle to thls informatlon per mmute. These figures are, of course, h~ghly de- pendent upon the exact efficlency of cataloging whlch can be achieved

(3) The game of checkers, when played with a slmple scorlng schcnlc and wlth rote learning only, requlles more than this numbcr of wolds for master callber of play and, as a consequence, is not completely amenable to this treatment on the IBM 704. (4) A game, such as checkers, is a suitable vehicle for use during thc

development of learnmg kchnlques, and It 1s a very satlslactory device for demonstratlng machme learnmg procedures to the unbellevlng.

Learrzmg Procedure Involving Generalizations

An obvlous way to decrease the amount of storage nceded to utlhzc past expcrlence is to gencralue on the basis of experlence and to save oIlly the gencrallzatlons. Thls should, of course, be a contmuous proccys I f I t IS to be truly eflectlve, and it should mvolve several levels of abstractloll A start has bccn made in th~s dircctlon by havmg thc program selcLt a subset of posqlble terms for use In the evaluatlon polynolulal and by havlllg the program determinc the s ~ g n and magnitude of thc cocllicients whlch multlply these parameters At the present tme thls subset conslsts of 16 terms choscn from a list of 38 parameters The picce-ndvantagc term nccdcd to define the task is computed sepatately and, of course, IS not altered by

7 tils playlng-tlme requlremcnt, whlc large In terms o l cost, would bc le\\ thtl11

the tlme whlch the checker master probably spends to acqulre his proficrency.


tion of any one game Program Alpha IS used to play agalnst human opponents, and durlng self-play Alpha and Beta play each othcr

At the end of each self-play game a detcrmlnatlon IS made of the relatlve playlng abdlty of Alpha, as compared wlth Beta, by a neutral portlon of the program If Alpha wlns-or IS ad~udged to be ahead when a game IS otherwise termlnated-the then current scoring system used by Alpha IS

gwen to Beta If, on the other hand, Beta wlns or is ahead, thls fact IS recorded as a black mark for Alpha Whenever Alpha recewes an arbltrary number of black marks (usually sct at three) It is assumed to be on the wrong track, and a fairly drastlc and arbltrary change IS made m its scormg polynomial (by reduclng the coefficlent of the leadlng term to zero). Thls actlon IS necessary on occaslon, sincc thc entlre learnlng proccss is an attempt to find the hlghest polnt m multldimenslonal scorlng space 111 the presence of many secondary maxlma on whlch the plogram can become trapped By manual intcrventlon It is posslble to return to some prevlous condltlon or make some othel change if It becomes apparent that the learn- Ing process IS not functlonmg properly In general, however, the program seeks to extrlcate Itself from traps and to lmprovc more or less contmuously

The capablllty of the program can be tested at any tlme by havlng Alpha play one or more book games (with the learnmg procedure tcrnporarlly im- moblllzed) and by correlatlng Its play wlth the recommendations of the masters or, more interestingly, by pitting It agalnst a human player

Polynomial Modification Procedure If Alpha is to make changcs in Its scorlng polynomial, I t must bc glven some trustworthy cnteria for measuring performance A loglcal difficulty presents Itself, slnce thc only mcasurrng paramctcr avallclble 1s this same scoring polynomial that the process IS deslgned to Implove. Recoulse IS

had to the pccullar property of the look-ahead procedure, whlch makes I t less important for the scorlng polynonual to be partlcularly good the further ahead the process IS contlnucd. This means that one can evaluate the relative change in the posltions of two players, whcn th19 cvaluation 1s made over a falrly large number of moves, by uslng a scoring system whlch IS

much too gross to be slgnlficant on a move-by-move basis. Perhaps an even better way of looklng at the matter is that we are at-

temptmg to make the score, calculated for the curlent board position, look llke that calculated for the terminal board posltlon of the cham of moves which most probably will occur durlng actual play Of course, if onc could dcvelop a perfect system of thls sort It would be the equivalent of always looking ahead to the end of the game The nearer t h ~ s ideal is ap- proached, the better would be the play.l*

’z There IS a loglcal fallacy In thls argument The proglam mlght save only in- vanant terms whlch have notlnng to do wlth goodness of play, for example, It mlght


In order to obtain a sufliclently large span to make use of thls charac- teustlc, Alpha keeps a record of the apparent goodness of Its board posltlons as the game progresses. Thls rccord IS kcpt by computing thc scorlng polynonual for each board posttlon cncounteled in actual play and by savlng tbls polynomlal In Its entlrety At the same tlme, Alpha also com- putes the backed-up score for all board posltlons, using the look-ahead procedure descrlbed earher At each play by Alpha the initial board score, as saved from the prevlous Alpha move, IS compared with the backcd-up score for the current posltion The dllTerence between these scorcs, defined as delta, IS used to check the scorlng polynomlal If delta IS posltive It is reasonable to assume that the inltlal board evaluatlon was 111 enor and terms which contrlbutcd posltlvely should have been givcn morc wclght, whde those that contrlbuted negatlvely should h u e been glven less welght A converse statemcnt can be made for the case where delta IS negatlvc Pre- sumably, In thrs caye, clther the mltial board evaluatlon was Incorrect, 01 a wrong cholce of moves was made, and gteater weight should have beell glven to terms maklng negatlve contributlons, wlth less weight to positlvc terms These changes are not made duectly but are blought about In an involved way whlch wlll now be descrlbed

A rccord is kcpt of the correlation exlsting between the signs of the in- divldual term contributlons in the lnltial scorlng polynomial and the slgn of delta After each play an adjustment is nmdc in the valucs of the correlation coeficlcnts, due account being taken of the numbel of tmes that each particular term has bccn used and has had a nonzcro v h c The co- eflicient for the polynomlal term (other than thc plcce-advantage term) wlth thc then largest correlatlon cocficlcnt 1s set at a prcscrlbcd mnxlmun~ value wit11 proportionate values dctel mmcd for all of the rem:11nlng W-

cficicnts Actually, the term cocficlents are fixcd at Integral powers of 2, this power being defined by the ratlo of thc correhtlon coeflicients. More precisely, If the ratio of two correlatlon coclficients IS cqual to or larger than IZ but less than IZ + 1, where 12 is an Integcr, then the ratlo o f the two term coefficients 1s set equal to 2tL This plocedule was adopted i n ordcr to increase the range In values of the term coefficients Whenevcl a corrcla- tlon-coeficlent caIcuIatlon leads to a negative S I ~ I I , a corlespondlng reversal IS made In the sign associated with the term Itself

Instabilities It should be noted that the span of moves over whlch delta is colllp~tcd consists of a remembered palt and an anticipatcd poltloI1. Durlng the rc- 111embered play, usc had been made of Alpha’s culrcnt scoring polynol11lal to dctcrrnlne Alpha’s moves but not to detctrulne the opponcnl’s nlov~s,

count the squales on the checkerbodrd The forced lncluslon of the plcce-advantage term plevents thls

86 ARTIFICIAL I N T t L L I G E N L h

while durlng the anticlpatlon play thc nlovc5 for both s~dcs are m d c wing Alpha’s scormg polynomlal One IS tempted to Increase the sensltivity of delta as an indlcator of change by lncreaslng the span of the remembered portlon. This has been found to be dangerous since the coefficlents In the evaluatlon polynomial and, Indeed, the terms themselves, may change between the tlme of the remembered evaluatlon and the tlme at which the antlclpation evaluation is made As a matter of fact, thls dlfficulty is present even for a span of one move palr It is necessary to recompute the sconng polynomial for a given mtial board position after a move has been determined and after the indlcated correctlons m the scormg polynomial have been made, and to save thls score for future comparlsons, rather than to save the score used to determine the move Thls may seem a trivial point, but its neglect in the inltlal stages of these experiments led to osclllatlons qulte analogous to the instabllity Induced In electrical circults by long delays In a feedback loop

As a means of stablllzlng against minor variatlons in the delta values, an arbitrary mmimum value was set, and when delta fell below thls minimum for any particular move no change was made m the polynomlal Thls same minlmum value IS used to set limlts for the Initla1 board evaluation score to declde whether or not It wlll be assumed to be zero Thls minimum 1s recomputed each tlme and, normally, has been fixed at the average value of the coefficients for the terms In the currently exlstlng evaluation polynomial

Still another type of instablllty can occur whenever a new term is introduced into the scormg polynomlal Obviously, after only a smgle move the correlation coefficlent of this new term will have a magnitude of 1, even though It might go to 0 after the very next move To prevent vrolent fluc- tuations due to t h ~ s cause, the correlatlon coefficlents for newly introduced terms are computed as If these terms had already been used several times and had been found to have a zero correlatlon coefficient. This IS done by replaclng the tmes-used number In the calculatlon by an arbltrary number (usually set at 16) until the usage does, In fact, equal this number

After a term has been m use for some tme, qulte the opposite actlon is deslred so that the more recent expenence can outweigh earlier results. Thls is achieved, together wlth a substantial reduction in calculation time, by using powers of 2 In place of the actual times used and by Ilmltlng the maxlmum power that IS used To be specific, at any stage of play defined as the Nth move, corrections to the values of the correlatlon coeflicmts C, are made uslng 16 for N until N equals 32, whereupon 32 is used untll N equals 64, et cetera, usmg the formula

C N = CN--1 - - CN--1 k 1 N

and a value for N larger than 256 is never used

MACHINE LEARNING USING THE G A M E OF CHECKERS 87

After a m~nlmum was set for delta It seemed reasonable to attach greater welght to sltuations leadlng to large values of dclta Accordingly, two addltlonal categorles are defined. If a contributlon to delta IS made by the first tern], meanlng that a change has occurred In the plece ratio, the Indicated changes m the correlatlon coefficlents are doubled, while If the value of delta IS so largc as to lndlcate that an almost sure win or lose wlll result, the effect on the correlatlon coefficients IS quadlupled

Term Replacement Mentlon has been made several tlmes of the procedure for replacing terms In the scoring polynomial The program, as It IS currently runnlng, contams 38 different terms (In additlon to the plece-advantage term), 16 of these being Included In the scorlng polynomlal at any one tune and the remalnlng 22 bemg kept m reserve After each move a low-tcrm tally is recorded against that active term which has the lowest correlatlon coefficient and, at the same tlme, a test IS made to see If thls brings Its tally count up to some arbitrary h u t , usually set at 8 When thls limit 1s reached for any specific term, thls term is transferred to the bottom of the reserve list, and It IS replaced by a term from the head of the reserve list This new term enters the polynomlal with zero values for Its correlatlon coefficlent, times used, and low-tally count, On the average, then, an actwe term 1s replaced once each eight moves and the replaced terms are glven another chance after 176 moves As a check on the effectiveness of thls procedure, the program reports on the usage whlch has accrued agamst each dlscarded term Terms whlch are repeatedly re~ected after a minlmum amount of usage can be removed and replaced with completely new terms.

I t mlght be argued that this procedure of llavlng the program select terms for the evaluation polynomial from a supplled list is much too simple and that the program should genelate the terms for itself Unfo~tunately, no satisfactory scheme for doing thls has yet been devised. Wlth a man- generated llst one might at least ask that the terms be members of an orthogonal set, assumlng that this has some mcanlng as applled to thc evaluation of a checker position Apparently, no one knows enough about checkers to define such a set The only practical solution seems to be that of lncludlng a lelatively large number of posslble terms in the hope that all of the contrlbutlng parameters get covered somehow, even though In an involved and redundant way Thls is not an undeslrable state of aflalrs, however, slnce It simulates the situation whlch is likely to exist when an attempt is made to apply slmilar learning techniques to real-1lfe situntlons.

Many of the terms in the existlng llst are related 111 some vague way to the parameters used by checker experts Some of the concepts which checker experts appear to use have eludcd the wrlter’s attelllpts at de- finlhon, and he has been unable to program them. Some of the terms are


qulte unrelated to the usual checker lore and have been discovered more or less by accident The second moment about the dlagonal axis through the double comers is an example Twenty-seven dlfferent srnlple terms are now in use, the rest being combinational terms, as wlll be descrlbed later

A word mrght be said about these terms wlth respect to the exact way In whlch they are defined and the general procedures used for their evaluation Each term relates to the relative standings of the two sides, wlth respect to the parameter m question, and It is numerically equal to the dlfference between the ratlngs for the lndlvldual sides A reversal of the slgn obviously corresponds to a change of sides. As a fulther means of in- sunng symmetry the lndlvldual ratings of the respective sides are determined at correspondtng tlmes in the play as vrewed by the slde in questlon For example, consider a parameter whlch relates to the board condltlons as left after one slde has moved The rating of Black for such a parameter would be made aftcr Black had moved, and the ratlng of White would not be made untll after Whlte had movcd Durlng antlclpatlon play, these in- divldual ratlngs are made after each move and saved for future rcference. When an evaluatlon is desired the program takes the dlfferences between the most recent ratrngs and those made a move earller In general, an attempt has been made to define all parameters so that the Individual-slde ratlngs are express~ble as small posltlve Integers.

Binary Connective Terms In addltlon to the slmple terms of the type lust described, a number of com- bmational terms have been Introduced Without these terms the scorlng polynomlal would, of course, be h e a r A number of dlffcrent ways of Intro- ducing nonllnear terms have been devised but only one of these has been tested m any detal]. Thls scheme provldes terms whlch I~ave some of the propertles of blnaly logical connectlves Four such terms are formed for each pair of smple tcrms whlch are to be relatcd Thls 1s done by maklng an arbitrary dlvislon of the range in values for each of the slmple terms and asslgnlng the bmary values of 0 and 1 to these ranges Since most of the simple terms are symmetncal about 0, thw 1s easily done on a slgn basis The ncw terms arc then of the form A B , A E, X B , and ;i * B, yicldrng vnlues either of 0 or 1. Thesc terms arc Introduced into the scoring polynomml wlth adjustable coefficlcnts and slgns, and are t h e - after indlstmgulshable from the other terms

As It would requlre some 1404 such combmatlonal terms to Inter- relate the 27 simple terms ollglnally used, it was found desmble to llnut the actual number of comblnational terms used at any one t m e to a small fractlon of these and to Introduce ncw terms only as It bccamc possible to retlre older ineffectual terms. The terms actually used are given in Ap- pendix C.

MACHINE LEARNING USING THE GAME OF CHECKERS 89 I Preliminary Learning-by-generalization Tests An Idea of the learnmg abillty of this procedure can be gained by analyzmg an inltlal test serles of 28 gameslZ played wlth the program lust described At the start an arbltrary selection of 16 terms was chosen and all terms were assigned equal welghts. Durlng the first 14 games Alpha was asslgned the Whrte slde, wlth Beta constrained as to its first move (two cycles of the seven different initlal moves) Thereafter, Alpha was assigned Black and White alternately. Durlng &IS time a total of 29 different terms was dlscarded and replaced, the majonty of these on two dlffelent occasions

Certain other figures obtamed durlng these 28 games are of interest At frequent intelvals the program llsts the 12 leadlng terms in Alpha’s scoring polynomial wlth thelr correlation coefficlents and a running count of thc number of tmes these coefficients have been altered Based on these sam- plings, one observcs that at least 20 dlfferent terms wcrc asslgned thc largest cocfliclent at some time or other, some of these alternatmg wlth othcr terms a number of timcs, and two even reappearing at the top of thc list with thelr slgns reversed Whlle these varlatlons were more vlolent at the start of the senes of games and decreased as tlme went on, their plesence lndmted that the learning procedure was stlll not completely stable D u n g the first seven games there were at least 14 changes In occupancy at the top of the list rnvolving 10 dlfferent terms. Alpha won three of these games and lost four. The quallty of the play was extremely poor. Durmg the next seven games there were at least erght changes made In the top llstmg involvlng five dlfferent terms Alpha lost the first of these games and won the next six Quahty of play lrnploved steadlly but the machme stdl played rather badly. During Games 15 through 21 there were elght changes In the top listlng mvolvlng five terms, Alpha wlnnlng five games and loslng two, Some falrly good amatcur players who played the machm durlng thls pcllod agreed that It was “tncky but bcatable ’’ During Gnnm 22 through 28 there were at lcast four changes involving three tCrIllS Alpha won two games and lost live The program appeared to be np- proachlng a quallty of play whlch causcd I t to be dcscrlbed as “a bettcr- than-average player ” A detailed analysls of these results indicatcd that the lcarnmg procedure cl~d work and that the rate of learnlng was surprisinzly hlgh, but that the learnmg was qulte erratic and none too stable.

Second Series of Tests Some of the more obvious reasons for this erratic behavior in t l~c first serles of tests have been identified The program was modlfied In S C V C I ~ I

“Tfle games averaged 68 moves (34 to a slde) of whlch approximately 20 cmsed changes to be made In the scor~ng polynomlal.


respects to Improve the situatlon, and addrtional tests were made. Four of these modifications are important enough to justify a detailed explana- tion.

In the first place, the program was frequently fooled by bad play on the part of Its opponent A slmple solution was to change the correlatlon coefficients less drastically when delta was posltlve than when delta was negatlve. The procedure finally adopted for the posltlve delta case was to make correctlons to selected terms In the polynomial only When the scormg polynomml was posltive, changes were made to coefficlents assoclated wlth the negatively contributing terms, and when the polynomlal was negatlve, changes were made to the coefficlents associated with posltlvely contnbutlng terms No changes were made to coefficlents associated wlth terms whlch happened to be zero For the negative delta case, changes were made to the coefficlents of all contrlbutlng terms, just as before.

A second defect seemed to be connected wlth the too frequent Intro- ductlon of new terms Into the scormg polynomial and the tendency for these new terms to assume domlnant posltlons on the basls of Insufficient evidence. This was remedied by the simple expedlent of decreaslng the rate of lntroductlon of new terms from one every eight moves to one every 32 moves.

The thlrd defect had to do with the complete exclusion from consldera- tion of many of the board positlons encountered durmg play by reason of the mlnimum hmlt on delta Thls resulted In the mlsasslgnment of credlt to those board positrons whlch permltted spectacular moves when the credlt rightfully belonged to earller board positlons whlch had permltted the necessary ground-laylng moves Although no precise way has yet been devlsed to ensure the correct asslgnment of credlt, a very slmplc expedlent was found to be most effectlve In minlmizlng the adverse effects of earher assignments. Thls expedlent was to allow the span of remembered moves, over which delta 1s computed, to Increase untll delta exceeded the arbltrary minimum value, and then to apply the correctlons to the coefficients as dlctated by the terms In the retained polynomml for thls ear lm board posltlon In thls case, the difficulty whlch was mentloned In the section on Instabllltles in connectlon wlth an arbltrary increase in span, does not occur after each correctlon, smce no changes are made in the coefficlents of the scorlng polynomial as long as delta IS below the minlmum value Of course, whenever delta does exceed the mlnlmum value the program must then recompute the Initial scorlng polynomlal for the then current board posltlon and so restart the procedure with a span of a single remembered move pair This over-all procedure rectifies the defect of a,ulgnlng cte&t te 1 I-c.urd cc ‘ s~ txn +at 1 x 2 . YC Fu dcnT 5hc T C X J ~ :!an. 2~ x 2xsi~p :E s s q q =?x I: I XU: P:s(:\:T 1 ~ . y -*: zi- aT& u-x -


As a partlal expedient to compensate for this newly Introduced danger, a change was made In the inltial board evaluation Instead of evaluatlng the l n m l board posltlons directly, as was donc before, a standard but rudlmentary tree search (terminated after the first non~ump move) was used Errors due to Impending jump sltuahons were ellmlnated by thls procedure, and because of the greater accuracy of the evaluation it was posslble to reduce the mlnlmum delta lmlt by a small amount.

Finally, to avold the danger of havlng Beta adopt Alpha’s polynomlal as a result of a chance wln on Alpha’s part (or perhaps a situatlon in whlch Alpha had allowed Its polynomral to degenerate after an early or mldgame advantage had been gamed), it was declded to requlre a rnajorlty of wlns on Alpha’s part before Beta would adopt Alpha’s scormg polynomial

With these modificatlons, a new serles of tests was made In order to reduce the learnlng time, the inltlal selectlon of terms was made on the basls of the results obtained during the earher tests, but no attentlon was p a d to thelr prevlously asslgned welghts. In contrast wlth the earlrer erratlc behavlor, the revised program appeared to be extremely stable, perhaps at the expense of a somewhat lower inltlal learning ratc The way In which the character of the evaluation polynomial altered as learnmg pro- gressed is shown in Fig 4.

The most obvious change in behavior was In regard to the relative number of games won by Alpha and the prcvalence of draws During the first 28 games of the earher series Alpha won 16 and lost 12. The correspondmg figures for the first 28 games of the new serles were 18 won by Alpha, and four lost, with SIX draws In all cases the games were termlnated, If not finished, m 70 moves and a ~udgment made In terms of the final positions Unfortunately, these figures are not strlctly comparable because of the decreased frequency with whlch Beta adopted Alpha’s polynomlal during the second senes, both by deslgn and because a programming error im- mobillzed the adoptlon procedure dunng part of the tests. Nevcrthcless, the great decrease in the number of losses and the prevalence of dldws seemed to Indicate that the learning process was much more stable Some typlcal games from thls second serles are given in Appendlx B

As learning proceeds, it should become harder and harder for Alpha to improve its game, and one would expect the number of wins by Alpha to decrease wlth time, If secondary maxima in scoring space are encountered, one might even find sltuatlons in whlch Alpha wms less than half of the gamcs Wlth Beta at such a maximum any mlnor change In Alpha’s polynomlal would result in a degradation of its play, and several oscillatlons about the nmxlrnum nught occur before A1phcl lntldcd a t 11

point nhlch nould enable I t to beat Beta. Some evidence of t h ~ s trend is drscernjble In the pia!. although man!. more games ~ 1 1 1 haye to be Fla? sd before i t can k obirned ertarnt)

92 ARTIFICIAL I N T E L L I G E N C E

z''-l noOEM0

f---

Flgure 4 Second serles of Iearnlng-by-gcneralrz'ltlon tests Cocfficlents asslgned by plotted as a functlon of the number of gdmes played Two reglons of s p e c d found that the mtlal slgns of many of the terms had been set Incorrectly, and or 32 games

The tentatlve conclusions which can be drawn from these tests are

(1) A simple generalmtlon scheme of the type here used can be an effectwe learning device for problems amenable to tree-searching procedures

(2) The memory requlrements of such schemes are qulte modest and remain fixed with tune.

( 3 ) The operatmg times are also reasonable and remam fixed, mdepend- ent of the amount of accumulated leammg. (4) Inclplent forms of mstabdlty In the solutlon can bc expected but, at

least for the checker program, these can be dealt with by quite straight- forward procedures

( 5 ) Even with the incomplete and redundant set of parameters which


.MOVE.

Rote Learning vs. Generalization Some intercstmg comparisons can be made between the playmg style de-

veloped by the learning-by-generalmtion program and that developed by the earlier rote-learnmg procedure The program with rote learning soon


learned to lmltate master play durlng the openlng moves. It was always quite poor durmg thc mlddlc game, bu t It edslly lcarned how to avoid most of the obvious tIaps dullng end-game play dnd ~ o u l d usually drlve on towald a wm when left with a plwe advantage The program with the generallzatlon procedure has never learned to play 111 a conventional manner and Its openlngs are apt to be weak On the other hand, I t soon learned to play a good mlddle game, and with a plece advantage it usually polishes off its opponent in short order. Interestmgly enough, aftel 28 games it had still not learned how to wln an end game wlth two kmgs against one m a double corner.

Apparently, rote learnrng IS of the greatest help elthcr under condltions when the results of any specific actlon are long delayed or In those situa- tlons where hlghly speciahzed techniques are requlred Contrastmg wlth thn, the generallzatlon procedure 1s most helpful In sltuatlons In whlch the avadable permutatlons of conditions are large In number and when the consequences of any speclfic actlon are not long delayed.

Procedures Involving Both Forms of Learning The next obvlous step is to comblne the better features of the rote-learnmg procedure with a generallzatlon scheme Thls must be done wlth some care, since it is not practical to update the prevlously saved Information after every change In the evaluatlon polynomial A compromise solutlon might be to save only a very limlted amount of lnformatlon dunng the early stages of learning and to Increase the amount as warranted by the m- creasmg stabillty of the evaluation coefficient wlth learning For example, the program could be arranged to save only the piece-advantage term at the start At some stage in the learnlng process the next term could be added, perhaps when no change had been made in the parameter used for thls term d u n g some fairly long period, say for three complete games. If and when the program IS able to play an addltlonal perrod wlthout changes in the ncxt parameter, this could also be added, et cetcra When- ever ;I change docs occur In a patameter prcvlously c~ssun~ed to be stable, the enme memory tape could be revlewed, all terms involving the changed parameter and those lower on the list could be expunged, and the program could drop back to the earlier condltion wlth respect to its term- saving schedule.

Another solution would be to utlllze the gencralizatlon scheme alone until it had become falrly stable and to introduce rote learning at this tune, I t is, of course, perfectly feaslble to salvage much of the leamlng which has been accumulated by both of the programs studled to date. Thls could be done by appending an abrldged form of the present memory tape to the generalizatlon scheme m Its present stage of learnlng and by proceedlng from there in accordance with the first solutlon proposed above.


Future Development While I t is belleved that these tests have reached the stage of dmmishlng returns, some effort mlght well be expended in an attempt to get the program to generate its own parameters for the evaluation polynom~al Lackmg a perfectly general procedure, It might stdl be possible to generate terms based on theories as proposed by students of the game. This procedure would be at varlance wlth the wnter’s prevlous phllosophy, but It is hlghly llkely that slmilar compromlses will have to be made when one attempts to apply learning procedures to problems of economic importance

Conclusions

As a result of these experiments one can say with some certalnty that it is now posslble to devlsc learning schemes which w~ll greatly outperform an average person and that such learnlng schemes may eventually be economically feaslble as applied to real-lrfe problems.

Appendzx A : Prograinntzng Details

Approximate Size of Program Basic chcckcr-playing routlne 1100 lnstructlons Input, move verlficatloll and output 1400 mstluctlons Game startmg and tcrtnlnatlng routmes GOO lnstructlorls Loaders, table generators, dumplng, et cetcra 850 Instluctlons Statlstlcal and arlalytml routlnes 700 lnstluct~orls Rotc-learnlng routlncs 1600 Inst1 uctlons Gerlclallzstlon-lcarrllng routllles G50 lnslructlons Tables and constants for haslc play 700 words

Worklng S ~ R C C for baslc pltty 2000 words Workmg space for genclallzatlon Irarnlng 600 W O I ds Workmg spare for rotr lownll~g Balance of memory

Approximate Computation ‘rimes

To find all avallablc moves from glvrn I ~ o n ~ t l posltloxl 2 G milllscconds To malcr n slnglc movc and find rcsultlng board posltlon I 5 Inlllluccollds

To evaluale a board posltlotl (4 trrms) 2 4 n111I1srrorIh To find score for a saved board posltlon bote Icarnlng) 2 3 Indllsrcontla To evaluate posltlon (wltll 16 terms for gencrahzatlon lrarnlng) 7 5 milliseconds

Board Representations The standard checkerboard numbering system (see Appendix B ) is used In communicating with the machine. A modified numbering system is used for internal computations, the numbers shown on the squares in Fig. 5 corrcspondmg to the blt posltrons In an IBM 704 word. Any given board position IS represented by four such words; one word ( F A ) containing 1’s


Whlte

B lock

Flgure 5 Checkerboard notatlon far Internal computations.

in those bit positlons corresponding to squares containing pieces of the color whose turn it is to move and which normally move in a forward directlon To be specific, if It IS Black’s turn to move (Le , if Black is “actlve”) FA designates the location of all of Black’s pieces, both men and lungs. Conversely, If Whlte IS active, FA designates the locatlon of Whlte’s kings only, slnce White’s men can only move m the directlon arbi- trarlly called backwwd The other words designatc, respectively: B A , backward actlve pieces, FP, forward passlve pieces; and BP, backward passlve pieces

To conserve space when wrlting on tapc, three wolds are used to record board posltions w ~ t h kmgs, and only two words are used for bomd posltlons wlthout kings These are saved in a stmdardlzed form, as explalned In the text

Posslble moves are designated by five words; one word to indicate by its sign (with the word 1tseIf contalnmg othel mformatlon) whether the moves are jumps or not (If a jump is available, only jump moves are saved ) The other four words designate the location of those pieces which can move in the four different d~agonal dlrectlons. RF, for right forward; LF, fol lcft forward, LB for left backward; and RB, for nght backward, respectlvely

By reference to F I ~ 5 , it wdl be observed that a nght-forward movc results in an Increase of 4 in the square deslgnatlon, while a left-forward

MACHINE LEARNING USING THE G A M E OF CHECKERS 97

move results In an increase of 5 Blt positlons 9, 18 and 27 do not appeal on the board. Thls notation makes It possible to compute available moves for all pleces slmultaneously. Havlng prevlously computed a word called EMPTY, whlch contams 1’s In locations correspondmg to all unoccupled squares, one can compute RF, for the normal move case, in four mstruc- tlons, as listed below (In IBM 704 symbollc language).

C I A BillPZ’Y (puts n o ~ t l EBIPTI‘ Into the. ncrnmdntor) ALS 4 (shfts ~ o r d to lcft by 4 posltlons) ANA FA (forms loglcnl AND bctxccn EMPTY and FA) ST0 RF (stores word as nel\ly computed RP)

Jump moves are computed by a smplc extcnslon of thls procedure Multiple jumps are handled as a sequence of smgle jumps separated by null-reply moves.

Additional Timesaving Expedients Blt countmg is done by a table look-up procedure in a closed subroutme of 16 executed instructlons (408 mlcroseconds) This requlres a 256-word table which is generated at the start by a 13-word program. Slnular table look-up procedures are used, to turn a word end for end, and to locate the 1’s in a word for move reportmg.

Multipllcattons are usually avoided In several places where multipllca- tion by small integers must be done, it is programmed m terms of shlfts and loglcal operations.

Durrng the look-ahead procedure a complete record IS kept of the sequence of board posltlons currently under Investigation As a result, no computing IS needed to retract moves.

Appendix B. Smnple Garnes from the Second Set res w i t h Genelallzatiott Learning

Typical Openings The first eight moves of selected games In which Alpha played Black against Beta, showmg the way m which dlfferent types of play were trled

- - - _ _ _ _ - - I _ - G-4 G-6 (7-1.9 G-17 G-18 G-91 G-51 G-97 G-59 G-/ti G-43

10 14 11 10 11 16 11 10 11 16 11 10 11 10 12 16 11 16 10 1L 11 16 2-1 19 22 18 22 17 24 20 24 20 24 20 23 18 24 2 0 24 20 2 1. 20 23 I!) 14 18 16 20 16 20 10 1 i 7 11 8 11 7 11 8 12 10 15 11 15 1G 23 23 14 18 I 4 17 13 20 11 22 17 28 24 27 23 25 24 20 11 27 24 2G 3 9

9 18 9 18 9 14 8 15 10 14 10 14 1G 20 10 14 7 l G 7 10 8 11 22 15 23 14 23 18 22 17 17 10 23 18 23 19 23 18 21 17 23 18 22 17 11 18 10 17 14 23 7 11 0 15 14 23 20 27 14 23 G 10 14 23 10 14 21 17 21 14 27 18 17 10 28 24 27 18 31 24 27 18 23 19 20 19 17 10


While

Block

Figure 6 Square deslgnatlons used In reportmg games.

Ftgure 7. Elght-move openmg utlllztng generallzation learntng. (See Appendlx B, Game G-43.)


Typical Games Sample games in whlch Alpha played White against forced Beta openmgs

G- 1 - 12 16 24 19 8 12

22 18 10 14 20 22 16 20 30 26 11 1 G 28 24 7 11

22 17 3 8

17 10 G 15 22

26 17 9 13

17 14 2 7

23 18 16 23 14 10 7 14

18 9 5 14

27 18 9 20 27 31 24 12 16 21 17 13 22 25 18

1 5 9 6 5 9 8 1

G-18 G-SO G-4 0 - -- __ 12 10 12 1G 10 14 24 20 24 20 24 20 8 12 8 12 11 15

28 24 28 24 27 24 10 15 10 14 7 10 22 18 22 18 23 18 15 22 6 10 14 23 25 18 24 19 2G 19 7 10 1 6 10 14

18 14 32 28 19 10 10 17 3 8 6 15 21 14 2G 22 22 17 9 18 9 13 2 7

23 14 18 9 17 10 6 9 5 1 4 7 14

30 26 22 18 24 19 9 18 6 9 15 24

2G 23 25 22 28 19 3 8 2 6 1 4 1 7

23 14 30 25 21 14 1 0 14 17 9 18

27 23 21 14 5 25 22 G 9 6 9 1 8 2 5

14 10 18 15 29 22 $1 13 11 18 5 9

25 21 20 11 2 31 27 11 15 10 14 1 5 20 11 22 15 20 16 15 18 14 17 3 7 23 14 5 1 22 17 8 15 17 21 8 11

24 19 25 22 17 13

32 28 22 18 13 G 24 27 25 30 7 10 31 24 2 6 6 1

15 24 21 25 I1 20

G-1 G-18

9 13 12 16 1 6 24 20

13 17 10 19 32 27 29 25 16 20 13 17 18 14 10 7 11 15 2 11

G 10 14 10 15 18 19 23 14 9 21 14

rermlnitted 23 26 manually 10 7

26 30 25 21 30 2G

7 3 11 15 14 10 5 9

10 G 15 19

6 1 26 22

1 0 !I 1 3

20 16 19 23 6 9

23 27 1 G 11 22 25 11 7 25 30

7 2 27 32

- - __ G-90

9 14 18 0 8 11

15 8 4 11

19 15 11 18 23 14 13 17 9 5

12 10 28 24 17 22 6 10

30 25 1 6

25 21 5 1

21 17 24 20 18 19 20 16 17 13 0 2

13 17 10 0 ncta

concctles

G-40 - 4 8 1 G

10 14 0 10

14 17 10 15 17 21 32 28 5 9

27 24 20 27 10 16 12 19 15 22 31 9 14

31 20 14 18 28 24 8 11

24 10 21 26 30 21

Bcta concedcs

Appendix C. Evaluatron Polynomial Detads for Second Ser ies

Method of Computing Terms The 16 terms called for in the evaluation polynomial arc computed, indl- vldually, by taking the value of the approprmte parameter, as defined ba- low, for the board positlon under consideratlon and subtracting the value of this same parameter computed for the board positlon just prior to the


last move (wlth the necessary reversal in the dehltlons of actlve and passive sldes) Thls difference is then multlplled by the correspondlng program-computed coefficlent, whlch can vary between -218 and +218, and credited to the slde which was passive on the board positlon under con- slderatlon.

Definitions of Parameters ADV (Advancement) The parameter IS credlted wlth 1 for each passlve man in the 5th and 6th rows (countmg m passwe’s dlrectlon) and deblted with 1 for each passlve man In the 3rd and 4th rows

APEX (Apex) The parameter IS dcblted with 1 If there are no kings on the board, if elther square 7 01 26 IS occupled by an actwe man, and If nelther of these squares is occupled by a passlve man

BACK (Back Row Brldge) The parameter IS credlted wrth 1 d there are no active kings on the board and d the two bndge squares (1 and 3, or 30 and 32) In the back row are occupied by passwe pleces.

CENT (Center Control I) The parametcr IS credited wlth 1 for each of the followlng squares. 11, 12, 15, 16, 20,21, 24 and 25 whlch IS occupied by a passive man.

CNTR (Center Control 11) The parameter IS credltcd wlth 1 for each of the followlng squares 1 1, 12, 15, 16, 20, 21, 24 and 25 that IS elther currently occupled by an active plece or to whlch an actlve plece can move.

CORN (Double-corner Credit) The parameter is credited with 1 if the materlal credit value for the actlve slde IS 6 or less, if the passlve slde IS ahead in materlal credit, and if the active slde can move Into one of the double-corner squares.

CRAMP (Cramp) The parameter is credlted with 2 if the passive slde occuples the cramplng square (13 for Black, and 20 for Whlte) and at least one other nearby square ( 9 or 14 for Black, and 19 or 20 for Whlte), whlle certain squares (17, 21, 22 and 25 for Black, and 8, 11, 12 and 16 for Whlte) are all occupied by the actlve slde.

DENY (Denlal of Occupancy) The parameter IS credlted wlth 1 for each square defined In MOB if on the next move a piece occupylng thls square could be captured without an exchange.


DIA (Double Dlagonal Fde) The parameter IS credlted wlth 1 for each passive plece locatcd in the diagonal files termmatlng In the double-corner squares

DIAV (Dlagonal Moment Value) The parameter IS credlted wlth 5 for each passlve piece located on squares 2 removed from the double-corner dlagonal files, wlth 1 for each passlve piece locatcd on squares 1 removed from thc double-corner files and with K for each passwe piece In the double-corner files

DYKE (Dyke) The parameter is credlted wlth 1 for each stung of passive pieces that occupy thlee adlacent dlagonal squares

EXCH (Exchange) The parameter is credlted with 1 for each square to whlch the actlve slde may advance a plece and, m so dolng, force an exchange

EXPOS (Exposure) The parameter is credlted with 1 for each passlve piece that IS flanked along one or the other dlagonal by two empty squares.

FORK (Threat of Fork) The parameter is credlted with 1 for each situatlon In which passwe pieces occupy two adlacent squares in one row and In which thcrc are three empty squares so dlsposed that the actlve sldc could, by occupymg one of them, thrcaten a sure capture of one or the other of the two pleces.

GAP (Gap) The parameter is credltcd with 1 for each slngle cnlpty square that separates two passlve pleces along a dlagonal, or that separatcs a passlve plece from the edge of the board.

GUARD (Back-row Conttol) The parameter is credlted with 1 if there are no active kings and I€ either the Bridge or the Trlangle of Ore0 IS occupled by passive pieces

HOLE (Hole) The parametcr is credlted wlth 1 for each empty square that is surrounded by three or more passlve pleces

KCENT (King Center Control) Thc parameter is credlted wlth 1 for each of the followlng squares: 11, 12, 15, 16, 20, 21, 24 and 25 whlch is occupied by a pawve king

MOB (Total Mobllity) The parameter is credlted with 1 for each square to which the active slde could move one or more pieces 111 the normal fashion, dlsregarding the fact that jump moves may or may not be avallable.


MOBIL (Undenied Mobdity) The parameter is credited wlth the dlfference between MOB and DENY.

MOVE (Move) The paramcter 1s credited wlth 1 ~f pleces are even wlth a total piece count (2 for men, and 3 for kings) of less than 24, and if an odd number of pleces are in the move system, defined as those vertical files starting wlth squares 1,2, 3 and 4.

NODE (Node) The parameter 1s credited wlth 1 for each passive piece that 1s surrounded by at least three empty squares.

O R E 0 (Triangle of Oreo) The parameter 1s credited wlth 1 if thele are no passive kings and if the Triangle of Oreo (squares 2, 3 and 7 for Black, and squares 26, 30 and 31 for White) IS occupied by passive pieces.

POLE (Pole) The parameter is credited with 1 for each passive man that is completely surrounded by empty squares.

RECAP (Recapture) This parameter 1s Identical wlth Exchange, as defined above. (It was introduced to test the cffects produced by the random tmes at whlch parameters are mtloduced and deleted from the evaluation polynomial )

THRET (Threat) The parameter is credited wlth 1 for each square to which an actlve plcce may be moved and In so domg threaten the capture of a passive picce on a subsequent move.

Binary Connective Terms The abbrevlatlons used for the terms of thls type which have been employed are llsted below, m the order of A B, A * B” A B , and 2 * 0, where A and B are the two respective parameters heading the sublists of abbreviations.

Dcnlal of occupancy- Untlcnlcd mohhty- Untlerued moblhty- total moblllty denial of occupancy center contlol I

DEMO R.IoI)I~: 1 RIOC 1 DEMMO MODIS 2 MOC 3 DDEMO n u . w I 3 3 RIOC 2 m n m l MODE 4 MOC 4

MACHINE LEARNING USING THE GAME O F CHECKERS 103

Evaluation Polynomial (First 12 Terms Only) after 42 Games, during Which a Total of 1,039 Different Sets of Adjustments Were Made to the Terms and Their coefficient^'^

Correlatlon SlBIl Of P O W C l of 2 Tunes Term coefficient coefimcnt used as coeffment ad~ustcd

MOC 2 0 45 - 18 8.4 KCENT 0 40 + 16 127 MOC 4 0 35 - 14 95

MODE 3 0 33 - 13 210 DEMMO 0 27 - 11 132 MOVE 0 19 + 8 91 ADV 0 19 8 739 MODE ’2 0 19 8 65 RACK 0 14 - 6 G CNTR 0 13 4- 5 12 THRET 0 13 + 5 442 MOC 3 0 10 4- 4 89

- -

Discarded Terms during 42 Garn~s’~

T m c s adjusted Tern1 before dlscard

CORN 0 CRAMP 0 GUARD 0 EXPOS 162 DDMM 19 DYKE 115 MOC 1 1 EXCII 445 DUEhlO 53

l3 An addlttonal 20 games have recenl

Times adlusted Term before discard

MODE 1 1 CENT 386 MODE 4 0 FORK 400 MOBIL 707 POLE 11 HOLE 508 GAP 7!)2 M 0 13 GO8

Appendlx Dm Game Played by Mu R W Neuley and the Samuel Checker Program

In the summcr of 1962, at the request of the editors of this collec- tlon, Dr. Samuel arranged a match between his checker-playing program (on an IBM 7090 computer) and a human checker champion.

Mr Robert W Nealey 1s descrhed I n the IBM Research News for August, 1962, as “a former Connectlcut checkers champlon, and one of the nation’s foremost players.”


The Samuel program bested Mr. Nealey in the game repnnted below The annotatlons were made by Dr Samuel. Mr Nealey’s comments, as quoted by the ZBM Rescarclz News, are as follows

Our gmne . did have ~ t s poults. U p to the 31si move, all of our play had been previously publlshed, except where I evaded “the book” several times m n vam effort to throw the computer’s tlmrng 08 At the 32-27 loser and onwards, all the play 1s otlglnal with us, so far as I have been able to find. It IS very znteresting to me to note that the computer had to make several star moves in order to get the wm, and that I had several opportunltles to draw otherwrse That I S why I kept the game gowg The machine, therefore, played a perfect end- ing without one misstep In the matter of the end game, I have not had such competition from any human being since 1954, when I lost my last game.

Nealey (WHITE) vs. Sanzuel Checker Program ( B L A C K )

Date. July 12, 1962 Place. Yorktown, New York Mr. Nealey was gtven the option and chose to defend The Old Fourteenth openmg was followed.

11 15 23 19

8 11 22 17 4 8

17 13 25-22 would restnct Black’s variety of play a llttle more 15 18 24 20 Lee’s Old Foutteenth, Var 9 11-15 IS the trunk move,

9 14 26 23 Doran’s Var 100 listed as an even game. 10 15 19 10 6 15

28 24 Doran lists 23-19 as glvlng an easier game for White. 15 19 An aggresslve move for Black. 24 15

5 9 13 6 1 19 26

31 22 15 11 18 Sttll in Lee’s Var. 9 and Doran’s Var. 100

30 8

25 18 29 11 27 15 23 12 32 19 27 24 22 27 18 31

9 22 23 26 19 22 21 18 17 2

16 7

20 23


26 11 Thls is probably a poor move on Mr Nealey’s part

22 25 22 15 23 19 16 19 27 White makes a loslng move 24 The obvious reply, guaranteelng Black a king. 23 27 18 31 9 Black now has his king.

A good reply rnalnlaintng control of the center.

22 5

26 A delaying move to force White to advance 19 22 16 18 17 23 13 6 11 Le coup de maltre. A Black wln 1s now certain.

16 11 19 Le coup mortel

Whlte concedes.

Location of Black pleces-3,6,19K Location of White pleces-5,11,13.

acknowledgments - stanford university · 2002. 11. 18. · acknowledgments i a. m. turing,...

Documents