Download - The Vigen`ere Cipher (Draft) - Computer Scienceshene/NSF-4/Vigenere.pdf · The Vigen`ere Cipher (Draft) Can Li, Jun Ma, Jun Tao Melissa Keranen, Jean Mayo, Ching-Kuang Shene and Chaoli

The Vigenere Cipher (Draft)Can Li, Jun Ma, Jun Tao

Melissa Keranen, Jean Mayo, Ching-Kuang Shene and Chaoli Wang

Depart of Computer Science

Michigan Technological UniversityHoughton, Michigan

Version 0.1 (March 10, 2014)

The Vigenere cipher first appeared in the 1585 book Traicte des Chiffres (A Treatise onSecret Writing) by Blaise de Vigenere (Figure 1). However, Giovan Batista Belaso discusseda similar technique in his 1553 booklet La cifra del. Sig. Giovan Batista Belaso [5, page137]. Singh [11, pp. 45–51, Chapter 2] has a short and interesting discussion about Vigenere,which is quoted below, and Kahn [5, Chapter 4] has a longer and more detailed exposition.On the other hand, the book of Vigenere did present an auto-key system, which is perhapshis major contribution to cryptography in addition to the Vigenere cipher. This documentwill not discuss this auto-key systems.

Vigenere became acquainted with the writings of Alberti, Trithemius and Portawhen, at the age of twenty-six, he was sent to Rome on a two year diplomaticmission. To start with, his interest in cryptography was purely practical and waslinked to his diplomatic work. Then, at the age of thirty-nine, Vigenere decidedthat he had accumulated enough money for him to be able to abandon his careerand concentrate on a life of study. It was only then that he examined in detailthe ideas of Alberti, Trithemius, and Porta, weaving them into a coherent andpowerful new cipher [11, page 46]. . . . Although Alberti, Trithemius and Porta allmade vital contributions, the cipher is known as the Vigenere cipher in honour ofthe man who developed it into its final form. The strength of the Vigenere cipherlies in its using not one, but 26 distinct cipher alphabets to encode a message [11,page 48]. . . . To unscramble the message, the intended receiver needs to knowwhich row of the Vigenere square has been used to encipher each letter, so theremust be an agreed system of switching between rows. This is achieved by using akeyword [11, page 49]. . . . Vigenere’s work culminated in his Traicte des Chiffres,published in 1586. Ironically, this was the same year that Thomas Phelippes wasbreaking the cipher of Mary Queen of Scots. If only Mary’s secretary had readthis treatise, he would have known about the Vigenere cipher, Mary’s messages toBabington would have baffled Phelippes, and her life might have been spared [11,page 51].

This document focuses on the basics of the Vigenere cipher. Section 1 explains thecipher and the encryption and decryption processes. The Vigenere cipher is simple and easyto understand and implement. However, for nearly three centuries the Vigenere cipher had

1

Figure 1: Blaise de Vigenere

not been broken until Friedrich W. Kasiski published his 1863 book. Note that CharlesBabbage also used a similar technique and successfully broke the Vigenere cipher in 1846;but he did not publish his work. Section 2 and Section 3 discuss the two well-known attackson the Vigenere cipher. Section 2 focuses on Kasiski’s work and Section 3 presents theIndex of Coincidence (IOC, IoC or IC) method proposed in 1922 by William F. Friedman.Both methods try to estimate the length of the unknown keyword (Section 4). Once apossible length of the unknown keyword is found, the χ2 method is used to recover thekeyword (Section 5). Since the estimation of keyword length may not be correct, a numberof iterations may be needed. Hence, to decrypt a ciphertext encrypted with the Vigenerecipher, one usually follows an iterative procedure as shown below. Section 6 provides severalcomplete examples. Finally, Section 7 has some concluding remarks.

while (the decryption is not satisfactory) doestimate a new keyword length;recover the keyword using the estimated length;decrypt the ciphertext using the recovered keyword;

end

1 The Vigenere Cipher

1.1 The Cipher

The Vigenere cipher uses a 26×26 table with A to Z as the row heading and column heading(Figure 2). This table is usually referred to as the Vigenere Tableau, Vigenere Table orVigenere Square. We shall use Vigenere table throughout this document. The first row of

2

this table has the 26 English letters. Starting with the second row, each row has the lettersshifted to the left one position in a cyclic way. For example, when B (i.e., the second letterin the English alphabet) is shifted to the first position on the second row, the letter A movesto the end (i.e., the 26-th position).

Figure 2: The Vigenere Table

In addition to the plaintext, the Vigenere cipher also requires a keyword, which is repeatedso that the length is equal to that of the plaintext. For example, suppose the plaintext isMICHIGAN TECHNOLOGICAL UNIVERSITY and the keyword is HOUGHTON. Then, the keywordmust be repeated as follows:

MICHIGAN TECHNOLOGICAL UNIVERSITY

HOUGHTON HOUGHTONHOUGH TONHOUGNTO

In this document, we follow the tradition by removing all spaces and punctuation, convertingall letters to upper case, and dividing the result into 5-letter blocks. As a result, the aboveplaintext and keyword become the following:

MICHI GANTE CHNOL OGICA LUNIV ERSIT Y

HOUGH TONHO UGHTO NHOUG HTONH OUGHT O

To encrypt, pick a letter in the plaintext and its corresponding letter in the keyword, usethe keyword letter and the plaintext letter as the row index and column index, respectively,and the entry at the row-column intersection is the letter in the ciphertext. For example, thefirst letter in the plaintext is M and its corresponding keyword letter is H. This means that therow of H and the column of M are used, and the entry T at the intersection is the encryptedresult (Figure 3(a)). Similarly, since letter N in MICHIGAN corresponds to the letter N in thekeyword, the entry at the intersection of row N and column N is A which is the encryptedletter in the ciphertext (Figure 3(b)). Repeating this process until all plaintext letters areprocessed, the ciphertext is TWWNPZOA ASWNUHZBNWWGS NBVCSLYPMM. The following has theplaintext, repeated keyword and ciphertext aligned together.

3


HOUGH TONHO UGHTO NHOUG HTONH OUGHT O

TWWNP ZOAAS WNUHZ BNWWG SNBVC SLYPM M

(a) (b)

Figure 3: The Vigenere Table Examples

To decrypt, pick a letter in the ciphertext and its corresponding letter in the keyword,use the keyword letter to find the corresponding row, and the letter heading of the columnthat contains the ciphertext letter is the needed plaintext letter. Therefore, this is thereversed procedure of the encryption process. For example, to decrypt the first letter T inthe ciphertext, we find the corresponding letter H in the keyword. Then, the row of H is usedto find the corresponding letter T and the column that contains T provides the plaintextletter M (Figure 3(a)). Consider the fifth letter P in the ciphertext. This letter correspondsto the keyword letter H and row H is used to find P. Since P is on column I, the correspondingplaintext letter is I.

Example 1 Let us encrypt the first word MICHIGAN with the keyword HOUGHTON using theVigenere table.

MICHI GAN

HOUGH TON

Table 1 shows a step-by-step encryption. In this table, each pair of rows has the plaintextletter on the first column, the corresponding keyword letter on the second column, and thealphabet (top) and the row corresponding to the keyword letter (bottom). The plaintextletter and the corresponding ciphertext letters are underlined.

From this table, we have the following result:

MICHI GAN

HOUGH TON

TWWNP ZOA

Since the keyword is repeated and a keyword letter is used multiple times, we may breakthe plaintext into rows, each of which has all the plaintext letters that are encrypted with thesame keyword letter. Each row is referred to as a coset (Section 4.1). Table 2 is an example.The first column of this table has the he keyword letters. The plaintext is organized column

4

Table 1: Vigenere Table Example

Letter Keyword Letter A B C D E F G H I J K L M N O P Q R S T U V W X Y ZM Row H H I J K L M N O P Q R S T U V W X Y Z A B C D E F G

Letter Keyword Letter A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

I Row O O P Q R S T U V W X Y Z A B C D E F G H I J K L M N

Letter Keyword Letter A B C D E F G H I J K L M N O P Q R S T U V W X Y ZC Row U U V W X Y Z A B C D E F G H I J K L M N O P Q R S T

Letter Keyword Letter A B C D E F G H I J K L M N O P Q R S T U V W X Y ZH Row G G H I J K L M N O P Q R S T U V W X Y Z A B C D E F

Letter Keyword Letter A B C D E F G H I J K L M N O P Q R S T U V W X Y ZI Row H H I J K L M N O P Q R S T U V W X Y Z A B C D E F G

Letter Keyword Letter A B C D E F G H I J K L M N O P Q R S T U V W X Y ZG Row T T U V W X Y Z A B C D E F G H I J K L M N O P Q R S

Letter Keyword Letter A B C D E F G H I J K L M N O P Q R S T U V W X Y ZA Row O O P Q R S T U V W X Y Z A B C D E F G H I J K L M N

Letter Keyword Letter A B C D E F G H I J K L M N O P Q R S T U V W X Y ZN Row N N O P Q R S T U V W X Y Z A B C D E F G H I J K L M

by column (i.e., writing down the plaintext top-down). In this way, once we use row H allplaintext letters that should be encrypted by H can be encrypted quickly. The number ofrow movements (i.e., moving from row to row in encryption and decryption) is the length ofthe keyword, which is more convenient than the original table lookup process as the numberof row movements is equal to the length of the plaintext.

Table 2: Vigenere Table Example

Keyword Letter Plaintext Letter Coset Ciphertext CostH M T G V T A N CO I E I E W S W SU C C C R W W W LG H H A S N N G YH I N L I P U S PT G O U T Z H N MO A L N Y O Z B M

N N O I A B V

1.2 Other Cipher Devices

Since the Vigenere table is large and not very convenient, two portable devices were developedto make encryption and decryption easier. The first device, the cipher disk, was invented

5

by Leon Battista Alberti (1404–1472). This cipher disk has two concentric circles, with thelarge bottom one fixed and the small top one rotatable (Figure 4(a)). The 26 English lettersare shown along the perimeter of each disk. One can rotate the top disk to align any letterwith the letter A on the bottom disk. The plaintext and ciphertext use the letters on thebottom and top disks, respectively.

The use of the cipher disk is very simple. Rotate the top disk so that the keyword letterbeing used aligns with the letter A on the bottom disk, and the corresponding plaintext andciphertext letters are on the bottom and top disks, respectively. This alignment procedureis equivalent to shifting the rows down. For example, if two As are aligned together, we areusing row A; if the B is aligned with A, we are using row B; and if the C is aligned with A, weare using row C. Therefore, the cipher disk uses a rotatable disk to replace a large table, andis more convenient.

Consider the 10-th letter E in the plaintext and the corresponding keyword letter O.Rotate the top disk until O aligns with A (Figure 4(b)) and, consequently, the plaintext letterE on the bottom disk aligns with the letter S on the top disk. Therefore, E is encrypted toS. Conversely, the 10-th ciphertext letter S corresponds to the keyword letter O. Rotate thetop disk until O aligns with A, and, as a result, the ciphertext letter S on the top disk alignswith the letter E on the bottom one. Hence, the decrypted letter of S is E.

(a) (b)

Figure 4: The Cipher Disk

Another simple device is usually referred to as the Saint Cyr Slide or just simply slide.The top row of this slide is fixed and has the 26 English letters (Figure 5(a)). The bottomrow can be slided left and right and has two sets of the 26 letters (i.e., repeating the 26 letterstwice).1 Its use is identical to the cipher disk with the top fixed portion for the plaintextletters and the bottom movable portion for the ciphertext letter. We just slide the bottomportion so that the keyword letter aligns with the letter A and the top and bottom rowsprovide the corresponding plaintext and ciphertext letters.

Consider the 5-th letter I in our example. Its corresponding keyword letter is H. Slidethe bottom portion so that H aligns with the A of the fixed portion and the plaintext letter

1In fact, the set only needs 25 letters from A to Y. Why?

6

(a)

(b)

(c)

Figure 5: Saint Cyr Slide

I corresponds to the letter P (Figure 5(b)). Therefore, I is encrypted by H to P. The 11-thciphertext letter is W and the corresponding keyword letter is U. To decrypt, slide the bottomportion so that U aligns with A of the top portion (Figure 5(b)). The ciphertext letter W inthe bottom corresponds to C in the top, and W is decrypted with U to C.

1.3 The Algebraic Nature of the Vigenere Cipher

In Section 1.1, we mentioned that the alphabet is shifted to the left one position repeatedlyto build the 26 × 26 Vigenere table. This is equivalent to shift the alphabet (i.e., the rowheading of the Vigenere table) to the right one position at a time. For example, the row ofB is obtained by shifting the row of A to the left one position (Figure 2). This is equivalentto shifting the alphabet to the right one position. For the row of B, A is shifted to B and B isshifted to C and, hence, A is encrypted to B and B is encrypted to C. Similarly, for the row ofD which is three positions from A, A is shifted three positions to D, B is shifted three positionsto E, and C is shifted three positions to F. Therefore, A, B and C are encrypted to D, E and F

by shifting to the right three positions. In general, if a plaintext letter P is encrypted by akeyword letter K that is d positions from A, P is encrypted by K to the letter C that is dpositions to the right of P . We have to take cyclic shifting into consideration. If the lettersA, B, C, . . ., Z are assigned the values of 0, 1, 2, . . ., 25, each keyword letter is simply thedistance from that letter to A. As a result, the ciphertext letter C is obtained as follows,where “mod” is the modulus arithmetic:

C = (P + d) mod 26

To sum it up, if the keyword is repeated enough number of times so that its length is equalto the length of the plaintext, for plaintext p1p2p3 · · · pn, keyword k1k2 · · · kn and ciphertextc1c2 · · · cn, we have

ci = (pi + ki) mod 26

Decryption is the reversed procedure by shifting the ciphertext to the left. Since shifting to

7

the left is a subtraction, the decryption procedure is simply:

pi = (ci − ki) mod 26

With this in mind, it is very easy to program a Vigenere cipher.2

Encryptionfor i := 1 to n do

ci = (pi + ki) mod 26

Decryptionfor i := 1 to n do

pi = (ci − ki) mod 26

2 Kasiski Test

Friedrich W. Kasiski, a German military officer (actually a major), published his bookDie Geheimschriften und die Dechiffrirkunst (Cryptography and the Art of Decryption) in1863 [6]. Figure 6(a) has the cover of Kasiski’s book. This slightly more than 100 pages isthe first published work detailing Kasiski’s method to break the Vigenere cipher, althoughCharles Babbage used the same technique, but never published, as early as in 1846.

Kasiski suggested that one may look for repeated fragments in the ciphertext and compilea list of the distances that separate the repetitions. Then, the keyword length is likely todivide many of these distances. More precisely, Kasiski observed the following [6, 7]:

1. If a repeated substring in a plaintext is encrypted by the same substring in the keyword,then the ciphertext contains a repeated substring and the distance of the two occurencesis a multiple of the keyword length.

2. Not every repeated string in the ciphertext arises in this way; but, the probabilityof a repetition by chance is noticeably smaller. See [9] for a simple and interestingdiscussion.

Consider the following example encrypted by the keyword ION. The substring BVR inthe ciphertext repeats three times. The first two are encrypted from THE by ION. Sincethe keyword ION is shifted to the right repeatedly, the distance between the B in the firstoccurrence of BVR and the second is a multiple of the keyword length 3. The second and thethird occurences of BVR tell a different story. They are encrypted from THE and NIJ usingdifferent portions of the keyword (i.e., ION and ONI) and the distance between the two B’sin the second and third BVR may not be a multiple of the keyword length. Therefore, evenwe find repeated substrings, the distance between them may or may not be a multiple of thelength of the keyword and the repetitions may just be purely by chance.

Plaintext ......THE................THE.....................NIJ...........

Keyword ......ION................ION....................IONI...........

Ciphertext ......BVR................BVR.....................BVR...........

2In the ASCII code, letters A to Z are consecutive and K-’A’ is the distance from A to the letter K.However, the 26 letters in the EBCDIC code are not consecutive. Therefore, it would be better to save theletters in an array of 26 elements and shift the array index rather than using K-’A’.

8

(a) Kasiski’s Book (b) Friedman’s Booklet

Figure 6: The Two Landmark Publications

A long ciphertext may have a higher chance to see more repeated substrings and ashort plaintext encrypted with relatively long keyword may produce a ciphertext in whichno repetition can be found. Additionally, long repeated substrings in a ciphertext are notlikely to be by chance, whereas short repeated substrings may appear more often and somewhich may be purely by chance. The following example shows the encryption of MichiganTechnological University with keyword boy. There is no repeated substring of length atleast! Of course, Kasiski’s method fails.


BOYBO YBOYB OYBOY BOYBO YBOYB OYBOY B

NWAIW EBBRF QFOCJ PUGDO JVBGW SPTWR Z

onsider a longer plaintext. The following is a quote from Charles Antony Richard Hoare(Tony Hoare or C. A. R. Hoare), the 1980 ACM Turing Award winner, on software design:

There are two ways of constructing a software design:

One way is to make it so simple that there are obviously

no deficiencies, and the other way is to make it so complicated

that there are no obvious deficiencies.

The first method is far more difficult.

9

After removing space and punctuation and converting to upper case, we have the following:

THERE ARETW OWAYS OFCON STRUC TINGA SOFTW AREDE SIGNO NEWAY

ISTOM AKEIT SOSIM PLETH ATTHE REARE OBVIO USLYN ODEFI CIENC

IESAN DTHEO THERW AYIST OMAKE ITSOC OMPLI CATED THATT HEREA

RENOO BVIOU SDEFI CIENC IESTH EFIRS TMETH ODISF ARMOR EDIFF

ICULT

Then, the above is encrypted with the 6-letter keyword SYSTEM as follows:

LFWKI MJCLP SISWK HJOGL KMVGU RAGKM KMXMA MJCVX WUYLG GIISW

ALXAE YCXMF KMKBQ BDCLA EFLFW KIMJC GUZUG SKECZ GBWYM OACFV

MQKYF WXTWM LAIDO YQBWF GKSDI ULQGV SYHJA VEFWB LAEFL FWKIM

JCFHS NNGGN WPWDA VMQFA AXWFZ CXBVE LKWML AVGKY EDEMJ XHUXD

AVYXL

The following table has the plaintext, keyword and ciphertext aligned together. Thetexts in blue and underlined mark the repeated substrings of length 8. These are the longestsubstrings of length less than 10 in the ciphertext. The plaintext string THEREARE appearsthree times at positions 0, 72 and 144. The distance between two occurences is 72. Therepeated keyword and ciphertext are SYSTEMSY and LFWKIMJC, respectively. Therefore, thesethree occurences are not by chance and 72 is a multiple of the keyword length 6.


SYSTE MSYST EMSYS TEMSY STEMS YSTEM SYSTE MSYST EMSYS TEMSY

LFWKI MJCLP SISWK HJOGL KMVGU RAGKM KMXMA MJCVX WUYLG GIISW


STEMS YSTEM SYSTE MSYST EMSYS TEMSY STEMS YSTEM SYSTE MSYST

ALXAE YCXMF KMKBQ BDCLA EFLFW KIMJC GUZUG SKECZ GBWYM OACFV


EMSYS TEMSY STEMS YSTEM SYSTE MSYST EMSYS TEMSY STEMS YSTEM





ICULT

STEMS

AVYXL

The next longest repeating substring WMLA in the ciphertext has length 4 and occurs atpositions 108 and 182. The distance between these two positions is 74. At position 108,plaintext EOTH is encrypted to WMLA using SYST. At position 182, plaintext ETHO is encrypted

10

to WMLA using STEM. In this case, even through we find repeating substrings WMLA, they arenot encrypted by the same portion of the keyword and they come from different plaintextsections. As a result, this repetition is a pure chance and the distance 74 is unlikely to be amultiple of the keyword length.


EMSYS TEMSY STEMS YSTEM SYSTE MSYST EMSYS TEMSY STEMS YSTEM





ICULT

STEMS

AVYXL

There are five repeating substrings of length 3. They are MJC at positions 5 and 35 witha distance of 30, ISW at positions 11 and 47 with a distance of 36, KMK at positions 28 and 60with a distance of 32, VMQ at positions 99 and 165 with a distance of 66, and DAV at positions163 and 199 with a distance of 36. The following table is a summary. Note that the repeatingciphertext KWK is encrypted from two plaintext sections GAS and SOS with keyword portionsof EMS and SYS, respectively. Therefore, this is a pure chance.

Positions 5 35 11 47 28 60 99 165 163 199Distance 30 36 32 66 36Plaintext ARE ARE WAY WAY GAS SOS CIE CIE FIC FIC

Keyword MSY MSY MSY MSY EMS SYS TEM TEM YST YST

Ciphertext MJC MJC ISW ISW KMK KMK VMQ VMQ DAV DAV

The following table shows the distances and their factors. Since a distance may be amultiple of the keyword length, a factor of a distance may be the length of the keyword. Ifa match is by pure chance, the factors of this distance may not be factors of the keywordlength. In general, a good choice is the largest one that appears most often. Note that longerrepeating substrings may offer better choices because these matches are less likely to be bychance.

Distance Distance Factors8 72 2 3 4 6 8 9 12 18 24 36 724 74 2 37 743 66 2 2 3 6 11 22 33 66

36 2 3 4 6 9 12 18 3632 2 4 8 16 3230 2 3 5 6 10 15

11

The following table shows the distances and all factors no higher than 20. The last row ofthe table has the total count of each factor. It is clear that factors 2, 3 and 6 occur mostoften with counts 6, 4 and 4, respectively. Since keyword length 2 is too short to be usedeffectively, lengths 3 and 6 are more reasonable. As a result, we may use 3 and 5 as theinitial estimates to recover the keyword and decrypt the ciphertext (Section 5).

FactorsDistance 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

74 X72 X X X X X X X X66 X X X X36 X X X X X X X32 X X X X30 X X X X X X

Total 6 4 3 1 4 0 2 2 1 1 2 0 0 1 1 0 2 0 0

If we are convinced that some distances are likely not to be by chance, we may computethe greatest common divisor (GCD) of these distances and use it as a possible keywordlength. As mentioned earlier, distances 74 and 32 are likely to be by chance and the remainingdistances are 72, 66, 36 and 30. There GCD is GCD(72, 66, 36, 30) = 6.3 Since we know thekeyword SYSTEM, 6 is the correct length.

If we only have a ciphertext in hand, we have to do some guess work.

Example 2 The following is Hoare’s quote discussed earlier but encrypted with a differentkeyword.

VVQGY TVVVK ALURW FHQAC MMVLE HUCAT WFHHI PLXHV UWSCI GINCM

UHNHQ RMSUI MHWZO DXTNA EKVVQ GYTVV QPHXI NWCAB ASYYM TKSZR

CXWRP RFWYH XYGFI PSBWK QAMZY BXJQQ ABJEM TCHQS NAEKV VQGYT

VVPCA QPBSL URQUC VMVPQ UTMML VHWDH NFIKJ CPXMY EIOCD TXBJW

KQGAN

A search reveals the following repeating substrings and distances:

Distance Substring Positions Distance Factors12 NAEKVVQGYTVV 68 140 72 2 3 4 6 8 9 12 18 24 36 728 VVQGYTVV 0 72 72 2 3 4 6 8 9 12 18 24 36 72

72 144 72 2 3 4 6 8 9 12 18 24 36 723 VVQ 0 78 78 2 3 6 26 39 78

LUR 11 159 148 2 4 74 148WFH 14 30 16 2 4 8 16WKQ 118 199 81 3 9 27 81

3Since GCD(a, b, c, d) = GCD(GCD(a, b), c, d), we have GCD(72, 66, 36, 30) =GCD(GCD(72, 66), 36, 30) = GCD(6, 36, 30) = GCD(GCD(6, 36), 30) = GCD(6, 30) = 6.

12

The following table shows the distances and their factors. The most common factorsbetween 2 and 20 are 3, 4, 6, 8 and 9. They all appear to be reasonable and other methods(Section 3) may be needed to narrow down the choice.


148 X X81 X X78 X X X X72 X X X X X X X X16 X X X

Total 4 3 3 0 2 0 2 2 0 0 1 1 0 0 0 0 1 0 0

3 Index of Coincidence

Given a text string, the index of coincidence (IC of IOC), is the probability of tworandomly selected letters being equal. The use of index of coincidence was first proposedby William F. Friedman in 1922 [3, 7]. According to historian David Kahn [5, page 176],“Revierbank Publication No. 22, written in 1920, when Friedman was 28, must be regardedas the most important single publication in cryptography. It took the science into a newworld.”4 Figure 6(b) has the cover of Riverbank Publication No. 22 [3]. It was printed inFrance in 1922 to save cost.

Let the length of the text be N and let the size of the alphabet be n. Consider the i-thletter ai in the alphabet. Suppose ai appears in the given text Fi times. Since the numberof ai’s in the text is Fi, picking the first ai has Fi different choices and picking the second ai

has only Fi − 1 different choices because one ai has been selected. Since there are N(N − 1)different ways of picking two characters from the text, the probability of having two ai’s is

Fi(Fi − 1)

N(N − 1)

Since the alphabet has n different letters and the above applies to each of them, the proba-bility of having two identical letters from the text is

n∑

i=1

Fi(Fi − 1)

N(N − 1)=

1

N(N − 1)

n∑

i=1

Fi(Fi − 1)

Therefore, the index of coincidence is:

IC =1

N(N − 1)

n∑

i=1

Fi(Fi − 1) (1)

Note that English has n=26 letters.

4See Clark [1] for more about Friedman’s personal life.

13

Example 3 Consider the plaintext of Hoare’s quote discussed in Section 2 (page 8).





ICULT

The frequency count is as follows:

A B C D E F G H I J K L M

15 2 9 7 27 8 2 9 20 0 2 4 5

N O P Q R S T U V W X Y Z

10 19 2 0 12 15 22 4 2 5 0 4 0

The index of coincidence in Equation (1) is 0.068101. The five most frequently occurletters are E, T, I, O and A and S with 27, 22, 20, 19 and 15 occurrences, respectively.

Table 3 shows the frequencies of the 26 English letters. The five most frequently usedletters are E (13.11%), T (10.47%), A (8.15%), O (8.00%) and N (7.10%). The five leastfrequently used letters are Z (0.08%), Q (0.12%), J (0.13%), X (0.17%) and K (0.42%). Notethat this table is generated from a sample of long English texts. Different samples yieldslightly differebt results.

Table 3: Frequency (%) of Letters in English Text

A B C D E F G H I J K L M8.15 1.44 2.76 3.79 13.11 2.92 1.99 5.26 6.35 0.13 0.42 3.39 2.54N O P Q R S T U V W X Y Z

7.10 8.00 1.98 0.12 6.83 6.10 10.47 2.46 0.92 1.54 0.17 1.98 0.08

Given the frequency values as shown in Table 3, it is not difficult to calculate the indexof coincidence of English ICEnglish. Suppose the text has length N and the percentage ofletter ai is pi. More precisely, p1 is the probability to have an A (i.e., p1 = 8.15% = 0.0815),p2 is the probability to have a B (i.e., p2 = 1.44% = 0.0144), etc. The number of occurrencesof the i-th letter is simply Fi = pi × N . Thererfore, we have

ICEnglish =1

N(N − 1)

n∑

i=1

Fi(Fi − 1)

=1

N(N − 1)

n∑

i=1

(piN) (piN − 1)

=n∑

i=1

pipiN − 1

N − 1

14

If N is large enough, (piN − 1)/(N − 1) is approximately pi5, and we have

ICEnglish ≈n∑

i=1

p2i (2)

Plugging the values in Table 3 we have ICEnglish = 0.0686. Note that the IC computed inExample 3 is 0.066332, which is close to the ICEnglish of English text.

What if the text is a randomly generated one? In this case, the frequency of each letteris approximately equal to pi = 1/n, where n is the size of the alphabet. From Equation (2),we have the index of coincidence for randomly generated text ICRandom ≈ 1/n. Since Englishhas 26 letters, n = 26 and ICRandom ≈ 1/26 = 0.038466.

Example 4 The following is the ciphertext of Hoare’s quote in Example 3 (page 14) en-crypted with the Vigenere cipher and an unknown keyword:





KQGAN

The frequency count is shown below:


11 6 11 3 5 5 6 13 8 4 7 5 12


7 2 8 14 6 7 9 8 18 10 8 9 3

The IC reduces to 0.041989 because this is neither a plain English text nor random.

Example 5 The following is the ciphertext of Hoare’s quote in Example 3 (page 14) byshifting the plaintext to the right three positions (i.e., encrypted with a mono-alphabeticcipher):

WKHUH DUHWZ RZDBV RIFRQ VWUXF WLQJD VRIWZ DUHGH VLJQR QHZDB

LVWRP DNHLW VRVLP SOHWK DWWKH UHDUH REYLR XVOBQ RGHIL FLHQF

LHVDQ GWKHR WKHUZ DBLVW RPDNH LWVRF RPSOL FDWHG WKDWW KHUHD

UHQRR EYLRX VGHIL FLHQF LHVWK HILUV WPHWK RGLVI DUPRU HGLII

LFXOW

The frequency count is shown below:


0 4 0 15 2 9 7 27 8 2 9 20 0


2 4 5 10 19 2 0 12 15 22 4 2 5

5Say, compute the limit of this term as N approaches infinity.

15

Compared these values with those obtained in Example 3 for the same plaintext, we findthat the table is also shifted to the right 3 positions. Therefore, the IC should be the sameas the one computed in Example 3. Indeed, the result is exactly 0.66332!

Since ICRandom is the smallest IC we can get as the text is random, most if not all Englishtexts should have an IC between ICRandom = 0.038466 and ICEnglish = 0.0686. A text whoseIC is close to ICEnglish indicates that it is likely a valid English text or an English textencrypted with a mono-alphabetic cipher.

4 Estimating Keyword Length

In addition to Kasiski Test, there are several other ways to estimate the length of the unknownkeyword. This section introduces two of them.

4.1 A Method Based on Index of Coincidence

The index of coincidence can be used to estimate the length of the unknown keyword. Tothis end, let us guess a length l and divide the ciphertext c = c1c2c3 · · · cN into l strings asfollows:

• The first string s1 starts with c1 and includes every l-th letter: s1 = c1c1+lc1+2lc1+3l · · ·.

• The second string s2 starts with c2 and includes every l-th letters: s2 = c2c2+lc2+2lc2+3l · · ·.

• In general, the i-th string si starts with ci and includes every l-th letter: si = cici+lci+2lci+3l · · ·.

• The l-th string sl starts with cl and includes every l-th letter: sl = clc2lc3lc4l · · ·.

In this way, we have l strings s1, s2, . . ., sl, each of which is a substring of the ciphertext.These substrings are usually referred to as cosets. Moreover, if the length of the keywordl is correct, each coset is the encrypted result by the same letter of the keyword (Table 1,page 5). The “shorter” cosets have bN/lc letters and the “full” cosets have bN/lc+1 letters.

Example 6 Suppose we have the following ciphertext of 28 characters:

RSTCS JLSLR SLFEL GWLFI ISIKR MGL

If the length of the keyword is 3, this ciphertext is divided into three cosets as follows:

RCLRF GFSRL

SSSSE WIIM

TJLLL LIKG

The 28 letters are divided into 3 cosets with the second and third having 4 letters and thefirst having 5 letters.

16

If the length of the keyword is guessed right, each coset would preserve the English IC tosome degree. Therefore, the IC’s of the cosets s1 to sl should be closer to ICEnglish = 0.068,and the average of these IC’s would still be high, Otherwise, each coset would look more orless random, and the IC’s of these cosets would be closer to ICRandom = 0.038. Hence, theaverage of the IC’s would be low. Based on this observation, we may divide the ciphertextinto 1 coset (the coset itself), 2 cosets, 3 cosets, 4 cosets, etc and compute the IC of eachcoset and the average. The length that yields the highest average IC is likely to be thecorrect length of the keyword.

Example 7 Let us continue with the previous example. The following table shows the IC’sand their averages of lengths 2, 3 and 4.

Length Indices of Coincidence Average2 0.0769, 0.0659 0.07143 0.1111, 0.1944, 0.1667 0.15744 0.0476, 0.0476, 0.0476, 0.0476 0.0476

Since length 3 yields the largest IC average, the ciphertext is very likely encrypted by akeyword of length 3.

Example 8 Consider the ciphertext in Example 4 (page 15). The following shows a sum-mary of possible keyword lengths from 1 to 10.

Length Indices of Coincidence Average1 0.041989 0.0419892 0.046830, 0.044846 0.0458383 0.040068, 0.045654, 0.046532 0.0440854 0.050528, 0.047059, 0.057255, 0.042353 0.0492995 0.045122, 0.039024, 0.040244, 0.035366, 0.039024 0.0397566 0.052101, 0.062389, 0.058824, 0.039216, 0.051693 0.052058

0.0481287 0.039080, 0.041379, 0.041872, 0.032020, 0.068966 0.042106

0.041872, 0.0295578 0.058462, 0.055385, 0.086154, 0.040000, 0.101538 0.073109

0.063333, 0.096667, 0.0833339 0.031621, 0.043478, 0.075099, 0.047431, 0.043478 0.043876

0.043478, 0.023715, 0.043290, 0.04329010 0.066667, 0.019048, 0.028571, 0.033333, 0.042857 0.039048

0.063158, 0.031579, 0.047368, 0.031579, 0.026316

The largest IC average 0.073109 corresponds to keyword length 8, and, hence, l = 8 is themost likely length of the keyword. Compared with Example 2 (page 12) using Kasiski’smethod, keyword length 8 is very reasonable.

17

Example 9 Let us look at a longer ciphertext as follows:

TYWUR USHPO SLJNQ AYJLI FTMJY YZFPV EUZTS GAHTU WNSFW EEEVA

MYFFD CZTMJ WSQEJ VWXTU QNANT MTIAW AOOJS HPPIN TYDDM VKQUF

LGMLB XIXJU BQWXJ YQZJZ YMMZH DMFNQ VIAYE FLVZI ZQCSS AEEXV

SFRDS DLBQT YDTFQ NIVKU ZPJFJ HUSLK LUBQV JULAB XYWCD IEOWH

FTMXZ MMZHC AATFX YWGMF XYWZU QVPYF AIAFJ GEQCV KNATE MWGKX

SMWNA NIUSH PFSRJ CEQEE VJXGG BLBQI MEYMR DSDHU UZXVV VGFXV

JZXUI JLIRM RKZYY ASETY MYWWJ IYTMJ KFQQT ZFAQK IJFIP FSYAG

QXZVK UZPHF ZCYOS LJNQE MVK

The following shows a summary of possible keyword lengths from 1 to 10.

Length 1 2 3 4 5Average 0.041180 0.044827 0.046035 0.045062 0.038385

Length 6 7 8 9 10Average 0.062620 0.040136 0.044498 0.047109 0.041173

The largest IC average 0.062620 corresponds to keyword length 6, and, hence, l = 6 is themost likely length of the keyword.

Example 10 Here is one more example as follows:

WQXYM REOBP VWHTH QYEQV EDEXR BGSIZ SILGR TAJFZ OAMAV VXGRF

QGKCP IOZIJ BCBLU WYRWS TUGVQ PSUDI UWOES FMTBT ANCYZ TKTYB

VFDKD ERSIB JECAQ DWPDE RIEKG PRAQF BGTHQ KVVGR AXAVT HARQE

ELUEC GVVBJ EBXIJ AKNGE SWTKB EDXPB QOUDW VTXES MRUWW RPAWK

MTITK HFWTD AURRV FESFE STKSH FLZAE ONEXZ BWTIA RWWTT HQYEQ

VEDEX RBGSO REDMT ICM

The following shows a summary of possible keyword lengths from 1 to 10.

Length 1 2 3 4 5Average 0.043351 0.041466 0.044919 0.039914 0.039010

Length 6 7 8 9 10Average 0.041352 0.071317 0.037294 0.043934 0.040091

The largest IC average 0.071317 corresponds to keyword length 7, and l = 6 is the mostlikely length of the keyword.

To conclude this section, it is important to point out that sometimes you may have tolook at a few more keyword lengths that are close to the largest IC average, although thelargest IC average usually works well and is simpler and easier than Kasiski’s method.

18

4.2 Expectation of Index of Coincidence

There is another way to estimate the length of the unknown keyword. Pick two identicalletters at random. They may be in the same coset or they may be in different cosets. Letus consider each case separately.

• The two letters are in the same coset: Since the coset length is N/l, there areCN/l,2 different ways to choose them from the same coset. Since there are l cosets,these two letters can be in any one of them and there are l different ways of choosinga coset. As a result, there are l×CN/l,2 different ways to pick these 2 identical letters.Moreover, there are CN,2 different ways of picking any two letters. Therefore, theprobability of the two randomly chosen identical letters being in the same coset is

l × CN/l,2

CN,2=

l ×Nl (N

l−1)

2N(N−1)

2

=N

(Nl− 1

)

N(N − 1)=

1

l· N − l

N − 1

• The two letters are in different cosets: Since we have l cosets, we have Cl,2 dif-ferent ways to pick two cosets each of which contains one chosen letter. Since eachcoset has length N/l, there are N/l different ways to pick this letter. Hence, thereare Cl,2 × (N/l) × (N/l) different ways to pick these two letters. Since there are CN,2

different ways of picking any two letters, the probability of picking any two identicalletters in different cosets is

Cl,2 × Nl× N

l

CN,2

=l(l−1)

2× N

l× N

lN(N−1)

2

=l − 1

l· N

N − 1

If the selected letters are in the same coset, they are encrypted using the same alphabetletter. Therefore, the probability of they being equal is the IC of English ICEnglish. If theselected letters are in different cosets, they are encrypted using different letters, In this case,we may assume that the cosets are more likely to be random texts and have the probabilityof ICRandom. The mathematical expectation of the probability of picking two identical lettersweighted by ICEnglish and ICRandom is:

E[IC] =1

l· N − l

N − 1· ICEnglish +

l − 1

l· N

N − 1· ICRandom

A rough correspondence between E[IC] and keyword length is shown in Table 4. Hence,from the IC of a ciphertext, we are able to estimate the corresponding keyword length.

Example 11 Example 4 (page 15) calculated the IC of the following ciphertext of Hoare’squote.





KQGAN

19

Table 4: Keyword Lengths and Associated Indices of Coincidence for English

Length 1 2 3 4 5E[IC] 0.0660 0.0520 0.0473 0.0450 0.0436

Length 6 7 8 9 10E[IC] 0.0427 0.0420 0.0415 0.0411 0.0408

Length 11 12 13 14 15E[IC] 0.0405 0.0403 0.0402 0.0400 0.0399

Length 16 17 18 19 20E[IC] 0.0397 0.0396 0.0396 0.0395 0.0394

The calculated IC is 0.041989. Since Table 4 shows that the closest E[IC] is length 7, thelength of the unknown keyword is likely to be 7. We may try l = 8 since the correspondinglength of 0.0415 is also close to the calculated 0.041989.

Note that there are always some inaccuracies between this and the previous method.The IC’s in Example 9 (page 18) and Example 10 (page 18) are 0.041180 and 0.43351,respectively. This method suggests that the keyword lengths are 9 for the ciphertext inExample 9 and 5 for the ciphertext in Example 10. Both are not as accurate as the resultsproduced by the previous method. However, this method is an easy table lookup using oneIC without complicated calculations.

5 Keyword Recovery

If the estimated keyword length is correct, each coset constructed in Section 4.1 is encryptedwith the same letter. The following is an encryption with keyword BOY.


BOYBOYBO YBOYBOYBOYBOY BOYBOYBOYB

NWAIWEBB RFQFOCJPUGDOJ VBGWSPTWRZ

Since the keyword length is 3, we have the following cosets.


N I B F O P D V W T Z

W W B Q C U O B S W

A E R F J G J G P R

If A is considered to be the 0-th letter, the following table has the positions of all 26 Englishletters.

0 1 2 3 4 5 6 7 8 9 10 11 12A B C D E F G H I J K L M

13 14 15 16 17 18 19 20 21 22 23 24 25N O P Q R S T U V W X Y Z

20

Since the first letter in the keyword is B, the plaintext letters corresponding to B are shiftedto the right one position so that M becomes N and A becomes B. Since the second letter in thekeyword is O, the plaintext letters corresponding to O are shifted to the right 14 positions sothat I becomes W, O becomes B, and so on. Similarly, the third coset is obtained by shiftingthe plaintext letters corresponding to Y to the right 24 positions.

In the case of only knowing the three cosets, we need to shift each of them to the leftsome positions to get the plaintext back. More precisely, each coset is shifted to the left1 position, 2 positions, ..., and 25 positions. Note that we do not need to shift 0 positionbecause it is the coset itself. Since each shift produces a possible decryption of the coset,there are 26 different possibilities. If we have k cosets, the total number of shift combinationsis 26k , which can be very large even if k is small. For example, if the possible keyword lengthis 8, there are 268 = 208, 827, 064, 576 possible shift combinations (or possible keywords).With such a large number of combinations, it is very difficult to verify which shift of a cosetcan yield the correct keyword. Consequently, we need a better method rather than the useof brute force.

There is a simple method based on the frequency of letters in English (Table 3, page 14).Since each coset is encrypted by the same letter, its frequency does not look like a typicalEnglish text. Shifting a coset changes its frequency. Of the 26 possible shifts, one can yieldthe original plaintext whose frequency should be very similar to the frequency of English.Therefore, we may compare the frequency of each shift against the frequency of English,and the shift that produces a frequency closest to the English frequency is likely to be thecorrect shift. But, what is the meaning of “closest”? Fortunately, in statistics there aremethods to measure goodness-of-fit, one of which is the χ2 method. Given a set of observedvalues f1, f2, . . . , fn and a set of corresponding known/expected values F1, F2, . . . , Fn, the χ2

is computed as follows:

χ2 =n∑

i=1

(fi − Fi)2

Fi

In our case, the Fi’s are the values in Table 3 (page 14), which are known, and the fi’sare the frequency obtained from a shift. The shift of a coset that produces the smallest χ2

value is the one whose frequency is the closet to that of the English language. However, theshift corresponding to the smallest χ2 may not always be the correct choice. In general, wehave to examine several shifts that correspond to some small χ2 values.

Consider the second coset WWBQCUOBSW discussed earlier. The count, frequency fi andfrequency Fi from Table 3 are shown below. The computed χ2 is 17.0130.

Letter A B C D E F G H I J K L MCount 0 2 1 0 0 0 0 0 0 0 0 0 0fi 0 0.2 0 0 0 0 0 0 0 0 0 0 0Fi 0.082 0.014 0.028 0.038 0.131 0.029 0.020 0.053 0.064 0.001 0.004 0.034 0.025Letter N O P Q R S T U V W X Y ZCount 0 1 0 1 0 1 0 1 0 3 0 0 0fi 0 0.1 0 0.1 0 0.1 0 0.1 0 0.3 0 0 0Fi 0.071 0.080 0.020 0.001 0.068 0.061 0.105 0.025 0.009 0.015 0.002 0.020 0.001χ2 17.0130

21

If the coset WWBQCUOBSW is shifted to the left by one position, we have VVAPBTNARV andthe following table. The computed χ2 is 10.8557.

Letter A B C D E F G H I J K L MCount 2 1 0 0 0 0 0 0 0 0 0 0 0fi 0.2 0.1 0 0 0 0 0 0 0 0 0 0 0Fi 0.082 0.014 0.028 0.038 0.131 0.029 0.020 0.053 0.064 0.001 0.004 0.034 0.025Letter N O P Q R S T U V W X Y ZCount 1 0 1 0 1 0 1 0 3 0 0 0 0fi 0.1 0 0.1 0 0.1 0 0.1 0 0.3 0 0 0 0Fi 0.071 0.080 0.020 0.001 0.068 0.061 0.105 0.025 0.009 0.015 0.002 0.020 0.001χ2 10.8557

Table 5 shows the 26 χ2 values of each coset with the smallest one in boldface. Thesmallest χ2 of coset 1 is 1.9532 which corresponds to the letter B (i.e., shifting to the leftone position). The smallest χ2 of coset 2 is 2.1695 which corresponds to the letter O (i.e.,shifting to the left 14 positions). The smallest χ2 of coset 3 is 2.3933 which corresponds tothe letter Y (i.e., shifting to the left 24 positions). In other words, the first, second and thirdcosets are encrypted by B, O and Y, respectively. Therefore, we are lucky and find the correctkeyword BOY.

Suppose MICHIGAN TECHNOLOGICAL UNIVERSITY is again encrypted to the following withan unknown keyword of length 4:

YITZU GRFFE TZZOC GSITS XUEAH EIKUT P

The χ2 values of all shifts for each coset are shown in Table 6. To save space, this table onlyshows the four smallest χ2 values of each coset.

The smallest χ2 values suggest the keyword to be UAPS and the decrypted result isEIEHAGCNLEEHFONOYIEADUPINETSATA. This is certainly not correct. If we align the plaintext,ciphertext and decrypted text together as follows, we should be able to see the problem.


YITZUGRF FETZZOCGSITSX UEAHEIKUTP

EIEHAGCN LEEHFONOYIEAD UPINETSATA

It is obvious that the second shift corresponds to A and the fourth shift corresponds to S arecorrect, because the letters in corresponding positions of the plaintext and ciphertext areidentical. However, the first shift U and the third shift P are not. Therefore, the unknownkeyword looks like the following:

M

F

M

S

O

A

A

C

P

R

S

The following has all 16 possible combinations:

22

Table 5: χ2 Values for Ciphertext NWAIW EBBRF QFOCJ PUGDO JVBGW SPTWR Z

Shift Corresponding Letter Coset 1 χ2 Coset 2 χ2 Coset 3 χ2

0 A 12.6808 17.0130 33.41141 B 1.9532 10.8557 47.29822 C 16.6228 61.7972 3.35583 D 10.2763 15.4671 9.89834 E 24.9700 35.7427 4.41405 F 16.1760 17.4307 19.74836 G 29.5341 82.8543 22.83007 H 2,5481 14.5767 66.61358 I 6.3800 4.3482 41.45309 J 20.3966 9.5387 26.0354

10 K 8.4236 6.4101 62.527111 L 14.2454 43.6233 8.461412 M 9.8439 31.8807 26.173613 N 15.2270 70.7267 3.698114 O 11.8107 2.1695 14.626915 P 19.8472 16.2274 11.917016 Q 22.7962 6.1626 51.004217 R 8.1086 31.2416 10.629918 S 19.9131 34.4761 56.660019 T 4.6458 29.8607 36.445120 U 18.6617 4.8624 36.389821 V 3.3357 27.0150 14.099622 W 21.9697 3.7015 21.456623 X 18.5799 118.3588 33.745324 Y 19.8023 14.2303 2.393325 Z 19.8783 55.4882 19/8128

FAAS MAAS SAAS UAAS

FACS MACS SACS UACS

FAPS MAPS SAPS UAPS

FARS MARS SARS UARS

The following shows the decryption with each of the 16 possible keywords:

FAAS TITHPGRN AETHUOCONITAS UEICEISPTP FACS TIRHPGPN AERHUOAONIRAS UCICEGSPTNFAPS TIEHPGCN AEEHUONONIEAS UPICETSPTA FARS TICHPGAN AECHUOLONICAS UNICERSPTYMAAS MITHIGRN TETHNOCOGITAL UEIVEISITP MACS MIRHIGPN TERHNOAOGIRAL UCIVEGSITNMAPS MIEHIGCN TEEHNONOGIEAL UPIVETSITA MARS MICHIGAN TECHNOLOGICAL UNIVERSITYSAAS GITHCGRN NETHHOCOAITAF UEIPEISCTP SACS GIRHCGPN NERHHOAOAIRAF UCIPEGSCTNSAPS GIEHCGCN NEEHHONOAIEAF UPIPETSCTA SARS GICHCGAN NECHHOLOAICAF UNIPERSCTYUAAS EITHAGRN LETHFOCOYITAD UEINEISATP UACS EIRHAGPN LERHFOAOYIRAD UCINEGSATNUAPS EIEHAGCN LEEHFONOYIEAD UPINETSATA UARS EICHAGAN LECHFOLOYICAD UNINERSATY

23

Table 6: χ2 Values for Ciphertext YITZU GRFFE TZZOC GSITS XUEAH EIKUT P

Shift Letter Coset 1 χ2 Coset 2 χ2 Coset 3 χ2 Coset 4 χ2

0 A 2.2259 2.22921 B 2.99782 C 5.6245 3.61125 F 4.17506 G 5.2742

12 M 3.9131 3.485614 O 4.325815 P 1.988917 R 5.985718 S 4.6828 2.000820 U 2.303625 Z 3.6293

Therefore, the correct keyword is MARS. Compared with 264 = 456, 976, 16 is significantlysmaller and can use brute force to recover the keyword and the plaintext. Note that thesuccessfully rate is much higher if the ciphertext is long and the keyword is relatively short.

Example 12 The following ciphertext was discussed in Example 4 (page 15).





KQGAN

In Example 8 (page 17) we found the likely keyword length to be 8. The ciphertext isdivided into eight cosets and the smallest χ2 of each coset and its corresponding shift lettersare shown below.

Letter χ2

C 1.051619O 1.147637M 1.019016P 0.702297U 0.777985T 0.684600E 1.042904R 0.645601

The possible keyword is COMPUTER with which we can decrypt the ciphertext correctly.

24

Example 13 Example 9 (page 18) determined the keyword length of the following cipher-text to be 6.

TYWUR USHPO SLJNQ AYJLI FTMJY YZFPV EUZTS GAHTU WNSFW EEEVA

MYFFD CZTMJ WSQEJ VWXTU QNANT MTIAW AOOJS HPPIN TYDDM VKQUF

LGMLB XIXJU BQWXJ YQZJZ YMMZH DMFNQ VIAYE FLVZI ZQCSS AEEXV

SFRDS DLBQT YDTFQ NIVKU ZPJFJ HUSLK LUBQV JULAB XYWCD IEOWH

FTMXZ MMZHC AATFX YWGMF XYWZU QVPYF AIAFJ GEQCV KNATE MWGKX

SMWNA NIUSH PFSRJ CEQEE VJXGG BLBQI MEYMR DSDHU UZXVV VGFXV

JZXUI JLIRM RKZYY ASETY MYWWJ IYTMJ KFQQT ZFAQK IJFIP FSYAG

QXZVK UZPHF ZCYOS LJNQE MVK

The ciphertext is divided into 6 cosets. The smallest χ2 value of each coset is shown below.

Letter χ2

S 0.598145U 0.388866M 0.450186M 0.311680E 0.249809R 0.679312

Thus, the keyword is very likely to be SUMMER. The decrypted text, with spaces and punc-tuation added, is as follows:

BE KIND AND COURTEOUS TO THIS GENTLEMAN.

HOP IN HIS WALKS AND GAMBOL IN HIS EYES.

FEED HIM WITH APRICOCKS AND DEWBERRIES,

WITH PURPLE GRAPES, GREEN FIGS AND MULBERRIES.

THE HONEY BAGS STEAL FROM THE HUMBLEBEES,

AND FOR NIGHT TAPERS CROP THEIR WAXEN THIGHS

AND LIGHT THEM AT THE FIERY GLOWWORMS’ EYES

TO HAVE MY LOVE TO BED AND TO ARISE.

AND PLUCK THE WINGS FROM PAINTED BUTTERFLIES

TO FAN THE MOONBEAMS FROM HIS SLEEPING EYES.

NOD TO HIM, ELVES AND DO HIM COURTESIES.

This is what Titania said in Act 3, Scene 1 of William Shakespeare’s A MidsummerNight’s Dream. Of course, this is a correct decryption.

Example 14 The following ciphertext was discussed in Example 10 (page 18). The possiblekeyword length is 6.

WQXYM REOBP VWHTH QYEQV EDEXR BGSIZ SILGR TAJFZ OAMAV VXGRF

QGKCP IOZIJ BCBLU WYRWS TUGVQ PSUDI UWOES FMTBT ANCYZ TKTYB

VFDKD ERSIB JECAQ DWPDE RIEKG PRAQF BGTHQ KVVGR AXAVT HARQE

ELUEC GVVBJ EBXIJ AKNGE SWTKB EDXPB QOUDW VTXES MRUWW RPAWK

MTITK HFWTD AURRV FESFE STKSH FLZAE ONEXZ BWTIA RWWTT HQYEQ

VEDEX RBGSO REDMT ICM

25

This ciphertext is divided into 7 cosets. The smallest χ2 value of each cost is shownbelow:

Letter χ2

A 0.268109M 0.975344E 0.314517R 0.278291I 1.367654C 0.591278A 0.650083

Therefore, AMERICA is a possible keyword and the decrypted text, with spaces and punctua-tion added, is shown below:

WE THE PEOPLE OF THE UNITED STATES, IN ORDER TO FORM A MORE PERFECT UNION,

ESTABLISH JUSTICE, INSURE DOMESTIC TRANQUILITY, PROVIDE FOR THE COMMON

DEFENCE, PROMOTE THE GENERAL WELFARE, AND SECURE THE BLESSINGS OF LIBERTY

TO OURSELVES AND OUR POSTERITY, DO ORDAIN AND ESTABLISH THIS CONSTITUTION

FOR THE UNITED STATES OF AMERICA.

You certainly know what this quote is.

6 Complete Examples

This section discusses a few complete examples. Each example uses Kasiski’s method and theindex of coincidence method to determine a possible keyword length, with which a possiblekeyword is constructed and used to decrypt the ciphertext. This procedure repeats until ameaningful plaintext is found. Keyword length search is limited to the range of 2 and 20 inthis section. With a computer program, we can search for any length and this restriction ispurely artificial.

Example 15 The following is a ciphertext to be analyzed.

DAZFI SFSPA VQLSN PXYSZ WXALC DAFGQ UISMT PHZGA MKTTF TCCFX

KFCRG GLPFE TZMMM ZOZDE ADWVZ WMWKV GQSOH QSVHP WFKLS LEASE

PWHMJ EGKPU RVSXJ XVBWV POSDE TEQTX OBZIK WCXLW NUOVJ MJCLL

OEOFA ZENVM JILOW ZEKAZ EJAQD ILSWW ESGUG KTZGQ ZVRMN WTQSE

OTKTK PBSTA MQVER MJEGL JQRTL GFJYG SPTZP GTACM OECBX SESCI

YGUFP KVILL TWDKS ZODFW FWEAA PQTFS TQIRG MPMEL RYELH QSVWB

AWMOS DELHM UZGPG YEKZU KWTAM ZJMLS EVJQT GLAWV OVVXH KWQIL

IEUYS ZWXAH HUSZO GMUZQ CIMVZ UVWIF JJHPW VXFSE TZEDF

The first task is estimating the keyword length. Kasiski’s method found the followingrepeated strings and their positions.

26

Distance String Positions Distance6 YSZWXA 17 353 3364 HQSV 84 294 210

MJEG 103 215 112OSDE 121 303 102

3 ETZ 59 389 330HPW 88 382 294AZE 154 168 14TAM 208 322 114SZO 264 362 98ELM 292 306 14MUZ 309 366 57

The following table shows the distances and their factors. The most common factors are2, 3, 7 and 14. Since the factor 2 is unlikely, we have three estimates 3, 7 and 14. Of thesethree possible keyword lengths, 14 is the most likely because it has the highest count.


336 X X X X X X X X X330 X X X X X X X294 X X X X X210 X X X X X X X X182 X X X X114 X X X X112 X X X X X X98 X X X57 X X

14 X X

Total 9 6 2 2 5 6 2 0 2 1 1 1 7 2 2 0 0 2 0

The following table has the average of index of coincidence value of each length. Thehighest one is 14 with an average of 0.064378. Hence, we have strong evidence showing thatthe keyword length is 14.

Distance 1 2 3 4 5Average 0.042614 0.044759 0.042969 0.042926 0.041610Distance 6 7 8 9 10Average 0.044117 0.050827 0.043529 0.041019 0.044318Distance 11 12 13 14 15Average 0.042662 0.041208 0.040273 0.064378 0.040722Distance 16 17 18 19 20Average 0.042582 0.047256 0.041294 0.042000 0.043187

The table below has the smallest χ2 value of each coset and the corresponding letter:

27

A 0.556354M 0.435045B 0.738187R 0.589010O 1.061695I 0.824027S 1.322445E 0.372434T 1.188328H 0.981700O 1.096823M 2.236836A 0.726619S 0.625189

Thus, the recovered keyword is AMBROISETHOMAS. The following is the decrypted text withspaces and punctuation added. We are lucky to decrypt it in one shot.

DO YOU KNOW THE LAND WHERE THE ORANGE TREE BLOSSOMS?

THE COUNTRY OF GOLDEN FRUITS AND MARVELOUS ROSES,

WHERE THE BREEZE IS SOFTER AND BIRDS LIGHTER,

WHERE BEES GATHER POLLEN IN EVERY SEASON,

AND WHERE SHINES AND SMILES, LIKE A GIFT FROM GOD,

AN ETERNAL SPRINGTIME UNDER AN EVER-BLUE SKY!

ALAS! BUT I CANNOT FOLLOW YOU

TO THAT HAPPY SHORE FROM WHICH FATE HAS EXILED ME!

THERE! IT IS THERE THAT I SHOULD LIKE TO LIVE,

TO LOVE, TO LOVE, AND TO DIE!

IT IS THERE THAT I SHOULD LIKE TO LIVE, IT IS THERE, YES, THERE!

This is part one of “Connais-tu le pays” of Ambroise Thomas’ opera Mignon. In the first act,Mignon speaks to Wilhelm and Lothano who rescue her, tells her abduction, and describesher past time with this beautiful and well-known aria. The above is an English translationfrom French. “Connais-tu le pays” means “Do you know this country (or land)”.

Example 16 The following is the ciphertext to be analyzed.

28

QRBAI UWYOK ILBRZ XTUWL EGXSN VDXWR XMHXY FCGMW WWSME LSXUZ

MKMFS BNZIF YEIEG RFZRX WKUFA XQEDX DTTHY NTBRJ LHTAI KOCZX

QHBND ZIGZG PXARJ EDYSJ NUMKI FLBTN HWISW NVLFM EGXAI AAWSL

FMHXR SGRIG HEQTU MLGLV BRSIL AEZSG XCMHT OWHFM LWMRK HPRFB

ELWGF RUGPB HNBEM KBNVW HHUEA KILBN BMLHK XUGML YQKHP RFBEL

EJYNV WSIJB GAXGO TPMXR TXFKI WUALB RGWIE GHWHG AMEWW LTAEL

NUMRE UWTBL SDPRL YVRET LEEDF ROBEQ UXTHX ZYOZB XLKAC KSOHN

VWXKS MAEPH IYQMM FSECH RFYPB BSQTX TPIWH GPXQD FWTAI KNNBX

SIYKE TXTLV BTMQA LAGHG OTPMX RTXTH XSFYG WMVKH LOIVU ALMLD

LTSYV WYNVW MQVXP XRVYA BLXDL XSMLW SUIOI IMELI SOYEB HPHNR

WTVUI AKEYG WIETG WWBVM VDUMA EPAUA KXWHK MAUPA MUKHQ PWKCX

EFXGW WSDDE OMLWL NKMWD FWTAM FAFEA MFZBN WIHYA LXRWK MAMIK

GNGHJ UAZHM HGUAL YSULA ELYHJ BZMSI LAILH WWYIK EWAHN PMLBN

NBVPJ XLBEF WRWGX KWIRH XWWGQ HRRXW IOMFY CZHZL VXNVI OYZCM

YDDEY IPWXT MMSHS VHHXZ YEWNV OAOEL SMLSW KXXFX STRVI HZLEF

JXDAS FIE

Kasiski’s method found many repeating strings as shown in Table 7 (page 30).Since there are potentially many repeating strings of length 3 (ı.e., tri-graph), they usually

can help determine the keyword length. Table 8 (page 31) shows the distances and theirfactors of repeating strings of length 3. The most common factors are 2, 3, 4, 6 and 12with occurrences 23, 25, 18, 20 and 18, respectively. In this case, even though the factor 2is ignored, it is not so obvious about the correct length of the keyword. This is a commonproblem with Kasiski’s method.

The following table has the average of index of coincidence value of each length. Thehighest one is 12 with an average of 0.067244. Hence, we have strong evidence showing thatthe keyword length is 12. Note that the length 12 is not the highest count with Kasiski’smethod (Table 8).

Distance 1 2 3 4 5Average 0.043760 0.044658 0.049472 0.050311 0.043309




The following table has the smallest χ2 value of each coset and the corresponding letter:

29

Table 7: Repeated Strings, Positions and Distances in Example 16

Distance String Positions Distance Distance String Positions Distance9 GOTPMXRTX 263 419 156 3 LFM 137 149 128 KHPRFBEL 194 242 48 LVB 168 408 2405 DFWTA 389 569 180 LAE 174 618 4444 KILB 9 225 216 MLW 189 477 288

TAIK 92 392 300 477 561 84SILA 172 628 456 NVW 217 253 36YNVW 252 456 204 253 349 96GWIE 281 509 228 349 457 108XTHX 331 427 96 LBN 227 647 420HXZY 333 717 384 UAL 276 444 168MAEP 355 523 168 444 612 168

3 LBR 11 278 267 WHG 287 383 96EGX 20 140 120 AEL 297 619 322MHX 31 151 120 TXT 378 405 27WWS 40 554 514 405 426 21MEL 43 486 443 NNB 396 649 253ELS 44 728 684 YGW 433 508 75MFS 52 364 312 SML 476 730 254IEG 62 283 221 GWW 514 553 39RXW 68 677 609 KMA 534 594 60GPX 109 385 276 DDE 557 701 144NUM 120 300 180 AMF 573 579 6WNV 134 722 588 HZL 687 745 58

U 0.231760N 0.317295I 0.340955T 0.529938E 0.302274D 0.395590S 0.455630T 0.393412A 0.219040T 0.354012E 0.321425S 0.404944

Thus, the recovered keyword is UNITEDSTATES. The following is the decrypted plaintext withspaces and punctuation added.

30

Table 8: Distances and Factors in Example 16


684 X X X X X X X X609 X X588 X X X X X X X514 X

444 X X X X X443420 X X X X X X X X X X X322 X X X312 X X X X X X X288 X X X X X X X X X276 X X X X X267 X254 X253 X

240 X X X X X X X X X X X221 X X180 X X X X X X X X X X X168 X X X X X X X X144 X X X X X X X X X120 X X X X X X X X X X108 X X X X X X X96 X X X X X X X84 X X X X X X X75 X X X60 X X X X X X X X X

58 X39 X X36 X X X X X X X27 X X21 X X12 X X X X X6 X X X

Total 23 25 18 7 20 5 7 7 5 1 18 3 5 6 4 1 6 1 5

WE, THEREFORE, THE REPRESENTATIVES OF THE UNITED STATES OF AMERICA,

IN GENERAL CONGRESS, ASSEMBLED, APPEALING TO THE SUPREME JUDGE OF

THE WORLD FOR THE RECTITUDE OF OUR INTENTIONS, DO, IN THE NAME,

AND BY AUTHORITY OF THE GOOD PEOPLE OF THESE COLONIES, SOLEMNLY PUBLISH

AND DECLARE, THAT THESE UNITED COLONIES ARE, AND OF RIGHT OUGHT TO BE

FREE AND INDEPENDENT STATES, THAT THEY ARE ABSOLVED FROM ALL ALLEGIANCE

TO THE BRITISH CROWN, AND THAT ALL POLITICAL CONNECTION BETWEEN THEM AND

THE STATE OF GREAT BRITAIN, IS AND OUGHT TO BE TOTALLY DISSOLVED, AND THAT

AS FREE AND INDEPENDENT STATES, THEY HAVE FULL POWER TO LEVY WAR,

CONCLUDE PEACE, CONTRACT ALLIANCES, ESTABLISH COMMERCE, AND TO DO ALL

OTHER ACTS AND THINGS WHICH INDEPENDENT STATES MAY OF RIGHT DO. AND FOR

THE SUPPORT OF THIS DECLARATION, WITH A FIRM RELIANCE ON THE PROTECTION

OF DIVINE PROVIDENCE, WE MUTUALLY PLEDGE TO EACH OTHER OUR LIVES,

OUR FORTUNES AND OUR SACRED HONOR.

This is the last paragraph of the Declaration of Independence.

31

7 Concluding Remarks

In this document we discuss the Vigenere cipher from its algorithm and devices to keywordlength estimation and recovery. However, we did not touch upon frequency analysis. More-over, the auto-correlation method [8, page 250] is not even mentioned in keyword lengthestimation (Section 4). The use of system of linear congruence equations to recover thekeyword given a keyword length is not discussed because the readers may not know the wayof manipulating linear congruence equations. More information can be found in Stinson [12,pp. 34–36] and Hoffstein et. al. [4, pp. 201–209]. The presentation here also assumes thatthe ciphertext is long enough and the keyword length is relatively short. Refer to [10] forfurther information and references on breaking short ciphertexts encrypted by the Vigenerecipher. We are currently developing an interactive visualization tool that implements thetechniques presented in this document.6 The interested readers may find the availability ofthis tool at the following address:

www.cs.mtu.edu/~shene/NSF-4

Acknowledgments

This work is partially supported by the National Science Foundation under grants DUE-1140512, DUE-1245310, CNS-1229297, IIS-1017935 and IIS-1319363.

References

[1] Ronald Clark. The Man Who Broke Purple. Little, Brown and Company, 1977.

[2] Mehmet Emin Dalkilic and Cengiz Gungor. An Interactive Cryptanalysis Algorithm forthe Vigenere Cipher. In ADVIS ’00 Proceedings of the First International Conferenceon Advances in Information Systems), pages 341–351, 2000.

[3] William F. Friedman. The Index of Coincidence and Its Applications inCryptanalysis. Aegean Park Press, 1996. (This book was originally pub-lished in 1922 as Riverbank Publication No. 22, Riverbank Laboratories,Geneva, Illinois. It is also available from George C. Marshall Foundation athttp://www.marshallfoundation.org/library/friedman/books/Methods II watermark.pdf.).

[4] Jeffrey Hoffstein, Jill Pipher, and Joseph H. Silverman. An Introduction to MathematicalCryptography. Springer, 2008.

6We only found one similar tool published in recent years [2]; however, this tool seems not available tothe public.

32

[5] David Kahn. The Code Breakers. Macmillan, 1967. (A revised and updated edition waspublished by Scriber in 1996.).

[6] Friedrich Kasiski. Die Geheimschriften und die Dechiffrirkunst. Mittler und Sohn,Berlin, 1863. (An unabridged facsimile version was published by Adamant Media Cor-poration in 2006.).

[7] Solomon Kullback. Statistical Methods in Cryptanalysis. Aegean Park Press, 1976.(This book was originally published in 1938 by the Government Printing Office.).

[8] Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone. Handbook of AppliedCryptography. CRC Press, 1996.

[9] Klaus Pommerening. Kasiski’s Test: Couldn’t the Repetitions be by Accident? Cryp-tologia, 30(4):346–352, October 2006.

[10] Tobias Schrodel. Breaking Short Vigenere Ciphers. Cryptologia, 32(4):334–347, October2008.

[11] Simon Singh. The Code Book. Anchor Books, 1999.

[12] Douglas R. Stinson. Cryptography Theory and Practice. CRC Press, 1995.

33

Download - The Vigen`ere Cipher (Draft) - Computer Scienceshene/NSF-4/Vigenere.pdf · The Vigen`ere Cipher (Draft) Can Li, Jun Ma, Jun Tao Melissa Keranen, Jean Mayo, Ching-Kuang Shene and Chaoli

Top Related