Transcript

Matching Keys and Encrypted Manuscripts

Eva Pettersson and Beáta MegyesiDepartment of Linguistics and Philology

Uppsala [email protected]

Abstract

Historical cryptology is the study ofhistorical encrypted messages aimingat their decryption by analyzing themathematical, linguistic and other cod-ing patterns and their historical con-text. In libraries and archives we canfind quite a lot of ciphers, as well askeys describing the method used totransform the plaintext message into aciphertext. In this paper, we presentwork on automatically mapping keysto ciphers to reconstruct the originalplaintext message, and use languagemodels generated from historical textsto guess the underlying plaintext lan-guage.

1 Introduction

Hand-written historical records constitute animportant source, without which an under-standing of our society and culture wouldbe severely limited. A special type ofhand-written historical records are encryptedmanuscripts, so called ciphers, created withthe intention to keep the content of the mes-sage hidden from others than the intendedreceiver(s). Examples of such materials arepolitical, diplomatic or military correspon-dence and intelligence reports, scientific writ-ings, private letters and diaries, as well asmanuscripts related to secret societies.According to some historians’ estimates, one

percent of the material in archives and li-braries are encrypted sources, either encryptedmanuscripts called ciphertexts, decrypted ororiginal plaintext, and/or keys describing howthe encryption/decryption is performed. Themanuscripts are usually not indexed as en-crypted sources (Láng, 2018), which makes

it difficult to find them unless you are luckyto know a librarian with extensive knowledgeabout the selection of the particular libraryyou are digging in. In addition, related ci-phertexts, plaintexts and keys are usually notstored together, as information about how theencrypted sources are related to each other islost. If the key has not been destroyed overtime—unintentionally, or intentionally for se-curity reasons—it is probably kept in a differ-ent place than the corresponding ciphertext orplaintext given that these were probably pro-duced in different places by different persons,and eventually ended up in different archives.Information about the origin of the ciphertextand key, such as dating, place, sender and/orreceiver, or any cleartext in the manuscript,might give some important clues to the prob-ability that a key and a ciphertext originatefrom the same time, and persons. However, in-formation about metadata is far from enoughto link the related encrypted sources to eachother. It is a cumbersome, if not impossible,process for a historian to try to map a bunchof keys to a pile of ciphertexts scattered in thearchive in order to try to decrypt these on thebasis of the corresponding key. Further, thecryptanalyst might reconstruct a key given aciphertext, and the reconstructed key might beapplicable to other ciphertexts as well therebyproviding more decrypted source material.In this paper, we present work on automati-

cally mapping ciphertext sequences to keys toreturn the plaintext from the ciphertext basedon simple and homophonic substitution fromEarly Modern times. We measure the outputof the mapping by historical language modelsdeveloped for 14 European languages to makeeducated guesses about the correct decryptionof ciphertexts. The method is implementedin a publicly available on-line user interface

where users can upload a transcribed key anda ciphertext and the tool returns the plain-text output along with a probability measureof how well the decrypted plaintext matcheshistorical language models for these Europeanlanguages.In Section 2, we give a brief introduction

to historical cryptology with the main focuson encrypted sources and keys. Section 3 de-scribes the method for mapping ciphers totheir corresponding keys. The experimentalresults are presented in Section 4 and dis-cussed in Section 5. Finally, we conclude thepaper in Section 6.

2 Historical Cryptology

Ciphers use a secret method of writing, basedon an encryption algorithm to generate a ci-phertext, which in turn can be used to decryptthe message to retrieve the intended, underly-ing information, called the plaintext. A cipheris usually operated on the basis of a key. Thekey contains information about what outputthe cipher shall produce given the plaintextcharacters in some specific language.Historical cryptology is the study of encoded

or encrypted messages from our history aim-ing at the decryption by analyzing the math-ematical, linguistic and other coding patternsand their histories. One of the main and glo-rious goals is to develop algorithms for de-cryption of various types of historical ciphers,i.e. to reconstruct the key in order to re-trieve the corresponding plaintext from a ci-phertext. The main focus for cryptanalystshas been on specific ciphers, see e.g. (Bauer,2017; Singh, 2000) for nice summaries, whilesystematic decryption of various cipher typeson a larger scale has been paid less attentionto (see e.g. Knight et al. (2006); Nuhn andKnight (2014); Ravi and Knight (2008)). His-torians, on the other hand, are searching forciphertexts and keys in libraries to reveal new,important and hitherto hidden information tofind new facts and interpretations about ourhistory. Another, less observed goal withinhistorical cryptology is therefore to map theencrypted sources, the original keys and cor-responding ciphertexts.There are many different types of ciphers

used throughout our history (Kahn, 1996). In

early modern times, when encryption becamefrequently used in Europe, ciphers were typi-cally based on transposition, where the plain-text characters are reordered in a systematicway, or substitution of plaintext characters totransform each character in the plaintext toanother symbol from existing alphabets, dig-its, special symbols, or a mixture of these(Bauer, 2007). More advanced substitution ci-phers include homophonic, polygraphic, andpolyalphabetic substitution. In Figure 1, weshow a homophonic substitution cipher with akey, a short ciphertext and the correspondingplaintext generated by the key. Each plaintextcharacter, written in capital letter, has one orseveral corresponding symbol(s) by which theplaintext characters are substituted to encryptthe message. To make decryption difficult, themost frequently occurring plaintext charactersare usually substituted with one of several pos-sible symbols.

Figure 1: Ciphertext, key, and the correspond-ing plaintext for a homophonic substitution ci-pher.

Ciphertexts contain symbol sequences withspaces, or without any space to hide wordboundaries. Similar to historical text, punctu-ation marks are not frequent, sentence bound-aries are typically not marked, and capital-ized initial characters in the beginning of thesentence are usually missing. We can alsofind nulls in ciphertexts, i.e. symbols withoutany corresponding plaintext characters to con-fuse the cryptanalyst to make decryption evenharder.Keys might contain substitution of not only

characters in the plaintext alphabet, but alsonomenclatures where bigrams, trigrams, syl-lables, morphemes, common words, and/ornamed entities, typically referring to persons,

geographic areas, or dates, are substitutedwith certain symbol(s). Diacritics and dou-ble letters are usually not encoded. Each typeof entity to be encrypted might be encoded byone symbol only (unigragh), two symbols (di-graph), three symbols (trigraph), and so on.For example, the plaintext alphabet charactersmight be encrypted with codes using two-digitnumbers, the nomenclatures with three-digitnumbers, space with one-digit numbers, andthe nulls with two-digit numbers, etc. Fig-ure 2 illustrates a key based on homophonicsubstitution with nomenclature from the sec-ond half of the 17th century. Each letter inthe alphabet has at least one correspondingciphertext symbol, represented as a two-digitnumber (digraph), and the vowels and doubleconsonants have one additional graphical sign(unigraph). The key also contains encoded syl-lables with digraphs consisting of numerals orLatin characters, followed by a nomenclaturein the form of a list of Spanish words encodedwith three-digit numbers letters or graphicalsigns.

Figure 2: A key from the second half of the17th century (Algemeen Rijksarchief, 1647-1698) from the DECODE database (Megyesiet al., 2019).

One of the first steps, apart from digiti-zation of the encrypted source, is the tran-scription of keys and ciphertext images, beforecryptanalysis can be applied, aiming at the de-cryption of the ciphertext to recover the key.However, there are other challenges that alsoneed attention depending on what documenttypes that are available and what documents

that need to be recovered. These are:

1. generate ciphertext given a plaintext anda key (i.e. encryption)

2. reconstruct plaintext from a ciphertextand a key (less trivial due to unknownplaintext language and ambiguous codesequences and nulls)

3. map key and ciphertexts

Next, we will describe our experiments onretrieving plaintext from keys given a cipher-text with the goal to be able to automaticallyfind ciphertexts that belong to a particularkey.

3 Mapping Ciphers to Keys

3.1 DataFor our experiments on automatically map-ping ciphertext sequences to key-value pairs,we need access to four kinds of input files:

1. a transcribed ciphertext

2. its corresponding key

3. its corresponding plaintext (for evaluationpurposes)

4. a set of language models (for language de-tection)

A collection of several hundreds of ci-phers and keys from Early Modern timescan be found in the recently developed DE-CODE database (Megyesi et al., 2019).1 Thedatabase allows anyone to search in the col-lection, whereas upload of new encryptedmanuscripts may be done by registered usersonly. The collected ciphertexts and keys areannotated with metadata including informa-tion about the provenance and location ofthe manuscript, transcription, possible de-cryption(s) and translation(s) of the cipher-text, images, and any additional material ofrelevance to the particular manuscript.Currently, most of the ciphertexts in the

DECODE database are still unsolved. Eventhough the overall aim of the cipher-key map-ping algorithm (CKM) is to automatically tryto match these unsolved ciphertexts to existing

1https://cl.lingfil.uu.se/decode/

Name Cipher type Plaintext UseBarb.lat.6956 homophonic, nulls, nomenclature Italian trainingFrancia-64 homophonic, nulls, nomenclature Italian evaluationBorg.lat.898 simple substitution Latin evaluationCopiale homophonic German evaluation (lang. detection)

Table 1: Datasets used for training and evaluation of the CKM algorithm.

keys, we need previously solved ciphertexts,connected to a key and a plaintext file, to con-duct our experiments. In our experiments, wethus make use of three previously broken keysand their corresponding ciphertexts, writtenoriginally during Early Modern times. For allthree manuscripts, the transcribed ciphertextas well as the key and a plaintext version of thecontents are available through the DECODEdatabase. We also add a fourth decrypted file,the Copiale cipher (Knight et al., 2011),2 forwhich both the ciphertext and the plaintextare accessible through the DECODE database.This cipher will only be used for evaluation ofthe language detection part of the algorithm.The data collection used for the experiments issummarized in Table 1, where each manuscriptis described with regard to its cipher type andplaintext language.In our experiments, we will use the Barb ci-

pher for initial tests during the developmentof the CKM algorithm, hence-forth the train-ing set. This cipher is based on homophonicsubstitution with nomenclatures, and consistsof codes with numbers. The codes are 2-digit numbers representing plaintext charac-ters, syllables and some function words, andthree-digit codes for place and person namesand common words. The cipher also containstwo one-digit codes denoting word boundaries.The evaluation set on the other hand, consistsof two ciphers:

1. the Francia cipher, which has the same ci-pher type (homophonic substitution withnulls and nomenclature) and underlyingplaintext language (Italian) as the Barbcipher used during training

2. the more divergent Borg cipher, which isinstead a simple substitution cipher withLatin as the underlying plaintext lan-guage

2https://cl.lingfil.uu.se/~bea/copiale/

The transcription of the ciphertexts was alsoretrieved from the DECODE database. Eachtranscription file of a particular cipher (whichmay consist of one or multiple images) startswith comment lines (marked by "#") with in-formation about the name of the file, the im-age name, the transcriber’s id, the date of thetranscription, etc. The transcription is carriedout symbol by symbol and row by row keep-ing line breaks, spaces, punctuation marks,dots, underlined symbols, and cleartext words,phrases, sentences, paragraphs, as shown inthe original image. In case cleartext is embed-ded in the ciphertext, the cleartext sequenceis clearly marked as such with a language id.For a detailed description of the transcription,we refer to Megyesi et al. (2019).Original and reconstructed keys also follow a

common format, starting with metadata aboutthe key followed by a description of the codegroups. Each code is described in a separateline followed by the plaintext entity (character,syllable, word, etc), delimited by at least onespace character.Figure 3 shows a few lines from the key file

belonging to the Barb cipher, with codes 1 and8 denoting a word boundary, codes 00 and 02denoting the letter a, code 03 denoting theletter o, code 04 denoting either the letter uor the letter v (since there was often no dis-tinction between these two letters in histori-cal texts), and code 232 denoting the wholephrase ambasciatore di Spagna (ambassador ofSpain).

3.2 Storing code-value pairs

In the first step of the CKM algorithm, thekey file is processed and the code-value pairs,as well as the length of the longest code, arestored for future reference. The key file is re-quired to be in plain text format, and with onecode-value pair on each line. Furthermore, thecode and its value should be separated by at

#Key#homophonic with nomenclature#null = space (word boundary)1 <null>8 <null>00 a02 a03 o04 u/v...232 ambasciatore di Spagna

Figure 3: Example format for a transcribedkey file in the Decode database.

least one white space character. If the key filecontains values denoting word delimiters, thescript will recognise these as such if they arewritten as "null" (in any combination of upper-case and lower-case characters, so for example"Null" and "NULL" will also be recognised asword delimiters). Any lines initialized with ahashtag sign ("#") will be ignored, as contain-ing comments, in accordance with the formatillustrated in Figure 3.

3.3 Mapping ciphertexts tocode-value pairs

In the second step, the transcribed ciphertextis processed, and its contents matched againstthe code-value pairs stored in the previousstep, to reveal the underlying plaintext mes-sage. In this process, three types of input areignored:

1. text within angle brackets (presumed tocontain cleartext segments)

2. lines starting with a hashtag (presumedto be comments)

3. question marks (presumed to denote theannotator’s uncertainty about the tran-scription)

The first lines from a transcribed ciphertextfile in the Decode database is shown in Figure4, where the name of the image from whichthe transcription has been made is given as acomment (preceded by a hashtag), cleartext isgiven within angle brackets, and the annota-tor’s uncertainty about transcribing the last

#002r.jpg<IT De Inenunchi? 14 Maggio 1628.>6239675017378233236502343051822004623?

Figure 4: Example of a ciphertext segment inthe Decode database.

character on the line as the digit "3" is sig-nalled by a question mark.

If the transcribed ciphertext contains wordboundaries in the form of space characters, orif the code-value pairs stored in the previousstep contain codes for denoting word bound-aries, the ciphertext is split into words basedon this information, and the remaining map-ping process is performed word by word. If nosuch information exists, the whole ciphertextis treated as a single segment, to be processedcharacter by character.

If word boundaries have been detected, forevery word that is shorter than, or equal inlength to, the longest code in the key file, wecheck if the whole word can be matched to-wards a code. If so, we replace the cipher-text sequence with the value connected to thatcode. If not, or if the word is longer than thelongest code in the key file, we iterate over thesequence, character by character, and try tomatch each character against the key file. Ifnot successful, we merge the current charac-ter(s) with the succeeding character, and tryto match the longer sequence against the keyfile, until we reach a sequence equal in lengthto the longest code in the key file. If noth-ing can be matched in the key file for themaximum length sequence, we replace this se-quence by a question mark instead, and moveon to the next character. This non-greedysearch-and-replace mechanism is applied forthe whole word, except for the end of the word.Since we know that the key in the training filecontains code-value pairs for representing suf-fixes, the script checks, in each iteration, if theremaining part of the word is equal in lengthto the longest code in the key file. If so, we tryto match the whole sequence, and only if thisfails, we fall back to the character-by-charactermapping. The whole algorithm for the match-ing procedure is illustrated in Figure 5.

If sequence equals a code:Replace sequence by matched value

Else:While characters in sequence:

If char(s) equals a code:Replace char(s) by matched value

Else if char length equalslongest code length:

Replace char(s) by ? andmove on to next character

Else:Merge char with next charand try again

Figure 5: Algorithm for mapping ciphertextsequences to code-value pairs.

3.4 Language identificationWhen the plaintext has been recovered, thenext task is to guess what language the de-crypted text is written in, and present hy-potheses to the user. This is done basedon language models for historical text, de-rived from the HistCorp collection of histori-cal corpora and tools (Pettersson and Megyesi,2018).3 We use word-based language modelscreated using the IRSTLM package (Federicoet al., 2008), for the 14 languages currentlyavailable from the HistCorp webpage: Czech,Dutch, English, French, German, Greek, Hun-garian, Icelandic, Italian, Latin, Portuguese,Slovene, Spanish and Swedish. For this par-ticular task, we only make use of the unigrammodels, i.e. the single words in the histori-cal texts. The plaintext words generated inthe previous steps, by matching character se-quences in the ciphertext against code-valuepairs in the key file, are compared to the wordspresent in the language model for each of these14 languages. As output, the script producesa ranked list of these languages, presentingthe percentage of words in the plaintext filethat are also found in the word-based languagemodel for the language in question. The ideais that if a large percentage of the words inthe decrypted file are also present in a lan-guage model, there is a high chance that thekey tested against the ciphertext is actually anappropriate key for this particular text.

3.5 EvaluationWe evaluate our method for cipher-key map-ping based on the percentage of words in the

3https://cl.lingfil.uu.se/histcorp/

evaluation corpus that are identical in the au-tomatically deciphered text and in the man-ually deciphered gold standard version (tak-ing the position of the word into considera-tion). Casing is not considered in the com-parison, since lower-case and upper-case wordsare usually not represented by different codesin the key files used in our experiments. Fur-thermore, some codes may refer to several (re-lated) values, such as in the example given inFigure 3 (see further Section 3.2), where thecode 04 could correspond either to the letteru or to the letter v. This also holds for dif-ferent inflectional forms of the same lemma,such as the Italian word for ’this’, that couldbe questo, questa or questi, depending on thegender and number of the head word that it isconnected to, and therefore would typically berepresented by the same code. In these cases,we consider the automatic decipherment to becorrect, if any of the alternative mappings cor-responds to the form chosen in the gold stan-dard.The language identification task is evaluated

by investigating at what place in the rankedlist of 14 potential languages the target lan-guage is presented.

4 Results4.1 Cipher-Key mappingIn the first experiment, we tested the CKMalgorithm on the Francia cipher (see furtherSection 3.1); a cipher of the same type as thecipher used as a role model during the develop-ment of the method. This cipher shares severalcharacteristics with the training text: (i) it iswritten during the same time period, (ii) itis collected from the Vatican Secret Archives,(iii) it is a numerical cipher with homophonicsubstitution and nomenclatures, and (iv) worddelimiters are represented in the key.When comparing the automatically de-

crypted words to the words in the manuallydeciphered gold standard, approximately 79%of the words are identical. The cases wherethe words differ could be categorized into fourdifferent types:

1. Incomplete key: Diacritics(36 instances)The key only contains plain letters, with-out diacritics. The human transcriber

has however added diacritics in the man-ually deciphered text, where applicable.For example, the code 9318340841099344has been interpreted by the script as theword temerita. The human transcriberhas added a diacritic to the last lettera, resulting in the word temeritá (’bold-ness’), even though the key states a (with-out accent) as the value for the code "44".

2. Character not repeated(27 instances)In cases where a character should be re-peated in order for a word to be spelledcorrectly (at least according to present-day spelling conventions), the humantranscriber has in many cases chosen torepeat the character, even though this isnot stated in the key.

3. Human reinterpretation(2 instances)In a few cases where the text contains un-expected inflectional forms, such as a sin-gular ending where a plural ending wouldbe expected, the human transcriber haschosen the grammatically correct form,even though the code in the ciphertext ac-tually is different.

4. Wrong interpretation by the script(16 instances)Due to the non-greedy nature of the al-gorithm, it will sometimes fail to matchlonger codes in the key file, when it findsa match for a shorter code. This could beseen for example for prefixes such as buon-in buonissima (’very good’), and qual- inqualche (’some’).

In a second experiment, we tested the scriptagainst a cipher of another type, the Borgcipher,4 based on simple substitution with asmall nomenclature, encoded by 34 graphicalsigns with word boundaries marked by space.Since this cipher is based on simple substi-tution, rather than homophonic substitution,and with word boundaries already marked, itis less ambiguous than the Francia cipher. Ac-cordingly, approximately 97% of the words inthe output from the cipher-key mapping scriptare identical to the words in the gold standard.

4https://cl.lingfil.uu.se/~bea/borg/

The mismatches are mainly due to some word-initial upper-case letters in the ciphertext be-ing written as the plaintext letter, instead ofbeing encoded. As an example, the Latin wordnucem (inflectional form of the word ’nut’)would normally be enciphered as ’9diw1’ in theBorg cipher, but in one case it occurs insteadas ’Ndiw1’. There are several similar cases forother words throughout the cipher.

4.2 Language identificationFor the language identification task, we cansee from Table 2 that both the Barb cipherand the Francia cipher are correctly identi-fied as written in Italian by the CKM algo-rithm, and the Copiale cipher is correctly iden-tified as written in German. These guessesare based on the fact that 79.17% of the to-kens in the automatically recovered version ofthe Barb plaintext, and 80.05% of the tokensin the Francia text, could also be found in theItalian language model, whereas 86.55% of thetokens in the Copiale cipher could be matchedin the German language model. As could beexpected, the second best guess produced bythe algorithm for the Italian manuscripts is forthe closely related languages Spanish and Por-tuguese respectively. More surprisingly, thethird best guess for both these texts is Ger-man, with a substantial amount of the to-kens found in the German language model aswell. A closer look at the German languagemodel used in our experiments reveals a possi-ble explanation to this. The German languagemodel is based on data from the time period1050–1914, where the oldest texts contain asubstantial amount of citations and text blocksactually written in Latin, a language closelyrelated to Italian. This might also explain whythe Borg text, written in Latin, is identifiedby the script as written in German. The thirdguess for the Borg text is Swedish, for whichthe language model is also based on very oldtext (from 1350 and onwards), with blocks ofLatin text in it. The Latin language model onthe other hand is rather small, containing onlyabout 79,000 tokens extracted from the An-cient Greek and Latin Dependency Treebank.5Due to the small size of this language modelas compared to the language models for the

5https://perseusdl.github.io/treebank_data/

Name Top 3 language models Gold languageBarb-6956 Italian 79.17% Italian

Spanish 66.77%German 65.73%

Francia-64 Italian 80.05% ItalianPortuguese 72.40%German 70.22%

Borg.lat.898 German 56.73% LatinSpanish 55.01%Swedish 51.50%

Copiale German 86.55% GermanSlovene 69.95%Swedish 57.76%

Table 2: Language identification results.

other languages in this study, in combinationwith the fact that Latin words occur in oldertexts for many languages, it is hard for thescript to correctly identify Latin as the sourcelanguage.For the German Copiale cipher, the second

best guess is for Slovene, and the third bestguess is for Swedish. This could be due tothe fact that both Slovene and Swedish werestrongly influenced by the German languagein historical times, meaning that many Ger-man and German-like words would appear inhistorical Slovene and Swedish texts.

4.2.1 Present-day language modelsSo far, we have presumed that language mod-els based on historical text would be bestsuited for the task of language identificationin the context of historical cryptology. This isbased on the assumption that spelling and vo-cabulary were different in historical times thanin present-day text, meaning that some wordsand their particular spelling variants wouldonly occur in historical text. It could howeverbe argued that it is easier to find large amountsof present-day text to build language modelsfrom. As a small test to indicate whether ornot present-day text would be useful in thiscontext, we downloaded the Spacy languagemodels for present day Italian6 and German,7trained on Wikipedia text, and compared thecoverage in these language models to the cov-

6https://github.com/explosion/spacy-models/releases//tag/it_core_news_sm-2.1.0

7https://github.com/explosion/spacy-models/releases//tag/de_core_news_md-2.1.0

erage in the historical language models, whenapplied to the plaintexts of the Barb, Franciaand Copiale manuscripts.As seen from Table 3, the percentage of

word forms found in the language modelsbased on present-day data is considerablylower than for the historical language mod-els, even though the present-day data sets arelarger. The preliminary conclusion is that lan-guage models based on historical text is bettersuited for the task at hand, but present-daylanguage models could also be useful, in par-ticular in cases where it is hard to find suit-able historical data to train a language model.More thorough experiments would however beneeded to confirm this.

5 Discussion

From our experiments, we can conclude thatthe implemented algorithm makes it possibleto restore the hidden plaintext from cipher-texts and their corresponding key. For one ofthe ciphers evaluated, 79% of the words werecorrectly mapped to the gold standard plain-text words, and the mismatches were mainlydue to diacritics and repeated characters notbeing part of the key. This knowledge couldeasily be taken into consideration in furtherdevelopment of the algorithm, where the scriptcould test to add diacritics in strategic posi-tions and to repeat certain characters in caseswhere a specific word could not be found ina language model (provided that we alreadyhave an educated guess on what languagethe underlying plaintext is written in). For

Name Historical LM Present-day LMBarb-6956 79.17% 64.11%Francia-64 80.05% 66.23%Copiale 86.55% 81.44%

Table 3: Language identification results, comparing language models based on historical text tolanguage models based on present-day text.

the other cipher evaluated, being an out-of-domain manuscript of a different cipher typeand another underlying language than themanuscript used during training, we got veryencouraging results with 97% of the words inthe manuscript being correctly matched, andthe mismatches in the remaining words mainlybeing due to plaintext characters occurring aspart of ciphertext words.

For the language identification task, the re-sults are mixed. The German and Italianmanuscripts are correctly identified as beingwritten in German and Italian respectively,whereas the algorithm assumes the Latin textto be written in German. This indicates thatwe need to be careful about what texts to putinto the language models. In the current ex-periments, we have simply used the languagemodels at hand for historical texts for differentlanguages, without taking into account differ-ences in time periods and genres covered, northe size of the text material used as a basis forthe language model. Thus, the language mod-els used in the experiments are very different insize, where the Latin language model containsabout 79,000 tokens, as compared to approxi-mately 124 million tokens in the German lan-guage model. Furthermore, since the languagein very old texts is typically quite differentfrom the language in younger texts, languagemodels only containing texts from the time pe-riod in which the cipher is assumed to havebeen created would better suit our purposes.In addition, many old texts contain blocks ofLatin words, since this was the Lingua Francain large parts of the world in historical times.This results in many Latin words being foundin language models for other languages as well.

The language detection evaluation alsoshows that using language models based onhistorical text has a clear advantage over us-ing state-of-the-art language models based onpresent-day language.

6 Conclusion

In this paper, we have presented a studywithin the field of historical cryptology, anarea strongly related to digital humanities ingeneral, and digital philology in particular.More specifically, we have introduced an algo-rithm for automatically mapping encrypted ci-phertext sequences to their corresponding key,in order to reconstruct the plaintext describingthe underlying message of the cipher. Sinceciphertexts and their corresponding keys areoften stored in separate archives around theworld, without knowledge about which key be-longs to which ciphertext, such an algorithmcould help in connecting ciphertexts to theircorresponding keys, revealing the encipheredinformation to historians and other researcherswith an interest in historical sources.

Acknowledgments

This work has been supported by theSwedish Research Council, grant 2018-06074: DECRYPT - Decryption of historicalmanuscripts.

ReferencesBelgium Algemeen Rijk-

sarchief, Brussels. 1647-1698.cl.lingfil.uu.se/decode/database/record/960[link].

Craig Bauer. 2017. Unsolved! The Historyand Mystery of the World’s Greatest Ciphersfrom Ancient Egypt to Online Secret Societies.Princeton University Press, Princeton, USA.

Friedrich Bauer. 2007. Decrypted Secrets — Meth-ods and Maxims of Cryptology. 4th edition.Springer.

Marcello Federico, Nicola Bertoldi, and MauroCettolo. 2008. IRSTLM: an open source toolkitfor handling large scale language models. InProceedings of Interspeech 2008, pages 1618–1621.

David Kahn. 1996. The Codebreakers: The Com-prehensive History of Secret Communicationfrom Ancient Times to the Internet. New York.

Kevin Knight, Beáta Megyesi, and ChristianeSchaefer. 2011. The copiale cipher. In Invitedtalk at ACL Workshop on Building and UsingComparable Corpora (BUCC). Association forComputational Linguistics.

Kevin Knight, Anish Nair, NishitRathod, and Kenji Yamada. 2006.https://www.aclweb.org/anthology/P06-2065Unsupervised analysis for decipherment prob-lems. In Proceedings of the COLING/ACL2006 Main Conference Poster Sessions, pages499–506, Sydney, Australia. Association forComputational Linguistics.

Benedek Láng. 2018. Real Life Cryptology: Ci-phers and Secrets in Early Modern Hungary. At-lantis Press, Amsterdam University Press.

Beáta Megyesi, Nils Blomqvist, and Eva Petters-son. 2019. The DECODE Database: Collectionof Ciphers and Keys. In Proceedings of the 2ndInternational Conference on Historical Cryptol-ogy, HistoCrypt19, Mons, Belgium.

Malte Nuhn and Kevin Knight. 2014. Cipher typedetection. In Proceedings of the 2014 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 1769–1773. Associ-ation for Computational Linguistics.

Eva Pettersson and Beáta Megyesi. 2018. TheHistCorp Collection of Historical Corpora andResources. In Proceedings of the Digital Hu-manities in the Nordic Countries 3rd Confer-ence, Helsinki, Finland.

Sujith Ravi and Kevin Knight. 2008. Attacking de-cipherment problems optimally with low-ordern-gram models. In Proceedings of the 2008 Con-ference on Empirical Methods in Natural Lan-guage Processing, pages 812–819. Association forComputational Linguistics.

Simon Singh. 2000. The Code Book: The Sci-ence of Secrecy from Ancient Egypt to QuantumCryptography. Anchor Books.


Top Related