an exercise in conversion dirk roorda @ ehumanities 2012-01-26

Download An exercise in conversion Dirk Roorda @ eHumanities 2012-01-26

If you can't read please download the document

Upload: neil-hicks

Post on 13-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

  • Slide 1

An exercise in conversion Dirk Roorda @ eHumanities 2012-01-26 Slide 2 the task the method the lessons the result demo Slide 3 JapAM Descartes Correspondence ca. 700 letters 69,237 lines 600 formulas 4.2 MB (without the 311 pictures) Slide 4 Slide 5 CKCC corpus Descartes XML : Text Encoding Initiative (TEI) ~ 35,000 elements, of which 7,200 metadata 7,700 paragraphs 6,200 formulas 6,000 text-formattings 4,200 structure 2,900 page-breaks 538 images Slide 6 Slide 7 Slide 8 Slide 9 observation non-algorithmic changes consolidation proofs Slide 10 use digital equipment: -your text-editor -your scripting language -your regular expressions Slide 11 Slide 12 replace =(.*?)$ by match1 ??? Aargh!#@\] Slide 13 Slide 14 Slide 15 Slide 16 Slide 17 ...formulasmetaclosers... conversion process canonicalinitialcorrectedimprovedchecked metadata combining Slide 18 Slide 19 Slide 20 Slide 21 Slide 22 convert.pl 100 KB of program code text = 25 densely typed pages = 3427 lines of which 2175 real code lines Code/Input = 1/32 Slide 23 Slide 24 1/3 of the tasks need 2/3 of the code formulas: (2)37 % headers, openers, closers:(3)16 % meta and images: (3)11 % run time of same tasks formulas:(2)29 % headers, openers, closers:(3) 6 % meta and images(3)10 % total run time(25)40 sec Slide 25 1. Unicode is your friend 2. Split into many subtasks 3. task = configuration + workflow 4. Count and check 5. Performance matters 6. Do not give up automation Slide 26 Slide 27 (2a) that can be run separately (2b) that can be reordered easily Slide 28 Slide 29 Slide 30 was 30+ seconds is now 2.07 seconds many new subtasks based on same template (gain = 15 * 30 = 7.5 min per run) many, many runs before everything is OK (gain = 100 * 7.5 = 12.5 hours CPU-time) Slide 31 we used a lot of expert knowledge which has all been transferred to - the source - consolidated extra inputs so the conversion is still repeatable and modifiable sourceformulasmetaclosersresults corrections hints CKCC conversion program