how to teach a computer to read the internetc2.com/~ward/sao/techignite-v1/techignite.pdf ·...
TRANSCRIPT
How to Teach a Computer to Read
the Internet
An Open-Source Project on GitHub
StructuredSemi-
StructuredUnstructured
RSSXMLJSON
WikiTwitter
Meta Tags
NewsTranslations
Poetry
1. parse what you expect
2. see what else you get
3. repeat real fast
Methodology
1. parse what you expect
there are factsthat have keys made of wordsand values that aren’t keys
2. see what else you get
we found 19,498 keysone key was partners
3. repeat real fast
the run started at 18:02:32parsed 87,325,460,601 bytes
key
value
upper
lower
familia
r
Real Fast Iterating
Real Fast Parsing
The quick brown fox
b e b e b e b e
fox\0
yybuf
yythunk
yytext
yybuf
yybegin
yyend
yylimit
yybuflen
yypos
nounadj
Real Fast Data
50 min for every byte5 min for useful prefix5 sec for last sampling
(30GB of wikitext)
Compiler Writer:
Parsing Explorer:
grammartext
text
text
grammar
text
text
text
programs
wikitext
regex text
substitutions
extra code for nesting
primary = identifier ! ‘=‘ | ‘(‘ expression ‘)’
nesti
ng
lookahead
as in regexas
in ya
cc
Piumarta 2007
Ford 2004
read this
dev.AboutUs.org
http: //c2.com/~ward/sao/TechIgnite-v1/