how to teach a computer to read the internetc2.com/~ward/sao/techignite-v1/techignite.pdf ·...

20
How to Teach a Computer to Read the Internet An Open-Source Project on GitHub

Upload: vannhi

Post on 18-Mar-2018

228 views

Category:

Documents


10 download

TRANSCRIPT

Page 1: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve

How to Teach a Computer to Read

the Internet

An Open-Source Project on GitHub

Page 2: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve

StructuredSemi-

StructuredUnstructured

RSSXMLJSON

WikiTwitter

Meta Tags

NewsTranslations

Poetry

Page 3: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve

1. parse what you expect

2. see what else you get

3. repeat real fast

Methodology

Page 4: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve

1. parse what you expect

there are factsthat have keys made of wordsand values that aren’t keys

Page 5: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve

2. see what else you get

we found 19,498 keysone key was partners

Page 6: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve

3. repeat real fast

the run started at 18:02:32parsed 87,325,460,601 bytes

Page 7: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve

key

value

Page 8: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve

upper

lower

Page 9: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve

familia

r

Page 10: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve
Page 11: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve

Real Fast Iterating

Page 12: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve

Real Fast Parsing

The quick brown fox

b e b e b e b e

fox\0

yybuf

yythunk

yytext

yybuf

yybegin

yyend

yylimit

yybuflen

yypos

nounadj

Page 13: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve

Real Fast Data

50 min for every byte5 min for useful prefix5 sec for last sampling

(30GB of wikitext)

Page 14: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve

Compiler Writer:

Parsing Explorer:

grammartext

text

text

grammar

text

text

text

programs

wikitext

Page 15: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve

regex text

substitutions

extra code for nesting

Page 16: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve

primary = identifier ! ‘=‘ | ‘(‘ expression ‘)’

nesti

ng

lookahead

as in regexas

in ya

cc

Page 17: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve

Piumarta 2007

Ford 2004

Page 18: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve

read this

dev.AboutUs.org

Page 19: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve

http: //c2.com/~ward/sao/TechIgnite-v1/

Page 20: How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf · Industrial production : ... Consumer electroncs Dgital distributon April 1, 1976 Steve