tokeniser francisco miguel pérez romero university of sevilla
TRANSCRIPT
Tokeniser
Francisco Miguel Pérez Romero
University of Sevilla
Roadmap
Introduction
Class Diagram
Libraries
Conclusions
Roadmap
Introduction
Class Diagram
Libraries
Conclusions
Web Wrapping
Information retrieval
VerifierOntologiserExtractor
Query
NavigatorFormFiller
Tokeniser
¨ Tokenisation Rules¨ Configuration File ¨ Web Page¨ Parser
Tokeniser Usage
¨ Web Page Classification¨ Information Extraction Learners¨ Information Extraction
Example
Config FileToken List
Web Page
Tokeniser
XML File Token
List
Concepts
¨ Configuration File¨ Token¨ Tokenisation types
Roadmap
Introduction
Class Diagram
Libraries
Conclusions
Example
3 Token Classes: Word Space Digit Space Digit
Class Diagram: Tokenisation
Tokenisation Example
Class Diagram: Tokeniser
Roadmap
Introduction
Class Diagram
Libraries
Conclusions
Comparison Features 1
¨ Comparison Features:¨ Javadoc documentation?¨ Support UNICODE UTF-8¨ Support UNICODE UTF-16¨ Named Groups¨ Indexable Groups > 9¨ Negative Groups¨ Nested groups¨ Lazy qualifications?
Comparison Features 2
¨ Comparison Features:¨ Fuzzy matching?¨ Support POSIX?¨ Support Ignore Case?¨ Support New Line Option?¨ Use State Machine?¨ Support accent?
Libraries
¨ Tabla 1
Libraries
¨ Tabla 2
Libraries
¨ Tabla 3
Benchmark 1
¨ Regular Expression List¨ String List¨ Matching all one another¨ Time in ms
Benchmark 1: 10000 Iterations
¨ org.apache: -> 7078 ms¨ com.stevesoft : -> 19782 ms¨ kmy.regex : -> 781 ms¨ java.util : -> 1266 ms¨ jregex.Pattern : -> 1000 ms¨ org.apache.oro : -> 2156 ms¨ dk.brics.automaton : -> 265 ms¨ com.karneim.util.collection : -> 407 ms
Benchmark 1: 20000 Iterations
¨ org.apache: -> 11796 ms¨ com.stevesoft : -> 26641 ms¨ kmy.regex : -> 906 ms¨ java.util : -> 1891 ms¨ jregex.Pattern : -> 1422 ms¨ org.apache.oro : -> 3375 ms¨ dk.brics.automaton : -> 312 ms¨ com.karneim.util.collection : -> 610 ms
Benchmark 1: 50000 Iterations
¨ org.apache: -> 28656 ms¨ com.stevesoft : -> 63297 ms¨ kmy.regex : -> 1781 ms¨ java.util : -> 4281 ms¨ jregex.Pattern : -> 3219 ms¨ org.apache.oro : -> 7641 ms¨ dk.brics.automaton : -> 531 ms¨ com.karneim.util.collection : -> 1312 ms
Diagram
org.
apac
he
com
.ste
veso
ft
kmy.
rege
x
java
.util
jrege
x.Pa
ttern
org.
apac
he.o
ro
dk.b
rics
com
.kar
neim
0
10000
20000
30000
40000
50000
60000
70000
10000 It20000 It50000 It
Benchmark 2
¨ Source Code¨ Matching tags
Benchmark 2: Amazon
¨ org.apache : -> 218 ms¨ com.stevesoft : -> 63 ms¨ kmy.regex : ->94 ms¨ java.util : -> 0 ms¨ jregex.Pattern : -> 93 ms¨ org.apache.oro : -> 32 ms¨ dk.brics.automaton : -> 0 ms¨ com.karneim.util.collection : -> 47 ms
Benchmark 2: Marca
¨ org.apache : -> 62 ms¨ com.stevesoft : -> 47 ms¨ kmy.regex : ->93 ms¨ java.util : -> 0 ms¨ jregex.Pattern : -> 94 ms¨ org.apache.oro : -> 16 ms¨ dk.brics.automaton : -> 0 ms¨ com.karneim.util.collection : -> 62 ms
Benchmark 2: Ebay
¨ org.apache : -> 31 ms¨ com.stevesoft : -> 125 ms¨ kmy.regex : ->266 ms¨ java.util : -> 0 ms¨ jregex.Pattern : -> 156 ms¨ org.apache.oro : -> 47 ms¨ dk.brics.automaton : -> 0 ms¨ com.karneim.util.collection : -> 172 ms
Diagram
org.
apac
he
com
.ste
veso
ft
kmy.
rege
x
java
.util
jrege
x.Pa
ttern
org.
apac
he.o
ro
dk.b
rics
com
.kar
neim
0
50
100
150
200
250
300
AmazonMarcaEbay
To sum up…
¨ Dk.brics.automaton is the faster¨ Dk.brics and com.karneim fail with URL¨ Kmy.regex or java.util
Roadmap
Introduction
Class Diagram
Libraries
Conclusions
Conclusions
¨ Tokenisation test¨ Searching information¨ A real project¨ Experience
Thanks!