tokeniser francisco miguel pérez romero university of sevilla

33
Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Upload: dominick-powell

Post on 16-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Tokeniser

Francisco Miguel Pérez Romero

University of Sevilla

Page 2: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Roadmap

Introduction

Class Diagram

Libraries

Conclusions

Page 3: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Roadmap

Introduction

Class Diagram

Libraries

Conclusions

Page 4: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Web Wrapping

Information retrieval

VerifierOntologiserExtractor

Query

NavigatorFormFiller

Page 5: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Tokeniser

¨ Tokenisation Rules¨ Configuration File ¨ Web Page¨ Parser

Page 6: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Tokeniser Usage

¨ Web Page Classification¨ Information Extraction Learners¨ Information Extraction

Page 7: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Example

Config FileToken List

Web Page

Tokeniser

XML File Token

List

Page 8: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Concepts

¨ Configuration File¨ Token¨ Tokenisation types

Page 9: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Roadmap

Introduction

Class Diagram

Libraries

Conclusions

Page 10: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Example

3 Token Classes: Word Space Digit Space Digit

Page 11: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Class Diagram: Tokenisation

Page 12: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Tokenisation Example

Page 13: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Class Diagram: Tokeniser

Page 14: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Roadmap

Introduction

Class Diagram

Libraries

Conclusions

Page 15: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Comparison Features 1

¨ Comparison Features:¨ Javadoc documentation?¨ Support UNICODE UTF-8¨ Support UNICODE UTF-16¨ Named Groups¨ Indexable Groups > 9¨ Negative Groups¨ Nested groups¨ Lazy qualifications?

Page 16: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Comparison Features 2

¨ Comparison Features:¨ Fuzzy matching?¨ Support POSIX?¨ Support Ignore Case?¨ Support New Line Option?¨ Use State Machine?¨ Support accent?

Page 17: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Libraries

¨ Tabla 1

Page 18: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Libraries

¨ Tabla 2

Page 19: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Libraries

¨ Tabla 3

Page 20: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Benchmark 1

¨ Regular Expression List¨ String List¨ Matching all one another¨ Time in ms

Page 21: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Benchmark 1: 10000 Iterations

¨ org.apache: -> 7078 ms¨ com.stevesoft : -> 19782 ms¨ kmy.regex : -> 781 ms¨ java.util : -> 1266 ms¨ jregex.Pattern : -> 1000 ms¨ org.apache.oro : -> 2156 ms¨ dk.brics.automaton : -> 265 ms¨ com.karneim.util.collection : -> 407 ms

Page 22: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Benchmark 1: 20000 Iterations

¨ org.apache: -> 11796 ms¨ com.stevesoft : -> 26641 ms¨ kmy.regex : -> 906 ms¨ java.util : -> 1891 ms¨ jregex.Pattern : -> 1422 ms¨ org.apache.oro : -> 3375 ms¨ dk.brics.automaton : -> 312 ms¨ com.karneim.util.collection : -> 610 ms

Page 23: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Benchmark 1: 50000 Iterations

¨ org.apache: -> 28656 ms¨ com.stevesoft : -> 63297 ms¨ kmy.regex : -> 1781 ms¨ java.util : -> 4281 ms¨ jregex.Pattern : -> 3219 ms¨ org.apache.oro : -> 7641 ms¨ dk.brics.automaton : -> 531 ms¨ com.karneim.util.collection : -> 1312 ms

Page 24: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Diagram

org.

apac

he

com

.ste

veso

ft

kmy.

rege

x

java

.util

jrege

x.Pa

ttern

org.

apac

he.o

ro

dk.b

rics

com

.kar

neim

0

10000

20000

30000

40000

50000

60000

70000

10000 It20000 It50000 It

Page 25: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Benchmark 2

¨ Source Code¨ Matching tags

Page 26: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Benchmark 2: Amazon

¨ org.apache : -> 218 ms¨ com.stevesoft : -> 63 ms¨ kmy.regex : ->94 ms¨ java.util : -> 0 ms¨ jregex.Pattern : -> 93 ms¨ org.apache.oro : -> 32 ms¨ dk.brics.automaton : -> 0 ms¨ com.karneim.util.collection : -> 47 ms

Page 27: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Benchmark 2: Marca

¨ org.apache : -> 62 ms¨ com.stevesoft : -> 47 ms¨ kmy.regex : ->93 ms¨ java.util : -> 0 ms¨ jregex.Pattern : -> 94 ms¨ org.apache.oro : -> 16 ms¨ dk.brics.automaton : -> 0 ms¨ com.karneim.util.collection : -> 62 ms

Page 28: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Benchmark 2: Ebay

¨ org.apache : -> 31 ms¨ com.stevesoft : -> 125 ms¨ kmy.regex : ->266 ms¨ java.util : -> 0 ms¨ jregex.Pattern : -> 156 ms¨ org.apache.oro : -> 47 ms¨ dk.brics.automaton : -> 0 ms¨ com.karneim.util.collection : -> 172 ms

Page 29: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Diagram

org.

apac

he

com

.ste

veso

ft

kmy.

rege

x

java

.util

jrege

x.Pa

ttern

org.

apac

he.o

ro

dk.b

rics

com

.kar

neim

0

50

100

150

200

250

300

AmazonMarcaEbay

Page 30: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

To sum up…

¨ Dk.brics.automaton is the faster¨ Dk.brics and com.karneim fail with URL¨ Kmy.regex or java.util

Page 31: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Roadmap

Introduction

Class Diagram

Libraries

Conclusions

Page 32: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Conclusions

¨ Tokenisation test¨ Searching information¨ A real project¨ Experience

Page 33: Tokeniser Francisco Miguel Pérez Romero University of Sevilla

Thanks!