content analysis with apache tika

29
Content analysis with Apache Tika Paolo Mottadelli - [email protected] or [email protected]

Upload: paolo-mottadelli

Post on 08-May-2015

9.944 views

Category:

Technology


5 download

DESCRIPTION

Apache Tika presentation, taken from Paolo Mottadelli's preso @ ApacheCon US 2008

TRANSCRIPT

Page 1: Content Analysis with Apache Tika

Content analysis with Apache Tika

Paolo Mottadelli -

[email protected] or [email protected]

Page 2: Content Analysis with Apache Tika

Paolo Mottadelli

Main challenge

2

Luceneindex

Page 3: Content Analysis with Apache Tika

Paolo Mottadelli

Other challenges

3

Page 4: Content Analysis with Apache Tika

Paolo Mottadelli

What is Tika?

4

Another Indian Lucene project? No.

Page 5: Content Analysis with Apache Tika

Paolo Mottadelli

What is Tika?

It is a Toolkit

5

Page 6: Content Analysis with Apache Tika

Paolo Mottadelli

Current coverage

6

Page 7: Content Analysis with Apache Tika

Paolo Mottadelli

A brief history of Tika

Sponsored by the Apache Lucene PMC

7

Page 8: Content Analysis with Apache Tika

Paolo Mottadelli

Tika organization

8

Changing after graduation

Page 9: Content Analysis with Apache Tika

Paolo Mottadelli

Getting Tika

… and contributing

9

Page 10: Content Analysis with Apache Tika

Paolo Mottadelli

Tika Design

10

Page 11: Content Analysis with Apache Tika

Paolo Mottadelli

The Parser interfacevoid parse(InputStream stream, ContentHandler

handler, Metadata metadata) throws IOException, SAXException, TikaException;

11

Page 12: Content Analysis with Apache Tika

Paolo Mottadelli

Tika Design

12

Page 13: Content Analysis with Apache Tika

Paolo Mottadelli

Document input stream

13

Page 14: Content Analysis with Apache Tika

Paolo Mottadelli

Tika Design

14

Page 15: Content Analysis with Apache Tika

Paolo Mottadelli

XHTML SAX events<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<title>...</title>

</head>

<body> ... </body>

</html>

15

Page 16: Content Analysis with Apache Tika

Paolo Mottadelli

Why XHTML?

• Reflect the structured text content of the document

• Not recreating the low level details• For low level details use low level parser libs

16

Page 17: Content Analysis with Apache Tika

Paolo Mottadelli

ContentHandler (CH) and Decorators (CHD)

17

Page 18: Content Analysis with Apache Tika

Paolo Mottadelli

Tika Design

18

Page 19: Content Analysis with Apache Tika

Paolo Mottadelli

Document metadata

19

Page 20: Content Analysis with Apache Tika

Paolo Mottadelli

… more metadata: HPSF

20

Page 21: Content Analysis with Apache Tika

Paolo Mottadelli

Tika Design

21

Page 22: Content Analysis with Apache Tika

Paolo Mottadelli

Parser implementations

22

Page 23: Content Analysis with Apache Tika

Paolo Mottadelli

The AutoDetectParser

• Encapsulates all Tika functionalities• Can handle any type of document

23

Page 24: Content Analysis with Apache Tika

Paolo Mottadelli

Type DetectionMimeType type = types.getMimeType(…);

24

Page 25: Content Analysis with Apache Tika

Paolo Mottadelli

tika-mimetypes.xml

An example: Gzip

<mime-type type="application/x-gzip">

<magic priority="40">

<match value="\037\213" type="string“ offset="0" />

</magic>

<glob pattern="*.tgz" />

<glob pattern="*.gz" />

<glob pattern="*-gz" />

</mime-type>

25

Page 26: Content Analysis with Apache Tika

Paolo Mottadelli

Supported formats

26

Page 27: Content Analysis with Apache Tika

Paolo Mottadelli

A really simple exampleInputStream input =

MyTest.class.getResourceAsStream("testPPT.ppt");

Metadata metadata = new Metadata();

ContentHandler handler = new BodyContentHandler();

new OfficeParser().parse(input, handler, metadata);

String contentType = metadata.get(Metadata.CONTENT_TYPE);

String title= metadata.get(Metadata.TITLE);

String content = handler.toString();

27

Page 28: Content Analysis with Apache Tika

Paolo Mottadelli

Future Goals

28

Page 29: Content Analysis with Apache Tika

Paolo Mottadelli

Who uses Tika?

29