content analysis with apache tika
DESCRIPTION
Apache Tika presentation, taken from Paolo Mottadelli's preso @ ApacheCon US 2008TRANSCRIPT
Content analysis with Apache Tika
Paolo Mottadelli -
Paolo Mottadelli
Main challenge
2
Luceneindex
Paolo Mottadelli
Other challenges
3
Paolo Mottadelli
What is Tika?
4
Another Indian Lucene project? No.
Paolo Mottadelli
What is Tika?
It is a Toolkit
5
Paolo Mottadelli
Current coverage
6
Paolo Mottadelli
A brief history of Tika
Sponsored by the Apache Lucene PMC
7
Paolo Mottadelli
Tika organization
8
Changing after graduation
Paolo Mottadelli
Getting Tika
… and contributing
9
Paolo Mottadelli
Tika Design
10
Paolo Mottadelli
The Parser interfacevoid parse(InputStream stream, ContentHandler
handler, Metadata metadata) throws IOException, SAXException, TikaException;
11
Paolo Mottadelli
Tika Design
12
Paolo Mottadelli
Document input stream
13
Paolo Mottadelli
Tika Design
14
Paolo Mottadelli
XHTML SAX events<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>...</title>
</head>
<body> ... </body>
</html>
15
Paolo Mottadelli
Why XHTML?
• Reflect the structured text content of the document
• Not recreating the low level details• For low level details use low level parser libs
16
Paolo Mottadelli
ContentHandler (CH) and Decorators (CHD)
17
Paolo Mottadelli
Tika Design
18
Paolo Mottadelli
Document metadata
19
Paolo Mottadelli
… more metadata: HPSF
20
Paolo Mottadelli
Tika Design
21
Paolo Mottadelli
Parser implementations
22
Paolo Mottadelli
The AutoDetectParser
• Encapsulates all Tika functionalities• Can handle any type of document
23
Paolo Mottadelli
Type DetectionMimeType type = types.getMimeType(…);
24
Paolo Mottadelli
tika-mimetypes.xml
An example: Gzip
<mime-type type="application/x-gzip">
<magic priority="40">
<match value="\037\213" type="string“ offset="0" />
</magic>
<glob pattern="*.tgz" />
<glob pattern="*.gz" />
<glob pattern="*-gz" />
</mime-type>
25
Paolo Mottadelli
Supported formats
26
Paolo Mottadelli
A really simple exampleInputStream input =
MyTest.class.getResourceAsStream("testPPT.ppt");
Metadata metadata = new Metadata();
ContentHandler handler = new BodyContentHandler();
new OfficeParser().parse(input, handler, metadata);
String contentType = metadata.get(Metadata.CONTENT_TYPE);
String title= metadata.get(Metadata.TITLE);
String content = handler.toString();
27
Paolo Mottadelli
Future Goals
28
Paolo Mottadelli
Who uses Tika?
29