apache tika: 1 point oh!

55
Apache Tika: 1 point Oh! Chris A. Mattmann NASA JPL/Univ. Southern California/ASF [email protected] November 9, 2011

Upload: brice

Post on 18-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Apache Tika: 1 point Oh!. Chris A. Mattmann NASA JPL/Univ. Southern California/ASF [email protected] November 9, 2011. And you are?. Senior Computer Scientist at NASA JPL in Pasadena, CA USA Software Architecture/Engineering Prof at Univ. of Southern California. Apache Member involved in - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Apache Tika: 1 point Oh!

Apache Tika: 1 point Oh!

Chris A. MattmannNASA JPL/Univ. Southern California/ASF

[email protected] November 9, 2011

Page 2: Apache Tika: 1 point Oh!

• Apache Member involved in– OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC), SIS

(Mentor), Lucy (Mentor) and Gora (Champion), MRUnit (Mentor), Airavata (Mentor)

• Senior Computer Scientist at NASA JPL in Pasadena, CA USA

• Software Architecture/Engineering Prof at Univ. of Southern California

And you are?

Page 3: Apache Tika: 1 point Oh!

Roadmap• 1st part of the talk

– Why Tika?– What is Tika?– What are the current versions of Tika?– What can it do?

• 2nd part of the talk– NASA Earth Science Data Systems– Data System Needs and Requirements– How does Tika help?

Page 4: Apache Tika: 1 point Oh!

The Information Landscape

Page 5: Apache Tika: 1 point Oh!

Proliferation of content types available

• By some accounts, 16K to 51K content types*

• What to do with content types?– Parse them

• How?• Extract their text and structure

– Index their metadata• In an indexing technology like Lucene, Solr, or in

Google Appliance– Identify what language they belong to

• Ngrams

*http://filext.com/

Page 6: Apache Tika: 1 point Oh!

Importance of content types

Page 7: Apache Tika: 1 point Oh!

Importance of content type detection

Page 8: Apache Tika: 1 point Oh!

Search Engine Architecture

Page 9: Apache Tika: 1 point Oh!

Goals

• Identify and classify file types– MIME detection

• Glob pattern– *.txt– *.pdf

• URL– http://…pdf– ftp://myfile.txt

• Magic bytes• Combination of

the above means

• Classification means reaction can be targeted

Page 10: Apache Tika: 1 point Oh!

is…

• A content analysis and detection toolkit• A set of Java APIs providing MIME type detection,

language identification, integration of various parsing libraries

• A rich Metadata API for representing different Metadata models

• A command line interface to the underlying Java code

• A GUI interface to the Java code

Page 11: Apache Tika: 1 point Oh!

Tika’s (Brief) History• Original idea for Tika came from Chris Mattmann and

Jerome Charron in 2006• Proposed as Lucene sub-project

– Others interested, didn’t gain much traction

• Went the Incubator route in 2007 when Jukka Zitting found that there was a need for Tika capabilities in Apache Jackrabbit– A Content Management System

• Graduated from the Incubator to Lucene sub-project in 2008

• Graduated to Apache TLP in April 2010• 40, 88 and 29 issues resolved in versions 1.0, 0.10,

and 0.9

Page 12: Apache Tika: 1 point Oh!

Community• Mailing lists

– User: 125 peeps, ~70 msg/mo.– Dev: 210 peeps, ~250 msg/mo.

• Committers/PMC– 13 peeps– Large majority of them active

• Releases– 11 releases so far– Just pushed out 1 point OH

• http://s.apache.org/N0I

Credit: svnsearch.org

Page 13: Apache Tika: 1 point Oh!

Use in the classroom• Have used Apache Tika for the past 2 years in

both my Search Engines/Information Retrieval class and my Software Architecture class– Several student final projects have turned into

contributions for the project and merit for the students

• Define data management projects that involve the use of OODT, and other technologies like Solr, Tika, Nutch, Hadoop, etc.

Page 14: Apache Tika: 1 point Oh!

Some recent 1 point oh press

Page 15: Apache Tika: 1 point Oh!

Getting started rapidly…like now!

• Download Tika from:– http://tika.apache.org/download.html

• Grab tika-app-1.0.jar

• alias tika “java –jar tika-app-1.0.jar”

• tika < somefile.doc > extracted-text.xhtml

• tika –m < somefile.doc > extracted.met• Works on Windows too (alias only on UNIX)

Page 16: Apache Tika: 1 point Oh!

A quick NASA dataset• Atmospheric Infrared Sounder Mission (AIRS)

– Level 2 Cloud Clear Radiance Product– Grab it from here:

• ftp://airspar1u.ecs.nasa.gov/ftp/data/s4pa/Aqua_AIRS_Level2/AIRI2CCF.003/2007/005/

– Just grab the first file• java -jar tika-app-1.0.jar -m <

AIRS.2007.01.05.001.L2.CC.v4.0.9.0.G07006021239.hdf– Hopefully this worked for you, if not, blame..

• Windows– And Bill Gates

25-Mar-11 CORDEX-MATTMANN 16

Page 17: Apache Tika: 1 point Oh!

Detecting MIME types from Java

• String type = Tika.detect(…)– java.io.InputStream– java.io.File– java.net.URL– java.lang.String

Page 18: Apache Tika: 1 point Oh!

Adding new MIME types

• Got XML?

• Based on freedesktop.org spec (loosely)

Page 19: Apache Tika: 1 point Oh!

Many custom applications and tools

• You need this: to read this:

Page 20: Apache Tika: 1 point Oh!

Third-party parsing libraries

• Most of the custom applications come with software libraries and tools to read/write these files– Rather than re-invent the wheel, figure out a way to

take advantage of them

• Parsing text and structure is a difficult problem– Not all libraries parse text in equivalent manners– Some are faster than others– Some are more reliable than others

Page 21: Apache Tika: 1 point Oh!

Parsing

• String content = Tika.parseToString(…)– InputStream– File– URL

Page 22: Apache Tika: 1 point Oh!

Streaming Parsing

• Reader reader = Tika.parse(…)– InputStream– File– URL

Page 23: Apache Tika: 1 point Oh!

Extraction of Metadata

• Important to follow common Metadata models– Dublin Core – any electronic resource– XMP – also general like Dublin Core– Word Metadata – specific to .doc, .ppt, etc.– EXIF – image related

• Lots of standards and models out there– The use and extraction of common models allows for content

intercomparison– All standardize mechanisms for searching– You always know for X file type that field Y is there and of type

String or Int or Date

Page 24: Apache Tika: 1 point Oh!

Cancer Research Example

Page 25: Apache Tika: 1 point Oh!

Cancer Research Example

Attributes

Relationships

Credit: A. Hart

Page 26: Apache Tika: 1 point Oh!

Tika Sponsoring the Any23 Project

• Tika PMC is sponsoring the Any23 project in the Incubator (entered: 10/1/2011)

• Any23 = “Anything to Triples”• Semantic Toolkit for parsing, identification of

all major semantic web content types (RDF, etc.)

• Related to Apache Jena• Looking for synergies between 2 efforts

Page 27: Apache Tika: 1 point Oh!

Metadata

• Metadata met = new Metadata();//Dubiln Coremet.set(Metadata.FORMAT, “text/html”);//multi-valuedmet.set(Metadata.FORMAT, “text/plain”);System.out.println(met.getValues(Metadata.FORMAT));

• Other met models supported (HTTP Headers, Word, Creative Commons, Climate Forecast, etc.)– Run: tika --list-met-models

Page 28: Apache Tika: 1 point Oh!

Methods for language identification

• N-grams– Method of detecting next character or set of

characters in a sequence– Useful in determine whether small snippets of

text come from a particular language, or character set

• Non-computational approaches– Tagging– Looking for common words or characters

Page 29: Apache Tika: 1 point Oh!

Language Detection

• LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile(FileUtils.readFileToString(newFile(filename))));

• System.out.println(lang.getLanguage());• Uses Ngram analysis included with Tika

– Originating from Nutch– Can be improved

Page 30: Apache Tika: 1 point Oh!

Running Tika in GUI form

• tika --gui

<html xmlns:html=“…”><body>…</body></html>

Page 31: Apache Tika: 1 point Oh!

Integrating Tika into your App

• Maven

• Ant

• Eclipse

• It’s just a set of jars– tika-core– tika-parsers– tika-app– tika-bundle– tika-server

tika-core

tika-parsers

tika-app

tika-bundle

tika-server

Page 32: Apache Tika: 1 point Oh!

Some really great stuff in 1.0• Super improved OSGi support

– New tika-bundle module

• Improved RTF parsing support, OO support, and parsing of Outlook email attachments

• Language Detection for Belarusian, Catalan, Esperanto, Galician, Lithuanian Romanian, Slovak,Slovenian, and Ukrainian

• Improved PDF parsing (extract annotation)

NICK ALREADY TALKED ABOUT THIS!!! Thunder stolen

Page 33: Apache Tika: 1 point Oh!

Things to watch out for

• Deprecated APIs->gone–Recompile code

• No more JDK 1.4 version of Tika–Upgrade

Page 34: Apache Tika: 1 point Oh!

Improvements to Tika

• Adding more parsers for content types• Improve the JAX-RS server support• Expanding ability to handle random access

file parsing– Scientific data file formats, some work on this– Leverage improvements in file representation

TIKA-701, TIKA-654, TIKA-645, TIKA-153

• Geospatial parsing support through GDAL• Improving language and charset detection

Page 35: Apache Tika: 1 point Oh!

Part 2

Science Data Systems at NASA

Credit: http://www.jpl.nasa.gov/news/news.cfm?release=2011-295

Page 36: Apache Tika: 1 point Oh!

NASA Ground Data Systems

Credit: D. Woollard

Page 37: Apache Tika: 1 point Oh!

Context• NASA develops science data processing systems

for multiple earth science missions• These systems convert the instrument telemetry

delivered to earth from space into useful data for scientific research

• Typical characteristics– Remote sensing instruments that orbit the Earth multiple

times daily– Data are acquired constantly– Complex algorithms convert instrument measurements to

geophysical quantities

Page 38: Apache Tika: 1 point Oh!

The Square Kilometer Array

• 1 sq. km ofantennas

• Never-beforeseen resolution looking intothe sky

• 700 TB– Per second!

Page 39: Apache Tika: 1 point Oh!

NASA DESDynI Mission

• 16 TB/day

• Geographically distributed

• 10s of 1000s of jobs per day

• Tier 1 Earth Science Decadal Mission

Page 40: Apache Tika: 1 point Oh!

Some Considerations• Scale

– Data throughput rates– # of data types– # of metadata types– # of users to send the data to

• Federation– Must leave the data where it is– Socio/Economic/Political

• Heterogeneity– Technology, data formats, skills!

Page 41: Apache Tika: 1 point Oh!

Apache OODT

• We’ve got some components to deal with these issues

Page 42: Apache Tika: 1 point Oh!

How are we building these systems now? -Allow for

push/pull of data over arbitrary

protocols

- Ingestion builds std catalog and

archive

-Deliver product metadata to

search, portal or GIS

-Plug in arbitrary met extractors

Page 43: Apache Tika: 1 point Oh!

How are we building these systems now? -Separation of

file management from workflow

management

-Allow for heterogeneous

computing resources

-Easily integrate PGEs

-Leverages same ingestion crawler

Page 44: Apache Tika: 1 point Oh!

What does this have to do with Tika?

Metadata Ext: TIKA!

Metadata Ext: TIKA!

MIME identification: TIKA!

MIME identification: TIKA!

Page 45: Apache Tika: 1 point Oh!

What does this have to do with Tika?

Metadata Ext: TIKA!

MIME identification: TIKA!

MIME identification: TIKA!

Page 46: Apache Tika: 1 point Oh!

Science Data File Formats• Hierarchical Data Format (HDF)

– http://www.hdfgroup.org – Versions 4 and 5– Lots of NASA data is in 4, newer NASA data in 5– Encapsulates

• Observation (Scalars, Vectors, Matrices, NxMxZ…)• Metadata (Summary info, date/time ranges, spatial

ranges)

– Custom readers/writers/APIs in many languages• C/C++, Python, Java

Page 47: Apache Tika: 1 point Oh!

Science Data File Formats• network Common Data Form (netCDF)

– www.unidata.ucar.edu/software/netcdf/ – Versions 3 and 4– Heavily used in DOE, NOAA, etc.– Encapsulates

• Observation (Scalars, Vectors, Matrices, NxMxZ…)• Metadata (Summary info, date/time ranges, spatial ranges)

– Custom readers/writers/APIs in many languages• C/C++, Python, Java

– Not Hierarchical representation: all flat

Page 48: Apache Tika: 1 point Oh!

So how does it work?• Ingestion

– Science data files, ancillary information from other missions, etc., arrive in NetCDF or HDF format

– Need to extract their met, catalog and archive them, etc.

• Can now use Tika to do this! TIKA-399 and TIKA-400 added this capability

• Processing– Processors (PGEs) generate NetCDF and HDF,

must extract met, catalog and archive

Page 49: Apache Tika: 1 point Oh!

Tool support• Entire stacks of tools written around

these formats– OPeNDAP, LAS, readers, writers, custom

NASA mission toolkits– OGC

• WMS, WCS, etc.

– Unique, one of a kind software build around these data file formats

• Apache can contribute strongly in this area!

Page 50: Apache Tika: 1 point Oh!

Besides processing science files

• …Tika also helps with• MIME identification

– Useful in remote file acquisition– Useful in classification (catalog/archive) of existing

content– Useful in crawling see my Nutch talk last year

http://s.apache.org/UvU

• Language identification– Can be useful when data is coming from around the

world, but need to quickly identify whether or not we can process it

Page 51: Apache Tika: 1 point Oh!

Big Goal• More closely link OODT and Tika

– Add new parser to Tika

– Easily get OODT met extractor based on it

• Contribute back some features still baking in OODT– Configuration aspects of parsing

– File types and extensions for science data files

• Spatial– Some work done in my CS572 class on spatial parser

for Tika – would be great to integrate with Tika, OODT, SIS, and Solr

Page 52: Apache Tika: 1 point Oh!

NASA Geo Challenges• Sometimes the data isn’t annotated with lat and lon

– How to discover this?

• Even when the data is annotated with spatial information,computation of e.g.,bounding box aroundthe poles is difficult

• Efficiency and speed are difficult since data is at scale

Page 53: Apache Tika: 1 point Oh!

Acknowledgements

• Some Tika material inspired by Jukka Zitting’s talks– http://www.slideshare.net/jukka/text-and-

metadata-extraction-with-apache-tika– http://www.slideshare.net/jukka/text-and-

metadata-extraction-with-apache-tika-4427630

• NASA Jet Propulsion Laboratory– OODT Team

Page 54: Apache Tika: 1 point Oh!

Book• Jukka and I have finished

the first definitive guide on Apache Tika

• Official release date: 11/17• Early Access available

through MEAPprogram

• http://manning.com/mattmann/

Page 55: Apache Tika: 1 point Oh!

Alright, I’ll shut up now

• Any questions?

• THANK YOU!– [email protected][email protected] – @chrismattmann on Twitter