apachecon na 2011 report

ApacheCon NA 2011 Report

2011/12/19

@ijokarumawak

About myself

• Nutch• Cloudera Certified

– Hadoop Developer– Hadoop Administrator

• CouchDB JP

ApacheCon

• http://na11.apachecon.com/• 2 days training• 3 days sessions

– Keynotes, 5 tracks

– Over 80 sessions

• Slide and audio files– http://lanyrd.com/2011/apachecon-north-america/

http://na11.apachecon.com/

Why did I go there?

• Because I wanted to!

Nov 5,6: CouchHackNov 7: CouchConf Berlin

Nov 3: Left JapanNov 14: Came back

Nov 9-11: ApachConNov 12: Apach BarCamp

Image from: http://en.wikipedia.org/wiki/File:World_map_blank_gmt.svg

Keynote| Building in Security and Innovation

• David A. Wheeler– A specialist at developing

Secure Open Source Software

• The importance of developing secure software

• Do not make the same mistake• Learn how to make it secure before

start to develop it

Keynote | The Apache Way Done Right:The Success of Hadoop

• Eric Baldeschwieler– co-founder and the CEO of

• History of Hadoop• Difficulty of leading a huge community• “Being optimistic and good things will happen.”

Keynote | Watson, a Reasoning System: based on Apache Inside!

• David Boloker– CTO of IBM's Emerging Internet Technology group

• IBM’s Watson won Jeopardy• Commercialization of Watson

– Its target is medical field

Lucene/Solr Meet up

• Discussion with core committers of Lucene/Solr– Erik Hatcher– Chris Hostetter– Simon Willnauer

• We are supposed to drink beer, aren't we?

Sessions I attended to

• Lucene 4.0 - next generation open source search– Simon Willnauer

• Solr Flair– Erik Hatcher

• And more… 20 sessions!• http://www.atware.co.jp/category/column/apachecon-na-2011/

Lucene 4.0- next generation open source search -

by Simon Willnauer

Lucene 4.0 by Simon Willnauer

about the author

• Lucene core committer• Project Management Committee chair (PMC)

• Berlin Buzzwords co-founder– http://berlinbuzzwords.de/

• Community portal targeting OpenSource Search– http://www.searchworkings.org/


Lucene 4.0

• The latest is currently Lucene 3.5.0• When does the Lucene 4.0 come out?

– Any time. He doesn’t know.


IndexWriter & IndexReader

• Talk to a Directory (file system)• Just a factory for input and output streams• From Lucene4

– Flex API on the Codec layer

• Codec– Defines the file format– Data structures– Fields, term dictionaries– You can use MySQL as a backup

• (it’s not a good idea though)

• 90% won’t get in touch– 10% might be researchers

• Backward compatibilityFile System

Directory

Codec

Flex API

IndexWriter & Reader


Storing Strings in UTF8

• Lucene 3 uses UTF16• From Lucene 4, UTF8• Performance will improve when you switched to

Lucene 4


PostingsFormat

• PostingsFormat can be defined per field

• field:uid = Pulsing – PostingsFormat– Usually 1 doc per uid– Inlines postings into term dictionary– Safes additional disc lookup

• field:spell = Memory – PostingsFormat– Spelling correction doesn’t need posting list traversal– Large amount of key lookups– Load terms into RAM

• field:body = Default – PostingsFormat• Primary Key lookup

– 170K qps -> 550K qps with Memory PostingsFormat

Term Dictionary Posting List

Term

Posting List

RAM

Terms


IndexDocValues• Lucene uses inverted index ( Term to Doc )

– It’s not good at to get a value of certain field from a document

• Fast access to a certain field’s value for every document– To sort documents or to display doc’s values not only its ID– Stored Fields

• It works but it’s not an efficient way• It’s designed for bulk read

– FieldCache ( on RAM )• Undo the entire work in the indexing time to make an array (un-inverting)• It works well until certain size of the index• It can be a problem under real-time or near-real-time usecases

– IndexDocValue• 1 value per field, type safe• It can reside on disk

• Reading 10M docs from a disc– FieldCache: 3161 ms– DocValues: 90 ms

Term

Doc Doc Doc

How to sort docs?


DWPT(Document Writer Per Thread)

• In Lucene 3– IndexWriter merges segments and flushes it to the disk– While flushing data, multi-threaded IndexWriter takes a break

• From Lucene 4– IndexWriter doesn’t merge data anymore– It flushes its own segment to the disc simultaneously– less RAM more Concurrency


Automaton Query

• Automaton Query– RegExp: (ftp|http).*– Fuzzy: dogs~1– Fuzzy-Prefix: (dogs~1).*

• Fuzzy query was too slow to use in production– Prior to 4.0, Fuzzy query took the simple yet horribly costly brute

force approach – In Lucene 3 this is about 0.1 - 0.2 QPS– Now it’s 50 QPS, 20k% improvement!

– http://java.dzone.com/news/lucenes-fuzzyquery-100-times

Solr Flair

by Erik Hatcher

Solr Flair by Erik Hatcher

Solr

Flair

User InterfacesUser InteractionsAjax suggestionDid you mean? – Spell CheckingFacetCluster.. So on


wt = velocity

• http://wiki.apache.org/solr/VelocityResponseWriter

• Solritas• /browse

http://wiki.apache.org/solr/VelocityResponseWriter

http://wiki.apache.org/solr/VelocityResponseWriter


Prism

• https://github.com/lucidimagination/Prism• Requires

– Lucid Works Enterprise– JRuby with Sinatra gem installed

• Production use of LucidWorks Enterprise requires an annual subscription– It’s free to play :’)

https://github.com/lucidimagination/Prism


blacklight

• http://projectblacklight.org/• Ruby on Rails• DEMO

– http://demo.projectblacklight.org/

• Being used by Universities– University of Versinia

• http://search.lib.virginia.edu/catalog?portal=all&q=lucene

– Stanford University• http://searchworks.stanford.edu/?q=lucene+in+action&search_field=

search

http://projectblacklight.org/


VUFind

• http://vufind.org/ • blacklight competitor• library resource portal • PHP• DEMO

– http://vufind.org/demo/

http://vufind.org/

http://vufind.org/

http://vufind.org/

http://vufind.org/demo/


TwigKit

• http://twigkit.com/ • JSP tag library• Search UI components• Samples

– http://twigkit.com/components.html

http://twigkit.com/

http://twigkit.com/

http://twigkit.com/


Ajax Solr

• https://github.com/evolvingweb/ajax-solr • Javascript library goes with JQuery• DEMO

– http://evolvingweb.github.com/ajax-solr/examples/reuters/index.html

https://github.com/evolvingweb/ajax-solr

http://evolvingweb.github.com/ajax-solr/examples/reuters/index.html

http://evolvingweb.github.com/ajax-solr/examples/reuters/index.html

ApacheCon 2012

• ApacheCon EUROPE• November 2012• Germany!!?

Thank you!

apachecon na 2011 report

Technology

blacklight http

vufind http

apachecon http

utf8 lucene

field field

velocity http

twigkit http

indexdocvalues lucene