apachecon na 2011 report
Post on 11-May-2015
1.332 Views
Preview:
TRANSCRIPT
ApacheCon NA 2011 Report
2011/12/19
@ijokarumawak
About myself
• Nutch• Cloudera Certified
– Hadoop Developer– Hadoop Administrator
• CouchDB JP
ApacheCon
• http://na11.apachecon.com/• 2 days training• 3 days sessions
– Keynotes, 5 tracks
– Over 80 sessions
• Slide and audio files– http://lanyrd.com/2011/apachecon-north-america/
Why did I go there?
• Because I wanted to!
Nov 5,6: CouchHackNov 7: CouchConf Berlin
Nov 3: Left JapanNov 14: Came back
Nov 9-11: ApachConNov 12: Apach BarCamp
Image from: http://en.wikipedia.org/wiki/File:World_map_blank_gmt.svg
Keynote| Building in Security and Innovation
• David A. Wheeler– A specialist at developing
Secure Open Source Software
• The importance of developing secure software
• Do not make the same mistake• Learn how to make it secure before
start to develop it
Keynote | The Apache Way Done Right:The Success of Hadoop
• Eric Baldeschwieler– co-founder and the CEO of
• History of Hadoop• Difficulty of leading a huge community• “Being optimistic and good things will happen.”
Keynote | Watson, a Reasoning System: based on Apache Inside!
• David Boloker– CTO of IBM's Emerging Internet Technology group
• IBM’s Watson won Jeopardy• Commercialization of Watson
– Its target is medical field
Lucene/Solr Meet up
• Discussion with core committers of Lucene/Solr– Erik Hatcher– Chris Hostetter– Simon Willnauer
• We are supposed to drink beer, aren't we?
Sessions I attended to
• Lucene 4.0 - next generation open source search– Simon Willnauer
• Solr Flair– Erik Hatcher
• And more… 20 sessions!• http://www.atware.co.jp/category/column/apachecon-na-2011/
Lucene 4.0- next generation open source search -
by Simon Willnauer
Lucene 4.0 by Simon Willnauer
about the author
• Lucene core committer• Project Management Committee chair (PMC)
• Berlin Buzzwords co-founder– http://berlinbuzzwords.de/
• Community portal targeting OpenSource Search– http://www.searchworkings.org/
Lucene 4.0 by Simon Willnauer
Lucene 4.0
• The latest is currently Lucene 3.5.0• When does the Lucene 4.0 come out?
– Any time. He doesn’t know.
Lucene 4.0 by Simon Willnauer
IndexWriter & IndexReader
• Talk to a Directory (file system)• Just a factory for input and output streams• From Lucene4
– Flex API on the Codec layer
• Codec– Defines the file format– Data structures– Fields, term dictionaries– You can use MySQL as a backup
• (it’s not a good idea though)
• 90% won’t get in touch– 10% might be researchers
• Backward compatibilityFile System
Directory
Codec
Flex API
IndexWriter & Reader
Lucene 4.0 by Simon Willnauer
Storing Strings in UTF8
• Lucene 3 uses UTF16• From Lucene 4, UTF8• Performance will improve when you switched to
Lucene 4
Lucene 4.0 by Simon Willnauer
PostingsFormat
• PostingsFormat can be defined per field
• field:uid = Pulsing – PostingsFormat– Usually 1 doc per uid– Inlines postings into term dictionary– Safes additional disc lookup
• field:spell = Memory – PostingsFormat– Spelling correction doesn’t need posting list traversal– Large amount of key lookups– Load terms into RAM
• field:body = Default – PostingsFormat• Primary Key lookup
– 170K qps -> 550K qps with Memory PostingsFormat
Term Dictionary Posting List
Term
Posting List
RAM
Terms
Lucene 4.0 by Simon Willnauer
IndexDocValues• Lucene uses inverted index ( Term to Doc )
– It’s not good at to get a value of certain field from a document
• Fast access to a certain field’s value for every document– To sort documents or to display doc’s values not only its ID– Stored Fields
• It works but it’s not an efficient way• It’s designed for bulk read
– FieldCache ( on RAM )• Undo the entire work in the indexing time to make an array (un-inverting)• It works well until certain size of the index• It can be a problem under real-time or near-real-time usecases
– IndexDocValue• 1 value per field, type safe• It can reside on disk
• Reading 10M docs from a disc– FieldCache: 3161 ms– DocValues: 90 ms
Term
Doc Doc Doc
How to sort docs?
Lucene 4.0 by Simon Willnauer
DWPT(Document Writer Per Thread)
• In Lucene 3– IndexWriter merges segments and flushes it to the disk– While flushing data, multi-threaded IndexWriter takes a break
• From Lucene 4– IndexWriter doesn’t merge data anymore– It flushes its own segment to the disc simultaneously– less RAM more Concurrency
Lucene 4.0 by Simon Willnauer
Automaton Query
• Automaton Query– RegExp: (ftp|http).*– Fuzzy: dogs~1– Fuzzy-Prefix: (dogs~1).*
• Fuzzy query was too slow to use in production– Prior to 4.0, Fuzzy query took the simple yet horribly costly brute
force approach – In Lucene 3 this is about 0.1 - 0.2 QPS– Now it’s 50 QPS, 20k% improvement!
– http://java.dzone.com/news/lucenes-fuzzyquery-100-times
Solr Flair
by Erik Hatcher
Solr Flair by Erik Hatcher
Solr
Flair
User InterfacesUser InteractionsAjax suggestionDid you mean? – Spell CheckingFacetCluster.. So on
Solr Flair by Erik Hatcher
wt = velocity
• http://wiki.apache.org/solr/VelocityResponseWriter
• Solritas• /browse
Solr Flair by Erik Hatcher
Prism
• https://github.com/lucidimagination/Prism• Requires
– Lucid Works Enterprise– JRuby with Sinatra gem installed
• Production use of LucidWorks Enterprise requires an annual subscription– It’s free to play :’)
Solr Flair by Erik Hatcher
blacklight
• http://projectblacklight.org/• Ruby on Rails• DEMO
– http://demo.projectblacklight.org/
• Being used by Universities– University of Versinia
• http://search.lib.virginia.edu/catalog?portal=all&q=lucene
– Stanford University• http://searchworks.stanford.edu/?q=lucene+in+action&search_field=
search
Solr Flair by Erik Hatcher
VUFind
• http://vufind.org/ • blacklight competitor• library resource portal • PHP• DEMO
– http://vufind.org/demo/
Solr Flair by Erik Hatcher
TwigKit
• http://twigkit.com/ • JSP tag library• Search UI components• Samples
– http://twigkit.com/components.html
Solr Flair by Erik Hatcher
Ajax Solr
• https://github.com/evolvingweb/ajax-solr • Javascript library goes with JQuery• DEMO
– http://evolvingweb.github.com/ajax-solr/examples/reuters/index.html
ApacheCon 2012
• ApacheCon EUROPE• November 2012• Germany!!?
Thank you!
top related