apachecon na 2011 report

28
ApacheCon NA 2011 Repo rt 2011/12/19 @ijokarumawak

Upload: koji-kawamura

Post on 11-May-2015

1.332 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: ApacheCon NA 2011 report

ApacheCon NA 2011 Report

2011/12/19

@ijokarumawak

Page 2: ApacheCon NA 2011 report

About myself

• Nutch• Cloudera Certified

– Hadoop Developer– Hadoop Administrator

• CouchDB JP

Page 3: ApacheCon NA 2011 report

ApacheCon

• http://na11.apachecon.com/• 2 days training• 3 days sessions

– Keynotes, 5 tracks

– Over 80 sessions

• Slide and audio files– http://lanyrd.com/2011/apachecon-north-america/

Page 4: ApacheCon NA 2011 report

Why did I go there?

• Because I wanted to!

Nov 5,6: CouchHackNov 7: CouchConf Berlin

Nov 3: Left JapanNov 14: Came back

Nov 9-11: ApachConNov 12: Apach BarCamp

Image from: http://en.wikipedia.org/wiki/File:World_map_blank_gmt.svg

Page 5: ApacheCon NA 2011 report

Keynote| Building in Security and Innovation

• David A. Wheeler– A specialist at developing

Secure Open Source Software

• The importance of developing secure software

• Do not make the same mistake• Learn how to make it secure before

start to develop it

Page 6: ApacheCon NA 2011 report

Keynote | The Apache Way Done Right:The Success of Hadoop

• Eric Baldeschwieler– co-founder and the CEO of

• History of Hadoop• Difficulty of leading a huge community• “Being optimistic and good things will happen.”

Page 7: ApacheCon NA 2011 report

Keynote | Watson, a Reasoning System: based on Apache Inside!

• David Boloker– CTO of IBM's Emerging Internet Technology group

• IBM’s Watson won Jeopardy• Commercialization of Watson

– Its target is medical field

Page 8: ApacheCon NA 2011 report

Lucene/Solr Meet up

• Discussion with core committers of Lucene/Solr– Erik Hatcher– Chris Hostetter– Simon Willnauer

• We are supposed to drink beer, aren't we?

Page 9: ApacheCon NA 2011 report

Sessions I attended to

• Lucene 4.0 - next generation open source search– Simon Willnauer

• Solr Flair– Erik Hatcher

• And more… 20 sessions!• http://www.atware.co.jp/category/column/apachecon-na-2011/

Page 10: ApacheCon NA 2011 report

Lucene 4.0- next generation open source search -

by Simon Willnauer

Page 11: ApacheCon NA 2011 report

Lucene 4.0 by Simon Willnauer

about the author

• Lucene core committer• Project Management Committee chair (PMC)

• Berlin Buzzwords co-founder– http://berlinbuzzwords.de/

• Community portal targeting OpenSource Search– http://www.searchworkings.org/

Page 12: ApacheCon NA 2011 report

Lucene 4.0 by Simon Willnauer

Lucene 4.0

• The latest is currently Lucene 3.5.0• When does the Lucene 4.0 come out?

– Any time. He doesn’t know.

Page 13: ApacheCon NA 2011 report

Lucene 4.0 by Simon Willnauer

IndexWriter & IndexReader

• Talk to a Directory (file system)• Just a factory for input and output streams• From Lucene4

– Flex API on the Codec layer

• Codec– Defines the file format– Data structures– Fields, term dictionaries– You can use MySQL as a backup

• (it’s not a good idea though)

• 90% won’t get in touch– 10% might be researchers

• Backward compatibilityFile System

Directory

Codec

Flex API

IndexWriter & Reader

Page 14: ApacheCon NA 2011 report

Lucene 4.0 by Simon Willnauer

Storing Strings in UTF8

• Lucene 3 uses UTF16• From Lucene 4, UTF8• Performance will improve when you switched to

Lucene 4

Page 15: ApacheCon NA 2011 report

Lucene 4.0 by Simon Willnauer

PostingsFormat

• PostingsFormat can be defined per field

• field:uid = Pulsing – PostingsFormat– Usually 1 doc per uid– Inlines postings into term dictionary– Safes additional disc lookup

• field:spell = Memory – PostingsFormat– Spelling correction doesn’t need posting list traversal– Large amount of key lookups– Load terms into RAM

• field:body = Default – PostingsFormat• Primary Key lookup

– 170K qps -> 550K qps with Memory PostingsFormat

Term Dictionary Posting List

Term

Posting List

RAM

Terms

Page 16: ApacheCon NA 2011 report

Lucene 4.0 by Simon Willnauer

IndexDocValues• Lucene uses inverted index ( Term to Doc )

– It’s not good at to get a value of certain field from a document

• Fast access to a certain field’s value for every document– To sort documents or to display doc’s values not only its ID– Stored Fields

• It works but it’s not an efficient way• It’s designed for bulk read

– FieldCache ( on RAM )• Undo the entire work in the indexing time to make an array (un-inverting)• It works well until certain size of the index• It can be a problem under real-time or near-real-time usecases

– IndexDocValue• 1 value per field, type safe• It can reside on disk

• Reading 10M docs from a disc– FieldCache: 3161 ms– DocValues: 90 ms

Term

Doc Doc Doc

How to sort docs?

Page 17: ApacheCon NA 2011 report

Lucene 4.0 by Simon Willnauer

DWPT(Document Writer Per Thread)

• In Lucene 3– IndexWriter merges segments and flushes it to the disk– While flushing data, multi-threaded IndexWriter takes a break

• From Lucene 4– IndexWriter doesn’t merge data anymore– It flushes its own segment to the disc simultaneously– less RAM more Concurrency

Page 18: ApacheCon NA 2011 report

Lucene 4.0 by Simon Willnauer

Automaton Query

• Automaton Query– RegExp: (ftp|http).*– Fuzzy: dogs~1– Fuzzy-Prefix: (dogs~1).*

• Fuzzy query was too slow to use in production– Prior to 4.0, Fuzzy query took the simple yet horribly costly brute

force approach – In Lucene 3 this is about 0.1 - 0.2 QPS– Now it’s 50 QPS, 20k% improvement!

– http://java.dzone.com/news/lucenes-fuzzyquery-100-times

Page 19: ApacheCon NA 2011 report

Solr Flair

by Erik Hatcher

Page 20: ApacheCon NA 2011 report

Solr Flair by Erik Hatcher

Solr

Flair

User InterfacesUser InteractionsAjax suggestionDid you mean? – Spell CheckingFacetCluster.. So on

Page 21: ApacheCon NA 2011 report

Solr Flair by Erik Hatcher

wt = velocity

• http://wiki.apache.org/solr/VelocityResponseWriter

• Solritas• /browse

Page 22: ApacheCon NA 2011 report

Solr Flair by Erik Hatcher

Prism

• https://github.com/lucidimagination/Prism• Requires

– Lucid Works Enterprise– JRuby with Sinatra gem installed

• Production use of LucidWorks Enterprise requires an annual subscription– It’s free to play :’)

Page 23: ApacheCon NA 2011 report

Solr Flair by Erik Hatcher

blacklight

• http://projectblacklight.org/• Ruby on Rails• DEMO

– http://demo.projectblacklight.org/

• Being used by Universities– University of Versinia

• http://search.lib.virginia.edu/catalog?portal=all&q=lucene

– Stanford University• http://searchworks.stanford.edu/?q=lucene+in+action&search_field=

search

Page 24: ApacheCon NA 2011 report

Solr Flair by Erik Hatcher

VUFind

• http://vufind.org/ • blacklight competitor• library resource portal • PHP• DEMO

– http://vufind.org/demo/

Page 25: ApacheCon NA 2011 report

Solr Flair by Erik Hatcher

TwigKit

• http://twigkit.com/ • JSP tag library• Search UI components• Samples

– http://twigkit.com/components.html

Page 26: ApacheCon NA 2011 report

Solr Flair by Erik Hatcher

Ajax Solr

• https://github.com/evolvingweb/ajax-solr • Javascript library goes with JQuery• DEMO

– http://evolvingweb.github.com/ajax-solr/examples/reuters/index.html

Page 27: ApacheCon NA 2011 report

ApacheCon 2012

• ApacheCon EUROPE• November 2012• Germany!!?

Page 28: ApacheCon NA 2011 report

Thank you!