ibm watson & open source software - linuxcon 2012
DESCRIPTION
Presented at LinuxCon 2012 - San Diego, CATRANSCRIPT
Watson & Open Source Software
Ivan Portilla IT Architect 8/29/12 [email protected]
If I have seen further it is by standing on the shoulders of giants.
Isaac Newton, Letter to Robert Hooke, February 5, 1675
3
Objectives By the end of this session, you should
be able to: ü Describe the main characteristics of
Watson QA system. ü Identify the key open source SW
used in Watson. ü Recognize examples of Agile
development best practices.
Disclaimers
Disclaimer 1 ü This presentation represents the view of the author and
does not represent the view of IBM. ü All opinions expressed in this presentation are strictly of
the speaker, and do NOT represent those of IBM, IBM management, or anyone else.
ü IBM and IBM (logo) are trademarks or registered trademarks of International Business Machines Corporation in the United States and/or other countries.
Disclaimer 2 I (We) do not work for the Watson team.
Jeopardy! The game
Game View
Let’s Play Jeopardy
BEFORE & AFTER: The Jerry Maguire star who automatically maintains your vehicle’s speed. COMMON BONDS: trout, loose change in your pocket, and compliments. Diplomatic Relations: Of the four countries in the world that the United States does not have diplomatic relations with, the one that’s farthest north. Geography: Chile shares its longest land border with this country
Natural Language Processing
Understanding natural language is hard!
Watson educa@on
A Brief History of Watson
A Brief History of Watson § Deep Blue Ended in 1997 § Looking for a new research challenge § 2004, IBM Research manager Charles Lickel,
§ Ken Jennings § Started in 2005
• David Ferrucci • DeepQA in 2007 • Won Jeopardy Match, Feb 2011
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% Answered
Baseline
12/2007
8/2008
5/2009
10/2009
11/2010
12/2008
5/2008
4/2010
Prec
isio
n
What is Watson?
ü Understands natural language.
ü Generates & evaluates hypothesis for beQer outcomes.
ü Adapts & learns from user selec@ons and responses.
hQp://www.ibm.com/innova@on/us/watson/
Watson metrics Development Team: 25 people Project Duration: 4 years Software: 1,000,000 SLOC
700K Java, 300K C++ ~ 130 components
Hardware: 90 IBM Power-750 servers 2880 Power7 cores @ 80+ TFLOPS 20 TB Disk, 16 TB RAM (memory) 10 Gbps network
hQp://na11.apachecon.com/talks/19932
Open Source Software
OSS -‐ Linux
hQp://video.linux.com/videos/linuxcon-‐vancouver-‐day-‐2-‐1/
Open Source too
hQp://@.arc.nasa.gov/opensource/
hQp://ocw.mit.edu/index.htm
How does it work?
Learning To Rank (Basic Architecture)
Watson Architecture
hQp://en.wikipedia.org/wiki/Watson_(computer)
Who is the 44th President of the United States?
Who is the 44th President of the United States?
Who is the 44th President of the United States?
Watson by R. Yates
Ques@on & Topic Analysis
Who is the 44th President of the United States?
Who is the 44th President of the United States?
'Who' is the '44th' 'President' of the 'United States'?
Lexical Answer Type → Person
Focus * Can be replaced by the correct answer to make a true statement
Keywords
Watson by R. Yates
Ques@on & Topic Analysis
Who is the 44th President of the United States?
Who is the 44th President of the United States?
Primary Search
Who is the 44th President of the United States?
Watson by R. Yates
Primary Search
Who is the 44th President of the United States?
Watson by R. Yates
Barack Obama George W. Bush Harvard Law School Illinois
Who is the 44th President of the United States?
Who is the 44th President of the United States? Barack Obama Who is the 44th President of the United States? George W. Bush Who is the 44th President of the United States? Harvard Law School Who is the 44th President of the United States? Illinois
Watson by R. Yates
Who is the 44th President of the United States?
Who is the 44th President of the United States? Barack Obama Who is the 44th President of the United States? George W. Bush Who is the 44th President of the United States? Harvard Law School Who is the 44th President of the United States? Illinois
Watson by R. Yates
Who is the 44th President of the United States? → Person Is Barack Obama a Person? .90 Is George W. Bush a Person? .90 Is Harvard Law School a Person? .10 Is Illinois a Person? .15
Who is the 44th President of the United States?
Barack Obama is the 44th President of the United States George W. Bush is the 44th President of the United States Harvard Law School is the 44th President of the United States Illinois is the 44th President of the United States
Watson by R. Yates
Who is the 44th President of the United States?
Barack Obama is the 44th President of the United States George W. Bush is the 44th President of the United States Harvard Law School is the 44th President of the United States Illinois is the 44th President of the United States
Watson by R. Yates
Barack Hussein Obama II (i/bəәˈrɑːk huːˈseɪn oʊˈbɑːməә/; born August 4, 1961) is the 44th and current President of the United States.
George Walker Bush (born July 6, 1946) is an American politician who served as the 43rd President of the United States from 2001 to 2009 and the 46th Governor of Texas from 1995 to 2000.
Barack Obama .95 George W. Bush .80 Harvard Law School .05 Illinois.10
Evidence Retrieval
Who is the 44th President of the United States? Candidate Answer Answer
Scoring Evidence retrieval &
scoring Confidence
Barack Obama 0.90 0.90 .95 George W. Bush 0.90 0.80 .65
Harvard Law School 0.10 0.05 .05
Illinois 0.15 0.10 .10
Watson by R. Yates
. . .
Models
Answer & Confidence
Question
Evidence Sources
Models
Models
Models
Models
Models Primary Search
Candidate Answer Generation
Hypothesis Genera@on
Hypothesis and Evidence Scoring
Final Confidence Merging & Ranking
Answer Sources
Ques@on & Topic Analysis
Answer Scoring
Evidence Retrieval & Scoring
Learned Models help combine and weigh the Evidence
Hypothesis Genera@on
Hypothesis and Evidence Scoring
Ques@on Decomposi@on
Balance & Combine
Merging & Ranking
Synthesis
DeepQA Massively Parallel Probabilistic Evidence-Based Architecture
ApacheCon 2011, Watson, a Reasoning System: based on Apache Inside!, David Boloker
OSS in Watson 1.0
OSS in Watson
ü UIMA, UIMA-‐AS ü Hadoop, Map Reduce ü Lucene, Indri
ApacheCon 2011, Watson, a Reasoning System: based on Apache Inside!, David Boloker
UIMA
hQp://uima.apache.org/
UIMA-‐Asynchronous Scaleout
UIMA AS provides more flexible and powerful scale out capability.
Innovate 2011, How Does It Work? The Architecture of Watson. Grady Booch
Think Hadoop
A framework for storing & processing big data.
ü 4,000 machines ü 20 PB
High reliability done in sofware: ü Automated failover for data &
computa@on ü Implemented in Java
hQp://hadoop.apache.org/mapreduce/
Map reduce
hQp://hadoop.apache.org/mapreduce/
UIMA pipelines in Hadoop
hQp://blogs.apache.org/founda@on/entry/apache_innova@on_bolsters_ibm_s 42
Hadoop Distributed File System
Input
.
.
.
~5000 “splits”
Mapper
Mapper
Mapper
Mapper
.
.
. ~50 -‐ 100 “mappers”
Mul@ple threads, each runing a UIMA pipeline
Thread Thread
Thread Thread
. . .
Thread Thread
Thread Thread
. . .
Thread Thread
Thread Thread
. . .
Thread Thread
Thread Thread
. . .
Reducer
Reducer
.
.
. Shuffl
e/Sort
Hadoop Distributed File System
Output
~400-‐800 threads
Architecture
hQp://lucene.apache.org
Indri &Lemur
Indri is a text search engine developed at Umass & CMU. Indri is part of the Lemur project.
QueryEnvironment
LocalServer
runquery
#combine(#2(george bush).title)
indrid
NetworkServerStub
LocalServer
indrid
NetworkServerStub
LocalServer
NetworkServerProxy
NetworkServerProxy
hQp://lemurproject.org/indri/
© 2011 IBM Corporation
Watson answers in 2-‐6 seconds
Ques@on 100s Possible Answers
1000’s of Pieces of Evidence
Mul@ple Interpreta@ons
100,000’s scores from many simultaneous Text Analysis Algorithms 100s sources
. . .
Hypothesis Genera@on
Hypothesis and Evidence Scoring
Final Confidence Merging & Ranking Synthesis Ques@on &
Topic Analysis Ques@on Decomposi@on
Hypothesis Genera@on
Hypothesis and Evidence Scoring Answer &
Confidence
ApacheCon 2011, Watson, a Reasoning System: based on Apache Inside!, David Boloker
Other OSS in Watson
hQps://www.ibm.com/developerworks/mydeveloperworks/blogs/InsideSystemStorage/entry/ibm_watson_how_to_build_your_own_watson_jr_in_your_basement7?lang=en
J-‐Archive data
hQp://www.j-‐archive.com/showgame.php?game_id=3577
Development Process
ü War room setting with continuous collaboration.
ü Weekly integration. ü Results driven with E2E regression testing.
ü About 8,000 experiments ü 10 GBs of test data/wk. ü Agile development
Innovate 2011, How Does It Work? The Architecture of Watson. Grady Booch
Related Materials
hQp://www.apache.org
hQp://manning.com/
hQp://oreilly.com/
www.caltech.edu
Take Away OSS is powerful and scalable enough for the Watson team, what about your project?
Resources IBM Journal of Research and Development hQp://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=6177717 IBM Watson hQp://www.ibm.com/innova@on/us/watson/ hQp://www.research.ibm.com/deepqa/index.shtml Nova hQp://www.pbs.org/wgbh/nova/tech/smartest-‐machine-‐on-‐earth.html
52
Review of Objectives Now that you have completed this session, you are able to: ü Describe the main characteristics of Watson QA system. ü Identify the key open source tools used in Watson. ü Recognize examples of Agile development best practices.
53
…any final questions ?