ibm watson & open source software - linuxcon 2012

53
Watson & Open Source Software Ivan Portilla IT Architect 8/29/12 [email protected]

Upload: iportilla

Post on 21-Nov-2014

3.394 views

Category:

Technology


3 download

DESCRIPTION

Presented at LinuxCon 2012 - San Diego, CA

TRANSCRIPT

Page 1: IBM Watson & Open Source Software - LinuxCon 2012

Watson & Open Source Software

Ivan Portilla IT Architect 8/29/12 [email protected]

Page 2: IBM Watson & Open Source Software - LinuxCon 2012

If I have seen further it is by standing on the shoulders of giants.

Isaac Newton, Letter to Robert Hooke, February 5, 1675

Page 3: IBM Watson & Open Source Software - LinuxCon 2012

3

Objectives By the end of this session, you should

be able to: ü Describe the main characteristics of

Watson QA system. ü Identify the key open source SW

used in Watson. ü Recognize examples of Agile

development best practices.

Page 4: IBM Watson & Open Source Software - LinuxCon 2012

Disclaimers

Page 5: IBM Watson & Open Source Software - LinuxCon 2012

Disclaimer 1 ü This presentation represents the view of the author and

does not represent the view of IBM. ü All opinions expressed in this presentation are strictly of

the speaker, and do NOT represent those of IBM, IBM management, or anyone else.

ü  IBM and IBM (logo) are trademarks or registered trademarks of International Business Machines Corporation in the United States and/or other countries.

Page 6: IBM Watson & Open Source Software - LinuxCon 2012

Disclaimer  2  I  (We)  do  not  work  for  the  Watson  team.  

Page 7: IBM Watson & Open Source Software - LinuxCon 2012

Jeopardy! The game

Page 8: IBM Watson & Open Source Software - LinuxCon 2012

Game View

Page 9: IBM Watson & Open Source Software - LinuxCon 2012

Let’s Play Jeopardy

BEFORE & AFTER: The Jerry Maguire star who automatically maintains your vehicle’s speed. COMMON BONDS: trout, loose change in your pocket, and compliments. Diplomatic Relations: Of the four countries in the world that the United States does not have diplomatic relations with, the one that’s farthest north. Geography: Chile shares its longest land border with this country

Page 10: IBM Watson & Open Source Software - LinuxCon 2012

Natural  Language  Processing  

Understanding  natural  language  is  hard!  

Page 11: IBM Watson & Open Source Software - LinuxCon 2012

Watson  educa@on  

Page 12: IBM Watson & Open Source Software - LinuxCon 2012

A Brief History of Watson

Page 13: IBM Watson & Open Source Software - LinuxCon 2012

A Brief History of Watson §  Deep Blue Ended in 1997 §  Looking for a new research challenge §  2004, IBM Research manager Charles Lickel,

§  Ken Jennings §  Started in 2005

•  David Ferrucci •  DeepQA in 2007 •  Won Jeopardy Match, Feb 2011

Page 14: IBM Watson & Open Source Software - LinuxCon 2012

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

% Answered

Baseline  

12/2007  

8/2008  

5/2009  

10/2009

11/2010  

12/2008  

5/2008  

4/2010  

Prec

isio

n

Page 15: IBM Watson & Open Source Software - LinuxCon 2012

What  is  Watson?  

ü Understands  natural  language.  

ü Generates  &  evaluates  hypothesis  for  beQer  outcomes.  

ü Adapts  &  learns  from  user  selec@ons  and  responses.  

 hQp://www.ibm.com/innova@on/us/watson/  

Page 16: IBM Watson & Open Source Software - LinuxCon 2012

Watson  metrics  Development Team: 25 people Project Duration: 4 years Software: 1,000,000 SLOC

700K Java, 300K C++ ~ 130 components

Hardware: 90 IBM Power-750 servers 2880 Power7 cores @ 80+ TFLOPS 20 TB Disk, 16 TB RAM (memory) 10 Gbps network

hQp://na11.apachecon.com/talks/19932    

Page 17: IBM Watson & Open Source Software - LinuxCon 2012

Open Source Software

Page 18: IBM Watson & Open Source Software - LinuxCon 2012

OSS  -­‐  Linux  

hQp://video.linux.com/videos/linuxcon-­‐vancouver-­‐day-­‐2-­‐1/    

Page 19: IBM Watson & Open Source Software - LinuxCon 2012

Open  Source  too  

hQp://@.arc.nasa.gov/opensource/    

hQp://ocw.mit.edu/index.htm    

Page 20: IBM Watson & Open Source Software - LinuxCon 2012

How does it work?

Page 21: IBM Watson & Open Source Software - LinuxCon 2012

Learning To Rank (Basic Architecture)

Page 22: IBM Watson & Open Source Software - LinuxCon 2012

Watson  Architecture  

hQp://en.wikipedia.org/wiki/Watson_(computer)    

Page 23: IBM Watson & Open Source Software - LinuxCon 2012

Who is the 44th President of the United States?

Page 24: IBM Watson & Open Source Software - LinuxCon 2012

Who is the 44th President of the United States?

Who is the 44th President of the United States?

Watson  by  R.  Yates  

Ques@on  &  Topic  Analysis  

Page 25: IBM Watson & Open Source Software - LinuxCon 2012

Who is the 44th President of the United States?

Who is the 44th President of the United States?

'Who' is the '44th' 'President' of the 'United States'?

Lexical Answer Type → Person

Focus * Can be replaced by the correct answer to make a true statement

Keywords

Watson  by  R.  Yates  

Ques@on  &  Topic  Analysis  

Page 26: IBM Watson & Open Source Software - LinuxCon 2012

Who is the 44th President of the United States?

Page 27: IBM Watson & Open Source Software - LinuxCon 2012

Who is the 44th President of the United States?

Page 28: IBM Watson & Open Source Software - LinuxCon 2012

Primary Search

Who is the 44th President of the United States?

Watson  by  R.  Yates  

Page 29: IBM Watson & Open Source Software - LinuxCon 2012

Primary Search

Who is the 44th President of the United States?

Watson  by  R.  Yates  

Barack Obama George W. Bush Harvard Law School Illinois

Page 30: IBM Watson & Open Source Software - LinuxCon 2012

Who is the 44th President of the United States?

Who is the 44th President of the United States? Barack Obama Who is the 44th President of the United States? George W. Bush Who is the 44th President of the United States? Harvard Law School Who is the 44th President of the United States? Illinois

Watson  by  R.  Yates  

Page 31: IBM Watson & Open Source Software - LinuxCon 2012

Who is the 44th President of the United States?

Who is the 44th President of the United States? Barack Obama Who is the 44th President of the United States? George W. Bush Who is the 44th President of the United States? Harvard Law School Who is the 44th President of the United States? Illinois

Watson  by  R.  Yates  

Who is the 44th President of the United States? → Person Is Barack Obama a Person? .90 Is George W. Bush a Person? .90 Is Harvard Law School a Person? .10 Is Illinois a Person? .15

Page 32: IBM Watson & Open Source Software - LinuxCon 2012

Who is the 44th President of the United States?

Barack Obama is the 44th President of the United States George W. Bush is the 44th President of the United States Harvard Law School is the 44th President of the United States Illinois is the 44th President of the United States

Watson  by  R.  Yates  

Page 33: IBM Watson & Open Source Software - LinuxCon 2012

Who is the 44th President of the United States?

Barack Obama is the 44th President of the United States George W. Bush is the 44th President of the United States Harvard Law School is the 44th President of the United States Illinois is the 44th President of the United States

Watson  by  R.  Yates  

Barack Hussein Obama II (i/bəәˈrɑːk huːˈseɪn oʊˈbɑːməә/; born August 4, 1961) is the 44th and current President of the United States.

George Walker Bush (born July 6, 1946) is an American politician who served as the 43rd President of the United States from 2001 to 2009 and the 46th Governor of Texas from 1995 to 2000.

Barack Obama .95 George W. Bush .80 Harvard Law School .05 Illinois.10

Page 34: IBM Watson & Open Source Software - LinuxCon 2012

Evidence Retrieval

Who is the 44th President of the United States? Candidate Answer Answer

Scoring Evidence retrieval &

scoring Confidence

Barack Obama 0.90 0.90 .95 George W. Bush 0.90 0.80 .65

Harvard Law School 0.10 0.05 .05

Illinois 0.15 0.10 .10

Watson  by  R.  Yates  

Page 35: IBM Watson & Open Source Software - LinuxCon 2012

. . .

Models

Answer & Confidence

Question

Evidence Sources

Models

Models

Models

Models

Models Primary Search

Candidate Answer Generation

Hypothesis  Genera@on  

Hypothesis  and  Evidence    Scoring  

Final  Confidence  Merging  &  Ranking  

Answer Sources

Ques@on  &  Topic  Analysis  

Answer Scoring

Evidence Retrieval & Scoring

Learned Models help combine and weigh the Evidence

Hypothesis  Genera@on  

Hypothesis and Evidence Scoring

Ques@on  Decomposi@on  

Balance & Combine

Merging & Ranking

Synthesis

DeepQA Massively Parallel Probabilistic Evidence-Based Architecture

ApacheCon  2011,  Watson,  a  Reasoning  System:  based  on  Apache  Inside!,  David  Boloker  

Page 36: IBM Watson & Open Source Software - LinuxCon 2012

OSS in Watson 1.0

Page 37: IBM Watson & Open Source Software - LinuxCon 2012

OSS  in  Watson  

ü UIMA,  UIMA-­‐AS  ü Hadoop,  Map  Reduce  ü Lucene,  Indri  

ApacheCon  2011,  Watson,  a  Reasoning  System:  based  on  Apache  Inside!,  David  Boloker  

Page 38: IBM Watson & Open Source Software - LinuxCon 2012

UIMA  

hQp://uima.apache.org/    

Page 39: IBM Watson & Open Source Software - LinuxCon 2012

UIMA-­‐Asynchronous  Scaleout  

UIMA  AS  provides  more  flexible  and  powerful  scale  out  capability.  

Innovate 2011, How Does It Work? The Architecture of Watson. Grady Booch

Page 40: IBM Watson & Open Source Software - LinuxCon 2012

Think  Hadoop  

A  framework  for  storing  &  processing  big  data.  

ü 4,000  machines  ü 20  PB  

High  reliability  done  in  sofware:  ü  Automated  failover  for  data  &  

computa@on  ü  Implemented  in  Java  

hQp://hadoop.apache.org/mapreduce/    

Page 41: IBM Watson & Open Source Software - LinuxCon 2012

Map  reduce  

hQp://hadoop.apache.org/mapreduce/    

Page 42: IBM Watson & Open Source Software - LinuxCon 2012

UIMA  pipelines  in  Hadoop  

hQp://blogs.apache.org/founda@on/entry/apache_innova@on_bolsters_ibm_s    42

Hadoop  Distributed  File  System  

Input  

.  

.  

.  

~5000  “splits”  

Mapper  

Mapper  

Mapper  

Mapper  

.  

.  

.  ~50  -­‐  100  “mappers”  

Mul@ple  threads,  each  runing  a  UIMA  pipeline  

Thread  Thread  

Thread  Thread  

.  .  .  

Thread  Thread  

Thread  Thread  

.  .  .  

Thread  Thread  

Thread  Thread  

.  .  .  

Thread  Thread  

Thread  Thread  

.  .  .  

Reducer  

Reducer  

.  

.  

.  Shuffl

e/Sort  

Hadoop  Distributed  File  System  

Output  

~400-­‐800    threads  

Page 43: IBM Watson & Open Source Software - LinuxCon 2012

Architecture  

hQp://lucene.apache.org    

Page 44: IBM Watson & Open Source Software - LinuxCon 2012

Indri  &Lemur  

Indri  is  a  text  search  engine  developed  at  Umass  &  CMU.  Indri  is  part  of  the  Lemur  project.  

QueryEnvironment

LocalServer

runquery

#combine(#2(george bush).title)

indrid

NetworkServerStub

LocalServer

indrid

NetworkServerStub

LocalServer

NetworkServerProxy

NetworkServerProxy

hQp://lemurproject.org/indri/    

Page 45: IBM Watson & Open Source Software - LinuxCon 2012

© 2011 IBM Corporation

Watson  answers  in  2-­‐6  seconds  

Ques@on  100s  Possible  Answers  

1000’s  of    Pieces  of  Evidence  

Mul@ple  Interpreta@ons  

100,000’s  scores  from  many  simultaneous  Text  Analysis  Algorithms  100s    sources  

.  .  .  

Hypothesis  Genera@on  

Hypothesis  and  Evidence    Scoring  

Final  Confidence  Merging  &  Ranking  Synthesis  Ques@on  &  

Topic  Analysis  Ques@on  Decomposi@on  

Hypothesis  Genera@on  

Hypothesis  and  Evidence  Scoring   Answer  &  

Confidence  

ApacheCon  2011,  Watson,  a  Reasoning  System:  based  on  Apache  Inside!,  David  Boloker  

Page 46: IBM Watson & Open Source Software - LinuxCon 2012

Other  OSS  in  Watson  

hQps://www.ibm.com/developerworks/mydeveloperworks/blogs/InsideSystemStorage/entry/ibm_watson_how_to_build_your_own_watson_jr_in_your_basement7?lang=en    

Page 47: IBM Watson & Open Source Software - LinuxCon 2012

J-­‐Archive  data  

hQp://www.j-­‐archive.com/showgame.php?game_id=3577    

Page 48: IBM Watson & Open Source Software - LinuxCon 2012

Development Process

ü War room setting with continuous collaboration.

ü Weekly integration. ü Results driven with E2E regression testing.

ü  About 8,000 experiments ü  10 GBs of test data/wk. ü  Agile development

Innovate 2011, How Does It Work? The Architecture of Watson. Grady Booch

Page 49: IBM Watson & Open Source Software - LinuxCon 2012

Related  Materials  

hQp://www.apache.org    

hQp://manning.com/    

hQp://oreilly.com/    

www.caltech.edu    

Page 50: IBM Watson & Open Source Software - LinuxCon 2012

Take  Away  OSS  is  powerful  and  scalable  enough  for  the  Watson  team,  what  about  your  project?  

Page 51: IBM Watson & Open Source Software - LinuxCon 2012

Resources  IBM  Journal  of  Research  and  Development  hQp://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=6177717    IBM  Watson  hQp://www.ibm.com/innova@on/us/watson/  hQp://www.research.ibm.com/deepqa/index.shtml    Nova  hQp://www.pbs.org/wgbh/nova/tech/smartest-­‐machine-­‐on-­‐earth.html  

Page 52: IBM Watson & Open Source Software - LinuxCon 2012

52

Review of Objectives Now that you have completed this session, you are able to: ü Describe the main characteristics of Watson QA system. ü  Identify the key open source tools used in Watson. ü Recognize examples of Agile development best practices.

Page 53: IBM Watson & Open Source Software - LinuxCon 2012

53

…any final questions ?