titanic linkedin presentation - 20022015

15
Tackling the Titanic Alex Akulov Carlos Hernandez 20 February 2015 Approach in an internal analytics competition

Upload: carlos-hernandez

Post on 18-Aug-2015

47 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Titanic LinkedIn Presentation - 20022015

Tackling the Titanic

Alex Akulov

Carlos Hernandez

20 February 2015

Approach in an internal analytics competition

Page 2: Titanic LinkedIn Presentation - 20022015

2Deloitte Titanic Analytics Competition | Walkthrough

Aussies challenge the worldDeloitte Australia called out the global member firms to an analytics competition - the response was loud and clear.

Nations31

Teams192

Practitioners359

Submissions1,954

Page 3: Titanic LinkedIn Presentation - 20022015

3Deloitte Titanic Analytics Competition | Walkthrough

Deloitte tackles the TitanicThe task was to predict the fate for half the passengers aboard the ship, based on the outcomes for the first half.

Survived

Died

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked31 0 1Uruchurtu, Don. Manuel E male 40 0 0 PC 17601 27.7208 C

246 0 1Minahan, Dr. William Edward male 44 2 0 19928 90C78 Q746 0 1Crosby, Capt. Edward Gifford male 70 1 1 WE/P 5735 71B22 S

17 0 3Rice, Master. Eugene male 2 4 1 382652 29.125 Q1 0 3Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S2 1 1Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833C85 C3 1 3Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 S

Page 4: Titanic LinkedIn Presentation - 20022015

4Deloitte Titanic Analytics Competition | Walkthrough

FeedbackFeedback

Model

Tuning

Vancouver Data Divers tackle the TitanicAn iterative approach to modeling and tuning proved imperative to achieving a high score.

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Model

Tuning

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Page 5: Titanic LinkedIn Presentation - 20022015

5Deloitte Titanic Analytics Competition | Walkthrough

FeedbackFeedback

Model

Tuning

Vancouver Data Divers tackle the TitanicAn iterative approach to modeling and tuning proved imperative to achieving a high score.

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Model

Tuning

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

From raw data to usable features:

DeterministicDeriving values from existing features

ProbabilisticFilling in the blanks with predictive model (e.g. age based on class and title)

Page 6: Titanic LinkedIn Presentation - 20022015

6Deloitte Titanic Analytics Competition | Walkthrough

Feature EngineeringAttributes were derived from existing data to generate an enhanced view of passengers and help with model accuracy.

Name: Rice, Master. Eugene Family Name: Rice

Given Name: Eugene

Passenger Type: Master

Class: 1 Sex: F Spouses: 0

Age Estimate: 46

Passenger IDSurvived? (Y/N)Passenger ClassNameGiven NameFamily NamePassenger TypeGenderGender Code * AgeGender Code * Passenger ClassAgeAge Estimate (Regression)Age Estimate (Distribution)SibSpSiblingsSpousesParchParentsChildrenWife? (Y/N)Husband? (Y/N)Father? (Y/N)Mother? (Y/N)Travel Type 1Travel Type 2TicketDeath in Group? (Y/N)Father * Death in GroupGroup SizeTicket First CharacterTicker First LetterFareFare per PassengerFare (log)CabinDeckCabin NumberEmbarked

Page 7: Titanic LinkedIn Presentation - 20022015

7Deloitte Titanic Analytics Competition | Walkthrough

FeedbackFeedback

Model

Tuning

Vancouver Data Divers tackle the TitanicAn iterative approach to modeling and tuning proved imperative to achieving a high score.

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Model

Tuning

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

From raw data to usable features:

DeterministicDeriving values from existing features

ProbabilisticFilling in the blanks with predictive model (e.g. age based on class and title)

Visualization driver feature selection for model development.

Assumptions can be tested to save time in model development.

Page 8: Titanic LinkedIn Presentation - 20022015

8Deloitte Titanic Analytics Competition | Walkthrough

Data visualizationTableau rapid-fire visualizations enabled us to get to know the data and segment it better for analysis.

Sex / Pclass

female male

1 2 3 1 2 3

0

20

40

60

80

100

120

140

160

180

200

220

240

260

280

300

320

340

360

Count of Name

SurvivedNo

Yes

Females in 3rd class need independent analysis.

Females in 1st/2nd class should be grouped.

Males in 3rd class could skew results. Best to analyze independently.

Pclass

1 2 3

20

40

60

80

100

120

140

160

180

200

220

240

260

280

Fare

Passenger class cannot be determined based on fares.

There was an overlap of class cabins.

Classes are distributed across decks.

Pclass

Alone

female male

Family

female male

Group

female male

No Yes No Yes No Yes No Yes No Yes No Yes

1

2

3

0

50

100

150

200

# Survived

0

50

100

150

200

# Survived

0

50

100

150

200

# Survived

There is no direct correlation between a passenger’s travel type (Alone/Family/Group) and survival rate, as previously theorized.

Page 9: Titanic LinkedIn Presentation - 20022015

9Deloitte Titanic Analytics Competition | Walkthrough

From raw data to usable features:

DeterministicDeriving values from existing features

ProbabilisticFilling in the blanks with predictive model (e.g. age based on class and title)

FeedbackFeedback

Model

Tuning

Vancouver Data Divers tackle the TitanicAn iterative approach to modeling and tuning proved imperative to achieving a high score.

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Model

Tuning

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Visualization driver feature selection for model development.

Assumptions can be tested to save time in model development.

Use Statistical and Machine Learning models suitable for the task.

Determine which features are useful and derive new if necessary.

Page 10: Titanic LinkedIn Presentation - 20022015

10Deloitte Titanic Analytics Competition | Walkthrough

Model DevelopmentMultiple statistical tools were used since some proved better than others in predicting outcome for groups of passengers.

KNIME Analytics Platform was used for prototyping and testing of various modeling approaches:

• Splitting data into groups

• Decision Trees

• Random Forest

• Logistic Regression

• Support Vector Machines (SVM)

Page 11: Titanic LinkedIn Presentation - 20022015

11Deloitte Titanic Analytics Competition | Walkthrough

From raw data to usable features:

DeterministicDeriving values from existing features

ProbabilisticFilling in the blanks with predictive model (e.g. age based on class and title)

Visualization driver feature selection for model development.

Assumptions can be tested to save time in model development.

Use Statistical and Machine Learning models suitable for the task.

Determine which features are useful and derive new if necessary.

FeedbackFeedback

Model

Tuning

Vancouver Data Divers tackle the TitanicAn iterative approach to modeling and tuning proved imperative to achieving a high score.

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Model

Tuning

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Submission is scored and ranked against other teams.

Once a certain threshold is met, the model is ready for tuning.

Page 12: Titanic LinkedIn Presentation - 20022015

12Deloitte Titanic Analytics Competition | Walkthrough

Kaggle SubmissionVancouver Data Divers achieved the goal of being top 10% two weeks before the competition deadline.

Gender

ClassClass

MF

Survived

1st or 2nd

LogisticRegression

3rd

Decision Treeon 'Master?', 'Fare'

If Master? = 1Survived

1st

Logistic Regression

2nd 3rd

Page 13: Titanic LinkedIn Presentation - 20022015

13Deloitte Titanic Analytics Competition | Walkthrough

From raw data to usable features:

DeterministicDeriving values from existing features

ProbabilisticFilling in the blanks with predictive model (e.g. age based on class and title)

Visualization driver feature selection for model development.

Assumptions can be tested to save time in model development.

Use Statistical and Machine Learning models suitable for the task.

Determine which features are useful and derive new if necessary.

Submission is scored and ranked against other teams.

Once a certain threshold is met, the model is ready for tuning.

FeedbackFeedback

Model

Tuning

Vancouver Data Divers tackle the TitanicIterative approach to modeling and tuning is imperative to achieving a high score.

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Model

Tuning

Kaggle

Submission

Model

Development

Data

Visualization

Feature

Engineering

Adjusting model parameters can increase the model accuracy without changing the input variables.

Cross validation is one approach to test the model after each tuning cycle.

Page 14: Titanic LinkedIn Presentation - 20022015

14Deloitte Titanic Analytics Competition | Walkthrough

Model TuningModel parameters can be adjusted to achieve better predictive results – moved the team from 19th to 13th spot

Python was used to tune the model by:

• Choosing the optimal features

• Adjusting model parameters

• Reducing manual effort

32,768 combinations of 15 features

2,000 attempts per hour

17 hours on a Deloitte laptop

Page 15: Titanic LinkedIn Presentation - 20022015

15Deloitte Titanic Analytics Competition | Walkthrough

Vancouver Data Divers Placed 13th OverallThe local IM&AT talent is capable of tackling predictive modeling projects

Vancouver office is on the map for global Analytics talent

Our story got one client excited thinking about predictive modeling opportunities

Developed stronger predictive analytics capabilities that can be shared within the practice