titanic linkedin presentation - 20022015
TRANSCRIPT
Tackling the Titanic
Alex Akulov
Carlos Hernandez
20 February 2015
Approach in an internal analytics competition
2Deloitte Titanic Analytics Competition | Walkthrough
Aussies challenge the worldDeloitte Australia called out the global member firms to an analytics competition - the response was loud and clear.
Nations31
Teams192
Practitioners359
Submissions1,954
3Deloitte Titanic Analytics Competition | Walkthrough
Deloitte tackles the TitanicThe task was to predict the fate for half the passengers aboard the ship, based on the outcomes for the first half.
Survived
Died
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked31 0 1Uruchurtu, Don. Manuel E male 40 0 0 PC 17601 27.7208 C
246 0 1Minahan, Dr. William Edward male 44 2 0 19928 90C78 Q746 0 1Crosby, Capt. Edward Gifford male 70 1 1 WE/P 5735 71B22 S
17 0 3Rice, Master. Eugene male 2 4 1 382652 29.125 Q1 0 3Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S2 1 1Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833C85 C3 1 3Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 S
4Deloitte Titanic Analytics Competition | Walkthrough
FeedbackFeedback
Model
Tuning
Vancouver Data Divers tackle the TitanicAn iterative approach to modeling and tuning proved imperative to achieving a high score.
Kaggle
Submission
Model
Development
Data
Visualization
Feature
Engineering
Model
Tuning
Kaggle
Submission
Model
Development
Data
Visualization
Feature
Engineering
5Deloitte Titanic Analytics Competition | Walkthrough
FeedbackFeedback
Model
Tuning
Vancouver Data Divers tackle the TitanicAn iterative approach to modeling and tuning proved imperative to achieving a high score.
Kaggle
Submission
Model
Development
Data
Visualization
Feature
Engineering
Model
Tuning
Kaggle
Submission
Model
Development
Data
Visualization
Feature
Engineering
From raw data to usable features:
DeterministicDeriving values from existing features
ProbabilisticFilling in the blanks with predictive model (e.g. age based on class and title)
6Deloitte Titanic Analytics Competition | Walkthrough
Feature EngineeringAttributes were derived from existing data to generate an enhanced view of passengers and help with model accuracy.
Name: Rice, Master. Eugene Family Name: Rice
Given Name: Eugene
Passenger Type: Master
Class: 1 Sex: F Spouses: 0
Age Estimate: 46
Passenger IDSurvived? (Y/N)Passenger ClassNameGiven NameFamily NamePassenger TypeGenderGender Code * AgeGender Code * Passenger ClassAgeAge Estimate (Regression)Age Estimate (Distribution)SibSpSiblingsSpousesParchParentsChildrenWife? (Y/N)Husband? (Y/N)Father? (Y/N)Mother? (Y/N)Travel Type 1Travel Type 2TicketDeath in Group? (Y/N)Father * Death in GroupGroup SizeTicket First CharacterTicker First LetterFareFare per PassengerFare (log)CabinDeckCabin NumberEmbarked
7Deloitte Titanic Analytics Competition | Walkthrough
FeedbackFeedback
Model
Tuning
Vancouver Data Divers tackle the TitanicAn iterative approach to modeling and tuning proved imperative to achieving a high score.
Kaggle
Submission
Model
Development
Data
Visualization
Feature
Engineering
Model
Tuning
Kaggle
Submission
Model
Development
Data
Visualization
Feature
Engineering
From raw data to usable features:
DeterministicDeriving values from existing features
ProbabilisticFilling in the blanks with predictive model (e.g. age based on class and title)
Visualization driver feature selection for model development.
Assumptions can be tested to save time in model development.
8Deloitte Titanic Analytics Competition | Walkthrough
Data visualizationTableau rapid-fire visualizations enabled us to get to know the data and segment it better for analysis.
Sex / Pclass
female male
1 2 3 1 2 3
0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
320
340
360
Count of Name
SurvivedNo
Yes
Females in 3rd class need independent analysis.
Females in 1st/2nd class should be grouped.
Males in 3rd class could skew results. Best to analyze independently.
Pclass
1 2 3
20
40
60
80
100
120
140
160
180
200
220
240
260
280
Fare
Passenger class cannot be determined based on fares.
There was an overlap of class cabins.
Classes are distributed across decks.
Pclass
Alone
female male
Family
female male
Group
female male
No Yes No Yes No Yes No Yes No Yes No Yes
1
2
3
0
50
100
150
200
# Survived
0
50
100
150
200
# Survived
0
50
100
150
200
# Survived
There is no direct correlation between a passenger’s travel type (Alone/Family/Group) and survival rate, as previously theorized.
9Deloitte Titanic Analytics Competition | Walkthrough
From raw data to usable features:
DeterministicDeriving values from existing features
ProbabilisticFilling in the blanks with predictive model (e.g. age based on class and title)
FeedbackFeedback
Model
Tuning
Vancouver Data Divers tackle the TitanicAn iterative approach to modeling and tuning proved imperative to achieving a high score.
Kaggle
Submission
Model
Development
Data
Visualization
Feature
Engineering
Model
Tuning
Kaggle
Submission
Model
Development
Data
Visualization
Feature
Engineering
Visualization driver feature selection for model development.
Assumptions can be tested to save time in model development.
Use Statistical and Machine Learning models suitable for the task.
Determine which features are useful and derive new if necessary.
10Deloitte Titanic Analytics Competition | Walkthrough
Model DevelopmentMultiple statistical tools were used since some proved better than others in predicting outcome for groups of passengers.
KNIME Analytics Platform was used for prototyping and testing of various modeling approaches:
• Splitting data into groups
• Decision Trees
• Random Forest
• Logistic Regression
• Support Vector Machines (SVM)
11Deloitte Titanic Analytics Competition | Walkthrough
From raw data to usable features:
DeterministicDeriving values from existing features
ProbabilisticFilling in the blanks with predictive model (e.g. age based on class and title)
Visualization driver feature selection for model development.
Assumptions can be tested to save time in model development.
Use Statistical and Machine Learning models suitable for the task.
Determine which features are useful and derive new if necessary.
FeedbackFeedback
Model
Tuning
Vancouver Data Divers tackle the TitanicAn iterative approach to modeling and tuning proved imperative to achieving a high score.
Kaggle
Submission
Model
Development
Data
Visualization
Feature
Engineering
Model
Tuning
Kaggle
Submission
Model
Development
Data
Visualization
Feature
Engineering
Submission is scored and ranked against other teams.
Once a certain threshold is met, the model is ready for tuning.
12Deloitte Titanic Analytics Competition | Walkthrough
Kaggle SubmissionVancouver Data Divers achieved the goal of being top 10% two weeks before the competition deadline.
Gender
ClassClass
MF
Survived
1st or 2nd
LogisticRegression
3rd
Decision Treeon 'Master?', 'Fare'
If Master? = 1Survived
1st
Logistic Regression
2nd 3rd
13Deloitte Titanic Analytics Competition | Walkthrough
From raw data to usable features:
DeterministicDeriving values from existing features
ProbabilisticFilling in the blanks with predictive model (e.g. age based on class and title)
Visualization driver feature selection for model development.
Assumptions can be tested to save time in model development.
Use Statistical and Machine Learning models suitable for the task.
Determine which features are useful and derive new if necessary.
Submission is scored and ranked against other teams.
Once a certain threshold is met, the model is ready for tuning.
FeedbackFeedback
Model
Tuning
Vancouver Data Divers tackle the TitanicIterative approach to modeling and tuning is imperative to achieving a high score.
Kaggle
Submission
Model
Development
Data
Visualization
Feature
Engineering
Model
Tuning
Kaggle
Submission
Model
Development
Data
Visualization
Feature
Engineering
Adjusting model parameters can increase the model accuracy without changing the input variables.
Cross validation is one approach to test the model after each tuning cycle.
14Deloitte Titanic Analytics Competition | Walkthrough
Model TuningModel parameters can be adjusted to achieve better predictive results – moved the team from 19th to 13th spot
Python was used to tune the model by:
• Choosing the optimal features
• Adjusting model parameters
• Reducing manual effort
32,768 combinations of 15 features
2,000 attempts per hour
17 hours on a Deloitte laptop
15Deloitte Titanic Analytics Competition | Walkthrough
Vancouver Data Divers Placed 13th OverallThe local IM&AT talent is capable of tackling predictive modeling projects
Vancouver office is on the map for global Analytics talent
Our story got one client excited thinking about predictive modeling opportunities
Developed stronger predictive analytics capabilities that can be shared within the practice