horse racing assignment

14
Thoroughbred Horse Racing Data Insight *Data Exploration

Upload: gaurav-tiwari

Post on 18-Jan-2017

245 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Thoroughbred Horse Racing

Data Insight

*Data Exploration

Basic Insight of Data

• Horse racing Data is a collection of data from three Tracks (Track ID –XXX ,

YYY and ZZZ) in USA and for four months i.e. from September 2004 to

December 2004.

• Track with track ID XXX is in Florida, track ID YYY is in Louisiana and

Track ID ZZZ is in Kentucky.

• All three tracks are in USA .

• There are 17028 observations and 23 variables in horse racing data set .

• There are some variables in horse racing data which are not defined in data

dictionary .

• All the variables defined in Data Dictionary are not present in final data set.

• There are quantitative 11 variable and 12 qualitative variables.

• All the variables have expected values as Data dictionary.

• Some of variables have coded values which are explained in data

dictionary.

Data Validation

• There are no missing values in the variables.

• Minimum value for Minimum _claim _price and Maximum _claim _price is zero for non- claiming races. i.e. for ALW ,MSW STR and STK .

• Minimum value for Minimum _claim _price and Maximum _claim _price is zero for non- claiming races. i.e. for ALW ,MSW STR and STK .

• Unit of Race _time is not defined in data dictionary.

• Nothing is found unusual during basic data validation.

Analysis of Variable Handle

• Average Value of Handle is 410117.981 and Median of the handle is 330726.

• There is large gap between Mean and Median of Handle which indicates the presence of outliners .

• The maximum value of Handle is 1800225.

• The first Quartile (Q1)value of handle is 245911 and Third Quartile (Q3)value of handle is 498734.

• Inter Quartile value of handle is 252823 .

• Any value which is greater than Q3+1.5IQR or Q1-1.5QR may be outlier.

• Any value grater than and equal to 877968.5 will be an outlier based on this we can say that there are some outliers present and 1800225 is definitely an outliers. (need to check the business scenario ).

Analysis of Variable Prize(Assuming purse as Prize)

• Average Value of Prize is 29736.1757 and Median of the Prize is 22500.00

• There is significant gap between Mean and Median of Prize which indicates presence of outlier.

• High Value of standard deviation which indicates the high spread of data.

• The maximum value of Prize is 500000.

• The first Quartile (Q1)value of Prize is 12000 and Third Quartile (Q3)value of Prize is 29000.

• Inter Quartile value of Prize is 17000 and according to 1.5 IQR formula , there some outliers(6 may be) present in variables (need to verify based on business scenario)

Analysis of Variable distance_id

• Average Value of distance_id is 710.382429 and Median of the distance_id is 700.0000

• There is small gap between Mean and Median of distance_id which indicates that there is no outliners.

• The maximum value of distance_id is 1100.

• The first Quartile (Q1)value of distance_id is 600 and Third Quartile (Q3)value of distance_id is 850.

• Inter Quartile value of distance_id is 250.00000 and according to 1.5 IQR formula there is no outliers in distance_id variable.

Analysis of Variable race_number

• Average Value of race_number is 5.70801034 and Median of the race_number is 6.000000

• There is small gap between Mean and Median of race_number which indicates that there is no outliners.

• Standard deviation is small which data is not spread in large scale.

• The maximum value of race_number is 15

Analysis of Variable number_of_runners

• Average Value of number_of_runners is 8.44832041 and Median of the number_of_runners is 8.000000

• There is small gap between Mean and Median of number_of_runners which indicates that there is no outliners.

• Standard deviation(2) is small which data is not spread in large scale.

• The maximum value of number_of_runners is 567.

Business Insight of Data

• The main income of any horse race track management company is Handle

which is governed by different independent factors such as number of

attendances , number of races etc. on different race track

• Hence our target variable is handle .

• In this presentation I have analyzed the Handle variable based on the

different independent variables for all three tracks in order to determine to

track which generates the maximum income for the horse race track

management company.

Average Handle by days of week

• The Total Number of races on all three tracks are 774.

• The maximum value of Average handle for Tracks is on Wednesday with 616678 value and total numbers of races on track s on Wednesday is 40

• There is no race data for Wednesday and Thursday on track ID XXX.

• There is no race data for Monday ,Tuesday and Wednesday on track ID YYY.

• There is no race data for Monday of track ID ZZZ.

• Average handle on weekends is higher than average handle on weekdays .

• Maximum numbers of races are organized on Friday

150

84

10 40

154

162

174 Sunday

Monday

Tuesday

Wensday

Thursday

Friday

Sunday

0

100000

200000

300000

400000

500000

600000

700000

Sunday Monday Tuesday Wensday Thursday Friday Sunday

Average Handle by days of week Number of races on days of week

Average Handle by Number of Runners

• The maximum number of runners on Tracks are 14 and Average handle is also maximum(733174).

• The average handle increases as the number of runners increases on tracks till runners are 12 however sharp decrease at race which has 13 of runners.

• Race with 7 runners has the maximum observations (153).

0

100000

200000

300000

400000

500000

600000

700000

800000

4 6 8 10 12 14

Ave

rage

ha

nd

le

Number of runners

Average Handle by Race Number

• The maximum value of Average handle for Tracks is 646100.00 and it is for race number 12 values.

• Race 1 ,2 ,3,4 and 5 are races which are organized in maximum numbers(76 number of races for each) .

• The average handle increases as the number of races increases on tracks however there is sharp a decrease in the average handle after race number 12 hence customer should restrict the number of races on tracks to 12 .

0

100000

200000

300000

400000

500000

600000

700000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Race Number

Ave

rage

ha

nd

le

Average Handle by Purse(Prize)

• The value of Average handle increase sharply when the prize amount cross 400000.

• The maximum value of Average handle is 1800225 at prize value 500000.

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000

0 100000 200000 300000 400000 500000 600000

Ave

rage

ha

nd

le

Purse

Suggestions

1.Wed , Fri , Sat , Sun : are the highest gross handle days in a week.

2.Steep increase in handle when the purse is higher than 150000.

3.Restrict the no of races to 11/day.

4.Average handle increases when the no of runners are in 4-12 range.