harness racing and sas
Post on 14-Dec-2014
93 Views
Preview:
DESCRIPTION
TRANSCRIPT
HARNESS RACING AND SASUSING SAS TO MODEL HORSE RACES
• “Past Performance” from TrackMaster for races September 26, 2013 at Yonkers Raceway
• Published in advance of the race
• Cost: $1.50
• Comes in XML format – parsed using python
• Contains 10 most recent PPs for each horse racing that day
• 12 races x 8 horses x 10 past performances = 960 records
• Variables of use: Lengths back at each quarter, final time, lead final time, gait, age (meta), track condition, track name, track length
• Created race-level, horse-race-level, and longitudinal data sets for different aspects of this analysis
DATA SET
GAIT AND CONDITION• Hypothesis: Gait and track condition influence race time
• Gait
• Binary: Pacers and Trotters• Each race is one or the other• Each horse is one or the other
• Condition
• Categorical: Fast, Good, or Sloppy• Each race categorized into one
• Created and cleaned race-level data set
• Means test showed means are different for both variables
• T-test showed these differences are statistically significant
REMOVING OUTLIERS
REMOVING OUTLIERS
GAIT T-TEST
CONDITION T-TEST
CORRELATION: LENGTHS BACK AT CALLS• Some horses pull away early, others seem to wait for the
last quarter to go to the front
• TrackMaster reports lengths back from lead and calls at each quarter
• Lengths are recorded as fractional numbers (to the quarter) and as parts of horse
• Nose• Head• Neck
• Additional complication: “costly breaks” of pace and disqualification
• Still not happy – strange lengths back for winners at final
CORRELATION OF LENGTHS BACK BY QUARTER
CORRELATION OF LENGTHS BACK BY QUARTER
• Goal: Quantify how much horses slow down with age
• Merged metadata for each horse with past performance data
• Single-variable regression analysis of mean data set
• Found that age is not a great predictor of speed
• Age: Discrete, yet not categorical
AGE AND SPEED
• Longitudinal data set
• Created dummy variables for past and present track conditions, gaits, and track sizes
• Used SAS’s “Lag” and “Last” Features
• Removed disqualified races
• Modeled race time based on current race conditions and two races prior
MULTIVARIATE REGRESSION
Label ParameterEstimate
StandardError
t Value Pr > |t|
Intercept 104.67788 4.81142 21.76 <.0001
Lag final time
0.01412 0.03120 0.45 0.6510
Lag2 final time
0.11361 0.02975 3.82 0.0001
Pacer -3.68185 0.21247 -17.33 <.0001
Fast -0.77005 0.38954 -1.98 0.0484
Sloppy 0.86942 0.43605 1.99 0.0465
Age 0.05312 0.04023 1.32 0.1871
5/8 Track -2.74052 0.20313 -13.49 <.0001
1 Track -3.18411 0.47824 -6.66 <.0001
MULTIVARIATE REGRESSION
Label ParameterEstimate
StandardError
t Value Pr > |t|
Fast lag 0.35883 0.38598 0.93 0.3528
Sloppy lag 0.48532 0.43151 1.12 0.2610
Fast lag2 0.09472 0.37245 0.25 0.7993
Sloppy lag2
-0.39904 0.42068 -0.95 0.3431
5/8 Track lag
0.14639 0.23680 0.62 0.5366
1 Track lag 0.40192 0.51792 0.78 0.4379
5/8 track lag2
0.58564 0.21764 2.69 0.0073
1 track lag2
0.67260 0.49172 1.37 0.1717
Variables of Interest Control Variables
Final race times from previous races are not great determinants of final race time this race!
Predicting the Winner
RightWrong
• Used the coefficients from my multivariate regression and most recent two races for each horse
• Ranked horses by predicted race values
• But my bets weren’t great! But better than choosing at random!
• Reason: Low, low variance in race times among horses. Not enough predictive power in model, even with R^2 > 0.5
PREDICTION OF SEPTEMBER 26 RACES
• SAS’s LAG and LAST features are great for dealing with longitudinal data
• Most work was on the DATA steps, not the PROC steps
• My model was based on only 960 occurrences, 96 horses
• With more data, might model Pacers and Trotters separately, Conditions separately
• Still want to investigate lengths back for winning horses
• Learned much about SAS and about harness racing
FINAL THOUGHTS
top related