air travel analytics in sas

32
Shivani Kumar, Navin Lalwani, Rohan Nanda, Han Ni, Ying Zhu 1 1

Upload: rohan-nanda

Post on 25-Jun-2015

435 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Air Travel Analytics in SAS

Shivani Kumar, Navin Lalwani, Rohan Nanda, Han Ni, Ying Zhu

1

1

Page 2: Air Travel Analytics in SAS

Table of Contents

○ Introduction

○ Business Question

○ Description of the Data

○ Exploratory Plots and Tables

○ Unsupervised and Supervised Analytics Models

○ Recommendations and Conclusion

○ Possible next steps

2

2

Page 3: Air Travel Analytics in SAS

Introduction

Air travel cancellation has always been a universal problem. As more and more economic connections happenamong different countries, this issue can cause huge problems to frequent travellers, especially long-distancetravellers, such as international students and business persons. Our group members come from different partsof the world, so this question is of key interest to us. So we decided to base our projects on the statistic dataof Bureau of Transportation Statistics of the United States, and hoped to generate some interesting insightsregarding air travel cancellation, thus to provide some useful insights for the frequent travellers mentionedabove.

Air cancellation can bring about a series of problems to various shareholders in tourism industry: the agenda ofcustomers get delayed, the airports get crowded, and the needs for hotel rooms rockets if a large number offlights got cancelled on the same day due to a severe weather. On acknowledging our insights, travellers canplan ahead accordingly, airlines and airports can make efforts to reduce cancellation based on our findings, andhotels can plan their marketing and sales according to certain flight cancellation pattern.

3

3

Page 4: Air Travel Analytics in SAS

Business Question

Flight cancellation can happen due to a variety of reasons. The most common causes are as follows:

1. Weather2. Natural Disasters3. Mechanical Errors4. Monopoly Routes5. Aircraft Size

Our team is interested in figuring out the different factors that will lead to a flight cancellation. After decidingour datasets for this project and initial analysis of the datasets, we decided to focus on the following domains:

1. Segments - by the Airport ID of original airport and Destination Airport ID pair2. Airport - by every Origin Airport ID3. Airlines - by Airline ID

We have learned to analyze data with Decision Tree Model and Regression Model in Business Intelligence andData Mining class. So we decided to try both models to analyze the above mentioned factors, and choose thebest model that has the smallest average squared error at the initial stage of our analysis.

*In order to work with 2 datasets, we used SQL to combine these two datasets first before we start toconduct the analysis using SAS Enterprise Miner.

4

4

Page 5: Air Travel Analytics in SAS

Description of the Data

After careful observation, we choose two datasets:(1) T100 Domestic Airline Segment Data(2) Airline On-Time Performance Data.

Those two datasets comes from Bureau of Transportation Statistics of Research and Innovative TechnologyAdministration (RITA). The first dataset has more than 70k rows and contains domestic market data reportedby U.S. air carriers, including carrier, origin, destination, and service class for enplaned passengers, freight andmail when both origin and destination airports are located within the boundaries of the United States and itsterritories. Each month, every certificated U.S. air carriers reports their traffic information to Office of Airline1

Information, using an internal normalized form named T-100, and this dataset summarized T-100 data from1993 to 2013.

The dataset named Airline On-Time Performance Data has more than a million rows. It is collected by theOffice of Airline Information, Bureau of Transportation Statistics (BTS), and contains on-time arrival data fornon-stop domestic flights by major air carriers, and provides such additional items as departure and arrivaldelays, origin and destination airports, flight numbers, scheduled and actual departure and arrival times,cancelled or diverted flights, taxi-out and taxi-in times, air time, and non-stop distance.2

Variables AvailableThese two datasets have sufficient data volume and variables for data analysis on the relationship between airtraffic patterns and externalities which hereby defined as airports and airlines.

(1) T100 Domestic Airline Segment Data This dataset supplied key insights on the factors that result in flight cancellations. The key measures ofthis dataset are listed below:

Variables Definition

DepScheduled Departures Scheduled

DepPerformed Departures Performed

Payload Available Payload (pounds)

Seats Available Seats

Passengers Non-Stop Segment Passengers Transported

1  Source: http://www.transtats.bts.gov/Fields.asp?Table_ID=2592 Source: http://www.transtats.bts.gov/Fields.asp?Table_ID=236

5

5

Page 6: Air Travel Analytics in SAS

Freight Non-Stop Segment Freight Transported (pounds)

Mail Non-Stop Segment Mail Transported (pounds)

Distance Distance between airports (miles)

LoadFactor Load Factor: Ratio of Passenger Miles to Available Seat Miles

RampTime Ramp to Ramp Time (minutes)

AirTime Airborne Time (minutes)

(2) Airline On-Time Performance DataThis dataset supplied the factors that affect the Delay and causes for different types of delays. The keymeasures of this dataset are listed below:

Variables Definition

CarrierDelay Carrier Delay, in Minutes

WeatherDelay Weather Delay, in Minutes

NASDelay National Air System Delay, in Minutes

SecurityDelay Security Delay, in Minutes

LateAircraftDelay Late Aircraft Delay, in Minutes

Analysis Methodology:

1. Consolidated the data for the months of May, June and July

The first dataset contains T-100 data from 1993 to 2013 and more than 10 million records. To get

valuable and effective information, we consolidated the data from May 2013 and July 2013, and get

70,000+ records.

2. Clean and construct new variables

a) Generated variables: Flights_Cancelled, Flights_Adhoc, Adhoc?, Cancellation?

The original first dataset doesn’t have clear indicator about cancellation number, but contain

Flights_Scheduled and Flights_Performed. We subtract Flights_Performed from Flight_Scheduled and

get the number of flights with unexpected changes, including both cancellation and Adhoc. If the

6

6

Page 7: Air Travel Analytics in SAS

unexpected changes is negative, we convert the changes into a new variable named”Flight_Cancelled”,

and if it’s positive, we convert the changes into another new variable named “Flights_Adhoc”. We also

created binary variables to show the occurrence of cancellation and adhoc, which are named

“Cancellation?” and “Adhoc?”.

Variables Definition

Flights_Cancelled Number of flights cancelled (Scheduled - Performed )

Flights_Adhoc Number of flights which took off adhoc (Scheduled -Performed)

Adhoc? Binary Variable to depict adhoc flights

Cancellation? Binary Variable to depict cancellations

b) Converted sum to average for: Passengers, Seats, Payload, Freight, Mail, Ramp_to_Ramp, AirTime

Several vital indicators which could be potential externalities impacting cancellations rates is in the sum

of the amount of all flights that day. Therefore, the actual flights numbers influence those indicators. To

exclude this bias possibility, we calculated the average number of the indicators (Total amount/ number

of flights performed) generated new variables to store the records.

Variables Definition

Avg_Passengers Avg_Passengers=Passengers/Departures Performed

Avg_Seats Avg_Seats=Seats/Departures Performed

Avg_Freight Avg_Freight=Freight/Departures Performed

Avg_Mail Avg_Mail=Mail/Departures Performed

Avg_Ramp_to_Ramp Avg_Ramp_to_Ramp=Ramp_to_Ramp/Departures Performed

Avg_AirTime Avg_AirTime=AirTime/Departures Performed

3) Analyzed data individually for each of the datasets

Two datasets that we are interested in are related to flight cancellations and delays. They have different

7

7

Page 8: Air Travel Analytics in SAS

primary keys and the internal calculation logic are intuitively different for each of these datasets.

Therefore, we decided to not to merge them, and analyzed them individually.

Exploratory Plots and Tables

We explored both our data sets to find relations between variables. Also, we tried to find interesting patternsrelated to flight cancellations by using tableau.

Interesting RelationshipsUsing a scatter plot in the data exploration menu in SAS we were able to arrive at some interestingrelationships between key variables in our data set.

a) Departures Performed:

We plotted the variable “departures_performed” against the variable “Airline_ID” with respect to“Flight_Cancelled”. The color blue indicates that a flight was not cancelled and the color red indicatesthat a flight was cancelled. The above graph shows us that the density of the red pixels is very high fordepartures exceeding 150. More specifically, airlines that had higher number of departures alsohad flight cancellations.

8

8

Page 9: Air Travel Analytics in SAS

The departures_performed variable was noted for further investigation.

b) Number of Passengers:

We plotted the variable “Total Passengers” against the variable “Airport_ID” with respect to“Flight_Cancelled”. The color blue indicates that a flight was not cancelled and the color redindicates that a flight was cancelled. An increase in the number of red pixels above the 2500passenger mark can be observed. More specifically, airports that handled higher passengers alsohad flight cancellations.

The total_passengers variable was noted for further investigation.

c) Distance

9

9

Page 10: Air Travel Analytics in SAS

We plotted the variable “Distance from Origin” against the variable “Dest_Airport_ID” with respect to“Flight_Cancelled”. The color blue indicates that a flight was not cancelled and the color red indicatesthat a flight was cancelled. Distances between the 500 and 750 miles mark see a larger density of redpixels. It can be observed that shorter distance flights see more flights cancellations.

The distance variable was noted for further investigation.

Using tableau we tried to find interesting facts about key variables.

a) Monthly Distribution of cancellations:

The charts above show that June and July are the months with the highest flight delay andcancellations. Also, the number of flights diverted increase in the month of June and July.

10

10

Page 11: Air Travel Analytics in SAS

b) Geographic distribution of flight delays

The three graphs above show that:1. Georgia had the maximum flights delayed due to weather.2. Texas had the maximum flights delayed due to security checks.3. Thursday sees the maximum amount of flight delays.

11

11

Page 12: Air Travel Analytics in SAS

Unsupervised and Supervised Analytics Models

For this project, we used k-means clustering, as our unsupervised model, and tried decision trees andregression models for each of the three domains: airports, airlines and segments.

Unsupervised Learning ModelIn the segments domain, on running a K-means cluster analysis, we found the following:

We had 46 clusters of segments. We were primarily interested in grouping segments based on thedepartures performed and the total flights cancelled in that segment.

We determined 5 major clusters. The range of departures performed in the clusters was from 6 to 864. Therange of flights cancelled for segments in the cluster was from 0 to 75. The five clusters were in decreasingorder of frequency are:

12

12

Page 13: Air Travel Analytics in SAS

● The largest cluster comprised of segments that had approximately 9 departures as the average forthe cluster, and 0.05 as the average of flight cancellations for the cluster.

● The next cluster comprised of segments that had approximately 55 departures as the average forthe cluster, and 0.21 as the average of flight cancellations for the cluster.

● The next cluster comprised of segments that had approximately 37.4 departures as the average forthe cluster, and 3 as the average of flight cancellations for the cluster.

● The next cluster comprised of segments that had approximately 119 departures as the average forthe cluster, and 0.39 as the average of flight cancellations for the cluster.

● The next cluster comprised of segments that had approximately 88 departures as the average for thecluster, and 2.2 as the average of flight cancellations for the cluster.

We weren’t able to analyze a significant trend through the use of this model, so we continued with predictivemodelling.

Supervised Learning Models

The two models that we looked at were :

1. Regression2. Decision Tree

We will finally base our analysis on one of these two models depending on which has lesser average squareerror.

Regression AnalysisWe conducted Regression analysis to determine the significant factors that influence flight cancellations. Weperformed backward, forward and stepwise regression. The diagram below represents the regression diagram :

The following actions were performed on the data:

13

13

Page 14: Air Travel Analytics in SAS

1. Data Partition: The data was partitioned into training and validation for basic model fitting and to preventoverfitting the training data.

2. Impute: The data was imputed to fill in the missing values.3. Regression Snapshots:

Stepwise Regression(With Airline ID as Target):

The ASE for Validation (Stepwise) : 0.100689

14

14

Page 15: Air Travel Analytics in SAS

We looked at the Regressions for the other selection models too, and decided to go ahead with Stepwise asit had the least average square error.

Output of the stepwise Regression, depicting all significant variables:

Stepwise Regression(With Origin Airport ID as Target):

The ASE for this model was 0.112633

Similarly, for the segment-wise regression model analysis, we got an ASE of 0.090134.

15

15

Page 16: Air Travel Analytics in SAS

These errors that we saw with the Regression model were much higher than what the decision tree gave us,so we rejected the regression model and based our analysis on the Decision Tree .

Decision Tree AnalysisDecision trees are a simple, but powerful form of multiple variable analysis. They provide unique capabilities tosupplement, complement, and substitute for traditional statistical forms of analysis. To access the importantvariables in this study we apply the decision tree model in terms of SAS to acquire the critical variables in ourdataset.By cross validation, we found the most important variables for our target and conducted furtheranalysis to provide business suggestion on factors that affect the flight cancellations.

A) Based on Airline ID domain

Experiment Methodology:

1. Import the following dataset :T-100 Segment data for the months of May,June and July (84,232 rows).

2. Edit variables and set different roles to all of variables

Variable Role Level

Airline ID ID Nominal

Aircraft Config Input Interval

Aircraft Group Input Interval

Aircraft Categorization Input Nominal

Departure Performed Input Interval

Class Input Nominal

Average Freight Input Interval

16

16

Page 17: Air Travel Analytics in SAS

Average Airtime Input Interval

Average Total Time at ground on bot Input Interval

Average Mail Input Interval

Average Passengers Input Interval

Average Payload Input Interval

Average Ramp to Ramp Input Interval

Distance Input Interval

Month Input Interval

Flight Cancelled Target Nominal

The other variables which are not important for this analysis, were rejected.

3.Data Partition With 70% for training and 30% for validation, all the rest is following the default setting.

4. Transformation Variable transformations can be used to stabilize variance, remove nonlinearity, improve additivity, and counter non-normality.The following variables were transformed in order to address these irregularities

Variable Method

Average Ramp to Ramp Log

Average Payload Log

Average Passengers Log

Average Airtime Log

Aircraft Categorisation Dummy Indicator

Class Dummy Indicator

Post transformation, the variables skewness reduced considerably and in seen in the below figures:

17

17

Page 18: Air Travel Analytics in SAS

5. Decision Tree AnalysisApplying with Cross validation, Rest are following the default settings.

6. ResultsThe ASE for Validation data is : 0.078363

18

18

Page 19: Air Travel Analytics in SAS

Decision Tree:

We also looked at the various important variables for this dataset:

The subtree assessment plot depicted that the tree was pruned such that there are 45 leaves.

19

19

Page 20: Air Travel Analytics in SAS

7. Outcomes

For a given airline, if :● the number of departures performed is more than approximately 3,● the average number of passengers travelling is less than approximately 3

then there is a 99.6% probability that a flight of that airline will not be cancelled.

20

20

Page 21: Air Travel Analytics in SAS

For a given airline, if :● the average payload is less than 10,● the Class is F● the departures performed less than 49

then there is 82.4% probability that the flight would get cancelled.

For a given airline, if:● the departures performed are more than 70,● the average payload is more than 9 pounds,● the average total time on ground is more than 18 minutes

then there is 83.3% probability that the flight would get cancelled.

B) Based on Airport IDChanging the ID variable to Origin Airport ID and keeping the other configurations similar, we see the followingresults:

The ASE for Validation data is 0.0987131

21

21

Page 22: Air Travel Analytics in SAS

The decision tree:

We see that the same set of variables were important for this analysis as well:

The subtree assessment plot with the average square errors:

22

22

Page 23: Air Travel Analytics in SAS

Outcomes

For a given Airport, if● the departures performed more than 42,● the average payload of less than 10 pounds,● the average mails sent is more than 1,

then it is very unlikely (100%) that the flight would get cancelled.

For a particular Airport ID,● the departures performed more than 70,● they belong to Class F,● the average payload of less than 10 pounds and Aircraft Config lesser than 2

then it is 83.6% likely that the flight would get cancelled.

23

23

Page 24: Air Travel Analytics in SAS

C) Based on Segments (Origin Airport ID and Destination Airport ID pairs)

Experiment Methodology:

1. Import the following dataset :T-100 Segment data for the months of May,June and July (84,232 rows).

2. Edit variables and set different roles to all of variables

Variable Role Level

Origin_Airport_ID ID Nominal

Dest_Airport_ID ID Nominal

flightAdHoc? Input Binary

Aircraft Config Input Interval

Aircraft Group Input Interval

Aircraft Categorization Input Nominal

Departure Performed Input Interval

Class Input Nominal

Average Freight Input Interval

Average Airtime Input Interval

Average Total Time at ground on bot Input Interval

Average Mail Input Interval

Average Passengers Input Interval

Average Payload Input Interval

Distance Input Interval

24

24

Page 25: Air Travel Analytics in SAS

Month Input Interval

Flight Cancelled? Target Nominal

The other variables which are not important for this analysis were rejected.

3.Data PartitionWith 70% for training and 30% for validation, all the rest is following the default setting.

4. Transformation

Variable Method

Average Payload Log

Average Passengers Log

Average Airtime Log

Aircraft Categorisation Dummy Indicator

Class Dummy Indicator

Post transformation, the variables skewness reduced considerably as seen in the figures depicted above in theairline-based analysis.

5. Decision Tree Analysis Applying with cross validation, rest are following the default settings.

6.ResultsThe ASE for Validation data is : 0.081963

25

25

Page 26: Air Travel Analytics in SAS

Decision Tree:

We also looked at the various important variables for this dataset:

The subtree assessment plot depicted that the tree was pruned such that there are 36 leaves.

26

26

Page 27: Air Travel Analytics in SAS

7. OutcomesFor a given segment, if :

● The number of departures performed is more than approximately 70,● The average allotted payload is less than approximately 9 pounds,

then there is an 88% probability that flights in that segment will get cancelled

27

27

Page 28: Air Travel Analytics in SAS

For a given segment, if :● The number of departures performed is more than approximately 70,● The average allotted payload is more than approximately 9 pounds● The average total time on ground for both source airport and destination airport is greater than

approximately 19 minutesthen there is an 83.3% probability that flights in that segment will get cancelled

For a given segment, if :● The number of departures performed is less than approximately 10 and greater than 2● The flights too off randomly without schedule,

then there is a 94.7% probability that flights in that segment will get cancelled

28

28

Page 29: Air Travel Analytics in SAS

Recommendations and Conclusion

Important Variables Venn AnalysisWe performed a venn analysis on the important variables in each of the three domains and plotted them,considering those ones that were important at arriving at our recommendations.

● Departures Performed and Avg. Payload are the most important variable in our analysis for all the

29

29

Page 30: Air Travel Analytics in SAS

three domains. They are the game-changing decider variables that decide cancellations for segments,airlines and airports

● Airlines and Segments share avg total time on ground at both source and destination as animportant variable. This is interesting because it is counter-intuitive. One would think that this wouldappear as a decider variable for airports

● Airlines and airports share the aircraft_class variable as common● FlightAdHoc, Avg. Passengers, and Airport Config and Avg Mails are important for segments,

airlines and airports respectively

Findings and Recommendations Segments Findings:

● In segments that have flights with very less payload on an average (< 8 pounds) but flyfrequently are likely to get cancelled. Moreover, the segments that have flights with higherpayloads and fly frequently, but spend more than 18 minutes at both the source and destinationairports are also likely to get cancelled.

● In segments that have flights with few departures and are taking off without being scheduledsee less or no cancellations.

Recommendations:● The airport should pilot a program to redirect a few congested segments’ traffic to runways

that handle the non-scheduled flights. Based on the results, it can determine whether prioritygiven to non-scheduled aircrafts was causing cancellations.

● A new runway should be opened to speed up ground handling and reduce the average timespent for higher payload aircrafts on ground at both source and destination

● The airport is accommodating flights of non-congested segments, that too flights that are notscheduled. However, congested, heavy-traffic segments but with less or no passengers arebeing cancelled, and those with passengers and cargo, and those that take time on the groundat both source and destination, are being cancelled.

Airlines: Findings:

● For small flights (accommodating three or lesser people) that fly more often (more than 3departures) have very little chance of getting cancelled.

● For flights that fly more often with little payload (lesser than 9 pounds) tend to get cancelledmore often. They also spend a considerable about of time at the airports (18 minutes).

Recommendations:● The last recommendation for the segments ties into the same for the airlines domain. Ground

crew of airline companies should make sure that quick ground handling time is instilled at the

30

30

Page 31: Air Travel Analytics in SAS

airport forhigher payload aircrafts on ground at both source and destination

● The payload analysis from segments complies with our finding for aircrafts with lesser numberof passengers. Just as it was found that less payload but high departure segment flights weregetting cancelled, the same for airlines hold true. Airlines ground staff at airports should bealert when these flights are schedules to arrive and depart at airports, to make sure thathandling time is fast.

Airports: Findings:

● For airports with frequent departures (more than 70) with relatively lesser payload ( 10 poundsor lesser) and belonging to Class F, and with avg. mails being loaded into the aircrafts, it is verylikely that these flights would get cancelled.

Recommendations:● As these delays affect a large population, the airports should work on Scheduled

Passenger/cargo service flights to understand why these flights result in frequent cancellations.From our findings, it is apparent that the handling time, in terms of baggage and mail loadinginto the aircrafts, is deciding the cancellations, apart from other important variables. Inconclusion, handling at the airports is taking time.

31

31

Page 32: Air Travel Analytics in SAS

Possible next steps

According to Wall Street Journal, illness, family emergencies, and rescheduled business meetings are a bigbusiness for airline companies. At some airlines, the resulting change fee and penalties passengers ended up3

paying added up to $2 billion a year, which is even higher than the total baggage fees. If airlines can delve moreinto the seasonal client data to figure out a cancellation pattern from the passenger’s side, adjust change feesand penalties according to the patterns discovered, the airlines can generate a higher revenue based on thatfinding.

3 Source: http://online.wsj.com/news/articles/SB10001424052970204563304574318212311819146

32

32