Exploratory Analysis of US Domestic Flight Data
In this article I will demonstrate the walkthrough of an exploratory data analysis with python using numpy, pandas, matplotlib and seaborn. I am going to use US domestic data from January from years 2017 through 2019 downloaded from here.
If you would like go through this project yourself, you can use a jupyter notebook and download the three .csv files from this repository.
Let’s go ahead and import the packages, which we are going to use for our project.
In[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
Make sure your .csv files are in the same directory as your jupyter notebook and run the following code to load the datasets into pandas dataframes and print out their sizes. Each .csv file contains data for the month of January for years 2017, 2018 and 2019 respectively.
In[2]:
flight_data1 = pd.read_csv("flight_data1.csv", dtype = {'DEP_TIME': object, 'ARR_TIME': object})flight_data2 = pd.read_csv("flight_data2.csv", dtype = {'DEP_TIME': object, 'ARR_TIME': object})flight_data3 = pd.read_csv("flight_data3.csv", dtype = {'DEP_TIME': object, 'ARR_TIME': object})print(flight_data1.shape, flight_data2.shape, flight_data3.shape)
Out [2]:
(450017, 27) (570118, 27) (583985, 27)
Each pair of the above output represents the number of rows and columns in each of the datasets. For example, flight_data1 contains 450017 rows and 27 columns. All three datasets together contain about 1.5 million rows of data, so in order to avoid slow runtime, we are going to sample random 10,000 rows from each dataframe for the purpose of our analysis.
In[3]:
sample_data1 = flight_data1.sample(n=10000, random_state=1)
sample_data2 = flight_data2.sample(n=10000, random_state=1)
sample_data3 = flight_data3.sample(n=10000, random_state=1)
Now let’s combine the three dataframes created above into one and look at the first 5 rows of the data.
In[4]:
combined_df = pd.concat([sample_data1, sample_data2, sample_data3])
print(combined_df.head())
Out[4]:
As you can see in the output above, a lot of the data is messy and unclear. Before using this data for our analysis, we will need to perform a few cleaning steps.
Data Cleaning
The DAY_OF_WEEK column contains the days of week coded in numbers 1 through 7. I am going to convert it into ordered categorical variables as words.
In[5]:
def to_categorical(col, ordered_categories):
ordered_var = pd.api.types.CategoricalDtype(ordered = True, categories = ordered_categories)
combined_df[col] = combined_df[col].astype(ordered_var)
col = "DAY_OF_WEEK"
ordered_categories = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', "Saturday", 'Sunday']
combined_df[col] = combined_df[col].astype(str)
weekday_dict = {'1': 'Monday', '2' : "Tuesday", '3': 'Wednesday',
'4': 'Thursday', '5': 'Friday', '6': 'Saturday',
'7': 'Sunday'}
combined_df[col] = combined_df[col].replace(weekday_dict)
to_categorical(col, ordered_categories)
combined_df.DAY_OF_WEEK.unique()
Out[5]:
[Monday, Saturday, Thursday, Wednesday, Sunday, Tuesday, Friday]
Categories (7, object): [Monday < Tuesday < Wednesday < Thursday < Friday < Saturday < Sunday]
The DEP_DEL15 column has a value of 0.0 if the given flight has a delay of less than 15 minutes and a value of 1.0 if the flight has a delay of 15 or more minutes. So let’s convert the values accordingly to represent the meaning more clearly.
In[6]:
combined_df['DEP_DEL15'] = combined_df['DEP_DEL15'].astype(str)delay_dict = {'0.0': '< 15 min', '1.0' : ">= 15 min"}combined_df['DEP_DEL15'] =
combined_df['DEP_DEL15'].replace(delay_dict)
Similarly, I am going to convert the values in the CANCELLATION_CODE column to represent their meanings.
In[7]:
col = "CANCELLATION_CODE"cancellation_code_dict = {'A': 'carrier', 'B' : "weather", 'C': 'NAS', 'D': 'security'}combined_df[col] = combined_df[col].replace(cancellation_code_dict)
The DEP_TIME and ARR_TIME values are in float format and will need to be converted to datetime. Let’s do that and print out a few rows to see how the values look after these changes.
In[8]:
def float_to_datetime(col):
combined_df[col] = pd.to_datetime(combined_df[col], format='%H%M', errors = 'coerce').dt.time
float_to_datetime('DEP_TIME')
float_to_datetime('ARR_TIME')
print(combined_df.DEP_TIME.head())
print(combined_df.ARR_TIME.head())
Out[8]:
217132 19:26:00
188169 07:22:00
159183 11:01:00
281252 06:58:00
200261 14:52:00
Name: DEP_TIME, dtype: object
217132 21:58:00
188169 07:28:00
159183 17:27:00
281252 11:54:00
200261 17:48:00
Name: ARR_TIME, dtype: object
Now that our data is relatively clean, we can proceed to exploration.
Univariate Exploration
As the title suggests, we are going to take a look at the distribution of different variables in our dataset. Univariate exploration will help us determine the variables that can contain insights and are worth exploring further in relation to other dimensions. We need to keep in mind though that the counts are not absolute, since we are exploring a sample of data.
Let’s take a look at the distribution of flights by days of week.
In[9]:
default_color = sb.color_palette()[0]
sb.countplot(data = combined_df, x = 'DAY_OF_WEEK', color = default_color)
plt.xticks(rotation=20)
Out[9]:
As you can see on the above graph, there are much less flights on Saturdays compared to other days of the week.
Now let’s see the distribution of flights by destination state to determine, which states are in the top 3.
In[10]:
default_color = sb.color_palette()[0]
plt.figure(figsize = [20,5])
sb.countplot(data = combined_df, x = 'DEST_STATE_NM', color = default_color)
plt.xticks(rotation=90)
Out[10]:
The top 3 destination states are Texas, Florida and California. Interestingly, those states are also the top 3 states by population.
An interesting question we could ask is what part of the day do most of the flights depart. To answer this question let’s take a look at the distribution of departures by hour of day.
In[11]:
def chart(occurance_list):
hour_list = [t.hour for t in occurance_list]
numbers=[x for x in range(0,24)]
labels=map(lambda x: str(x), numbers)
plt.xticks(numbers, labels)
plt.xlim(0,24)
plt.hist(hour_list)
plt.show()
chart(combined_df['DEP_TIME'])
Out[11]:
Peak number of flights occur early in the morning and late in the afternoon and there are minimal flights at night.
Let’s now take a look at the distribution of duration of flights in a standard scale first.
In[12]:
binsize = 20
bins = np.arange(0, combined_df['AIR_TIME'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = combined_df, x = 'AIR_TIME', bins = bins)
plt.xlabel('Air Time (min)')
plt.show()
Out[12]:
Air time has a long-tailed distribution, with a lot of flights around 100 minute-long and fewer flights greater than 200 minute-long.
Due to the long tail on the right, we might miss some insights if we only look at the standard scale distribution. So let’s transform the axis and look at the distribution on a log scale.
In[13]:
log_binsize = 0.025
bins = 10 ** np.arange(1, np.log10(combined_df['AIR_TIME'].max())+log_binsize, log_binsize)plt.figure(figsize=[8, 5])
plt.hist(data = combined_df, x = 'AIR_TIME', bins = bins)
plt.xscale('log')
plt.xticks([10, 20, 100, 200, 1000], ['10', '20', '100', '200', '1000'])
plt.xlabel('Air Time (min)')
plt.show()
Out[13]:
We can clearly see three peaks at around 80 minute, 120 minute and 300 minute marks.
Now I am going to explore the distribution of departure delay durations on a log scaled plot.
In[14]:
log_binsize = 0.05
bins = 10 ** np.arange(0, np.log10(combined_df['DEP_DELAY_NEW'].max())+log_binsize, log_binsize)plt.figure(figsize=[8, 5])
plt.hist(data = combined_df, x = 'DEP_DELAY_NEW', bins = bins)
plt.xscale('log')
plt.xticks([1, 4, 10, 40, 100, 400, 1000], ['1', '4','10', '40', '100', '400', '1000'])
plt.xlabel('Departure Delay (min)')
plt.show()
Out[14]:
Clearly the majority of delays are less then 100 minute long. A lot of the delays are short, within 40 minutes, but there are some extreme cases above 1000 minutes as well.
Now I would like to know the distribution of flight distances within the US, which I will do on a standard scale this time.
In[15]:
binsize = 50
bins = np.arange(0, combined_df['DISTANCE'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = combined_df, x = 'DISTANCE', bins = bins)
plt.xlabel('Flight distance (miles)')
plt.show()
Out[15]:
Most flights are shorter than 1000 miles, but there are also extreme values ranging from 3000 to 5000 miles. It is import to determine if any of the outliers in your data make sense. So after googling flight distances between different parts of the US it becomes clear that those outliers are reasonable. Of course it is recommended to take a look at the individual records, which contain the outliers and see if they make sense on their own, but I am going to skip that step in this analysis, since I won’t be focusing on flight distances a lot.
Bivariate Exploration
For every data analysis you first need to find the right questions you want to answer. It is a difficult task on it’s own, which can be accomplished by looking at the big picture of your data. One way to approach it is to create a correlation plot, which will highlight the correlations that are worth taking a look at. The diagonal of every correlation plot is always 1. This is because those values correspond to the correlation of a value to itself. Let’s now create a correlation plot for our data and see if we can find any surprising correlations.
In[16]:
numeric_vars = ['DEP_DELAY_NEW', 'AIR_TIME', 'DISTANCE', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']
plt.figure(figsize = [8, 5])
sb.heatmap(combined_df[numeric_vars].corr(), annot = True, fmt = '.3f',
cmap = 'vlag_r', center = 0)
plt.show()
Out[16]:
The strongest correlation is between air time and flight distance, which makes sense. There are also different amounts of correlations between delay reasons and overall departure delay, which also makes sense, since each separate delay reason contributes to overall delay. From this plot we can say that late aircraft delay contributes the most to departure delays. Based on the heatmap there are no surprising correlations, but at least it makes sense and we can now move forward and further explore our dataset.
One things we would like to know is whether any of the airports in the US are home to longer delays compared to others. Since there are about 330 different airports in our dataset, we will only look at the top 5 busiest ones. We will create a violinplot, which will show the distributions of total delay durations for each of those 5 airports. Let’s also exclude any flights with no delays, so that we can consentrate on duration of delays instead.
In[17]:
# get the list of top 5 airports from the dataset.
top_5_origin = combined_df['ORIGIN'].value_counts().head(5).keys()#create a dataframe with top 5 airports, including flights with delays only
top_5_origin_df = combined_df[combined_df['ORIGIN'].isin(top_5_origin) & combined_df['DEP_DELAY_NEW'] > 0]#order origins based on mean departure delay
my_order = top_5_origin_df.groupby(by=['ORIGIN'])['DEP_DELAY_NEW'].mean().sort_values(ascending=False).index#create violinplot with order based on ascending mean duration delay
default_color = sb.color_palette()[0]
plt.figure(figsize = [5,5])
sb.violinplot(data = top_5_origin_df, y= 'ORIGIN', x= 'DEP_DELAY_NEW', color= default_color, order= my_order)
Out[17]:
The plot shows us a pretty similar distribution for the top 5 busiest airports, although clearly the O’Hare International Airport (ORD) has more longer delays compared to other airports, since its distribution is wider towards the right side. It is difficult to draw conclusions from this plot alone, since the violins look very much alike.
Now I want to explore the distribution of delays by type for each of the top 5 busiest airports.
In[18]:
yvars = ['DEP_DELAY_NEW', 'CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY',
'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']plt.subplots(2,3,figsize=(12,7))
count = 1
default_color = sb.color_palette()[0]
for yvar in yvars:
plt.subplot(2,3, count)
sb.boxplot(x= "ORIGIN", y=yvar, data=top_5_origin_df, color= default_color)
plt.xlabel('ORIGIN')
plt.ylabel(yvar)
plt.tight_layout()
count +=1
Out[18]:
Just by looking at the number of data points on each plot above, we can say that late aircraft delay is the most frequent delay reason. The least frequent delay reason is security delay. Actually, in this particular sample, there are only 2 security delay incidents, both of which are in LAX. Not suprisingly, LAX has fewer and shorter weather delays compared to other airports, but surprisingly DEN has about the same weather delay distribution as LAX. This is unusual, since Denver is located in a cold place, where it snows in the winter like in Atlanta (ATL) and Chicago (ORD). DFW weather delay distribution is for the most part similar to LAX, with just one 200+ minute delay. NAS delays are fewer and shorter in LAX and DFW. Carreier delays are fewer and shorter in DEN.
I would like to know if any of the destinations have higher divertion rates compared to others, so let’s execute the code below.
In[19]:
# get the list of top 30 destination airports from the database.
top_30_dest = combined_df['DEST'].value_counts().head(30).keys()#create a dataframe with top 30 destination airports
top_30_dest_df = combined_df[combined_df['DEST'].isin(top_30_dest)]#order destinations based on mean divertions
my_order = top_30_dest_df.groupby(by=['DEST'])['DIVERTED'].mean().sort_values(ascending=False)# plot the barchart
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
dest = my_order.index
mean_diverted = my_order
ax.bar(dest,mean_diverted)
plt.xticks(rotation = 90)
plt.xlabel('destination airport')
plt.ylabel('divertion rate')
plt.show()
Out[19]:
MDW is located in Chicago, and given that we are dealing with January data, it is not surprising to see that this destination airport has a high divertion rate. SAN is in San Diego. Although San Diego is an all-year-round sunny city, it is listed along with LGA in the top 10 challenging airports in the USA according to this article.
Let’s now take a look at the origin airports with the highest cancellation rates.
In[20]:
# get the list of top 30 airports from the database.
top_30_origin = combined_df['ORIGIN'].value_counts().head(30).keys()#create a dataframe with top 30 airports
top_30_origin_df = combined_df[combined_df['ORIGIN'].isin(top_30_origin)]#order origins based on mean cancellations
my_order = top_30_origin_df.groupby(by=['ORIGIN'])['CANCELLED'].mean().sort_values(ascending=False)# plot the barchart
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
origin = my_order.index
mean_cancellation = my_order
ax.bar(origin,mean_cancellation)
plt.xticks(rotation = 90)
plt.xlabel('origin airport')
plt.ylabel('cancellation rate')
plt.show()
Out[20]:
MDW has about 7 times more frequent cancellations than MIA. Again, this may be due to particularly bad weather conditions, since the airport is located in the snowy city of Chicago. ORD is also located in Chicago, so it is not surprising that it has the second highest cancellation rate on this plot. The next 2 airports with the highest cancellation rates are JFK and BOS, which are both located in cities with cold climate, where there is a lot of snow in January. Overall, there are more airports located in colder cities on the left side of the plot and more airports located in warmer cities on the right side of the plot.
Now I want to see the cancellation rates broken down by day of week.
In[21]:
#order days of weak based on mean cancellations
my_order = combined_df.groupby(by=['DAY_OF_WEEK'])['CANCELLED'].mean().sort_values(ascending=False)# plot the barchart
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
origin = my_order.index
mean_canellation = my_order
ax.bar(origin,mean_canellation)
plt.xticks(rotation = 90)
plt.xlabel('day of week')
plt.ylabel('cancellation rate')
plt.show()
Out[21]:
The plot above is suprising as it would be expected to get about the same cancellation rate throughout the week. The differences between days of week are not very significant though.
Multivariate Exploration
In this part I will take a closer look at the top 10 busiest origin airports, their cancellation rates, cancellation types and delay durations.
I want to take a look how much each cancellation reason contributes to overall cancellations first.
In[22]:
# get the list of top 10 busiest origin airports from the database.
top_10_origin = combined_df['ORIGIN'].value_counts().head(10).keys()
#create a dataframe with top 10 busiest origin airports
top_10_origin_df = combined_df[combined_df['ORIGIN'].isin(top_10_origin)]def cancellation_code_rate(airport, cancellation_code):
code_occurence = top_10_origin_df[(top_10_origin_df['ORIGIN'] == airport) & (top_10_origin_df['CANCELLATION_CODE'] == cancellation_code)].shape[0]
return (code_occurence/top_10_origin_df.shape[0])# Values of each group
bars1 = [cancellation_code_rate(airport, 'carrier') for airport in top_10_origin]
bars2 = [cancellation_code_rate(airport, 'weather') for airport in top_10_origin]
bars3 = [cancellation_code_rate(airport, 'NAS') for airport in top_10_origin]
bars4 = [cancellation_code_rate(airport, 'security') for airport in top_10_origin]
#cancellation reasons
cancellation_reasons = ['carrier', 'weather', 'NAS', 'security']
# Heights of bars1 + bars2
bars12 = np.add(bars1, bars2).tolist()
# Heights of bars1 + bars2 + bars3
bars123 = np.add(bars12, bars3).tolist()
# The position of the bars on the x-axis
r = [0,1,2,3,4,5,6,7,8,9]
# Names of group and bar width
names = list(top_10_origin)
barWidth = 1
# Create green bars
plt.bar(r, bars1, color='#565d47', edgecolor='white', width=barWidth)
# Create brown bars
plt.bar(r, bars2, bottom=bars1, color='#b49c73', edgecolor='white', width=barWidth)
# Create pink bars
plt.bar(r, bars3, bottom=bars12, color='#eaac9d', edgecolor='white', width=barWidth)
# Create yellow bars
plt.bar(r, bars4, bottom=bars123, color='#f0daa4', edgecolor='white', width=barWidth)
# axis, title, legend
plt.xticks(r, names, fontweight='bold')
plt.xlabel("origin airport")
plt.ylabel('cancellation rate')
plt.legend(cancellation_reasons)
# Show graphic
plt.show()
Out[22]:
Clearly, the most predominant cancellation reason for all airports is weather. The proportion of NAS and carrier delays is more variable between airports. PHX has no NAS cancellations at all, whereas there are way more NAS cancellations than carrier cancellations in ORD and SFO. Cancellations due to security are only present in LAX. In previous plots, we saw that delays due to security reasons are also unique to LAX.
What about distribution of short (<15 minute) and long (≥ 15 minute) delays for those top 10 airports?
In[23]:
# Grouped boxplot
filtered_data = top_10_origin_df[top_10_origin_df['DEP_DEL15'] != 'nan']
plt.figure(figsize=(15,5))
sb.boxplot(x= "ORIGIN", y="DEP_DELAY_NEW", hue = "DEP_DEL15", data=filtered_data, palette = 'Set1')
plt.ylabel('Departure Delay (min)')
plt.xlabel('Origin Airport')
Out[23]:
As depicted in the plots of departure delay distribution in bivariate exploration, a lot of delays are very short. The plot above helps to separate delays that are greater or equal to 15 to be able to look more closely on the distribution of longer delays. Here we can clearly see that the distrubution of delays less than 15 minutes in time is almost the same for all 10 airports. The picture is completely different with longer delays. If we look at the points outside of the boxes, there are some very long delays above 400 minutes in DEN, LAX, LAS, ATL, DFW, PHX, ORD and IAH. DEN, SFO and CLT do not exceed 400 minutes. When we look at the distribution of points within the red boxes, clearly ORD is the airport with the longest delays among the 10 airports, whereas CLT has the shortest delays among the 10.