This document introduces the package DataExplorer, and shows how it can help you with different tasks throughout your data exploration process.
There are 3 main goals for DataExplorer:
The remaining of this guide will be organized in accordance with the goals. As the package evolves, more content will be added.
We will be using the nycflights13 datasets for this document. If you have not installed the package, please do the following:
There are 5 datasets in this package:
If you want to quickly visualize the structure of all, you may do the following:
library(DataExplorer)
data_list <- list(airlines, airports, flights, planes, weather)
plot_str(data_list)
You may also try plot_str(data_list, type = "r")
for a
radial network.
Now let’s merge all tables together for a more robust dataset for later sections.
merge_airlines <- merge(flights, airlines, by = "carrier", all.x = TRUE)
merge_planes <- merge(merge_airlines, planes, by = "tailnum", all.x = TRUE, suffixes = c("_flights", "_planes"))
merge_airports_origin <- merge(merge_planes, airports, by.x = "origin", by.y = "faa", all.x = TRUE, suffixes = c("_carrier", "_origin"))
final_data <- merge(merge_airports_origin, airports, by.x = "dest", by.y = "faa", all.x = TRUE, suffixes = c("_origin", "_dest"))
Exploratory data analysis is the process to get to know your data, so that you can generate and test your hypothesis. Visualization techniques are usually applied.
To get introduced to your newly created dataset:
rows | 336,776 |
columns | 42 |
discrete_columns | 16 |
continuous_columns | 26 |
all_missing_columns | 0 |
total_missing_values | 809,170 |
complete_rows | 906 |
total_observations | 14,144,592 |
memory_usage | 97,254,656 |
To visualize the table above (with some light analysis):
You should immediately notice some surprises:
Missing values are definitely creating problems. Let’s take a look at the missing profiles.
Real-world data is messy, and you can simply use
plot_missing
function to visualize missing profile for each
feature.
From the chart, speed variable is mostly missing, and probably not informative. Looks like we have found the culprit for the 0.3% complete rows. Let’s drop it:
Note: You may store the missing data profile with
profile_missing(final_data)
for additional analysis.
To visualize frequency distributions for all discrete features:
## 5 columns ignored with more than 50 categories.
## dest: 105 categories
## tailnum: 4044 categories
## time_hour: 6936 categories
## model: 128 categories
## name: 102 categories
Upon closer inspection of manufacturer variable, it is not hard to identify the following duplications:
Let’s clean it up and look at the manufacturer distribution again:
final_data[which(final_data$manufacturer == "AIRBUS INDUSTRIE"),]$manufacturer <- "AIRBUS"
final_data[which(final_data$manufacturer == "CANADAIR LTD"),]$manufacturer <- "CANADAIR"
final_data[which(final_data$manufacturer %in% c("MCDONNELL DOUGLAS AIRCRAFT CO", "MCDONNELL DOUGLAS CORPORATION")),]$manufacturer <- "MCDONNELL DOUGLAS"
plot_bar(final_data$manufacturer)
Feature dst_origin, tzone_origin, year_flights and tz_origin contains only 1 value, so we should drop them:
final_data <- drop_columns(final_data, c("dst_origin", "tzone_origin", "year_flights", "tz_origin"))
Frequently, it is very beneficial to look at bi-variate frequency distribution. For example, to look at discrete features by arr_delay:
## 5 columns ignored with more than 50 categories.
## dest: 105 categories
## tailnum: 4044 categories
## time_hour: 6936 categories
## model: 128 categories
## name: 102 categories
The resulting distribution looks quite different from the regular frequency distribution.
You may choose to break out all frequencies by a discrete variable:
## 5 columns ignored with more than 50 categories.
## dest: 105 categories
## tailnum: 4044 categories
## time_hour: 6936 categories
## model: 128 categories
## name: 102 categories