Not that anyone was holding their breath but I’ve been exploring another data project that’s taken up a lot of time1. Plus it’s summer so I have been traveling. That said, I’m back.

Here’s Part 1 in case you missed it or can’t remember it.

Anyway, as you could probably tell from Exploratory Data Analysis - Part 1, we aren’t going to dwell on Exploratory Data Analysis (EDA) too much. While EDA is necessary and important, I think there are quickly diminishing returns regarding your time spent. Other practitioners may differ, but I generally do a quick glance for outliers and missingness, get a sense of the shape of the variables, then move on.

The three tools I use are:

`summary`

: a function within base R (i.e., built into R as opposed to being a package).`DataExplorer`

: an R package that creates an HTML report describing your data.`ggplot`

: the go-to package for data visualization in the R ecosystem. It can be installed on its own, though it’s also part of the`tidyverse`

package.

From a code standpoint we’re going to do things a bit out of order, as I believe it will be helpful from a pedagogical standpoint not to get bogged down in the code too early. We’ll get to it very soon, but for now just accept that we have downloaded both the historical seasons and the current one, combined them, created some new features (e.g., fantasy points), and saved them to disk by position (e.g., QB, RB, etc.). It’s basically Rob Lowe’s character’s quote in Thank You for Smoking2:

Nick Naylor :But wouldn't they blow up in an all oxygen environment?

Jeff Megall :Probably. But it's an easy fix. One line of dialogue. 'Thank God we invented the... you know, whatever device.'

For the purposes of this post I’ll only analyze the QB data we’ve created but the process can be applied to any data set, including the other positions.

We’ll first import the data set we want to perform an EDA on.

```
qbs_by_season_qualified <-
read_parquet('000_a_data_science_journey/results/qbs_by_season_qualified.parquet')
```

### summary and quantile

The first EDA technique I use is the `summary`

function. This will give you a quick understanding of min/max/median/mean stats per variable, as well as counts for any factor. It won’t tell you anything if the variable is a character/string3, but if a variable is truly a string this doesn’t matter.

To use `summary`

on the `qbs_by_season_qualified`

data frame, simply run:

`summary(qbs_by_season_qualified)`

Which gives you the following results:

```
season_type season player_id
REG :928 Min. :1999 00-0019596: 37
POST:199 1st Qu.:2004 00-0010346: 27
Median :2010 00-0020531: 27
Mean :2010 00-0022924: 27
3rd Qu.:2016 00-0023459: 22
Max. :2021 00-0022942: 20
(Other) :967
player_name position team
T.Brady : 37 QB:1127 Length:1127
P.Manning : 27 Class :character
D.Brees : 27 Mode :character
B.Roethlisberger: 27
A.Rodgers : 22
P.Rivers : 20
(Other) :967
games fantasy_points_dk_next
Min. : 1.00 Min. : -0.78
1st Qu.: 6.00 1st Qu.: 45.56
Median :12.00 Median :171.26
Mean :10.67 Mean :174.13
3rd Qu.:16.00 3rd Qu.:288.50
Max. :17.00 Max. :492.98
NA's :159
...
```

What is this output telling us?

First up is `season_type`

, which is a factor. Factors are analyzed as a count of each factor. In this example we see there are 928 rows/observations of `REG`

and 199 of `POST`

. If we had more than six factor levels (i.e., distinct factors) it would show the top-6 and then the count of `(Other)`

, as seen in `player_id `

and `player_name`

.

Any numbers or dates (which I don’t have in this example) give you what’s known as quantiles. Quantiles give us the minimum and maximum values, as well as the mean and median. Additionally, and thus their name, quantiles show us the value to be in the 25th percentile (shown as 1st Qu.) and the 75th percentile (3rd Qu.). We also get to see what rows have NA’s, as seen in `fantasy_points_dk_next`

.

In short, summary does the following actions:

Factors are handled as counts

Numbers are handled as quantiles (more below)

Characters are simply a row count

NAs are also handled as a count

If we want to inspect our own quantiles there’s a base R function to do this named… `quantile`

. For `quantile`

you need two arguments: the variable you’re interested in and the quantiles/percentiles you want to know. Let’s try it out on `fantasy_points_dk_next`

for the 50th, 60th, 70th, 80th, 90th, 100th percentiles. We’ll also ignore NAs with the argument `na.rm = TRUE`

.

`quantile(qbs_by_season_qualified$fantasy_points_dk_next, c(0.5, 0.6, 0.7, 0.8, 0.9, 1), na.rm = TRUE)`

This gives us the following output, and you’ll notice the 50th percentile (also know as the median) matches the median number from `summary`

, as well as 100th percentile equalling the max.

```
50% 60% 70% 80% 90% 100%
171.260 230.824 271.604 306.160 343.464 492.980
```

This is also the first time we’ve come across a sequence. Previously we created a list of seasons (1999:2021) but that was with integers. Here we’re using decimals so we explicitly state what range our list is. Using `c(…)`

tells R that we’re feeding it a list.

Use percentile ranges that make sense for your data set and the problem you’re trying to solve.

### DataExplorer

`DataExplorer`

is a great R package that really expedites EDA.

The main function/feature in the package is `create_report`

, which you can probably guess creates a report about the data vis-à-vis EDA.

```
create_report(
data =
qbs_by_season_qualified %>%
select(
fantasy_points_dk_next,
fantasy_points_dk,
starts_with('team_'),
starts_with('passing_'),
starts_with('rushing_')),
output_file = 'qbs_by_season_qualified.html',
output_dir = 'results/')
```

The three arguments we provide are `data`

(what we’re analyzing), `output_file`

(file name for the report), and `output_dir`

(where to save the file). Our data set has a lot of variables that we don’t care about, so I’m selecting the ones I we do care about.

There are two things to note on the `select`

portion of this code, one I might expand on in a later post and one I likely won’t. The one I won’t is that within `tidyverse`

they have helper functions like `starts_with`

where you can give it the first n characters in a column name and it’ll pull anything that matches. There are a few other functions that are similar that are worth checking out.

The one I might touch on in a future post is any time you can use a syntax/format for file names, column names, etc. that you can later use with with regex or things like `starts_with`

the benefits are massive. If you know every file you are saving is broken into categories (e.g., passing, rushing, etc.) using a naming convention will help you SO much down the road when you’re reading them.

Back to `DataExplorer`

. There are five main sections (IMO) of the report we can inspect.

Basic Statistics

Univariate Distribution

Bar Chart

Correlation Analysis

Principal Component Analysis

Basic Statistics is essentially the `summary`

function in base R. A nice thing they include that isn’t in `summary`

is how much memory your data set is taking up. If you are resource-constrained on the hardware side this is nice to know.

Univariate Distribution shows the distribution of all of your numeric variables. This is very helpful as a quick diagnostic for outliers and distribution shapes.

Bar chart is similar to the Univariate Distribution section but it deals with factors. Note that if a factor has more than 50 categories/levels it ignores it.

Correlation Analysis is especially helpful if you plan on running a linear regression on your data. Linear regressions are very powerful and computationally cheap, but the math breaks down if you feed it highly correlated variables.

Note that the correlation works both ways: positive and negative. In order to run a linear regression you want the correlations to be as close to 0 as possible, not 1 or -1.

Principal Component Analysis (PCA) is a very important technique that I may get into in detail down the road. For now it does two things for us, both of which can be very powerful.

The first is it tells us how much variance there is in our data across all of the (numeric) variables. PCA tries to pack as much variance into the first principal component (PC1) as is possible, then with the remaining variance pack as much into the second principal component (PC2), and so on. In the example below 20.8% of the variance of the entire (numeric) data set can be crammed into one variable. While it isn’t statistically or mathematically true think of variance as information. Put another way, almost 21% of the data can be assigned to a single variable (PC1). PC2 has 8.3% (29.1% - 20.8%) of the remaining variance, but is show here as a cumulative value; if you used PC1 and PC2 you’d have 29.1% of the “information” from this data set.

Knowing how the variance is distributed tells us what algorithms may work best with our data. Some algorithms focus on the variables that have the most information and make decisions/rules based on them, even if they aren’t optimal for the overall solution. These algorithms are known as “greedy” algorithms. If you have a data set where the first few PCs are explaining the vast majority of the data you may want to look at algorithms that account for this (which we’ll explore when we start modeling).

The second thing PCA can be useful for, though not in this example, is if you have hundreds, thousands, tens of thousands, etc. of features it can both be computational expensive and break down the math of certain algorithms.

Let’s say you have a data set with 1,000 variables. Many of them don’t really give you any information, and if they do they might be highly correlated to other variables (e.g., distance from a shoreline and risk of flood). Not only will these highly correlated variables make a linear regression far less reliable, you’re also computing a lot of variables that aren’t only redundant, they’re actually harmful.

Instead of running your analysis on the full 1,000 variable data set you could run a PCA on it and pick the first n PCs that account for a specific amount of variance (I tend to aim for ~80%). This not only allows you to run your analysis on a much smaller data set (typically single digits of variables) but it also avoid algorithmic issues. In our case below, I wouldn’t actually employ PCA for the latter benefit given the variance is pretty well distributed but knowing how the variance is distributed is nonetheless useful information.

The `DataExplorer`

package is an **amazing** tool for a quick look at what your data looks like and what potential steps you will need to take given the shape of your data.

### ggplot

Much like tidyverse, ggplot is a universe of its own. You could literally make a career focusing on ggplot, but obviously that is outside the scope of what we are trying to do here.

Suffice to say it is the go-to visualization package in R, and it is extremely powerful and robust. While I don’t know if we’ll do a separate post about it, it will (I think) come up naturally throughout this series so I will not dwell on it here. If you want to jump ahead a bit there are a bunch of amazing ggplot resources, such as the R Graph Library.

### What’s Next?

Next we get into (for me anyway) the really fun stuff. We start cleaning the data, creating new variables, and get the data prepped for some machine learning. We’ll dig into the code referenced before that gave us the positional data that we performed EDA on in this post.

The next post will be **much more** of a deep dive into the tidyverse, as well as why we do (and don’t) create new features for our data set.

That’s what’s next!

If this pops it’ll be an awesome series of posts but it’s too soon to tell.

A criminal underrated film.

Even though the `team`

variable is a factor in the real data, I converted it to a character to show how `summary`

works for characters. Going forward it will be a factor.