1: Exploratory data analysis

We start by looking at how to explore our data. We will cover

  • Categorical variables

  • Numeric variables

  • Looking for outliers

  • Exploring correlations in the data

Some extrnal resources that you might find useful:

  • https://towardsdatascience.com/exploratory-data-analysis-in-python-a-step-by-step-process-d0dfa6bf94ee

# data manipulation
import pandas as pd
import numpy as np

# data viz
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns

# apply some cool styling
plt.style.use("ggplot")
rcParams['figure.figsize'] = (12,  6)

# use sklearn to import a dataset
from sklearn.datasets import load_wine

Import the data

Importing a dataset is simple with Pandas through functions dedicated to reading the data. If our dataset is a .csv file, we can just use

df = pd.read_csv("path/to/my/file.csv")

df stands for dataframe, which is Pandas’s object similar to an Excel sheet. This nomenclature is often used in the field. The read_csv function takes as input the path of the file we want to read. There are many other arguments that we can specify.

The .csv format is not the only one we can import — there are in fact many others such as Excel, Parquet and Feather.

For ease, in this example we will use Sklearn to import the wine dataset.

We eill set a new column called target which is a class for the wines.

# carichiamo il dataset
wine = load_wine()

# convertiamo il dataset in un dataframe Pandas
df = pd.DataFrame(data=wine.data, columns=wine.feature_names)
# creiamo la colonna per il target
df["target"] = wine.target

Take a quick look using Pandas

Two of the most commonly used functions in Pandas are .head() and .tail(). These two allow us to view an arbitrary number of rows (by default 5) from the beginning or end of the dataset. Very useful for accessing a small part of the dataframe quickly.

We can also use:

  • .shape

  • .describe()

  • .info()

Note the .shape call is not followed by parentheses (); this is beacause shape is an attribute of the dataset, whereas describe for example is a function that acts on the dataset.

df.tail(3)
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline target
175 13.27 4.28 2.26 20.0 120.0 1.59 0.69 0.43 1.35 10.2 0.59 1.56 835.0 2
176 13.17 2.59 2.37 20.0 120.0 1.65 0.68 0.53 1.46 9.3 0.60 1.62 840.0 2
177 14.13 4.10 2.74 24.5 96.0 2.05 0.76 0.56 1.35 9.2 0.61 1.60 560.0 2
df.shape
(178, 14)
df.describe()
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline target
count 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000
mean 13.000618 2.336348 2.366517 19.494944 99.741573 2.295112 2.029270 0.361854 1.590899 5.058090 0.957449 2.611685 746.893258 0.938202
std 0.811827 1.117146 0.274344 3.339564 14.282484 0.625851 0.998859 0.124453 0.572359 2.318286 0.228572 0.709990 314.907474 0.775035
min 11.030000 0.740000 1.360000 10.600000 70.000000 0.980000 0.340000 0.130000 0.410000 1.280000 0.480000 1.270000 278.000000 0.000000
25% 12.362500 1.602500 2.210000 17.200000 88.000000 1.742500 1.205000 0.270000 1.250000 3.220000 0.782500 1.937500 500.500000 0.000000
50% 13.050000 1.865000 2.360000 19.500000 98.000000 2.355000 2.135000 0.340000 1.555000 4.690000 0.965000 2.780000 673.500000 1.000000
75% 13.677500 3.082500 2.557500 21.500000 107.000000 2.800000 2.875000 0.437500 1.950000 6.200000 1.120000 3.170000 985.000000 2.000000
max 14.830000 5.800000 3.230000 30.000000 162.000000 3.880000 5.080000 0.660000 3.580000 13.000000 1.710000 4.000000 1680.000000 2.000000
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       178 non-null    float64
 1   malic_acid                    178 non-null    float64
 2   ash                           178 non-null    float64
 3   alcalinity_of_ash             178 non-null    float64
 4   magnesium                     178 non-null    float64
 5   total_phenols                 178 non-null    float64
 6   flavanoids                    178 non-null    float64
 7   nonflavanoid_phenols          178 non-null    float64
 8   proanthocyanins               178 non-null    float64
 9   color_intensity               178 non-null    float64
 10  hue                           178 non-null    float64
 11  od280/od315_of_diluted_wines  178 non-null    float64
 12  proline                       178 non-null    float64
 13  target                        178 non-null    int64  
dtypes: float64(13), int64(1)
memory usage: 19.6 KB

Notice that info gives quite different results to describe. Info tells us about the data types - describe gives us some summary statistics.

Make names more sensible

The name od280/od315_of_diluted_wines refers to a test for protein content. But it is not very descriptive to us, let’s chance it to make life easier for ourselves later.

df.rename(columns={"od280/od315_of_diluted_wines": "protein_concentration"}, inplace=True)

Undestanding the variables

We will look at two main types of variable discussed in the lecture: categorical and numeric.

Categorical variables

Categorical variables are those where the data are labelled by class, for example it could be data on something like postcode. If we had labelled houses by postcode, then this is said to be categorical data.

Let’s look at the distribution of the types for the wines, the target column

df.target.value_counts()
1    71
0    59
2    48
Name: target, dtype: int64
df.target.value_counts(normalize=True)
1    0.398876
0    0.331461
2    0.269663
Name: target, dtype: float64
df.target.value_counts().plot(kind="bar")
plt.title("Value counts of the target variable")
plt.xlabel("Wine type")
plt.xticks(rotation=0)
plt.ylabel("Count")
plt.show()
_images/lecture-1-exploratory-data-analysis_15_0.png

Numeric values

Numeric data is when we assign a numerical value as the label of an instance. To take the houses example again, we might label the houses by the distance to the nearest bus stop, this would then be a numeric data set.

We can perform exploratory analysis of the numeric values using the .describe() function.

df.magnesium.describe()
count    178.000000
mean      99.741573
std       14.282484
min       70.000000
25%       88.000000
50%       98.000000
75%      107.000000
max      162.000000
Name: magnesium, dtype: float64
df.magnesium.hist()
<AxesSubplot: >
_images/lecture-1-exploratory-data-analysis_18_1.png

Question do you think this has high/low skew - or high/low kurotis?

print(f"Skewness: {df['magnesium'].skew()}")
print(f"Kurtosis: {df['magnesium'].kurt()}")
Skewness: 1.098191054755161
Kurtosis: 2.1049913235905557
sns.catplot(x="target", y="proline", data=df, kind="box", aspect=1.5)
plt.title("Boxplot for target vs proline")
plt.show()
_images/lecture-1-exploratory-data-analysis_21_0.png

Question which of the datasets might have an outlier

sns.catplot(x="target", y="flavanoids", data=df, kind="box", aspect=1.5)
plt.title("Boxplot for target vs proline")
plt.show()
_images/lecture-1-exploratory-data-analysis_23_0.png

Take a closer look at the outlier

df[df['flavanoids'] > 5]
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue protein_concentration proline target
121 11.56 2.05 3.23 28.5 119.0 3.18 5.08 0.47 1.87 6.0 0.93 3.69 465.0 1

We can then remove that data from the data set, if we choose to.

df = df.drop(121)

Explore correlations in the data

As we discussed in the lecture we might want to remove features in the data that are very closely related to one another. Imagine a data set that had one feature with temperature in F and one with temperature in C. Now these two features tell us exactly the same thing, but on a different scale. Later we will see that adding both to a model would be of no benifit, as they do not carry any extra information. To look for redundant information between features ($x$ and $y$) we can use the correlation. $$ r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2(y_i - \bar{y})^2}} $$

corrmat = df.corr()
corrmat
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue protein_concentration proline target
alcohol 1.000000 0.094397 0.211545 -0.310235 0.270798 0.289101 0.236815 -0.155929 0.136698 0.546364 -0.071747 0.072343 0.643720 -0.328222
malic_acid 0.094397 1.000000 0.164045 0.288500 -0.054575 -0.335167 -0.411007 0.292977 -0.220746 0.248985 -0.561296 -0.368710 -0.192011 0.437776
ash 0.211545 0.164045 1.000000 0.443367 0.286587 0.128980 0.115077 0.186230 0.009652 0.258887 -0.074667 0.003911 0.223626 -0.049643
alcalinity_of_ash -0.310235 0.288500 0.443367 1.000000 -0.083333 -0.321113 -0.351370 0.361922 -0.197327 0.018732 -0.273955 -0.276769 -0.440597 0.517859
magnesium 0.270798 -0.054575 0.286587 -0.083333 1.000000 0.214401 0.195784 -0.256294 0.236441 0.199950 0.055398 0.066004 0.393351 -0.209179
total_phenols 0.289101 -0.335167 0.128980 -0.321113 0.214401 1.000000 0.864564 -0.449935 0.612413 -0.055136 0.433681 0.699949 0.498115 -0.719163
flavanoids 0.236815 -0.411007 0.115077 -0.351370 0.195784 0.864564 1.000000 -0.537900 0.652692 -0.172379 0.543479 0.787194 0.494193 -0.847498
nonflavanoid_phenols -0.155929 0.292977 0.186230 0.361922 -0.256294 -0.449935 -0.537900 1.000000 -0.365845 0.139057 -0.262640 -0.503270 -0.311385 0.489109
proanthocyanins 0.136698 -0.220746 0.009652 -0.197327 0.236441 0.612413 0.652692 -0.365845 1.000000 -0.025250 0.295544 0.519067 0.330417 -0.499130
color_intensity 0.546364 0.248985 0.258887 0.018732 0.199950 -0.055136 -0.172379 0.139057 -0.025250 1.000000 -0.521813 -0.428815 0.316100 0.265668
hue -0.071747 -0.561296 -0.074667 -0.273955 0.055398 0.433681 0.543479 -0.262640 0.295544 -0.521813 1.000000 0.565468 0.236183 -0.617369
protein_concentration 0.072343 -0.368710 0.003911 -0.276769 0.066004 0.699949 0.787194 -0.503270 0.519067 -0.428815 0.565468 1.000000 0.312761 -0.788230
proline 0.643720 -0.192011 0.223626 -0.440597 0.393351 0.498115 0.494193 -0.311385 0.330417 0.316100 0.236183 0.312761 1.000000 -0.633717
target -0.328222 0.437776 -0.049643 0.517859 -0.209179 -0.719163 -0.847498 0.489109 -0.499130 0.265668 -0.617369 -0.788230 -0.633717 1.000000
hm = sns.heatmap(corrmat, 
                 cbar=True, 
                 annot=True, 
                 square=True, 
                 fmt='.2f', 
                 annot_kws={'size': 10}, 
                 yticklabels=df.columns, 
                 xticklabels=df.columns, 
                 cmap="Spectral_r")
plt.show()
_images/lecture-1-exploratory-data-analysis_30_0.png

Question - which data are most correlated. If you had to choose to drop one column based on the relation to the target, which would it be? If you had to drop one of the strongly correlated columns, which would it be?