Data

Data#

In God we trust, all others bring data.

—William Edwards Deming

Data is a broad term that refers to facts, statistics, or information in a raw, unprocessed, or organized form. Data can take many forms, including numbers, text, images, audio recordings, and more.

Data processing#

The process of preparing raw data for machine learning involves several stages of data processing and manipulation to transform it into a structured and suitable format. The most common stages are:

data collection;
data cleaning:
- handling missing values;
- remove duplicates;
- outlier detection;
- data type conversions;
data exploration and visualization;
feature engineering.

The result of these manipulation is what is usually called a dataset: a specific collection of data that is organized and structured in a way that makes it suitable for analysis, processing, or machine learning tasks.

Data types#

Numerical continuous data#

Continuous data can take on any real[1] value within a range and often involves measurements. For instance:

height
temperature
distance
time

Numerical discrete data#

Discrete data consists of distinct, separate values and often involves counts or categorizations, e.g.

number of children
shoe size
test scores

Important

The distiction between continuous and discrete data can be occasionally ambiguous. For example, age in years probably should be considered as a discrete variable. However, if we allow fractional ages, e.g. \(30.2\) years, it becomes a continuous variable.

Categorical nominal variables#

Nominal data consists of categories with no inherent order or ranking. For example:

colors
fruits
gender
countries

Categorical ordinal variables#

Ordinal data includes categories with a meaningful order or ranking. Examples:

education level
customer satisfaction
movie rating
top-10 items suggested by a search engine

Examples of datasets#

There are several way how you can import some famous datasets in Python.

Tip

To install Python library scikit-learn (aka sklearn), run the command pip install scikit-learn

For instance, we can use helpers from sklearn.datasets module.

Iris dataset#

from sklearn.datasets import load_iris
iris_data = load_iris(as_frame=True)
iris_data['data']

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File ~/.pyenv/versions/3.10.12/lib/python3.10/site-packages/sklearn/utils/__init__.py:1193, in check_pandas_support(caller_name)
   1192 try:
-> 1193     import pandas  # noqa
   1195     return pandas

ModuleNotFoundError: No module named 'pandas'

The above exception was the direct cause of the following exception:

ImportError                               Traceback (most recent call last)
Cell In[2], line 2
      1 from sklearn.datasets import load_iris
----> 2 iris_data = load_iris(as_frame=True)
      3 iris_data['data']

File ~/.pyenv/versions/3.10.12/lib/python3.10/site-packages/sklearn/utils/_param_validation.py:214, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    208 try:
    209     with config_context(
    210         skip_parameter_validation=(
    211             prefer_skip_nested_validation or global_skip_validation
    212         )
    213     ):
--> 214         return func(*args, **kwargs)
    215 except InvalidParameterError as e:
    216     # When the function is just a wrapper around an estimator, we allow
    217     # the function to delegate validation to the estimator, but we replace
    218     # the name of the estimator by the name of the function in the error
    219     # message to avoid confusion.
    220     msg = re.sub(
    221         r"parameter of \w+ must be",
    222         f"parameter of {func.__qualname__} must be",
    223         str(e),
    224     )

File ~/.pyenv/versions/3.10.12/lib/python3.10/site-packages/sklearn/datasets/_base.py:691, in load_iris(return_X_y, as_frame)
    687 target_columns = [
    688     "target",
    689 ]
    690 if as_frame:
--> 691     frame, data, target = _convert_data_dataframe(
    692         "load_iris", data, target, feature_names, target_columns
    693     )
    695 if return_X_y:
    696     return data, target

File ~/.pyenv/versions/3.10.12/lib/python3.10/site-packages/sklearn/datasets/_base.py:96, in _convert_data_dataframe(caller_name, data, target, feature_names, target_names, sparse_data)
     93 def _convert_data_dataframe(
     94     caller_name, data, target, feature_names, target_names, sparse_data=False
     95 ):
---> 96     pd = check_pandas_support("{} with as_frame=True".format(caller_name))
     97     if not sparse_data:
     98         data_df = pd.DataFrame(data, columns=feature_names, copy=False)

File ~/.pyenv/versions/3.10.12/lib/python3.10/site-packages/sklearn/utils/__init__.py:1197, in check_pandas_support(caller_name)
   1195     return pandas
   1196 except ImportError as e:
-> 1197     raise ImportError("{} requires pandas.".format(caller_name)) from e

ImportError: load_iris with as_frame=True requires pandas.

This is a tabular dataset. The targets are encoded by digits \(0\), \(1\), \(2\):

iris_data['target'].value_counts()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 iris_data['target'].value_counts()

NameError: name 'iris_data' is not defined

What does these values mean?

iris_data.target_names  

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 iris_data.target_names  

NameError: name 'iris_data' is not defined

Here is how they look like in the wild (figure 1.1 from [Murphy, 2022])

setosa	versicolor	virginica

Bikeshare dataset#

Tip

To install Python library scikit-learn (aka sklearn), run the command pip install scikit-learn

import pandas as pd
auto_df = pd.read_csv("../ISLP_datsets/Bikeshare.csv")
auto_df.drop("Unnamed: 0", axis=1, inplace=True)
auto_df.head()

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[5], line 1
----> 1 import pandas as pd
      2 auto_df = pd.read_csv("../ISLP_datsets/Bikeshare.csv")
      3 auto_df.drop("Unnamed: 0", axis=1, inplace=True)

ModuleNotFoundError: No module named 'pandas'

Q. Which features are categorical and which are numeric?

MNIST dataset#

A classical dataset of handwritten digits.

from sklearn.datasets import fetch_openml

X, Y = fetch_openml('mnist_784', return_X_y=True, parser='auto')
X.shape, Y.shape

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File ~/.pyenv/versions/3.10.12/lib/python3.10/site-packages/sklearn/utils/__init__.py:1193, in check_pandas_support(caller_name)
   1192 try:
-> 1193     import pandas  # noqa
   1195     return pandas

ModuleNotFoundError: No module named 'pandas'

The above exception was the direct cause of the following exception:

ImportError                               Traceback (most recent call last)
File ~/.pyenv/versions/3.10.12/lib/python3.10/site-packages/sklearn/datasets/_openml.py:1043, in fetch_openml(name, version, data_id, data_home, target_column, cache, return_X_y, as_frame, n_retries, delay, parser, read_csv_kwargs)
   1042 try:
-> 1043     check_pandas_support("`fetch_openml`")
   1044 except ImportError as exc:

File ~/.pyenv/versions/3.10.12/lib/python3.10/site-packages/sklearn/utils/__init__.py:1197, in check_pandas_support(caller_name)
   1196 except ImportError as e:
-> 1197     raise ImportError("{} requires pandas.".format(caller_name)) from e

ImportError: `fetch_openml` requires pandas.

The above exception was the direct cause of the following exception:

ImportError                               Traceback (most recent call last)
Cell In[6], line 3
      1 from sklearn.datasets import fetch_openml
----> 3 X, Y = fetch_openml('mnist_784', return_X_y=True, parser='auto')
      4 X.shape, Y.shape

File ~/.pyenv/versions/3.10.12/lib/python3.10/site-packages/sklearn/utils/_param_validation.py:214, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    208 try:
    209     with config_context(
    210         skip_parameter_validation=(
    211             prefer_skip_nested_validation or global_skip_validation
    212         )
    213     ):
--> 214         return func(*args, **kwargs)
    215 except InvalidParameterError as e:
    216     # When the function is just a wrapper around an estimator, we allow
    217     # the function to delegate validation to the estimator, but we replace
    218     # the name of the estimator by the name of the function in the error
    219     # message to avoid confusion.
    220     msg = re.sub(
    221         r"parameter of \w+ must be",
    222         f"parameter of {func.__qualname__} must be",
    223         str(e),
    224     )

File ~/.pyenv/versions/3.10.12/lib/python3.10/site-packages/sklearn/datasets/_openml.py:1051, in fetch_openml(name, version, data_id, data_home, target_column, cache, return_X_y, as_frame, n_retries, delay, parser, read_csv_kwargs)
   1045 if as_frame:
   1046     err_msg = (
   1047         "Returning pandas objects requires pandas to be installed. "
   1048         "Alternatively, explicitly set `as_frame=False` and "
   1049         "`parser='liac-arff'`."
   1050     )
-> 1051     raise ImportError(err_msg) from exc
   1052 else:
   1053     err_msg = (
   1054         f"Using `parser={parser_!r}` requires pandas to be installed. "
   1055         "Alternatively, explicitly set `parser='liac-arff'`."
   1056     )

ImportError: Returning pandas objects requires pandas to be installed. Alternatively, explicitly set `as_frame=False` and `parser='liac-arff'`.

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 4
      1 import matplotlib.pyplot as plt
      2 import numpy as np
----> 4 X = X.astype(float).values / 255
      5 Y = Y.astype(int).values
      7 def plot_digits(X, y_true, y_pred=None, n=4, random_state=123):

NameError: name 'X' is not defined

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 plot_digits(X, Y, random_state=12)

NameError: name 'plot_digits' is not defined

Q. What type of data is MNIST dataset?

One-hot encoding#

Before feeding categorical data into machine learning models, we need to convert them to a numerical scale. The standard way to do it is to use a one-hot encoding, also called a dummy encoding.

If a feature belongs to the final set \( \{1, \ldots, K\}\), it is encoded by a binary vector

\[ (\delta_1, \ldots, \delta_K) \in \{0, 1\}^K, \quad \sum\limits_{k=1}^K \delta _k = 1. \]

Thus each categorical variable, which takes \(K\) different values, is converted to \(K\) numeric variables.

Note

In fact, it is enough to have \(K-1\) dummy variables since the value of \(\delta_K\) can be automatically deduced from the values of \(\delta_1, \ldots, \delta_{K-1}\).

Feature matrix#

A tabular numerical dataset can be represented as a feature matrix (or design matrix) \(\boldsymbol X\) of shape \(N\times D\) where

\(N\) — number of samples (rows)
\(D\) — number of features (columns)

Each sample \(\boldsymbol x_i\) is therefore represented by \(i\)-th row of the feature matrix \(\boldsymbol X\).