Pandas#

http://wdy.h-cdn.co/assets/16/05/768x576/sd-aspect-1454612525-baby-pandas.jpg

Pandas is a tool Python-based data analysis and manipulation

  • designed for working with heterogeneous data

  • well suited for data importing, aggregation and cleaning

  • quick visualizations of data

The best of pandas#

import pandas as pd
import numpy as np
df = pd.read_csv("titanic.csv", sep="\t")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd
      2 import numpy as np
      3 df = pd.read_csv("titanic.csv", sep="\t")

ModuleNotFoundError: No module named 'pandas'
type(df), df.shape
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 type(df), df.shape

NameError: name 'df' is not defined
df.head(10)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 df.head(10)

NameError: name 'df' is not defined
df.describe()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 df.describe()

NameError: name 'df' is not defined
df.info()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 df.info()

NameError: name 'df' is not defined

Select columns#

Use syntax df[[col1, ..., colN]]

df['Age']
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 df['Age']

NameError: name 'df' is not defined
df[['Age']]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 df[['Age']]

NameError: name 'df' is not defined
type(df), type(df['Age']), type(df[['Age']])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 type(df), type(df['Age']), type(df[['Age']])

NameError: name 'df' is not defined

Indexing#

df.sort_values("Age", inplace=True)
df.head(10)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 df.sort_values("Age", inplace=True)
      2 df.head(10)

NameError: name 'df' is not defined
df.tail(8)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 df.tail(8)

NameError: name 'df' is not defined
# access by index
df.iloc[78]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 2
      1 # access by index
----> 2 df.iloc[78]

NameError: name 'df' is not defined
# access by label
df.loc[78]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 2
      1 # access by label
----> 2 df.loc[78]

NameError: name 'df' is not defined
# multiple indexing
df.loc[[78, 79, 100], ["Age", "Cabin"]] 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 2
      1 # multiple indexing
----> 2 df.loc[[78, 79, 100], ["Age", "Cabin"]] 

NameError: name 'df' is not defined

pd.Series#

1-d slice of dataframes has type pd.Series

df["Age"].head(5).values
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 1
----> 1 df["Age"].head(5).values

NameError: name 'df' is not defined

Get access to index

df["Age"].head(5).index
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 df["Age"].head(5).index

NameError: name 'df' is not defined

Creating pd.Series#

pd.Series([1, 2, 3], index=["Red", "Green", "Blue"])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 pd.Series([1, 2, 3], index=["Red", "Green", "Blue"])

NameError: name 'pd' is not defined
pd.Series(1, index=["Red", "Green", "Blue"])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[17], line 1
----> 1 pd.Series(1, index=["Red", "Green", "Blue"])

NameError: name 'pd' is not defined

Convert Series to DataFrame

s = pd.Series([1, 2, 3], index=["Red", "Green", "Blue"])
type(s.to_frame("Values"))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[18], line 1
----> 1 s = pd.Series([1, 2, 3], index=["Red", "Green", "Blue"])
      2 type(s.to_frame("Values"))

NameError: name 'pd' is not defined

NaN’s#

df["Cabin"].head(10)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[19], line 1
----> 1 df["Cabin"].head(10)

NameError: name 'df' is not defined
df["Cabin"].dropna().head(10)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[20], line 1
----> 1 df["Cabin"].dropna().head(10)

NameError: name 'df' is not defined
df["Cabin"].fillna(3).head(10)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[21], line 1
----> 1 df["Cabin"].fillna(3).head(10)

NameError: name 'df' is not defined
df["Cabin"].fillna(method="bfill").head(10)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[22], line 1
----> 1 df["Cabin"].fillna(method="bfill").head(10)

NameError: name 'df' is not defined
pd.isna(df["Cabin"]).head(10)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[23], line 1
----> 1 pd.isna(df["Cabin"]).head(10)

NameError: name 'pd' is not defined

Визуализация#

df.sort_index()["Fare"].plot();
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[24], line 1
----> 1 df.sort_index()["Fare"].plot();

NameError: name 'df' is not defined
df["Sex"].hist();
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[25], line 1
----> 1 df["Sex"].hist();

NameError: name 'df' is not defined