Pandas#

Pandas is a tool Python-based data analysis and manipulation
designed for working with heterogeneous data
well suited for data importing, aggregation and cleaning
quick visualizations of data
The best of pandas#
import pandas as pd
import numpy as np
df = pd.read_csv("titanic.csv", sep="\t")
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd
2 import numpy as np
3 df = pd.read_csv("titanic.csv", sep="\t")
ModuleNotFoundError: No module named 'pandas'
type(df), df.shape
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[2], line 1
----> 1 type(df), df.shape
NameError: name 'df' is not defined
df.head(10)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[3], line 1
----> 1 df.head(10)
NameError: name 'df' is not defined
df.describe()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[4], line 1
----> 1 df.describe()
NameError: name 'df' is not defined
df.info()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[5], line 1
----> 1 df.info()
NameError: name 'df' is not defined
Select columns#
Use syntax df[[col1, ..., colN]]
df['Age']
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[6], line 1
----> 1 df['Age']
NameError: name 'df' is not defined
df[['Age']]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[7], line 1
----> 1 df[['Age']]
NameError: name 'df' is not defined
type(df), type(df['Age']), type(df[['Age']])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[8], line 1
----> 1 type(df), type(df['Age']), type(df[['Age']])
NameError: name 'df' is not defined
Indexing#
df.sort_values("Age", inplace=True)
df.head(10)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[9], line 1
----> 1 df.sort_values("Age", inplace=True)
2 df.head(10)
NameError: name 'df' is not defined
df.tail(8)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[10], line 1
----> 1 df.tail(8)
NameError: name 'df' is not defined
# access by index
df.iloc[78]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[11], line 2
1 # access by index
----> 2 df.iloc[78]
NameError: name 'df' is not defined
# access by label
df.loc[78]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[12], line 2
1 # access by label
----> 2 df.loc[78]
NameError: name 'df' is not defined
# multiple indexing
df.loc[[78, 79, 100], ["Age", "Cabin"]]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[13], line 2
1 # multiple indexing
----> 2 df.loc[[78, 79, 100], ["Age", "Cabin"]]
NameError: name 'df' is not defined
pd.Series
#
1-d slice of dataframes has type pd.Series
df["Age"].head(5).values
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[14], line 1
----> 1 df["Age"].head(5).values
NameError: name 'df' is not defined
Get access to index
df["Age"].head(5).index
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[15], line 1
----> 1 df["Age"].head(5).index
NameError: name 'df' is not defined
Creating pd.Series
#
pd.Series([1, 2, 3], index=["Red", "Green", "Blue"])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[16], line 1
----> 1 pd.Series([1, 2, 3], index=["Red", "Green", "Blue"])
NameError: name 'pd' is not defined
pd.Series(1, index=["Red", "Green", "Blue"])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[17], line 1
----> 1 pd.Series(1, index=["Red", "Green", "Blue"])
NameError: name 'pd' is not defined
Convert Series to DataFrame
s = pd.Series([1, 2, 3], index=["Red", "Green", "Blue"])
type(s.to_frame("Values"))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[18], line 1
----> 1 s = pd.Series([1, 2, 3], index=["Red", "Green", "Blue"])
2 type(s.to_frame("Values"))
NameError: name 'pd' is not defined
NaN’s#
df["Cabin"].head(10)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[19], line 1
----> 1 df["Cabin"].head(10)
NameError: name 'df' is not defined
df["Cabin"].dropna().head(10)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[20], line 1
----> 1 df["Cabin"].dropna().head(10)
NameError: name 'df' is not defined
df["Cabin"].fillna(3).head(10)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[21], line 1
----> 1 df["Cabin"].fillna(3).head(10)
NameError: name 'df' is not defined
df["Cabin"].fillna(method="bfill").head(10)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[22], line 1
----> 1 df["Cabin"].fillna(method="bfill").head(10)
NameError: name 'df' is not defined
pd.isna(df["Cabin"]).head(10)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[23], line 1
----> 1 pd.isna(df["Cabin"]).head(10)
NameError: name 'pd' is not defined
Визуализация#
df.sort_index()["Fare"].plot();
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[24], line 1
----> 1 df.sort_index()["Fare"].plot();
NameError: name 'df' is not defined
df["Sex"].hist();
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[25], line 1
----> 1 df["Sex"].hist();
NameError: name 'df' is not defined