After finishing my Udacity DataAnalyst Nanodegree I want to preserve my obtained skills using Pandas. Therefore I created a mixture of Cheat Sheet and Cookbook to go over several usecases. Maybe you will find it useful. This is Part 2 and is about Assessing and Wrangling in Pandas. I’m aiming at adding more content if I find something interesting. By The Way, although it’s Part 2 it’s the first part, as it was the easiest to collect and present the content in a readable way.
import pandas as pd
import numpy as np
Assessing
wrangling_df = pd.read_csv("patients.csv")
wrangling_df.shape
(503, 14)
wrangling_df.head()
patient_id | assigned_sex | given_name | surname | address | city | state | zip_code | country | contact | birthdate | weight | height | bmi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | female | Zoe | Wellish | 576 Brown Bear Drive | Rancho California | California | 92390.0 | United States | 951-719-9170ZoeWellish@superrito.com | 7/10/1976 | 121.7 | 66 | 19.6 |
1 | 2 | female | Pamela | Hill | 2370 University Hill Road | Armstrong | Illinois | 61812.0 | United States | PamelaSHill@cuvox.de+1 (217) 569-3204 | 4/3/1967 | 118.8 | 66 | 19.2 |
2 | 3 | male | Jae | Debord | 1493 Poling Farm Road | York | Nebraska | 68467.0 | United States | 402-363-6804JaeMDebord@gustr.com | 2/19/1980 | 177.8 | 71 | 24.8 |
3 | 4 | male | Liêm | Phan | 2335 Webster Street | Woodbridge | NJ | 7095.0 | United States | PhanBaLiem@jourrapide.com+1 (732) 636-8246 | 7/26/1951 | 220.9 | 70 | 31.7 |
4 | 5 | male | Tim | Neudorf | 1428 Turkey Pen Lane | Dothan | AL | 36303.0 | United States | 334-515-7487TimNeudorf@cuvox.de | 2/18/1928 | 192.3 | 27 | 26.1 |
Summarize boolean occurences
By multiplying a boolean with one, we can turn it into a integer and are able to sum up all occurences:
(1*(wrangling_df['given_name'] == "Jake")).sum()
1
Get Column Keys
Save column names into a list:
wrangling_df.keys()[-3:].tolist()
[‘weight’, ‘height’, ‘bmi’]
Filtering
Retrieve categorical values based on occurence limit in dataframe:
[x for x,y in list(wrangling_df['height'].value_counts().iteritems()) if y > 30]
# the result will express the heights, that occured more than thirty times in the dataframe
[67, 69, 65, 63, 66, 70, 72, 61]
Return values for column where value starts with specific string. Using List Comprehension and a classical for loop to go over a DataFrame column:
[row for row in wrangling_df['given_name'] if row.startswith('David')]
[‘David’, ‘David’]
Using pandas built-in iterrows will run much faster, as it won’t call the complete Pandas Object on every loop iteration:
[row[1][2] for row in wrangling_df.iterrows() if row[1][2].startswith('David')]
[‘David’, ‘David’]
Wrangling
Cleaning
Duplicates and NaN values
Identify duplicated and null values in address column:
wrangling_df[((wrangling_df.address.duplicated()) & wrangling_df.address.notnull())].head(1)
patient_id | assigned_sex | given_name | surname | address | city | state | zip_code | country | contact | birthdate | weight | height | bmi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
29 | 30 | male | Jake | Jakobsen | 648 Old Dear Lane | Port Jervis | New York | 12771.0 | United States | JakobCJakobsen@einrot.com+1 (845) 858-7707 | 8/1/1985 | 155.8 | 67 | 24.4 |
Remove these values:
clean_df = wrangling_df[~((wrangling_df.address.duplicated()) & wrangling_df.address.notnull())].head(1)
Remove Columns/Values
Drop
Drop rows with NaN values depending on subset of columns:
wrangling_df.dropna().shape
(491, 14)
wrangling_df.dropna(subset = ['surname']).shape
(503, 14)
Drop whole columns:
wrangling_df.drop(columns=['contact','birthdate']).shape
(503, 12)
Create Columns
Categorical Column
We can create a categorical column that will only allow these values to insert:
wrangling_df['illness'].index
KeyError: 'illness'
wrangling_df['illness'] = pd.Series(pd.Categorical(values=["none"]*len(wrangling_df),categories=["none","a bit","intermediate","very ill"]))
wrangling_df['illness'][0] = "bad"
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
wrangling_df['illness'][0] = "intermediate"
Rename Columns
wrangling_df_2 = wrangling_df.copy()
Different Methods to add suffix to column names
Use built-in pandas functions:
wrangling_df_2 = wrangling_df_2.add_suffix('X')
wrangling_df_2 = wrangling_df_2.add_prefix('X')
wrangling_df_2.head(1)
Xpatient_idX | Xassigned_sexX | Xgiven_nameX | XsurnameX | XaddressX | XcityX | XstateX | Xzip_codeX | XcountryX | XcontactX | XbirthdateX | XweightX | XheightX | XbmiX | XillnessX | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | female | Zoe | Wellish | 576 Brown Bear Drive | Rancho California | California | 92390.0 | United States | 951-719-9170ZoeWellish@superrito.com | 7/10/1976 | 121.7 | 66 | 19.6 | intermediate |
Use rename and formatter or lambda for renaming all columns:
wrangling_df_2.rename(columns='{}Y'.format, inplace=True)
wrangling_df_2.rename(columns = lambda x: x+'Z', inplace=True)
wrangling_df_2.head(1)
Xpatient_idXYZ | Xassigned_sexXYZ | Xgiven_nameXYZ | XsurnameXYZ | XaddressXYZ | XcityXYZ | XstateXYZ | Xzip_codeXYZ | XcountryXYZ | XcontactXYZ | XbirthdateXYZ | XweightXYZ | XheightXYZ | XbmiXYZ | XillnessXYZ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | female | Zoe | Wellish | 576 Brown Bear Drive | Rancho California | California | 92390.0 | United States | 951-719-9170ZoeWellish@superrito.com | 7/10/1976 | 121.7 | 66 | 19.6 | intermediate |
wrangling_df_2 = wrangling_df.copy()
Use list comprehensions or the map function to do the job:
wrangling_df_2.columns = [column + 'X' for column in wrangling_df_2.columns]
wrangling_df_2.columns = list(map(lambda s: s+'Y', wrangling_df_2.columns))
wrangling_df_2.head(1)
patient_idXY | assigned_sexXY | given_nameXY | surnameXY | addressXY | cityXY | stateXY | zip_codeXY | countryXY | contactXY | birthdateXY | weightXY | heightXY | bmiXY | illnessXY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | female | Zoe | Wellish | 576 Brown Bear Drive | Rancho California | California | 92390.0 | United States | 951-719-9170ZoeWellish@superrito.com | 7/10/1976 | 121.7 | 66 | 19.6 | intermediate |
Change Values
Lamdba/Apply Functions
Using a If-Else function. We have to remember, a lambda function has to generate a return value for every iteration, therefore the else is mandatory:
wrangling_df['height_result'] = wrangling_df['height'].apply(lambda x: "extrem height" if (x > 67) or (x < 60) else "normal height")
wrangling_df.head(5)
patient_id | assigned_sex | given_name | surname | address | city | state | zip_code | country | contact | birthdate | weight | height | bmi | illness | height_result | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | female | Zoe | Wellish | 576 Brown Bear Drive | Rancho California | California | 92390.0 | United States | 951-719-9170ZoeWellish@superrito.com | 7/10/1976 | 121.7 | 66 | 19.6 | intermediate | normal height |
1 | 2 | female | Pamela | Hill | 2370 University Hill Road | Armstrong | Illinois | 61812.0 | United States | PamelaSHill@cuvox.de+1 (217) 569-3204 | 4/3/1967 | 118.8 | 66 | 19.2 | none | normal height |
2 | 3 | male | Jae | Debord | 1493 Poling Farm Road | York | Nebraska | 68467.0 | United States | 402-363-6804JaeMDebord@gustr.com | 2/19/1980 | 177.8 | 71 | 24.8 | none | extrem height |
3 | 4 | male | Liêm | Phan | 2335 Webster Street | Woodbridge | NJ | 7095.0 | United States | PhanBaLiem@jourrapide.com+1 (732) 636-8246 | 7/26/1951 | 220.9 | 70 | 31.7 | none | extrem height |
4 | 5 | male | Tim | Neudorf | 1428 Turkey Pen Lane | Dothan | AL | 36303.0 | United States | 334-515-7487TimNeudorf@cuvox.de | 2/18/1928 | 192.3 | 27 | 26.1 | none | extrem height |
Use dict to change values programmatically
For better performance it’s always advisable to use a apply function, instead a for loop:
# Mapping from full state name to abbreviation
state_abbrev = {'California': 'CA',
'New York': 'NY',
'Illinois': 'IL',
'Florida': 'FL',
'Nebraska': 'NE'}
# Function to apply
def abbreviate_state(patient):
if patient['state'] in state_abbrev.keys():
abbrev = state_abbrev[patient['state']]
return abbrev
else:
return patient['state']
wrangling_df['state'] = wrangling_df.apply(abbreviate_state, axis=1)
wrangling_df.head(5)
patient_id | assigned_sex | given_name | surname | address | city | state | zip_code | country | contact | birthdate | weight | height | bmi | illness | height_result | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | female | Zoe | Wellish | 576 Brown Bear Drive | Rancho California | CA | 92390.0 | United States | 951-719-9170ZoeWellish@superrito.com | 7/10/1976 | 121.7 | 66 | 19.6 | intermediate | normal height |
1 | 2 | female | Pamela | Hill | 2370 University Hill Road | Armstrong | IL | 61812.0 | United States | PamelaSHill@cuvox.de+1 (217) 569-3204 | 4/3/1967 | 118.8 | 66 | 19.2 | none | normal height |
2 | 3 | male | Jae | Debord | 1493 Poling Farm Road | York | NE | 68467.0 | United States | 402-363-6804JaeMDebord@gustr.com | 2/19/1980 | 177.8 | 71 | 24.8 | none | extrem height |
3 | 4 | male | Liêm | Phan | 2335 Webster Street | Woodbridge | NJ | 7095.0 | United States | PhanBaLiem@jourrapide.com+1 (732) 636-8246 | 7/26/1951 | 220.9 | 70 | 31.7 | none | extrem height |
4 | 5 | male | Tim | Neudorf | 1428 Turkey Pen Lane | Dothan | AL | 36303.0 | United States | 334-515-7487TimNeudorf@cuvox.de | 2/18/1928 | 192.3 | 27 | 26.1 | none | extrem height |
Normalize, Extrapolate, Interpolate
Normalize each row by the sum of each row:
df = pd.DataFrame([[1,2,3],[3,4,5]])
df.div(df.sum(axis=1), axis=0)
0 | 1 | 2 | |
---|---|---|---|
0 | 0.166667 | 0.333333 | 0.500000 |
1 | 0.250000 | 0.333333 | 0.416667 |
Interpolate automatically using SciPy:
from scipy.interpolate import interp1d
m = interp1d([1,10],[1,100])
int(m(10))
100
Interpolate manually:
ser = pd.Series([1,2,3,4])
100*(ser - ser.min()) / (ser.max() - ser.min())
0 0.000000
1 33.333333
2 66.666667
3 100.000000
dtype: float64
Extrapolate automatically using SciPy:
m = interp1d([1,100],[1,10],fill_value="extrapolate")
int(m(95))
9
Insert Values
Appending, Concat
Append rows to dataframe from noncomplete dict:
app_dict = {'patient_id':wrangling_df['patient_id'].max()+1,
'given_name':'testuser',
'surname':'blahblah'}
new_row = pd.DataFrame([app_dict],columns=app_dict.keys())
pd.concat([wrangling_df,new_row], sort=True).reset_index().tail(3)
index | address | assigned_sex | birthdate | bmi | city | contact | country | given_name | height | height_result | illness | patient_id | state | surname | weight | zip_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
501 | 501 | 3652 Boone Crockett Lane | female | 2/13/1952 | 27.7 | Seattle | ChidaluOnyekaozulu@jourrapide.com1 360 443 2060 | United States | Chidalu | 67.0 | normal height | none | 502 | WA | Onyekaozulu | 176.9 | 98109.0 |
502 | 502 | 2778 North Avenue | male | 5/3/1954 | 19.3 | Burr | PatrickGersten@rhyta.com402-848-4923 | United States | Pat | 71.0 | extrem height | none | 503 | NE | Gersten | 138.2 | 68324.0 |
503 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | testuser | NaN | NaN | NaN | 504 | NaN | blahblah | NaN | NaN |
loc
Create masks to insert specific values:
height_mean = wrangling_df.height.mean()
mask = wrangling_df.patient_id == 4509
column_name = 'height'
wrangling_df.loc[mask, column_name] = height_mean * 1.5
Cut and Seperate
iloc
We can extract multiple columns and rows at once by combining iloc and numpy.r_:
df_cut = wrangling_df.iloc[np.r_[:5,10:15], np.r_[1, 3]]
df_cut.shape
(10, 2)
Merge, Group And Sum Values
Summarize categorical values
Find from a list of mutations of all combinations of categorical values where at least 4 are positive:
categories = ["C006_01","C006_02","C006_04","C006_06","C006_07","C006_10"]
all_muts = list(itertools.combinations(categories, 4))
f_s = 'is correct'
result_df = pd.DataFrame()
for m in all_muts:
temp_df = umfrage_df[(umfrage_df[m[0]] == f_s) & (umfrage_df[m[1]] == f_s) & (umfrage_df[m[2]] == f_s) & (umfrage_df[m[3]] == f_s)]
result_df = result_df.append(temp_df)
Sum multiple categorical questions with the same range into one continuous value:
def merge_categorical(df,col_str,rating_dict):
filter_col = [col for col in df if col.startswith(col_str)]
subset_df = df[filter_col].fillna(0)
norm_value = len(filter_col)
subset_df['sum'] = subset_df.apply(lambda x: (sum([rating_dict[y] for y in x[filter_col]]))/norm_value , axis=1)
return subset_df
rating_dict_freq_lesson = {"Every Lesson":4,"Most Lessons":3,"Some Lessons":2,"Never or Hardly Ever":1,0:0}
teacher_practice = merge_categorical(pisa_df,"ST79Q",rating_dict_freq_lesson)
Merging
Merging two dataframes depending on multiple keys:
wrangling_df.shape
(503, 16)
treatment_df = pd.read_csv("treatments.csv")
wrangling_df.merge(treatment_df,how='outer',left_on=['given_name','surname'],right_on=['given_name','surname']).shape
(783, 21)
It seems that none of the group of keys does match the new Dataframe. Therefore the length of the treatments will be appended and the resulting dataframe will be the sum of both Dataframes.
Melting
The powerful function from pandas melt, makes it possible to unite multiple categorical columns into one:
melt_df = wrangling_df.melt(id_vars = ['patient_id','surname'],
value_vars = ['height', 'weight'],
var_name = 'body_indicator',
value_name = 'body')
melt_df[melt_df.patient_id == 1]
patient_id | surname | body_indicator | body | |
---|---|---|---|---|
0 | 1 | Wellish | height | 66.0 |
503 | 1 | Wellish | weight | 121.7 |
It doesn’t really make sense to do this transformation, but it shows the purpose of melt in a good way.
Groupby
Create Dataframe that will count occurences grouped by another value. The resulting construct can be converted to DataFrame using unstack:
patients_country = wrangling_df.groupby(by=['country'])['state'].value_counts().sort_index()
patients_country.unstack(1)
state | AK | AL | AR | AZ | CA | CO | CT | DC | DE | FL | ... | SC | SD | TN | TX | VA | VT | WA | WI | WV | WY |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
country | |||||||||||||||||||||
United States | 1 | 9 | 4 | 4 | 60 | 4 | 5 | 2 | 3 | 22 | ... | 5 | 3 | 9 | 32 | 11 | 2 | 8 | 10 | 3 | 1 |
1 rows × 49 columns
Regex Use-Cases
Extract
Often multiple interesting values are concatenated in one string column. We can use Regex to extract them:
wrangling_df['phone_number'] = wrangling_df.contact.str.extract('((?:\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})', expand=True)
wrangling_df['email'] = wrangling_df.contact.str.extract('([a-zA-Z][a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.][a-zA-Z]+)', expand=True)
Find and Replace
Reverse every lowercase alphabetic word:
repl = lambda m: m.group(0)[::-1]
pd.Series(['foo 123', 'bar baz', np.nan]).str.replace(r'[a-z]+', repl)
0 oof 123
1 rab zab
2 NaN
dtype: object
Use Regex Groups:
pat = r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"
repl = lambda m: m.group('one').upper()
pd.Series(['one two three', 'foo bar baz']).str.replace(pat, repl)
0 ONE
1 FOO
dtype: object
Return function to return clean values for every Regex outcome:
import re
reg_df = pd.DataFrame(["blah","blubb","blach","blhab","asdblaheh"], columns=['test'])
def regex_filter(val):
if val:
mo = re.search("blah",val)
if mo:
return True
else:
return False
else:
return False
reg_df[reg_df['test'].apply(regex_filter)]
test | |
---|---|
0 | blah |
4 | asdblaheh |
Change Phone Number Format with Replace and Padding (Before):
wrangling_df.phone_number[:5]
0 951-719-9170
1 +1 (217) 569-3204
2 402-363-6804
3 +1 (732) 636-8246
4 334-515-7487
Name: phone_number, dtype: object
(After):
wrangling_df.phone_number.str.replace(r'\D+', '').str.pad(11, fillchar='1')[:5]
0 19517199170
1 12175693204
2 14023636804
3 17326368246
4 13345157487
Name: phone_number, dtype: object
Change Zip Code Format using Replace and Padding (Before):
wrangling_df.zip_code[:5]
0 92390.0
1 61812.0
2 68467.0
3 7095.0
4 36303.0
Name: zip_code, dtype: float64
wrangling_df.zip_code.astype(str).\
str[:-2].\
str.pad(5, fillchar='0').\
replace('0000n', np.nan)[:5]
0 92390
1 61812
2 68467
3 07095
4 36303
Name: zip_code, dtype: object