H4CK AND THAT
herrfeder
Apr 12, 2019

After finishing my Udacity DataAnalyst Nanodegree I want to preserve my obtained skills using Pandas. Therefore I created a mixture of Cheat Sheet and Cookbook to go over several usecases. Maybe you will find it useful. This is Part 2 and is about Assessing and Wrangling in Pandas. I’m aiming at adding more content if I find something interesting. By The Way, although it’s Part 2 it’s the first part, as it was the easiest to collect and present the content in a readable way.

import pandas as pd
import numpy as np

Assessing

wrangling_df = pd.read_csv("patients.csv")
wrangling_df.shape
(503, 14)
wrangling_df.head()
patient_id assigned_sex given_name surname address city state zip_code country contact birthdate weight height bmi
0 1 female Zoe Wellish 576 Brown Bear Drive Rancho California California 92390.0 United States 951-719-9170ZoeWellish@superrito.com 7/10/1976 121.7 66 19.6
1 2 female Pamela Hill 2370 University Hill Road Armstrong Illinois 61812.0 United States PamelaSHill@cuvox.de+1 (217) 569-3204 4/3/1967 118.8 66 19.2
2 3 male Jae Debord 1493 Poling Farm Road York Nebraska 68467.0 United States 402-363-6804JaeMDebord@gustr.com 2/19/1980 177.8 71 24.8
3 4 male Liêm Phan 2335 Webster Street Woodbridge NJ 7095.0 United States PhanBaLiem@jourrapide.com+1 (732) 636-8246 7/26/1951 220.9 70 31.7
4 5 male Tim Neudorf 1428 Turkey Pen Lane Dothan AL 36303.0 United States 334-515-7487TimNeudorf@cuvox.de 2/18/1928 192.3 27 26.1

Summarize boolean occurences

By multiplying a boolean with one, we can turn it into a integer and are able to sum up all occurences:

(1*(wrangling_df['given_name'] == "Jake")).sum()

1

Get Column Keys

Save column names into a list:

wrangling_df.keys()[-3:].tolist()

[‘weight’, ‘height’, ‘bmi’]

Filtering

Retrieve categorical values based on occurence limit in dataframe:

[x for x,y in list(wrangling_df['height'].value_counts().iteritems()) if y > 30]
# the result will express the heights, that occured more than thirty times in the dataframe

[67, 69, 65, 63, 66, 70, 72, 61]

Return values for column where value starts with specific string. Using List Comprehension and a classical for loop to go over a DataFrame column:

[row for row in wrangling_df['given_name'] if row.startswith('David')]

[‘David’, ‘David’]

Using pandas built-in iterrows will run much faster, as it won’t call the complete Pandas Object on every loop iteration:

[row[1][2] for row in wrangling_df.iterrows() if row[1][2].startswith('David')]

[‘David’, ‘David’]

Wrangling

Cleaning

Duplicates and NaN values

Identify duplicated and null values in address column:

wrangling_df[((wrangling_df.address.duplicated()) & wrangling_df.address.notnull())].head(1)
patient_id assigned_sex given_name surname address city state zip_code country contact birthdate weight height bmi
29 30 male Jake Jakobsen 648 Old Dear Lane Port Jervis New York 12771.0 United States JakobCJakobsen@einrot.com+1 (845) 858-7707 8/1/1985 155.8 67 24.4

Remove these values:

clean_df = wrangling_df[~((wrangling_df.address.duplicated()) & wrangling_df.address.notnull())].head(1)

Remove Columns/Values

Drop

Drop rows with NaN values depending on subset of columns:

wrangling_df.dropna().shape
(491, 14)
wrangling_df.dropna(subset = ['surname']).shape
(503, 14)

Drop whole columns:

wrangling_df.drop(columns=['contact','birthdate']).shape
(503, 12)

Create Columns

Categorical Column

We can create a categorical column that will only allow these values to insert:

wrangling_df['illness'].index
KeyError: 'illness'
wrangling_df['illness'] = pd.Series(pd.Categorical(values=["none"]*len(wrangling_df),categories=["none","a bit","intermediate","very ill"]))
wrangling_df['illness'][0] = "bad"
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
wrangling_df['illness'][0] = "intermediate"

Rename Columns

wrangling_df_2 = wrangling_df.copy()

Different Methods to add suffix to column names

Use built-in pandas functions:

wrangling_df_2 = wrangling_df_2.add_suffix('X')
wrangling_df_2 = wrangling_df_2.add_prefix('X')
wrangling_df_2.head(1)
Xpatient_idX Xassigned_sexX Xgiven_nameX XsurnameX XaddressX XcityX XstateX Xzip_codeX XcountryX XcontactX XbirthdateX XweightX XheightX XbmiX XillnessX
0 1 female Zoe Wellish 576 Brown Bear Drive Rancho California California 92390.0 United States 951-719-9170ZoeWellish@superrito.com 7/10/1976 121.7 66 19.6 intermediate

Use rename and formatter or lambda for renaming all columns:

wrangling_df_2.rename(columns='{}Y'.format, inplace=True)
wrangling_df_2.rename(columns = lambda x: x+'Z', inplace=True)
wrangling_df_2.head(1)
Xpatient_idXYZ Xassigned_sexXYZ Xgiven_nameXYZ XsurnameXYZ XaddressXYZ XcityXYZ XstateXYZ Xzip_codeXYZ XcountryXYZ XcontactXYZ XbirthdateXYZ XweightXYZ XheightXYZ XbmiXYZ XillnessXYZ
0 1 female Zoe Wellish 576 Brown Bear Drive Rancho California California 92390.0 United States 951-719-9170ZoeWellish@superrito.com 7/10/1976 121.7 66 19.6 intermediate
wrangling_df_2 = wrangling_df.copy()

Use list comprehensions or the map function to do the job:

wrangling_df_2.columns = [column + 'X' for column in wrangling_df_2.columns]
wrangling_df_2.columns = list(map(lambda s: s+'Y', wrangling_df_2.columns))
wrangling_df_2.head(1)
patient_idXY assigned_sexXY given_nameXY surnameXY addressXY cityXY stateXY zip_codeXY countryXY contactXY birthdateXY weightXY heightXY bmiXY illnessXY
0 1 female Zoe Wellish 576 Brown Bear Drive Rancho California California 92390.0 United States 951-719-9170ZoeWellish@superrito.com 7/10/1976 121.7 66 19.6 intermediate

Change Values

Lamdba/Apply Functions

Using a If-Else function. We have to remember, a lambda function has to generate a return value for every iteration, therefore the else is mandatory:

wrangling_df['height_result'] = wrangling_df['height'].apply(lambda x: "extrem height" if (x > 67) or (x < 60) else "normal height")
wrangling_df.head(5)
patient_id assigned_sex given_name surname address city state zip_code country contact birthdate weight height bmi illness height_result
0 1 female Zoe Wellish 576 Brown Bear Drive Rancho California California 92390.0 United States 951-719-9170ZoeWellish@superrito.com 7/10/1976 121.7 66 19.6 intermediate normal height
1 2 female Pamela Hill 2370 University Hill Road Armstrong Illinois 61812.0 United States PamelaSHill@cuvox.de+1 (217) 569-3204 4/3/1967 118.8 66 19.2 none normal height
2 3 male Jae Debord 1493 Poling Farm Road York Nebraska 68467.0 United States 402-363-6804JaeMDebord@gustr.com 2/19/1980 177.8 71 24.8 none extrem height
3 4 male Liêm Phan 2335 Webster Street Woodbridge NJ 7095.0 United States PhanBaLiem@jourrapide.com+1 (732) 636-8246 7/26/1951 220.9 70 31.7 none extrem height
4 5 male Tim Neudorf 1428 Turkey Pen Lane Dothan AL 36303.0 United States 334-515-7487TimNeudorf@cuvox.de 2/18/1928 192.3 27 26.1 none extrem height

Use dict to change values programmatically

For better performance it’s always advisable to use a apply function, instead a for loop:

# Mapping from full state name to abbreviation
state_abbrev = {'California': 'CA',
                'New York': 'NY',
                'Illinois': 'IL',
                'Florida': 'FL',
                'Nebraska': 'NE'}

# Function to apply
def abbreviate_state(patient):
    if patient['state'] in state_abbrev.keys():
        abbrev = state_abbrev[patient['state']]
        return abbrev
    else:
        return patient['state']
    
wrangling_df['state'] = wrangling_df.apply(abbreviate_state, axis=1)
wrangling_df.head(5)
patient_id assigned_sex given_name surname address city state zip_code country contact birthdate weight height bmi illness height_result
0 1 female Zoe Wellish 576 Brown Bear Drive Rancho California CA 92390.0 United States 951-719-9170ZoeWellish@superrito.com 7/10/1976 121.7 66 19.6 intermediate normal height
1 2 female Pamela Hill 2370 University Hill Road Armstrong IL 61812.0 United States PamelaSHill@cuvox.de+1 (217) 569-3204 4/3/1967 118.8 66 19.2 none normal height
2 3 male Jae Debord 1493 Poling Farm Road York NE 68467.0 United States 402-363-6804JaeMDebord@gustr.com 2/19/1980 177.8 71 24.8 none extrem height
3 4 male Liêm Phan 2335 Webster Street Woodbridge NJ 7095.0 United States PhanBaLiem@jourrapide.com+1 (732) 636-8246 7/26/1951 220.9 70 31.7 none extrem height
4 5 male Tim Neudorf 1428 Turkey Pen Lane Dothan AL 36303.0 United States 334-515-7487TimNeudorf@cuvox.de 2/18/1928 192.3 27 26.1 none extrem height

Normalize, Extrapolate, Interpolate

Normalize each row by the sum of each row:

df = pd.DataFrame([[1,2,3],[3,4,5]])
df.div(df.sum(axis=1), axis=0)
0 1 2
0 0.166667 0.333333 0.500000
1 0.250000 0.333333 0.416667

Interpolate automatically using SciPy:

from scipy.interpolate import interp1d

m = interp1d([1,10],[1,100])
int(m(10))
100

Interpolate manually:

ser = pd.Series([1,2,3,4])
100*(ser - ser.min()) / (ser.max() - ser.min())

0      0.000000
1     33.333333
2     66.666667
3    100.000000
dtype: float64

Extrapolate automatically using SciPy:

m = interp1d([1,100],[1,10],fill_value="extrapolate")
int(m(95))
9

Insert Values

Appending, Concat

Append rows to dataframe from noncomplete dict:

app_dict = {'patient_id':wrangling_df['patient_id'].max()+1,
            'given_name':'testuser',
            'surname':'blahblah'}

new_row = pd.DataFrame([app_dict],columns=app_dict.keys())

pd.concat([wrangling_df,new_row], sort=True).reset_index().tail(3)
index address assigned_sex birthdate bmi city contact country given_name height height_result illness patient_id state surname weight zip_code
501 501 3652 Boone Crockett Lane female 2/13/1952 27.7 Seattle ChidaluOnyekaozulu@jourrapide.com1 360 443 2060 United States Chidalu 67.0 normal height none 502 WA Onyekaozulu 176.9 98109.0
502 502 2778 North Avenue male 5/3/1954 19.3 Burr PatrickGersten@rhyta.com402-848-4923 United States Pat 71.0 extrem height none 503 NE Gersten 138.2 68324.0
503 0 NaN NaN NaN NaN NaN NaN NaN testuser NaN NaN NaN 504 NaN blahblah NaN NaN

loc

Create masks to insert specific values:

height_mean = wrangling_df.height.mean()
mask = wrangling_df.patient_id == 4509
column_name = 'height'
wrangling_df.loc[mask, column_name] = height_mean * 1.5

Cut and Seperate

iloc

We can extract multiple columns and rows at once by combining iloc and numpy.r_:

df_cut = wrangling_df.iloc[np.r_[:5,10:15], np.r_[1, 3]]
df_cut.shape
(10, 2)

Merge, Group And Sum Values

Summarize categorical values

Find from a list of mutations of all combinations of categorical values where at least 4 are positive:

categories = ["C006_01","C006_02","C006_04","C006_06","C006_07","C006_10"]
all_muts = list(itertools.combinations(categories, 4))

f_s = 'is correct'
result_df = pd.DataFrame()

for m in all_muts:
    temp_df = umfrage_df[(umfrage_df[m[0]] == f_s) & (umfrage_df[m[1]] == f_s) & (umfrage_df[m[2]] == f_s) & (umfrage_df[m[3]] == f_s)]
    result_df = result_df.append(temp_df)

Sum multiple categorical questions with the same range into one continuous value:

def merge_categorical(df,col_str,rating_dict):
    
    filter_col = [col for col in df if col.startswith(col_str)]
    subset_df = df[filter_col].fillna(0)
    norm_value = len(filter_col)
    subset_df['sum'] = subset_df.apply(lambda x: (sum([rating_dict[y] for y in x[filter_col]]))/norm_value , axis=1)
    
    return subset_df

rating_dict_freq_lesson = {"Every Lesson":4,"Most Lessons":3,"Some Lessons":2,"Never or Hardly Ever":1,0:0}

teacher_practice = merge_categorical(pisa_df,"ST79Q",rating_dict_freq_lesson)

Merging

Merging two dataframes depending on multiple keys:

wrangling_df.shape
(503, 16)
treatment_df = pd.read_csv("treatments.csv")
wrangling_df.merge(treatment_df,how='outer',left_on=['given_name','surname'],right_on=['given_name','surname']).shape
(783, 21)

It seems that none of the group of keys does match the new Dataframe. Therefore the length of the treatments will be appended and the resulting dataframe will be the sum of both Dataframes.

Melting

The powerful function from pandas melt, makes it possible to unite multiple categorical columns into one:

melt_df = wrangling_df.melt(id_vars = ['patient_id','surname'], 
                  value_vars = ['height', 'weight'], 
                  var_name = 'body_indicator', 
                  value_name = 'body')

melt_df[melt_df.patient_id == 1]
patient_id surname body_indicator body
0 1 Wellish height 66.0
503 1 Wellish weight 121.7

It doesn’t really make sense to do this transformation, but it shows the purpose of melt in a good way.

Groupby

Create Dataframe that will count occurences grouped by another value. The resulting construct can be converted to DataFrame using unstack:

patients_country = wrangling_df.groupby(by=['country'])['state'].value_counts().sort_index()
patients_country.unstack(1)
state AK AL AR AZ CA CO CT DC DE FL ... SC SD TN TX VA VT WA WI WV WY
country
United States 1 9 4 4 60 4 5 2 3 22 ... 5 3 9 32 11 2 8 10 3 1

1 rows × 49 columns

Regex Use-Cases

Extract

Often multiple interesting values are concatenated in one string column. We can use Regex to extract them:

wrangling_df['phone_number'] = wrangling_df.contact.str.extract('((?:\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})', expand=True)
wrangling_df['email'] = wrangling_df.contact.str.extract('([a-zA-Z][a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.][a-zA-Z]+)', expand=True)

Find and Replace

Reverse every lowercase alphabetic word:

repl = lambda m: m.group(0)[::-1]
pd.Series(['foo 123', 'bar baz', np.nan]).str.replace(r'[a-z]+', repl)
0    oof 123
1    rab zab
2        NaN
dtype: object

Use Regex Groups:

pat = r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"
repl = lambda m: m.group('one').upper()
pd.Series(['one two three', 'foo bar baz']).str.replace(pat, repl)
0    ONE
1    FOO
dtype: object

Return function to return clean values for every Regex outcome:

import re
reg_df = pd.DataFrame(["blah","blubb","blach","blhab","asdblaheh"], columns=['test'])

def regex_filter(val):
    if val:
        mo = re.search("blah",val)
        if mo:
            return True
        else:
            return False
    else:
        return False

reg_df[reg_df['test'].apply(regex_filter)]
test
0 blah
4 asdblaheh

Change Phone Number Format with Replace and Padding (Before):

wrangling_df.phone_number[:5]
0         951-719-9170
1    +1 (217) 569-3204
2         402-363-6804
3    +1 (732) 636-8246
4         334-515-7487
Name: phone_number, dtype: object

(After):

wrangling_df.phone_number.str.replace(r'\D+', '').str.pad(11, fillchar='1')[:5]
0    19517199170
1    12175693204
2    14023636804
3    17326368246
4    13345157487
Name: phone_number, dtype: object

Change Zip Code Format using Replace and Padding (Before):

wrangling_df.zip_code[:5]
0    92390.0
1    61812.0
2    68467.0
3     7095.0
4    36303.0
Name: zip_code, dtype: float64
wrangling_df.zip_code.astype(str).\
str[:-2].\
str.pad(5, fillchar='0').\
replace('0000n', np.nan)[:5] 
0    92390
1    61812
2    68467
3    07095
4    36303
Name: zip_code, dtype: object