Pandas Cookbook Part 2 (Assessing/Wrangling)

herrfeder

Apr 12, 2019

After finishing my Udacity DataAnalyst Nanodegree I want to preserve my obtained skills using Pandas. Therefore I created a mixture of Cheat Sheet and Cookbook to go over several usecases. Maybe you will find it useful. This is Part 2 and is about Assessing and Wrangling in Pandas. I’m aiming at adding more content if I find something interesting. By The Way, although it’s Part 2 it’s the first part, as it was the easiest to collect and present the content in a readable way.

import pandas as pd
import numpy as np

Assessing

wrangling_df = pd.read_csv("patients.csv")
wrangling_df.shape

(503, 14)

wrangling_df.head()

	patient_id	assigned_sex	given_name	surname	address	city	state	zip_code	country	contact	birthdate	weight	height	bmi
0	1	female	Zoe	Wellish	576 Brown Bear Drive	Rancho California	California	92390.0	United States	951-719-9170ZoeWellish@superrito.com	7/10/1976	121.7	66	19.6
1	2	female	Pamela	Hill	2370 University Hill Road	Armstrong	Illinois	61812.0	United States	PamelaSHill@cuvox.de+1 (217) 569-3204	4/3/1967	118.8	66	19.2
2	3	male	Jae	Debord	1493 Poling Farm Road	York	Nebraska	68467.0	United States	402-363-6804JaeMDebord@gustr.com	2/19/1980	177.8	71	24.8
3	4	male	Liêm	Phan	2335 Webster Street	Woodbridge	NJ	7095.0	United States	PhanBaLiem@jourrapide.com+1 (732) 636-8246	7/26/1951	220.9	70	31.7
4	5	male	Tim	Neudorf	1428 Turkey Pen Lane	Dothan	AL	36303.0	United States	334-515-7487TimNeudorf@cuvox.de	2/18/1928	192.3	27	26.1

Summarize boolean occurences

By multiplying a boolean with one, we can turn it into a integer and are able to sum up all occurences:

(1*(wrangling_df['given_name'] == "Jake")).sum()

Get Column Keys

Save column names into a list:

wrangling_df.keys()[-3:].tolist()

[‘weight’, ‘height’, ‘bmi’]

Filtering

Retrieve categorical values based on occurence limit in dataframe:

[x for x,y in list(wrangling_df['height'].value_counts().iteritems()) if y > 30]
# the result will express the heights, that occured more than thirty times in the dataframe

[67, 69, 65, 63, 66, 70, 72, 61]

Return values for column where value starts with specific string. Using List Comprehension and a classical for loop to go over a DataFrame column:

[row for row in wrangling_df['given_name'] if row.startswith('David')]

[‘David’, ‘David’]

Using pandas built-in iterrows will run much faster, as it won’t call the complete Pandas Object on every loop iteration:

[row[1][2] for row in wrangling_df.iterrows() if row[1][2].startswith('David')]

[‘David’, ‘David’]

Wrangling

Cleaning

Duplicates and NaN values

Identify duplicated and null values in address column:

wrangling_df[((wrangling_df.address.duplicated()) & wrangling_df.address.notnull())].head(1)

	patient_id	assigned_sex	given_name	surname	address	city	state	zip_code	country	contact	birthdate	weight	height	bmi
29	30	male	Jake	Jakobsen	648 Old Dear Lane	Port Jervis	New York	12771.0	United States	JakobCJakobsen@einrot.com+1 (845) 858-7707	8/1/1985	155.8	67	24.4

Remove these values:

clean_df = wrangling_df[~((wrangling_df.address.duplicated()) & wrangling_df.address.notnull())].head(1)

Remove Columns/Values

Drop

Drop rows with NaN values depending on subset of columns:

wrangling_df.dropna().shape

(491, 14)

wrangling_df.dropna(subset = ['surname']).shape

(503, 14)

Drop whole columns:

wrangling_df.drop(columns=['contact','birthdate']).shape

(503, 12)

Create Columns

Categorical Column

We can create a categorical column that will only allow these values to insert:

wrangling_df['illness'].index

KeyError: 'illness'

wrangling_df['illness'] = pd.Series(pd.Categorical(values=["none"]*len(wrangling_df),categories=["none","a bit","intermediate","very ill"]))

wrangling_df['illness'][0] = "bad"

ValueError: Cannot setitem on a Categorical with a new category, set the categories first

wrangling_df['illness'][0] = "intermediate"

Rename Columns

wrangling_df_2 = wrangling_df.copy()

Different Methods to add suffix to column names

Use built-in pandas functions:

wrangling_df_2 = wrangling_df_2.add_suffix('X')
wrangling_df_2 = wrangling_df_2.add_prefix('X')
wrangling_df_2.head(1)

	Xpatient_idX	Xassigned_sexX	Xgiven_nameX	XsurnameX	XaddressX	XcityX	XstateX	Xzip_codeX	XcountryX	XcontactX	XbirthdateX	XweightX	XheightX	XbmiX	XillnessX
0	1	female	Zoe	Wellish	576 Brown Bear Drive	Rancho California	California	92390.0	United States	951-719-9170ZoeWellish@superrito.com	7/10/1976	121.7	66	19.6	intermediate

Use rename and formatter or lambda for renaming all columns:

wrangling_df_2.rename(columns='{}Y'.format, inplace=True)
wrangling_df_2.rename(columns = lambda x: x+'Z', inplace=True)
wrangling_df_2.head(1)

	Xpatient_idXYZ	Xassigned_sexXYZ	Xgiven_nameXYZ	XsurnameXYZ	XaddressXYZ	XcityXYZ	XstateXYZ	Xzip_codeXYZ	XcountryXYZ	XcontactXYZ	XbirthdateXYZ	XweightXYZ	XheightXYZ	XbmiXYZ	XillnessXYZ
0	1	female	Zoe	Wellish	576 Brown Bear Drive	Rancho California	California	92390.0	United States	951-719-9170ZoeWellish@superrito.com	7/10/1976	121.7	66	19.6	intermediate

wrangling_df_2 = wrangling_df.copy()

Use list comprehensions or the map function to do the job:

wrangling_df_2.columns = [column + 'X' for column in wrangling_df_2.columns]
wrangling_df_2.columns = list(map(lambda s: s+'Y', wrangling_df_2.columns))
wrangling_df_2.head(1)

	patient_idXY	assigned_sexXY	given_nameXY	surnameXY	addressXY	cityXY	stateXY	zip_codeXY	countryXY	contactXY	birthdateXY	weightXY	heightXY	bmiXY	illnessXY
0	1	female	Zoe	Wellish	576 Brown Bear Drive	Rancho California	California	92390.0	United States	951-719-9170ZoeWellish@superrito.com	7/10/1976	121.7	66	19.6	intermediate

Change Values

Lamdba/Apply Functions

Using a If-Else function. We have to remember, a lambda function has to generate a return value for every iteration, therefore the else is mandatory:

wrangling_df['height_result'] = wrangling_df['height'].apply(lambda x: "extrem height" if (x > 67) or (x < 60) else "normal height")
wrangling_df.head(5)

	patient_id	assigned_sex	given_name	surname	address	city	state	zip_code	country	contact	birthdate	weight	height	bmi	illness	height_result
0	1	female	Zoe	Wellish	576 Brown Bear Drive	Rancho California	California	92390.0	United States	951-719-9170ZoeWellish@superrito.com	7/10/1976	121.7	66	19.6	intermediate	normal height
1	2	female	Pamela	Hill	2370 University Hill Road	Armstrong	Illinois	61812.0	United States	PamelaSHill@cuvox.de+1 (217) 569-3204	4/3/1967	118.8	66	19.2	none	normal height
2	3	male	Jae	Debord	1493 Poling Farm Road	York	Nebraska	68467.0	United States	402-363-6804JaeMDebord@gustr.com	2/19/1980	177.8	71	24.8	none	extrem height
3	4	male	Liêm	Phan	2335 Webster Street	Woodbridge	NJ	7095.0	United States	PhanBaLiem@jourrapide.com+1 (732) 636-8246	7/26/1951	220.9	70	31.7	none	extrem height
4	5	male	Tim	Neudorf	1428 Turkey Pen Lane	Dothan	AL	36303.0	United States	334-515-7487TimNeudorf@cuvox.de	2/18/1928	192.3	27	26.1	none	extrem height

Use dict to change values programmatically

For better performance it’s always advisable to use a apply function, instead a for loop:

# Mapping from full state name to abbreviation
state_abbrev = {'California': 'CA',
                'New York': 'NY',
                'Illinois': 'IL',
                'Florida': 'FL',
                'Nebraska': 'NE'}

# Function to apply
def abbreviate_state(patient):
    if patient['state'] in state_abbrev.keys():
        abbrev = state_abbrev[patient['state']]
        return abbrev
    else:
        return patient['state']
    
wrangling_df['state'] = wrangling_df.apply(abbreviate_state, axis=1)
wrangling_df.head(5)

	patient_id	assigned_sex	given_name	surname	address	city	state	zip_code	country	contact	birthdate	weight	height	bmi	illness	height_result
0	1	female	Zoe	Wellish	576 Brown Bear Drive	Rancho California	CA	92390.0	United States	951-719-9170ZoeWellish@superrito.com	7/10/1976	121.7	66	19.6	intermediate	normal height
1	2	female	Pamela	Hill	2370 University Hill Road	Armstrong	IL	61812.0	United States	PamelaSHill@cuvox.de+1 (217) 569-3204	4/3/1967	118.8	66	19.2	none	normal height
2	3	male	Jae	Debord	1493 Poling Farm Road	York	NE	68467.0	United States	402-363-6804JaeMDebord@gustr.com	2/19/1980	177.8	71	24.8	none	extrem height
3	4	male	Liêm	Phan	2335 Webster Street	Woodbridge	NJ	7095.0	United States	PhanBaLiem@jourrapide.com+1 (732) 636-8246	7/26/1951	220.9	70	31.7	none	extrem height
4	5	male	Tim	Neudorf	1428 Turkey Pen Lane	Dothan	AL	36303.0	United States	334-515-7487TimNeudorf@cuvox.de	2/18/1928	192.3	27	26.1	none	extrem height

Normalize, Extrapolate, Interpolate

Normalize each row by the sum of each row:

df = pd.DataFrame([[1,2,3],[3,4,5]])
df.div(df.sum(axis=1), axis=0)

	0	1	2
0	0.166667	0.333333	0.500000
1	0.250000	0.333333	0.416667

Interpolate automatically using SciPy:

from scipy.interpolate import interp1d

m = interp1d([1,10],[1,100])
int(m(10))

Interpolate manually:

ser = pd.Series([1,2,3,4])
100*(ser - ser.min()) / (ser.max() - ser.min())

    0.000000
   33.333333
   66.666667
  100.000000
dtype: float64

Extrapolate automatically using SciPy:

m = interp1d([1,100],[1,10],fill_value="extrapolate")
int(m(95))

Insert Values

Appending, Concat

Append rows to dataframe from noncomplete dict:

app_dict = {'patient_id':wrangling_df['patient_id'].max()+1,
            'given_name':'testuser',
            'surname':'blahblah'}

new_row = pd.DataFrame([app_dict],columns=app_dict.keys())

pd.concat([wrangling_df,new_row], sort=True).reset_index().tail(3)

	index	address	assigned_sex	birthdate	bmi	city	contact	country	given_name	height	height_result	illness	patient_id	state	surname	weight	zip_code
501	501	3652 Boone Crockett Lane	female	2/13/1952	27.7	Seattle	ChidaluOnyekaozulu@jourrapide.com1 360 443 2060	United States	Chidalu	67.0	normal height	none	502	WA	Onyekaozulu	176.9	98109.0
502	502	2778 North Avenue	male	5/3/1954	19.3	Burr	PatrickGersten@rhyta.com402-848-4923	United States	Pat	71.0	extrem height	none	503	NE	Gersten	138.2	68324.0
503	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	testuser	NaN	NaN	NaN	504	NaN	blahblah	NaN	NaN

loc

Create masks to insert specific values:

height_mean = wrangling_df.height.mean()
mask = wrangling_df.patient_id == 4509
column_name = 'height'
wrangling_df.loc[mask, column_name] = height_mean * 1.5

Cut and Seperate

iloc

We can extract multiple columns and rows at once by combining iloc and numpy.r_:

df_cut = wrangling_df.iloc[np.r_[:5,10:15], np.r_[1, 3]]
df_cut.shape

(10, 2)

Merge, Group And Sum Values

Summarize categorical values

Find from a list of mutations of all combinations of categorical values where at least 4 are positive:

categories = ["C006_01","C006_02","C006_04","C006_06","C006_07","C006_10"]
all_muts = list(itertools.combinations(categories, 4))

f_s = 'is correct'
result_df = pd.DataFrame()

for m in all_muts:
    temp_df = umfrage_df[(umfrage_df[m[0]] == f_s) & (umfrage_df[m[1]] == f_s) & (umfrage_df[m[2]] == f_s) & (umfrage_df[m[3]] == f_s)]
    result_df = result_df.append(temp_df)

Sum multiple categorical questions with the same range into one continuous value:

def merge_categorical(df,col_str,rating_dict):
    
    filter_col = [col for col in df if col.startswith(col_str)]
    subset_df = df[filter_col].fillna(0)
    norm_value = len(filter_col)
    subset_df['sum'] = subset_df.apply(lambda x: (sum([rating_dict[y] for y in x[filter_col]]))/norm_value , axis=1)
    
    return subset_df

rating_dict_freq_lesson = {"Every Lesson":4,"Most Lessons":3,"Some Lessons":2,"Never or Hardly Ever":1,0:0}

teacher_practice = merge_categorical(pisa_df,"ST79Q",rating_dict_freq_lesson)

Merging

Merging two dataframes depending on multiple keys:

wrangling_df.shape

(503, 16)

treatment_df = pd.read_csv("treatments.csv")
wrangling_df.merge(treatment_df,how='outer',left_on=['given_name','surname'],right_on=['given_name','surname']).shape

(783, 21)

It seems that none of the group of keys does match the new Dataframe. Therefore the length of the treatments will be appended and the resulting dataframe will be the sum of both Dataframes.

Melting

The powerful function from pandas melt, makes it possible to unite multiple categorical columns into one:

melt_df = wrangling_df.melt(id_vars = ['patient_id','surname'], 
                  value_vars = ['height', 'weight'], 
                  var_name = 'body_indicator', 
                  value_name = 'body')

melt_df[melt_df.patient_id == 1]

	patient_id	surname	body_indicator	body
0	1	Wellish	height	66.0
503	1	Wellish	weight	121.7

It doesn’t really make sense to do this transformation, but it shows the purpose of melt in a good way.

Groupby

Create Dataframe that will count occurences grouped by another value. The resulting construct can be converted to DataFrame using unstack:

patients_country = wrangling_df.groupby(by=['country'])['state'].value_counts().sort_index()
patients_country.unstack(1)

state	AK	AL	AR	AZ	CA	CO	CT	DC	DE	FL	...	SC	SD	TN	TX	VA	VT	WA	WI	WV	WY
country
United States	1	9	4	4	60	4	5	2	3	22	...	5	3	9	32	11	2	8	10	3	1

1 rows × 49 columns

Regex Use-Cases

Extract

Often multiple interesting values are concatenated in one string column. We can use Regex to extract them:

wrangling_df['phone_number'] = wrangling_df.contact.str.extract('((?:\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})', expand=True)
wrangling_df['email'] = wrangling_df.contact.str.extract('([a-zA-Z][a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.][a-zA-Z]+)', expand=True)

Find and Replace

Reverse every lowercase alphabetic word:

repl = lambda m: m.group(0)[::-1]
pd.Series(['foo 123', 'bar baz', np.nan]).str.replace(r'[a-z]+', repl)

  oof 123
  rab zab
      NaN
dtype: object

Use Regex Groups:

pat = r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"
repl = lambda m: m.group('one').upper()
pd.Series(['one two three', 'foo bar baz']).str.replace(pat, repl)

0    ONE
1    FOO
dtype: object

Return function to return clean values for every Regex outcome:

import re
reg_df = pd.DataFrame(["blah","blubb","blach","blhab","asdblaheh"], columns=['test'])

def regex_filter(val):
    if val:
        mo = re.search("blah",val)
        if mo:
            return True
        else:
            return False
    else:
        return False

reg_df[reg_df['test'].apply(regex_filter)]

	test
0	blah
4	asdblaheh

Change Phone Number Format with Replace and Padding (Before):

wrangling_df.phone_number[:5]

       951-719-9170
  +1 (217) 569-3204
       402-363-6804
  +1 (732) 636-8246
       334-515-7487
Name: phone_number, dtype: object

(After):

wrangling_df.phone_number.str.replace(r'\D+', '').str.pad(11, fillchar='1')[:5]

  19517199170
  12175693204
  14023636804
  17326368246
  13345157487
Name: phone_number, dtype: object

Change Zip Code Format using Replace and Padding (Before):

wrangling_df.zip_code[:5]

  92390.0
  61812.0
  68467.0
   7095.0
  36303.0
Name: zip_code, dtype: float64

wrangling_df.zip_code.astype(str).\
str[:-2].\
str.pad(5, fillchar='0').\
replace('0000n', np.nan)[:5] 

  92390
  61812
  68467
  07095
  36303
Name: zip_code, dtype: object