Pandas is a powerful tool for data analysis. Writing in Pandas is a lot of fun because it enables you to do data manipulation with velocity. While working magic on a Jupyter notebook with Pandas can make you feel like a data wizard, sometimes it comes at the cost of code readability and modularity. If the code will be in production eventually (and it will be), it might be a good practice to start cleaning up your code after the exploration phase.

There are some more advanced and less known Pandas functions which come in handy in terms of making the code cleaner. I am not saying all Pandas code should follow this style, but I would like to share this so people can decide if they prefer this style or not.

First things first, let’s spin up a sample DataFrame for demonstration purposes. Here I choose the preloaded tips dataset from Seaborn, since we are focusing on coding style, any dataset will do.

1
2
3
4
5
import pandas as pd
import seaborn as sns

tips = sns.load_dataset("tips").head()
tips.head()
1
2
3
4
5
6
    total_bill  tip sex smoker  day time    size
0 16.990 1.010 Female No Sun Dinner 2
1 10.340 1.660 Male No Sun Dinner 3
2 21.010 3.500 Male No Sun Dinner 3
3 23.680 3.310 Male No Sun Dinner 2
4 24.590 3.610 Female No Sun Dinner 4

And suppose we want to do the following manipulations

1
2
3
4
5
6
7
8
9
tips['bill_before_tip'] = tips['total_bill'] - tips['tip']

tips['tip_percentage'] = tips['tip'] / tips['total_bill']

tips['reduced_tip'] = tips['tip'] / 2

tips['tip_percentage'] = tips['reduced_tip'] / tips['total_bill']

tips['date'] = pd.to_datetime('2019-12-12')

and result in the following DataFrame

1
tips.head()
1
2
3
4
5
6
    total_bill  tip sex smoker  day time    size    bill_before_tip tip_percentage  reduced_tip date
0 16.990 1.010 Female No Sun Dinner 2 15.980 0.030 0.505 2019 - 12 - 12
1 10.340 1.660 Male No Sun Dinner 3 8.680 0.080 0.830 2019 - 12 - 12
2 21.010 3.500 Male No Sun Dinner 3 17.510 0.083 1.750 2019 - 12 - 12
3 23.680 3.310 Male No Sun Dinner 2 20.370 0.070 1.655 2019 - 12 - 12
4 24.590 3.610 Female No Sun Dinner 4 20.980 0.073 1.805 2019 - 12 - 12

The above code looks pretty straightforward and easily digestable. The only nit-picky comments could be one should not modify tips itself, as one might want to refer to it later. And lack of in line comments that explains why tip_percentage is calculated twice (which is my fault for not including the comments). I think almost everyone should agree the above code itself is fine. Here I want to provide an alternative way to write the above code, which may or may not be more readable and modularized depending on your perspective. Nonetheless, it should demonstrate that Pandas offers various methods which accommodates to different styles, and even inspire you to write your code in a different structure.

assign method

First let’s look at the assign method. According to Pandas official documentation, assign basically lets you create a new column on a new copy of the original DataFrame. You can

  • pass in a callable (i.e. function) which will be applied to the DataFrame and assigns the new columns,
  • or pass in a Series which will be assigned as the new column.

For example, the line that calculates bill_before_tip

1
tips['bill_before_tip'] = tips['total_bill'] - tips['tip']

can be written in terms of assign in the following manner

1
tips = tips.assign(bill_before_tip=tips['total_bill'] - tips['tip'])

where the argument key word will be the new column name.
There are couple of different ways to use assign

1
2
3
4
5
6
7
8
9
10
def calculate_bill_before_tip(df):
return df.assign(bill_before_tip=df['total_bill'] - df['tip'])


def calculate_bill_before_tip_using_dict(df):
return df.assign(**{'bill_before_tip': df['total_bill'] - df['tip']})


def calculate_bill_before_tip_using_lambda(df):
return df.assign(bill_before_tip=lambda x: x['total_bill'] - x['tip'])

depending on the operation you are doing one might be better than others.

Now, the good thing about assign is it does not modify the original DataFrame. Rather, it creates a copy of the original DataFrame, and modifies the copy. It saves you the trouble of doing

1
2
temp_df = df.copy()
temp_df['blah'] = [something]

and packs the copying and creating new column into a single line. Computationally, it is slower than a simple column operation since we need to create a copy, but this is a nice way to keep all column creations in a consistent style.

You can also assign multiple columns in one assign statement

1
2
df.assign(bill_before_tip=df['total_bill'] - df['tip'],
tip_percentage=df['tip'] / df['total_bill'])

and assign also allows easy chaining of methods

1
df.assign(bill_before_tip=df['total_bill'] - df['tip']).groupby('smoker').bill_before_tip.mean()

So, by rewriting every line using assign and wrapping them in functions, we get to the following code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def calculate_tip_percentage(df):
return df.assign(tip_percentage=df['tip'] / df['total_bill'])


def calculate_reduced_tip(df):
return df.assign(reduced_tip=df['tip'] / 2)


def update_tip_percentage_with_reduced_tip(df):
return df.assign(tip_percentage=df['reduced_tip'] / df['total_bill'])


def add_date_column(df, date_str):
return df.assign(date=pd.to_datetime(date_str))


tips_copy = calculate_tip_percentage(tips)

tips_copy = calculate_reduced_tip(tips_copy)

tips_copy = update_tip_percentage_with_reduced_tip(tips_copy)

tips_copy = add_date_column(tips_copy, date_str='2019-12-12')

Nice! The code is looking pretty good. But I have to say that starting every line with tips_copy = seems kind of redundant. Can we simplify even more? Sure you can, and the answer is pipe!

pipe method

Pandas has another nice method named pipe in case you want to do method chaining. In other words, func(df) is equivalent to df.pipe(func). For functions with arguments, it can be specified explicitly by doing

1
df.pipe(func, arg1=[something])

pipe allows easy chaining of method, which can be a double-edged sword. It allows shorter and potentially more concise code, but if abused can make the code hard to read and debug.

With pipe, we can rewrite our code one more time, and this is what the final version looks like

1
2
3
4
5
6
tips_processed = (tips
.pipe(calculate_bill_before_tip)
.pipe(calculate_tip_percentage)
.pipe(calculate_reduced_tip)
.pipe(update_tip_percentage_with_reduced_tip)
.pipe(add_date_column, date_str='2019-12-12'))

Looking good! With self-explanatory function names, one can easily understand what the manipulation is doing without having to dive into details. This eliminates the temporary variables and copies, and avoids exposing complicated details of each manipulation to the reader. I especially love the consistency in pipe and assign, which makes it clear what operations are being done on what DataFrame. That being said, this is just one way of writing code. The ultimate goal is still to be consistent with your team’s coding guidelines, as well as producing clean, readable, and efficient code. Hope this has provided some inspirations on how Pandas code can be made more consistent, and may the assign and pipe make their way to your code one day!