Another way to better organize Pandas code
Pandas is a powerful tool for data analysis. Writing in Pandas is a lot of fun because it enables you to do data manipulation with velocity. While working magic on a Jupyter notebook with Pandas can make you feel like a data wizard, sometimes it comes at the cost of code readability and modularity. If the code will be in production eventually (and it will be), it might be a good practice to start cleaning up your code after the exploration phase.
There are some more advanced and less known Pandas functions which come in handy in terms of making the code cleaner. I am not saying all Pandas code should follow this style, but I would like to share this so people can decide if they prefer this style or not.
First things first, let’s spin up a sample DataFrame for demonstration purposes. Here I choose the preloaded tips
dataset from Seaborn
, since we are focusing on coding style, any dataset will do.
1 | import pandas as pd |
1 | total_bill tip sex smoker day time size |
And suppose we want to do the following manipulations
1 | tips['bill_before_tip'] = tips['total_bill'] - tips['tip'] |
and result in the following DataFrame
1 | tips.head() |
1 | total_bill tip sex smoker day time size bill_before_tip tip_percentage reduced_tip date |
The above code looks pretty straightforward and easily digestable. The only nit-picky comments could be one should not modify tips
itself, as one might want to refer to it later. And lack of in line comments that explains why tip_percentage
is calculated twice (which is my fault for not including the comments). I think almost everyone should agree the above code itself is fine. Here I want to provide an alternative way to write the above code, which may or may not be more readable and modularized depending on your perspective. Nonetheless, it should demonstrate that Pandas offers various methods which accommodates to different styles, and even inspire you to write your code in a different structure.
assign method
First let’s look at the assign
method. According to Pandas official documentation, assign
basically lets you create a new column on a new copy of the original DataFrame. You can
- pass in a
callable
(i.e. function) which will be applied to the DataFrame and assigns the new columns, - or pass in a
Series
which will be assigned as the new column.
For example, the line that calculates bill_before_tip
1 | tips['bill_before_tip'] = tips['total_bill'] - tips['tip'] |
can be written in terms of assign
in the following manner
1 | tips = tips.assign(bill_before_tip=tips['total_bill'] - tips['tip']) |
where the argument key word will be the new column name.
There are couple of different ways to use assign
1 | def calculate_bill_before_tip(df): |
depending on the operation you are doing one might be better than others.
Now, the good thing about assign
is it does not modify the original DataFrame. Rather, it creates a copy of the original DataFrame, and modifies the copy. It saves you the trouble of doing
1 | temp_df = df.copy() |
and packs the copying and creating new column into a single line. Computationally, it is slower than a simple column operation since we need to create a copy, but this is a nice way to keep all column creations in a consistent style.
You can also assign multiple columns in one assign statement
1 | df.assign(bill_before_tip=df['total_bill'] - df['tip'], |
and assign also allows easy chaining of methods
1 | df.assign(bill_before_tip=df['total_bill'] - df['tip']).groupby('smoker').bill_before_tip.mean() |
So, by rewriting every line using assign
and wrapping them in functions, we get to the following code
1 | def calculate_tip_percentage(df): |
Nice! The code is looking pretty good. But I have to say that starting every line with tips_copy =
seems kind of redundant. Can we simplify even more? Sure you can, and the answer is pipe
!
pipe method
Pandas has another nice method named pipe
in case you want to do method chaining. In other words, func(df)
is equivalent to df.pipe(func)
. For functions with arguments, it can be specified explicitly by doing
1 | df.pipe(func, arg1=[something]) |
pipe
allows easy chaining of method, which can be a double-edged sword. It allows shorter and potentially more concise code, but if abused can make the code hard to read and debug.
With pipe
, we can rewrite our code one more time, and this is what the final version looks like
1 | tips_processed = (tips |
Looking good! With self-explanatory function names, one can easily understand what the manipulation is doing without having to dive into details. This eliminates the temporary variables and copies, and avoids exposing complicated details of each manipulation to the reader. I especially love the consistency in pipe
and assign
, which makes it clear what operations are being done on what DataFrame. That being said, this is just one way of writing code. The ultimate goal is still to be consistent with your team’s coding guidelines, as well as producing clean, readable, and efficient code. Hope this has provided some inspirations on how Pandas code can be made more consistent, and may the assign
and pipe
make their way to your code one day!