How to efficiently loop through Pandas DataFrame
If working with data is part of your daily job, you will likely run into situations where you realize you have to loop through a Pandas Dataframe and process each row. I recently find myself in this situation where I need to loop through each row of a large DataFrame, do some complex computation to each row, and recreate a new DataFrame base on the computation results. Savvy data scientists know immediately that this is one of the bad situations to be in, as looping through pandas DataFrame can be cumbersome and time consuming.
However, in the event where you have no option other than to loop through a DataFrame by rows, what is the most efficeint way? Let’s look at the usual suspects:
- for loop with
.iloc
- iterrows
- itertuple
- apply
- python zip
- pandas vectorization
- numpy vectorization
When I wrote my piece of code I had a vague sense that I should stay away from iloc
, iterrows
and try using pandas builtin functions or apply. But I ended up using itertuples
because what I was trying to do is fairly complex and could not be written in the form that utilizes apply. It turned out that using itertuple did not satisfy my time constraint, so I Googled around and found the above list of candidates I could potentially try out. I decided to try each of them out and record my findings as well as the reason why some options are more efficient that others.
Experiment results with %timeit
To compair the performance of each apprach fairly, I implemented each approach in a Jupyter notebook, and used the magic % timeit
function to measure the efficiency of each on a randomly generated DataFrame. For simplicity, each approach is trying to compute the sum of all elements of two of the columns of the DataFrame.
First let’s generate a DataFrame large enough with random integers
1 | import timeit |
The DataFrame df
has a shape of (100000, 4), where the first 5 rows look like
1 | df.head() |
1 | A B C D |
For the experiment I am using pandas 0.25.3.
Standard python for loop with iloc
A very basic way to achieve what we want to do is to use a standard for loop, and retrieve value using DataFrame’s iloc
method
1 | def loop_with_for(df): |
Check performance using timeit
1 | %timeit loop_with_for(df) |
2.01 s ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using Jupyter’s prune
function we get a detailed analysis on number of function calls and time consumed on each step
1 | %prun -l 4 loop_with_for(df) |
1 | 10200009 function calls (10000007 primitive calls) in 4.566 seconds |
Seems like with the for loop + iloc approach, most of the time is spent on accessing values of each cell of the DataFrame, and checking data type with python’s isinstance
function. Let’s see if we can get some speed up if we switch to using one of the functions pandas provides.
Using pandas iterrows function
The pandas iterrows
function returns a pandas Series for each row, with the down side of not preserving dtypes across rows.
1 | def loop_with_iterrows(df): |
Check performance using timeit
1 | %timeit loop_with_iterrows(df) |
9.14 s ± 59.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1 | %prun -l 4 loop_with_iterrows(df) |
1 | 30000101 function calls (29500097 primitive calls) in 17.009 seconds |
Surprisingly, the iterrows
approach is almost 5 times slow than using standard for loop! The reason, suggested by the above log, is that iterrows
spends a lot of time creating pandas Series object, which is known to incur a fair amount of overhead. And yet, the Series it created does not preserve dtypes across rows, which is why it is always recommended to use itertuples
over iterrows
, if you have to choose between one of them.
Using pandas itertuples function
The pandas itertuples
function is similar to iterrows
, except it returns a namedtuple
for each row, and preserves dtypes across rows.
1 | def loop_with_itertuples(df): |
Check performance using timeit
1 | %timeit loop_with_itertuples(df) |
110 ms ± 7.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1 | %prun -l 4 loop_with_itertuples(df) |
1 | 301143 function calls (301120 primitive calls) in 0.200 seconds |
It is exciting that we are finally getting into miliseconds per loop now! itertuples
saves the overhead of creating Series each row by creating namedtuple
instead. This is efficient, yet we are still paying for overhead for creating namedtuple.
Using python zip
There is another interesting way to loop through the DataFrame, which is to use the python zip
function. The way it works is it takes a number of iterables, and makes an iterator that aggragates elements from each of the iterables. Since a column of a Pandas DataFrame is an iterable, we can utilize zip
to produce a tuple for each row just like itertuples
, without all the pandas overhead! Personally I find the approach using zip
very clever and clean.
1 | def loop_with_zip(df): |
Check performance using timeit
1 | %timeit loop_with_zip(df) |
27.1 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1 | %prun -l 4 loop_with_zip(df) |
1 | 146 function calls in 0.049 seconds |
Now we are talking! We just saved about 4 times by using zip
. The efficiency depends on the fact that we are not creating namedtuple for every row. zip
simply returns an iterator of tuples.
Using pandas apply function
Of course we can always use the well-known pandas apply function, which is commonly used to do complex operations on DataFrame rows and columns.
1 | def using_apply(df): |
Check performance using timeit
1 | %timeit using_apply(df) |
2.6 s ± 587 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1 | %prun -l 4 using_apply(df) |
1 | 8401975 function calls (8001949 primitive calls) in 4.559 seconds |
We are seeing about the same performance as using standard loops. According to this stack overflow post, apply is still doing row operations and creating Series, which explains why the function calls are mostly getting values from Series. This is surprising as I always thought apply is one of the more efficient functions if one needs to do row operations.
Using pandas builtin add function
If what we are actually doing is just adding two columns and computing total sum, using the pandas built-in add and sum function would have been the obvious way. Unfortunately many computations we do does not have a simple built-in operation in Pandas. But this approach gives us a good indicator of how efficient these Pandas built-in functions are in practice.
1 | def using_pandas_builtin(df): |
Check performance using timeit
1 | %timeit using_pandas_builtin(df) |
567 µs ± 81.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1 | %prun -l 4 using_pandas_builtin(df) |
1 | 645 function calls (642 primitive calls) in 0.002 seconds |
Since pandas functions are highly optimized, it is expected to be very efficient. This also shows that if your complex operation can be broken down to a series of pandas builtin functions, then it might make more sense to go that route than trying to jam all the operations in a lambda function then use apply.
Using numpy builtin function
And eventually, if you are really looking for efficiency and speed, always go to numpy. Here we convert each column into a numpy array, and does all the heavy lifting utilizing numpy’s builtin functionalities. It’s known to be more efficient than Pandas Dataframe operations.
1 | def using_numpy_builtin(df): |
Check performance using timeit
1 | %timeit using_numpy_builtin(df) |
272 µs ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1 | %prun -l 4 using_numpy_builtin(df) |
1 | 37 function calls in 0.001 seconds |
And of course, the clear winner of this contest is the approach using numpy.
What about DataFrame with different size
Now we have a good understanding of the efficiency of each approach, a natural question to ask is: how large the DataFrame needs to be for us to start considering trying a more efficient approach? To answer that question, I ran an experiment on DataFrames with different sizes, from 1000 rows to 40000 rows. The results are summarized in the plot below.
From the plot we can see that iterrows
is the least efficient and computation time grows the fastest. Then on second tier we have the apply
function and standard for loop, which have almost same performance. Then the most efficient approaches are pandas and numpy built-in functions, the performance of which are very consistent despite increasing number of rows. Close seconds are zip
and itertuples
, with zip
approach slightly faster than itertuples
. (numpy approach is not drawn on the plot due to its running time being miniscule.)
Here is a table with the performance details of the experiment.
For simple operations like what we do here (adding two columns), the difference in performance starts to show once we get to 10000 rows-ish. For more complicated operations, it seems reasonable to start thinking efficiency once you get to about 5000 rows.
Additional Insights
how to use apply
I noticed that there are different ways one can use the apply function in our context, and how we use them actually makes a difference in terms of performance. Let’s take a look.
For the sake of illustration, I modified the using_apply
function from above to compute sum of all columns instead of just columns ‘A’ and ‘B’.
1 | def using_apply(df): |
The main difference between the above two function is the way we access each column value. In using_apply
, we does apply on each row, then access each column value separately, whereas in the other function, we only pass in the relevant columns, and unpack the row to get all columns at the same time. On first look, I did not think this would make a difference, and boy I was wrong. Running both function on a DataFrame of size (100000, 4) yields the following results.
1 | %timeit using_apply(df) |
4.05 s ± 502 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1 | %timeit using_apply_unpack(df) |
1.81 s ± 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The difference it more than 2 times! We get some savings of accessing all columns by unpacking rather than accessing one by one. If you need to use apply, then bear in mind that these little details could give you a fair amount of efficiency boost!
source code comparison
Sometimes, to figure out what is going on, it is helpful to look at the source codes. I included some of the source codes of the functions here for readers who are interested.
iterrows
For readers who are interested, this is the iterrows source code for iterrows
1 | columns = self.columns |
The klass
object here is actually the Series class. And it is interesting to see that iterrows
also uses zip
, except it is using it to create Series by zipping index and values together. This partly explains why zip
beats iterrows
by so much.
itertuples
For readers who are interested, this is the itertuples source code for itertuples
1 | arrays = [] |
zip
For readers who are interested, this is the zip source code for python’s zip
function
1 | def zip(*iterables): |