Blog

Pandas performance tips

January 26, 2023
| | | |
Share:

Using Python pandas can be quite a pain if you don’t know how to parse or read data effectively. This can be especially so if you are doing Kaggle and spend several minutes on each command, costing precious time for execution.

For iterations: dict>.values>itertuples>iterrows>range(len(df))

A common way to iterate through a datafram is to do it row by row, which can be extremely slow.

import pandas as pd

def apply_loop(df):
    salary_sum = 0
    
    for i in range(len(df)):
        salary_sum += df.iloc[i]['Employee Salary']

    return salary_sum/df.shape[0]
  
%%timeit
apply_loop(data)
## 46.1 ms ± 268 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using iterrows() and df.itertuples()

import pandas as pd

def salary_iterrows(df):
    salary_sum = 0
    
    for index, row in df.iterrows():
        salary_sum += row['Employee Salary']
        
    return salary_sum/df.shape[0]

%%timeit
salary_iterrows(data)
## 18.2 ms ± 42.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
import pandas as pd

def salary_itertuples(df):
    salary_sum = 0
    
    for row in df.itertuples(): 
        salary_sum += row._4
        
    return salary_sum/df.shape[0]

%%timeit
salary_itertuples(data)

Having a numpy array can help, but at the same time this re

Value_counts() vs GroupBy()

A lot of people use GroupBy() with regards to each value in a certain row to get a histogram for name.

No comments
Leave Comment