The Problem
A team of data analysts working with Python and Pandas has ran into productivity issues with their data exploration tasks on important large datasets. The running time of analyses was in the order of minutes, which was unacceptable for interactive analysis.
The Analysis
Our analysis uncovered several technical difficulties that reduced the productivity of the team:
- Group-by with many groups led to memory stress in analysis jobs or to their failure.
- Custom aggregators in Python slowed down analyses.
- Inconsistent result types returned by Pandas, in particular scalar vs. series vs. dataframe, led to the team including inefficient data normalization code in their analyses.
The Solution
We provided tailored solutions to the common computational tasks that were responsible for the technical issues and reduced performance:
- Group-by operations were ported to NumPy, allowing for much faster processing with much fewer computational resources. The new functionality was wrapped in an interface similar to Pandas’, which reduced the effort of utilizing the functionality in existing analyses code.
- The set of custom aggregators was ported to Numba, simplifying the code and making it faster. The custom aggregators were packaged for easy integration into group-by operations, either using Pandas or using NumPy.
- Inefficient normalization code became unnecessary and was removed.
The Impact
The running time of analyses went down from minutes to seconds, enabling the team to conduct interactive analysis of their large datasets, and making them more productive. At the same time, only a small effort was needed to adapt existing analyses code to the new interface, which was similar to the original one.