OptimizationNumPyPandasVectorizationParallelismML Systems

ML Pipeline Runtime Optimization

Refactored a feature-heavy ML pipeline around vectorized operations, matrix-based computation, and memory-aware transformations.

Runtime

0m -> 4m

End-to-end pipeline

Rows

Feature computation scale

Data

~1 GB

Tabular workload

Speedup

Iteration velocity

Problem

Pipeline runtime and memory pressure made iteration slow, expensive, and difficult to scale for production-like data sizes.

Challenge

The pipeline mixed repeated dataframe operations, Python loops, redundant intermediate objects, and compute paths that did not match the shape of the data.

Architecture

How the pieces fit together.

The optimized workflow batches feature transforms, replaces row-wise loops with NumPy/Pandas vectorization, parallelizes independent computation, and controls temporary allocations.

Architecture View

System structure and decision flow

Before40m

After4m

Python loops

Interpreter overhead

Copies

Memory pressure

Repeated joins

Redundant work

Unbatched features

Slow inference

Dataset / Inputs

Approximately 1 GB of tabular data and 3 million rows flowing through feature engineering and inference steps.

Technical Decisions

Profile first, then optimize the hottest compute paths.
Replace row-level Python operations with vectorized NumPy/Pandas transformations.
Batch repeated feature calculations and remove redundant intermediate frames.
Use parallelism only for independent workloads with clear memory boundaries.

Implementation Details

Converted repeated loops into matrix-oriented operations.
Reduced dataframe copies and temporary object growth.
Moved computation toward C-backed array operations.
Kept output parity checks around key feature columns.

Metrics / Results

Runtime was reduced from 40 minutes to 4 minutes on roughly 1 GB and 3 million rows while keeping the modeling behavior intact.

Lessons Learned

Runtime optimization is usually data-layout work before it is model work.
Memory bloat can hide inside convenient dataframe chains.
Small Python operations become expensive when repeated millions of times.

Future Improvements

Add automated profiling snapshots to CI.
Explore Polars or DuckDB for larger-than-memory feature workloads.
Track memory high-water marks in production runs.