ML Pipeline Runtime Optimization
Refactored a feature-heavy ML pipeline around vectorized operations, matrix-based computation, and memory-aware transformations.
Runtime
0m -> 4m
End-to-end pipeline
Rows
0M
Feature computation scale
Data
~1 GB
Tabular workload
Speedup
0x
Iteration velocity
Problem
Pipeline runtime and memory pressure made iteration slow, expensive, and difficult to scale for production-like data sizes.
Challenge
The pipeline mixed repeated dataframe operations, Python loops, redundant intermediate objects, and compute paths that did not match the shape of the data.
Architecture
How the pieces fit together.
The optimized workflow batches feature transforms, replaces row-wise loops with NumPy/Pandas vectorization, parallelizes independent computation, and controls temporary allocations.
Architecture View
System structure and decision flow
Python loops
Interpreter overhead
Copies
Memory pressure
Repeated joins
Redundant work
Unbatched features
Slow inference
Dataset / Inputs
- Approximately 1 GB of tabular data and 3 million rows flowing through feature engineering and inference steps.
Technical Decisions
- Profile first, then optimize the hottest compute paths.
- Replace row-level Python operations with vectorized NumPy/Pandas transformations.
- Batch repeated feature calculations and remove redundant intermediate frames.
- Use parallelism only for independent workloads with clear memory boundaries.
Implementation Details
- Converted repeated loops into matrix-oriented operations.
- Reduced dataframe copies and temporary object growth.
- Moved computation toward C-backed array operations.
- Kept output parity checks around key feature columns.
Metrics / Results
- Runtime was reduced from 40 minutes to 4 minutes on roughly 1 GB and 3 million rows while keeping the modeling behavior intact.
Lessons Learned
- Runtime optimization is usually data-layout work before it is model work.
- Memory bloat can hide inside convenient dataframe chains.
- Small Python operations become expensive when repeated millions of times.
Future Improvements
- Add automated profiling snapshots to CI.
- Explore Polars or DuckDB for larger-than-memory feature workloads.
- Track memory high-water marks in production runs.