Description
Conceptually create a pipeline processor that performs out-of-core computation.
This is easily parallelizable (multi-core or machines), in theory cython / ipython / joblib / hadoop could operate with this
requirements
the data set must support chunking, and the function must operate only on that chunk
Useful in cases of a large number of rows, or a problem that you want to parrallelize.
input
a chunking iterator that reads from disk (could take chunksize parameters,
a handle and just call the iterators as well)
read_csv
HDFStore
function
take an iterated chunk, and an axis and return another pandas object
(could be a reduction, transformation, whatever)
output
an output mechanism to take the function application, must support appending
to_csv
(with appending)HDFStore
(table)- another pipeline
- in memory
Map-reduce is an example of this type of pipelining.
Interesting Library
https://github.com/enthought/distarray