Skip to content

ENH: create out-of-core processing module #3202

Closed
@jreback

Description

@jreback

Conceptually create a pipeline processor that performs out-of-core computation.
This is easily parallelizable (multi-core or machines), in theory cython / ipython / joblib / hadoop could operate with this

requirements

the data set must support chunking, and the function must operate only on that chunk
Useful in cases of a large number of rows, or a problem that you want to parrallelize.

input

a chunking iterator that reads from disk (could take chunksize parameters,
a handle and just call the iterators as well)

  • read_csv
  • HDFStore

function

take an iterated chunk, and an axis and return another pandas object
(could be a reduction, transformation, whatever)

output

an output mechanism to take the function application, must support appending

  • to_csv (with appending)
  • HDFStore (table)
  • another pipeline
  • in memory

Map-reduce is an example of this type of pipelining.

Interesting Library
https://github.com/enthought/distarray

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementIO DataIO issues that don't fit into a more specific labelIdeasLong-Term Enhancement DiscussionsPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions