Design Notes

Notes from @mrocklin:

Thoughts on Python + HDFS

I think that there are a few open questions

What technology is best to use when wrapping the C library?
How should we design our API?
What are objectives / use cases for this library?
How should we all develop collaboratively?

What technology is best to use when wrapping the C library?

C Python API: Direct, requires a compiler (but so does libhdfs3), has the virtue of being done (maybe?). Less accessible to the developer pool within Continuum. Possibly more accessible to developer pool within Pivotal.
CTypes: Direct, no compiler needed (other than for libhdfs3). More accessible to Continuum developer pool.
Cython: Are we still considering this? Daniel seems to have stopped his efforts here but should we reconsider this?

I have a slight preference for ctypes over the C Python API for social reasons, but I don't really care that much which technology we use.

How should we design our API?

I like the idea of a two-level API:

Direct one-to-one mapping of the C API using one of the above technologies
Pythonic API on top of this written in pure Python

We know that the libhdfs3 API is complete and stable. It would be nice to shift this very solid bedrock up to the Python level, where we can then play around with fun interfaces easily in pure python.

This two-level approach is also taken by h5py, which faced a similar problem with the HDF5 C library. It seems to have worked well for them. There is never anything that the C library can do that you can't get done with h5py with some elbow grease.

What are objectives / use cases for this library?

HDFS is pervasive and useful enough that I hope that most of our value will come from enabling other non-Continuum technologies. I think that we want to make Python a first class player in HDFS land and that a convenient, complete, standalone library to HDFS is key to accomplishing that.

For me personally, I see three needs:

Namenode level management of data within HDFS (copy, move, list, navigate, ...)
Datanode level reading/writing of blocks for distributed
Streaming data in and out to Python applications from the namenode

hdfs = HDFS(ip, **auth)
    with hdfs.open('/path/to/myfile.csv', 'rt') as f:
        for line in f:
            ...

This last item seems lame, but it may be handy. It is at least a good example of a successful Python API that could be applied to HDFS. There may be value in marrying HDFS to familiar Python experiences.

How should we develop collaboratively?

HDFS is a bear. It's hard for us to communicate and help each other when we experiment on different hardware. It's also hard to robustly test our software. I'm pretty ignorant here. Some questions:

What are our options? acluster? docker? Local CDH solution?
Is there a solution here that is pain free enough and valuable enough to agree on?
What's our status with Kerberos and friends?

Desired API for distributed

I need two operations:

For a file/directory, list a sequence of blocks and the hosts on which each block lives

filename -> [(block, {host})]

>>> hdfs.list_blocks('/path/to/myfile.*.csv')
or
>>> hdfs.list_blocks('/path/')
[(block-identifier, {'192.168.0.1', '192.168.0.2', '192.168.0.3'}),
 (block-identifier, {'192.168.0.2', '192.168.0.3', '192.168.0.4'}),
 ...
 ]

For a particular block, the ability to load the bytes from the local data node, assuming that we are on a data node that owns that block. No inter-node communication should occur.

block -> bytes

>>> bytes = hdfs.load_block(block-identifier)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Design Notes

Thoughts on Python + HDFS

What technology is best to use when wrapping the C library?

How should we design our API?

What are objectives / use cases for this library?

How should we develop collaboratively?

Desired API for distributed

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally