-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Streaming filters #2911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming filters #2911
Conversation
Add structures and preliminary functions to take a buffer, file or blob and write the contents in chunks through an arbitrary number of chained filters, finally writing into a user-provided function accept the contents.
Migrate the `git_filter_list_apply_*` functions over to using the new filter streams.
Use the new streaming filter API during checkout.
Let the filters use the checkout data's temporary buffer, instead of having to allocate new buffers each time.
Test pushing a file on-disk into a streaming filter that compresses it into the ODB, and inflates it back into the working directory.
e455d9f
to
8c2dfb3
Compare
git's crlf filter is streaming, so it's possible to keep git's semantics that way, though I don't know off-hand how much of a stream the rest is. Aborting might be doable if we can recognise that there's a single filter in effect, so we can replay the stream from a blob or file, but we don't have to worry about that right now. |
As for the checkout stream being a |
if ((error = mkpath2file(data, path, data->opts.dir_mode)) < 0) | ||
return error; | ||
struct checkout_stream { | ||
git_filter_stream base; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nitpick, but we do have a preference for calling this 'parent', rather than base (I think it's just the iterators which use this name).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Yep, I changed this to |
e7f3531
to
b49eddd
Compare
The existence of I think we should extract We should activate Other than the buffer thing, the code looks pretty good. We should merge this in soon so we can see how the bindings can work with it. |
Provide a convenience function that creates a buffer that can be provided to callers but will not be freed via `git_buf_free`, so the buffer creator maintains the allocation lifecycle of the buffer's contents.
For consistency with the rest of the library, where an opt is an options *structure*.
Refactor `git_filter_list__load_with_attr_reader` into `git_filter_list__load_ext`, which takes a `git_filter_options`.
17f4374
to
9c9aa1b
Compare
Yeah, I zapped The only really breaking change here is changing |
Introduce GITTEST_INVASIVE_FS_STRUCTURE for things that are invasive to your filesystem structure (like creating folders at your filesystem root) and GITTEST_INVASIVE_FS_SIZE for things that write lots of data.
😍 I don't feel bad about breaking the API if we're making it more consistent. Unless someone has objections, I'm going to merge this soon. |
This is an attempt at providing a streaming functionality for filters. Generally speaking, the streaming interface is fairly straightforward: we set up a chain of filters such that each writes into the next filter, and the final filter will use the same interface to write into the actual writing target (something that puts it into the ODB or the working directory). A consumer will ask the
filter_list
for a stream, and then will simply write to it as it sees fit, and that filter will be responsible for doing a transformation and then passing the data along to the next stream.I have updated the internals of the library (checkout, for example) to use the new streams. Filters using the existing
apply
API are still supported (we create a "proxy" stream that loads the entire data and then passes it to apply).There is also a test that optionally (when the
GITTEST_INVASIVE_FILESYSTEM
environment variable is set) pushes a 500 MB file through filters that compress it into the ODB (and decompress it into the working directory) so that you can see that filters are operational without loading the entire file into RAM.Two things I want to call out, though I'm certain that reviews will find more:
git_filter_stream
itself, which is the terminal writer, not an actual pass-through stream. This is semantically a little weird, I think. It might make more sense to simply have another base withwrite
andclose
semantics (and presumablygit_filter_stream
would extend this) though we would have to trade the additional layer of abstraction versus the naming. Feedback is welcome.n
bytes (which seems very reasonable) or have some ability to abort and rewind the filters (which seems difficult, to say the least.)