Skip to content

Streaming filters #2911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Feb 19, 2015
Merged

Streaming filters #2911

merged 13 commits into from
Feb 19, 2015

Conversation

ethomson
Copy link
Member

This is an attempt at providing a streaming functionality for filters. Generally speaking, the streaming interface is fairly straightforward: we set up a chain of filters such that each writes into the next filter, and the final filter will use the same interface to write into the actual writing target (something that puts it into the ODB or the working directory). A consumer will ask the filter_list for a stream, and then will simply write to it as it sees fit, and that filter will be responsible for doing a transformation and then passing the data along to the next stream.

I have updated the internals of the library (checkout, for example) to use the new streams. Filters using the existing apply API are still supported (we create a "proxy" stream that loads the entire data and then passes it to apply).

There is also a test that optionally (when the GITTEST_INVASIVE_FILESYSTEM environment variable is set) pushes a 500 MB file through filters that compress it into the ODB (and decompress it into the working directory) so that you can see that filters are operational without loading the entire file into RAM.

Two things I want to call out, though I'm certain that reviews will find more:

  1. Right now the ultimate writer uses the same interface as the streaming filter. So checkout, for instance, passes a git_filter_stream itself, which is the terminal writer, not an actual pass-through stream. This is semantically a little weird, I think. It might make more sense to simply have another base with write and close semantics (and presumably git_filter_stream would extend this) though we would have to trade the additional layer of abstraction versus the naming. Feedback is welcome.
  2. I did not change the existing filters (ident, CR/LF) to stream. Streaming CR/LF filters would actually break the existing semantics, as they load the entire file to do the stats. This means that if you had poorly configured your .gitattributes then you might try to CR/LF filter some giant file and we would OOM. We could either: change the CR/LF filter to only perform stats on the first n bytes (which seems very reasonable) or have some ability to abort and rewind the filters (which seems difficult, to say the least.)

Edward Thomson and others added 5 commits February 17, 2015 02:19
Add structures and preliminary functions to take a buffer, file or
blob and write the contents in chunks through an arbitrary number
of chained filters, finally writing into a user-provided function
accept the contents.
Migrate the `git_filter_list_apply_*` functions over to using the
new filter streams.
Use the new streaming filter API during checkout.
Let the filters use the checkout data's temporary buffer, instead
of having to allocate new buffers each time.
Test pushing a file on-disk into a streaming filter that compresses
it into the ODB, and inflates it back into the working directory.
@carlosmn
Copy link
Member

git's crlf filter is streaming, so it's possible to keep git's semantics that way, though I don't know off-hand how much of a stream the rest is. Aborting might be doable if we can recognise that there's a single filter in effect, so we can replay the stream from a blob or file, but we don't have to worry about that right now.

@carlosmn
Copy link
Member

As for the checkout stream being a filter_stream, this might just come down to naming. The interface you're defining there is just a WriteStream or WriteCloser (depending on your favourite language this week) so I don't see an issue with checkout defining a sink which writes out to disk with this interface. Maybe if you named it git_writestream it'd look just fine.

if ((error = mkpath2file(data, path, data->opts.dir_mode)) < 0)
return error;
struct checkout_stream {
git_filter_stream base;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nitpick, but we do have a preference for calling this 'parent', rather than base (I think it's just the iterators which use this name).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@ethomson
Copy link
Member Author

Yep, I changed this to git_writestream. I was thinking that there was something in git_filter_stream that was unique to filters. There's not. I dropped it in favor of git_writestream. Thanks!

@carlosmn
Copy link
Member

The existence of git_filter_list__set_temp_buf() seems a bit suspicious. Why is there a per-filte_list buffer which we set from just the one place? Can we instead pass it as an argument somewhere?

I think we should extract filter_list_out_buffer_from_raw() as git_buf_attach_nonowned() or a better name, as it's not about filters but making a git_buf point to memory which outlives it. We don't have to do it in this PR but there was another PR where the code does this manually and it would help readability if we could delegate to a function.

We should activate GITTEST_INVASIVE_FILESYSTEM on travis.

Other than the buffer thing, the code looks pretty good. We should merge this in soon so we can see how the bindings can work with it.

Edward Thomson added 4 commits February 19, 2015 10:05
Provide a convenience function that creates a buffer that can be provided
to callers but will not be freed via `git_buf_free`, so the buffer
creator maintains the allocation lifecycle of the buffer's contents.
For consistency with the rest of the library, where an opt is an
options *structure*.
Refactor `git_filter_list__load_with_attr_reader` into
`git_filter_list__load_ext`, which takes a `git_filter_options`.
@ethomson
Copy link
Member Author

Yeah, I zapped set_temp_buf and instead fleshed out a more libgit2-idiomatic filter_list__load_ext that takes a filter_options struct. (In doing so, I renamed git_libgit2_opt_t to git_libgit2_flag_t to better match our pattern of "options" being structs.)

The only really breaking change here is changing GIT_FILTER_OPT_DEFAULT -> GIT_FILTER_DEFAULT, which I can revert if that seems needlessly gratuitous.

Introduce GITTEST_INVASIVE_FS_STRUCTURE for things that are invasive
to your filesystem structure (like creating folders at your filesystem
root) and GITTEST_INVASIVE_FS_SIZE for things that write lots of data.
@carlosmn
Copy link
Member

😍

I don't feel bad about breaking the API if we're making it more consistent. Unless someone has objections, I'm going to merge this soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants