Streaming filters #2911

ethomson · 2015-02-17T21:40:24Z

This is an attempt at providing a streaming functionality for filters. Generally speaking, the streaming interface is fairly straightforward: we set up a chain of filters such that each writes into the next filter, and the final filter will use the same interface to write into the actual writing target (something that puts it into the ODB or the working directory). A consumer will ask the filter_list for a stream, and then will simply write to it as it sees fit, and that filter will be responsible for doing a transformation and then passing the data along to the next stream.

I have updated the internals of the library (checkout, for example) to use the new streams. Filters using the existing apply API are still supported (we create a "proxy" stream that loads the entire data and then passes it to apply).

There is also a test that optionally (when the GITTEST_INVASIVE_FILESYSTEM environment variable is set) pushes a 500 MB file through filters that compress it into the ODB (and decompress it into the working directory) so that you can see that filters are operational without loading the entire file into RAM.

Two things I want to call out, though I'm certain that reviews will find more:

Right now the ultimate writer uses the same interface as the streaming filter. So checkout, for instance, passes a git_filter_stream itself, which is the terminal writer, not an actual pass-through stream. This is semantically a little weird, I think. It might make more sense to simply have another base with write and close semantics (and presumably git_filter_stream would extend this) though we would have to trade the additional layer of abstraction versus the naming. Feedback is welcome.
I did not change the existing filters (ident, CR/LF) to stream. Streaming CR/LF filters would actually break the existing semantics, as they load the entire file to do the stats. This means that if you had poorly configured your .gitattributes then you might try to CR/LF filter some giant file and we would OOM. We could either: change the CR/LF filter to only perform stats on the first n bytes (which seems very reasonable) or have some ability to abort and rewind the filters (which seems difficult, to say the least.)

Add structures and preliminary functions to take a buffer, file or blob and write the contents in chunks through an arbitrary number of chained filters, finally writing into a user-provided function accept the contents.

Migrate the `git_filter_list_apply_*` functions over to using the new filter streams.

Use the new streaming filter API during checkout.

Let the filters use the checkout data's temporary buffer, instead of having to allocate new buffers each time.

Test pushing a file on-disk into a streaming filter that compresses it into the ODB, and inflates it back into the working directory.

carlosmn · 2015-02-18T06:05:42Z

git's crlf filter is streaming, so it's possible to keep git's semantics that way, though I don't know off-hand how much of a stream the rest is. Aborting might be doable if we can recognise that there's a single filter in effect, so we can replay the stream from a blob or file, but we don't have to worry about that right now.

carlosmn · 2015-02-18T06:15:40Z

As for the checkout stream being a filter_stream, this might just come down to naming. The interface you're defining there is just a WriteStream or WriteCloser (depending on your favourite language this week) so I don't see an issue with checkout defining a sink which writes out to disk with this interface. Maybe if you named it git_writestream it'd look just fine.

carlosmn · 2015-02-18T08:45:11Z

src/checkout.c

-	if ((error = mkpath2file(data, path, data->opts.dir_mode)) < 0)
-		return error;
+struct checkout_stream {
+	git_filter_stream base;


Just a nitpick, but we do have a preference for calling this 'parent', rather than base (I think it's just the iterators which use this name).

ethomson · 2015-02-18T14:49:19Z

Yep, I changed this to git_writestream. I was thinking that there was something in git_filter_stream that was unique to filters. There's not. I dropped it in favor of git_writestream. Thanks!

carlosmn · 2015-02-19T11:19:23Z

The existence of git_filter_list__set_temp_buf() seems a bit suspicious. Why is there a per-filte_list buffer which we set from just the one place? Can we instead pass it as an argument somewhere?

I think we should extract filter_list_out_buffer_from_raw() as git_buf_attach_nonowned() or a better name, as it's not about filters but making a git_buf point to memory which outlives it. We don't have to do it in this PR but there was another PR where the code does this manually and it would help readability if we could delegate to a function.

We should activate GITTEST_INVASIVE_FILESYSTEM on travis.

Other than the buffer thing, the code looks pretty good. We should merge this in soon so we can see how the bindings can work with it.

Provide a convenience function that creates a buffer that can be provided to callers but will not be freed via `git_buf_free`, so the buffer creator maintains the allocation lifecycle of the buffer's contents.

For consistency with the rest of the library, where an opt is an options *structure*.

Refactor `git_filter_list__load_with_attr_reader` into `git_filter_list__load_ext`, which takes a `git_filter_options`.

ethomson · 2015-02-19T17:08:15Z

Yeah, I zapped set_temp_buf and instead fleshed out a more libgit2-idiomatic filter_list__load_ext that takes a filter_options struct. (In doing so, I renamed git_libgit2_opt_t to git_libgit2_flag_t to better match our pattern of "options" being structs.)

The only really breaking change here is changing GIT_FILTER_OPT_DEFAULT -> GIT_FILTER_DEFAULT, which I can revert if that seems needlessly gratuitous.

Introduce GITTEST_INVASIVE_FS_STRUCTURE for things that are invasive to your filesystem structure (like creating folders at your filesystem root) and GITTEST_INVASIVE_FS_SIZE for things that write lots of data.

carlosmn · 2015-02-19T17:53:29Z

😍

I don't feel bad about breaking the API if we're making it more consistent. Unless someone has objections, I'm going to merge this soon.

Streaming filters

Edward Thomson and others added 5 commits February 17, 2015 02:19

filters: introduce streaming filters

fbdc9db

Add structures and preliminary functions to take a buffer, file or blob and write the contents in chunks through an arbitrary number of chained filters, finally writing into a user-provided function accept the contents.

filters: stream internally

5555696

Migrate the `git_filter_list_apply_*` functions over to using the new filter streams.

checkout: stream the blob into the filters

e78f5c9

Use the new streaming filter API during checkout.

checkout: maintain temporary buffer for filters

646364e

Let the filters use the checkout data's temporary buffer, instead of having to allocate new buffers each time.

filter: test a large file through the stream

8c2dfb3

Test pushing a file on-disk into a streaming filter that compresses it into the ODB, and inflates it back into the working directory.

ethomson force-pushed the streaming_filters branch from e455d9f to 8c2dfb3 Compare February 17, 2015 22:01

carlosmn reviewed Feb 18, 2015
View reviewed changes

Edward Thomson added 3 commits February 18, 2015 10:24

git_writestream: from git_filter_stream

b75f15a

filter streams: base -> parent

f7c0125

checkout: let the stream writer close the fd

b49eddd

ethomson force-pushed the streaming_filters branch from e7f3531 to b49eddd Compare February 18, 2015 15:31

Edward Thomson added 4 commits February 19, 2015 10:05

buffer: introduce git_buf_attach_notowned

d4cf167

Provide a convenience function that creates a buffer that can be provided to callers but will not be freed via `git_buf_free`, so the buffer creator maintains the allocation lifecycle of the buffer's contents.

git_filter_opt_t -> git_filter_flag_t

795eacc

For consistency with the rest of the library, where an opt is an options *structure*.

filter: add git_filter_list__load_ext

d05218b

Refactor `git_filter_list__load_with_attr_reader` into `git_filter_list__load_ext`, which takes a `git_filter_options`.

filter: take temp_buf in git_filter_options

9c9aa1b

ethomson force-pushed the streaming_filters branch from 17f4374 to 9c9aa1b Compare February 19, 2015 16:46

tests: separate INVASIVE filesystem tests

feb0e02

Introduce GITTEST_INVASIVE_FS_STRUCTURE for things that are invasive to your filesystem structure (like creating folders at your filesystem root) and GITTEST_INVASIVE_FS_SIZE for things that write lots of data.

carlosmn added a commit that referenced this pull request Feb 19, 2015

Merge pull request #2911 from ethomson/streaming_filters

d15884c

Streaming filters

carlosmn merged commit d15884c into libgit2:master Feb 19, 2015

nulltoken mentioned this pull request Feb 19, 2015

buffer: Expose git_buf_put() to bindings #2892

Closed

This was referenced Feb 20, 2015

Allow custom filters to directly load working directory data #2786

Closed

Filter API should support streaming #2757

Closed

shiftkey mentioned this pull request Apr 20, 2015

Filter Support Redux libgit2/libgit2sharp#1025

Closed

nulltoken mentioned this pull request May 8, 2015

Streaming Filter Support libgit2/libgit2sharp#1030

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Streaming filters #2911

Streaming filters #2911

Uh oh!

ethomson commented Feb 17, 2015

Uh oh!

carlosmn commented Feb 18, 2015

Uh oh!

carlosmn commented Feb 18, 2015

Uh oh!

carlosmn Feb 18, 2015

Uh oh!

ethomson Feb 18, 2015

Uh oh!

ethomson commented Feb 18, 2015

Uh oh!

carlosmn commented Feb 19, 2015

Uh oh!

ethomson commented Feb 19, 2015

Uh oh!

carlosmn commented Feb 19, 2015

Uh oh!

Uh oh!

Streaming filters #2911

Streaming filters #2911

Uh oh!

Conversation

ethomson commented Feb 17, 2015

Uh oh!

carlosmn commented Feb 18, 2015

Uh oh!

carlosmn commented Feb 18, 2015

Uh oh!

carlosmn Feb 18, 2015

Choose a reason for hiding this comment

Uh oh!

ethomson Feb 18, 2015

Choose a reason for hiding this comment

Uh oh!

ethomson commented Feb 18, 2015

Uh oh!

carlosmn commented Feb 19, 2015

Uh oh!

ethomson commented Feb 19, 2015

Uh oh!

carlosmn commented Feb 19, 2015

Uh oh!

Uh oh!