Skip to content

Rename PackBuilder -> PackDefinition, improve PackOptions, introduce WriteTo #1213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed

Rename PackBuilder -> PackDefinition, improve PackOptions, introduce WriteTo #1213

wants to merge 8 commits into from

Conversation

kevin-david
Copy link
Contributor

The options make sense to me as immutable, and a struct required one less null check :)

The idea here is to enable writing a pack to an arbitrary stream without the additional overhead of indexing - looking at the libgit2, there's some unnecessary work done if you're building a packfile to send to a remote server, for example.

Thanks to @whoisj for the guidance on interop with libgit2

/cc @AMSadek

Consequently, add new PackBuilderOptions overload which takes an output stream
@whoisj
Copy link

whoisj commented Oct 24, 2015

👍 ✨ thanks @kevin-david options as a struct has been bugging me 😉

@nulltoken
Copy link
Member

@kevin-david Thanks for this. few comments below

  • I would tend to agree with @dahlbyk in PackBuilder Fixup #1205 (comment). Unless there's a strong case where this optimization adds a lot a value, I'd prefer to stick with a class and favor consistency.
  • I feel there's some kind of discrepancy with the new overload accepting a stream compared to the original one. Let me try to further explain my thoughts:
    • A directory is a link to a place where several pack files could be stored, whereas a stream would only allow the user to create one.
    • We're not settled yet on the API and there's a potential pending use case (see PackBuilder Fixup #1205 (comment)) where one would like to make the packing process distribute the output into several different new pack files. Making the ctor only accept one stream would actually hinder this.

One potential approach to work around this would be to change the new ctor to accept a Func<Stream> writeableStreamsProvider. This provider could be invoked many times and would be responsible to hand out to the caller a new stream each time it's invoked.

Additional question: Would that make sense to also write the generated index into a different Stream?

Thoughts?

/cc @carlosmn

@AMSadek
Copy link
Contributor

AMSadek commented Oct 26, 2015

@kevin-david there's a potential problem here when converting PackBuilderOptions to struct that it will have a default constructor by its own, setting the member values to nulls (or default value more precisely).
there problem will be in

// A valid stream or path is guaranteed by the PackBuilderOptions constructor
if (options.OutputPackStream != null)
{
    builder.WriteTo(options.OutputPackStream);
}
else
{
    builder.Write(options.PackDirectory);
}

Maybe both of them are nulls due to creating an object of PackBuilderOptions using it's default constructor.

My suggestion is either to check again here, or to keep it as a class with no default constructor :)

@@ -695,8 +694,15 @@ private PackBuilderResults InternalPack(PackBuilderOptions options, Action<PackB
// call the provided action
packDelegate(builder);

// writing the pack and index files
builder.Write(options.PackDirectoryPath);
// A valid stream or path is guaranteed by the PackBuilderOptions constructor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily, see my comment.

@kevin-david
Copy link
Contributor Author

OK, back to a class we go. I'll fix up the commit message once I rebase at the end. I still think it makes sense make the PackBuilderOptions immutable (and didn't hear any objections 😄), but I can drop the commit entirely if need be.

@nulltoken I see what you mean about a single Stream being restrictive, but am wondering how a consumer would take advantage of a Func<Stream> writeableStreamsProvider. You said:

This provider could be invoked many times and would be responsible to hand out to the caller a new stream each time it's invoked.

How could we determine when the writeableStreamsProvider is invoked?

I did some playing around and it wasn't clear to me how we could do that without extracting PackBuilder.Write out into whatever packDelegate is provided - and even then it doesn't seem obvious to me (as a caller) when I should call it. Also, how would the underlying Streams get disposed from such a function?

@kevin-david
Copy link
Contributor Author

Also, re:

Additional question: Would that make sense to also write the generated index into a different Stream?

I definitely agree - and null would be an easy way to say "no indexing", but I didn't see an easy way to make this happen using the existing libgit2 interface. If I missed something like me know.

@dahlbyk
Copy link
Member

dahlbyk commented Oct 26, 2015

I still think it makes sense make the PackBuilderOptions immutable, but I can drop the commit entirely if need be.

The argument for mutable options simply comes down to API stability. Having moved away from optional parameters, making options immutable means we have to add a new constructor every time we expose a new option.

F# records are perfect for immutable Options classes, but we're in C# so we have to make do.

@kevin-david
Copy link
Contributor Author

The argument for mutable options simply comes down to API stability. Having moved away from optional parameters, making options immutable means we have to add a new constructor every time we expose a new option.

I was thinking what a pain it was to get rid of the optional parameters I added (which promptly broke the tests) and replace them with overloads as I was doing it. It was less code at first (before WriteTo), but I think it ended up being more afterwards because of the mutual exclusivity of packDirectory and the output Stream - however, if I wanted to keep it that way I'd need to add checks in the property setters like before anyway.

So, in general, is the pattern you guys follow taking "required" parameters in the constructor (packDir or outputStream) and things with sensible defaults (number of threads) as mutable properties with defaults? In that case, should packDirectory be readonly, or should everything be mutable for consistency?

It still seems weird to me that I can change the max number of threads or packDirectory in the middle of repacking - but in practice I doubt it will matter. I'm fairly new to the practice of avoiding breaking changes / maintaining API stability, so I completely understand if we'd rather stick to mutable options.

@dahlbyk
Copy link
Member

dahlbyk commented Oct 27, 2015

So, in general, is the pattern you guys follow taking "required" parameters in the constructor (packDir or outputStream) and things with sensible defaults (number of threads) as mutable properties with defaults?

On further consideration, required parameters are typically provided in the method itself, leaving options as truly optional (and generally offering overloads with and without options). That wouldn't be a bad adjustment to make here, actually.

I wonder if we do have any cases of options types having a non-default ctor. This might be another useful meta test to enforce our "taste"? What do you think, @nulltoken?

In that case, should packDirectory be readonly, or should everything be mutable for consistency?

readonly, if they're left in PackBuilderOptions.

@whoisj
Copy link

whoisj commented Oct 28, 2015

@dahlbyk 👍

@dahlbyk
Copy link
Member

dahlbyk commented Oct 29, 2015

That wouldn't be a bad adjustment to make here, actually.

And actually, the Action<PackBuilder> is technically optional and so could be moved to PackBuilderOptions, reducing the overload count to two (stream + options and path + options).

@kevin-david
Copy link
Contributor Author

OK, updated once again with two overloads and making PackBuilderOptions only have a default constructor. Let me know what you think.

I have a feeling this will end up being one commit in the end because separating them out will be extraordinarily painful :)

else
results = orgRepo.ObjectDatabase.Pack(packBuilderOptions);
{
results = orgRepo.ObjectDatabase.Pack(packDir, packBuilderOptions);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well move this up out of the if/else since the calls are now identical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, oops.

@dahlbyk
Copy link
Member

dahlbyk commented Nov 4, 2015

I like the move of the delegate to options, but PackDelegate doesn't feel right. My first thought for a name is PackBuilder, since you're essentially building the pack in the action, but that may be weird in relation to the PackBuilder type. It's a bigger change, but I'd almost advocate for renaming the type itself to PackDefinition, in imitation of TreeDefinition.

{
Assert.Throws<ArgumentNullException>(() =>
{
orgRepo.ObjectDatabase.Pack(null);
repo.ObjectDatabase.Pack(outputPackStream: null, options: new PackBuilderOptions());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also use Pack(default(Stream), …) or Pack(default(string), …) to differentiate between the overloads. Not sure which is more readable.

@kevin-david
Copy link
Contributor Author

I like your thinking!

Two new commits. One for small fixes, another for PackBuilder -> PackDefinition

@dahlbyk
Copy link
Member

dahlbyk commented Nov 5, 2015

I like the removal of Builder elsewhere as well. Would it be difficult to drop. Builder in the test variable names as well, for consistency?

Also remove `Builder` from results and options.
Finish eradication of `PackBuilder` other than its use when providing an
Action<PackDefinition> to be performed during packing.
@kevin-david
Copy link
Contributor Author

One day I'll learn to run the tests!

I think I got rid of everything PackBuilder now (except the Action<PackDefinition>), including in the tests.

@dahlbyk
Copy link
Member

dahlbyk commented Nov 5, 2015

This works for me! @nulltoken?

@kevin-david kevin-david changed the title PackBuilder: convert Options to a struct, introduce WriteTo Rename PackBuilder -> PackDefinition, improve PackOptions, introduce WriteTo Nov 5, 2015
@nulltoken
Copy link
Member

@dahlbyk @kevin-david I do ❤️ the way the code evolved! However, something bothering me and I'm starting to feel that @carlosmn was right in the beginning.

The current closure based design compels #1216 to rely on very weird (and barely readable) code constructs.

For instance, it looks awfully complex (and without any real pros) to do one revwalk and split the objects to different packs.

Is it only me being overly moody or do you guys also feel some disturbance in the force? (cc @AMSadek)

@nulltoken
Copy link
Member

In order to move this discussion forward,I can think of two different designs:

  • An autonomous PackBuilder type that would accept an IRepository in its constructor and expose WriteTo() overloads (one writing to a path, and another one writing to Streams). This type would be IDisposable and the user would be responsible to dispose it. This type would live in a LibGit2Sharp.Advanced namespace.
  • A ObjectDatabase instance method which would return a PackBuilder. This type would also expose WriteTo() overloads. This type would also be IDisposable and the user would be responsible to dispose it.

After having thought about it, I'm not sure we could go with a PackDefinition type. Contrarily to a Tree, a pack can be split (would we want to split it to keep the number of written objects more or less evenly distributed).

Just my $ 0.02... Thoughts?

@kevin-david
Copy link
Contributor Author

I agree that the current design seems a bit funky - even looking at the tests, supplying a delegate to override the default packing behaivor requires some trickery. I think both options you proposed would work better than the current solution and obviate the need to rename the PackBuilder type.

With my limited experience, it seems like the instance method hanging off of ObjectDatabase is more consistent with the rest of the implementation - how many other places do you allow users of the library to directly instantiate complex types, other than Repository?

On the other hand, someone might use a PackBuilder without ever using the ObjectDatabase at all, if they wanted to pack some arbitrary set of objects (maybe just one commit, for example) or had a custom backend of some kind that duplicated a lot of the behaivor of ObjectDatabase.

@nulltoken
Copy link
Member

With my limited experience, it seems like the instance method hanging off of ObjectDatabase is more consistent with the rest of the implementation - how many other places do you allow users of the library to directly instantiate complex types, other than Repository?

Only plain DTOs from the user perspective (Options, TreeDefinition, ...)

On the other hand, someone might use a PackBuilder without ever using the ObjectDatabase at all, if they wanted to pack some arbitrary set of objects (maybe just one commit, for example) or had a custom backend of some kind that duplicated a lot of the behaivor of ObjectDatabase.

Well, they would still need a Repository instance? And the definition of custom backends belong to an instance, doesn't it?

@kevin-david
Copy link
Contributor Author

Only plain DTOs from the user perspective (Options, TreeDefinition, ...)

That sounds like a decent argument for the ObjectDatabase approach then, unless the new "advanced" namespace will have a different access pattern - did anyone have other "advanced" capabilities in mind?

Well, they would still need a Repository instance? And the definition of custom backends belong to an instance, doesn't it?

Absolutely, and thinking about it more, AddBackend lives off of ObjectDatabase, so having the PackBuilder live there too seems logical.

@nulltoken
Copy link
Member

That sounds like a decent argument for the ObjectDatabase approach then, unless the new "advanced" namespace will have a different access pattern - did anyone have other "advanced" capabilities in mind?

@carlosmn in #192

Absolutely, and thinking about it more: the AddBackend lives off of ObjectDatabase, so having the PackBuilder live there too seems logical.

Ok. I'll send a PR (unless someone beats me to it) to make the PackBuilder instantiating done through an ObjectDatabase method. We'll discuss then based on some real code.

@dahlbyk
Copy link
Member

dahlbyk commented Nov 9, 2015

I think there are two separate concepts at play here:

  1. PackBuilder lifecycle - IDisposable vs closure. In my mind, these are equivalent:

    using (var pb = repo.ObjectDatabase.Pack()) { ... }
    repo.ObjectDatabase.Pack(pb => { ... });
  2. Exposing ad-hoc writes. As I understand it, the usability gap here is the lack of:

    PackResult WriteTo(string packDirectory);
    PackResult WriteTo(Stream outputPackStream);

Unless I'm missing something, addressing the latter does not require changing the former - we replace a write on Dispose()/return with explicit writes.

Now the real question is: what's the best API to handle the latter? The downside of explicit writes is that you get unexpected behavior if you forget to "flush", and we can't auto-write if we haven't been given a directory or stream. This is indeed Advanced, so we may be fine with that?

@nulltoken
Copy link
Member

Now the real question is: what's the best API to handle the latter? The downside of explicit writes is that you get unexpected behavior if you forget to "flush", and we can't auto-write if we haven't been given a directory or stream. This is indeed Advanced, so we may be fine with that?

@dahlbyk There's also an additional concern: Allow an easy way of splitting of a packbuilding among several pack files based on some threshold of number of objects per pack file.

IOW the use case I'd like to be sure to be covered would be something like

  • Perform one revwalk
  • AddRecursively each object
  • Each time a certain numbers of objects has been written, flush the packfile to the disk

@AMSadek used an example in which objects were splitted based on their types. I'm not sure if it's a real production use case or a somewhat contrived example. Would that be a real need, I can't think of a user friendly way to make the splitting automated.

@dahlbyk
Copy link
Member

dahlbyk commented Nov 11, 2015

The more I think about this, the more I'm realizing we aren't going to get around PackDefinition needing an internal git_packbuilder swap mechanism after a Write/WriteTo, otherwise the multi-pack scenario requires a strange interweaving of foreach and using that feels gross. Am I reading the LG2 code/docs correctly that a git_packbuilder is not meant to be reused?

Proposal:

  1. Return a PackResults from Write and WriteTo. (Aside: that should probably be PackResult for consistency with other result types.)

  2. Promote PackOptions.PackBuilder to a parameter:

    void Pack(Action<PackDefinition> packBuilder, PackOptions options)
    {
      using (var pd = new PackDefinition(repo)) {
        pd.Apply(options);
        packBuilder(pd);
      } // If unwritten changes, throw
    }
  3. Optional: preserve one or both pack-all convenience overloads:

    PackResults Pack(string packDirectory, PackOptions options)
    {
      PackResults results = new PackResults();
      Pack(pb =>
      {
        pb.Add(repo.ObjectDatabase); // I still think we should add this
        results = pb.Write(packDirectory);
      }, options);
      return results;
    }

Multi-pack usage would look something like...

repo.ObjectDatabase.Pack(pd =>
{
  var fileNum = 0;
  foreach (var o in repo.ObjectDatabase)
  {
    pd.AddRecursively(o);

    if (pd.ObjectsCount > 99) // Add could return `PackStatus` or equivalent instead
      pd.Write(".git/objects/pack" + (fileNum++));
  }

  if (pd.ObjectsCount > 0) // Is `Write()` a no-op or error if nothing added?
    pd.Write(".git/objects/pack" + fileNum);
}, null);

@AMSadek
Copy link
Contributor

AMSadek commented Nov 12, 2015

Hey all, sorry for not following on the conversation, I got couple of questions here to catch the details I missed.

@nulltoken

it looks awfully complex (and without any real pros) to do one revwalk and split the objects to different packs.

I agree with you here, I don't see much benefit from introducing PackDefinition either as replacement for PackBuilder or as new type.

In order to move this discussion forward,I can think of two different designs:

So why the current design is no longer valid ? is it because of introducing .WriteTo(Stream s) or because the use cases where we do not just write one pack file per repo?
If it's the former, I don't see the difference between write to stream or to a file. The way of creating PackOptions with mutual choice of either using a directory path or a stream would do it.

If it's the latter, I think this discussion should be on the other PR #1216 so that other contributors can follow up on it ... ( and not get confused like me :) ) .. but anyways
I've couple of tests on the other PR that do it in the current design. And maybe yeah, it's not the most straight forward. But this would be the way to do it if want to keep .write and constructors internal.

Otherwise, my best bet would like what @nulltoken proposed first =>

An autonomous PackBuilder type that would accept an IRepository in its constructor and expose WriteTo() overloads (one writing to a path, and another one writing to Streams).

and It should be as simple as following for writing each 100 object in a pack for a Repository repo

using (PackBuilder packBuilder = new PackBuilder(repo))
{
    int currentObject = 0;
    foreach (GitObject gitObject in repo.ObjectDatabase)
    {
        packBuilder.Add(gitObject);
        if (currentObject++ % 100 == 0)
        {
            results = packBuilder.Write(packDirectoryPath);
            packBuilder.Clear();
        }
    }
    if (!packBuilder.Empty())
    {
        results = packBuilder.Write(packDirectoryPath);
        packBuilder.Clear();
    }
}

where packBuilder.Clear(); clears the type, maybe simply by just releasing it's internal handle and create a new one.

@AMSadek
Copy link
Contributor

AMSadek commented Nov 12, 2015

@nulltoken

@AMSadek used an example in which objects were splitted based on their types. I'm not sure if it's a real production use case or a somewhat contrived example. Would that be a real need, I can't think of a user friendly way to make the splitting automated.

Well, I don't have a real world case in mind right now, but the way I see the whole pack usage is
Pack: a process that takes as input some git objects and write them in a pack file.
It doesn't have to be all the objects at once, to a single file, nor at the same time. I just see it as simple as "Hey, put objects x,y,z to file. Write the file. Terminate."
That enables me (the API user) ... to create pack files depending on type, count, number of bytes written ...etc. And in any order like creation, topological, by a commit filter or revwalk ... etc.

I can't think of a user friendly way to make the splitting automated.

I wouldn't worry much about that, the splitting criteria will be determined by the API user, and we can't automate all possible scenarios of how to choose objects to pack.
I would rather provide a simple way for choosing objects to be packed and write them.

I would like to hear what do you guys think of that :)

@whoisj
Copy link

whoisj commented Nov 17, 2015

I do have a real world case for creating packs with one object type. In the case of data mining repos, it is far more efficient if just the commit data to the parse engine. The data miner has no need for blob our tree data... Well maybe tree data, but absolutely not blobs.

Also the write file API is weird to me. Why ask a pack builder object to understand the file system? WriteTo(Stream) is more elegant and SOLID.

@AMSadek
Copy link
Contributor

AMSadek commented Nov 17, 2015

@whoisj

Why ask a pack builder object to understand the file system?

What is it meant by understand file system ? is that because we are passing a directory path to write files to it? ... I can see in other places in the API file paths are passed in similar way. Do you suggest a better way to write to files?

Also the write file API is weird to me. WriteTo(Stream) is more elegant and SOLID.

Are you suggesting to drop the write to file from the API ?
Well, two things, first I assume it's faster if we're going to write the pack file to desk at the end anyway cause it's native. Second, WriteTo(Stream) doesn't write the index (.idx) file. Only the pack file (.pack).

@whoisj
Copy link

whoisj commented Nov 17, 2015

@AMSadek

What is it meant by understand file system?

I'm a strong proponent of the SOLID design pattern. Which implies that PackBuilder should build packs, not write to file systems. I get that the API is potentially easier to consume/understand of there's a WriteTo(string) entry-point but the ability to mount, parse, and understand a file system seems outside the expected behavior of an object designed to pack git blob deltas into a single file. The necessary logic for testing directory and file existence seems foreign here - thought, again, I understand the natural tenancy to place it here.

Are you suggesting to drop the write to file from the API?

Honestly, I would rather have the Odb itself take care of writing the pack file back to where ever the Odb wants to put these kind of things. To assume the Odb is inside the .git/objects folder is to assume too much; even to assume the Odb is represented as the layout on disk at all is likely a bad assumption in some cases.

I would much rather see the pack file reference passed back to the Odb for managing.

Again, I do not own this API - @nulltoken is the best arbiter of these kinds of things.

@AMSadek
Copy link
Contributor

AMSadek commented Nov 17, 2015

@whoisj
What do you think of
PackDefinition Pack(Action<PackBuilder> buildAction, PackOptions options) so that

  • buildAction does the selection of the objects to b packed. Pack method itself will own the the handle of the PackBuilder, and dispose it by the end of the call.
  • PackDefinition will have the overloads of .WriteTo methods.
  • Add the overload PackDefinition PackAll (PackOptions options) to pack all objects.

For the disposal of PackDefinition, we can either

  • Leave it to the API user to call Dispose() after they call .WriteTo or using using () {},
  • Or, Release the internal handle ourselves by the end of the .WriteTo call, making the PackDefinition usable only once, then throw an exception if .WriteTo was called again.

Pros in first, we could write the same PackDefinition instance multiple times or to multiple streams.
Pros in second, the usage would be easier and look nicer.

So yeah, just thinking out loud :)!

@carlosmn
Copy link
Member

Honestly, I would rather have the Odb itself take care of writing the pack file back to where ever the Odb wants to put these kind of things.

We do have this in libgit2 proper, precisely because the .pack+.idx is just the default. An odb backend can provide a function to call to generate whatever they use to accept data from the network (the default packed backend just gives you something with which you call the indexer).

While the core of the packbuilder is the foreach, it's not as convenient when doing C to pipe its output into an indexer, so we provide a function for this common use-case. But having one shouldn't detract from having the other. If you want to write into a repository, you should be passing the stream into whatever the object db gave you instead of assuming that you know what storage format it's using. This is where a WriteTo(ObjectDatabase) or WriteTo(Repository) method would be useful to grab the indexer for the particular odb on behalf of the user and push the stuff there. This might even make sense to include in libgit2 itself.

Even with such a method, a WriteTo(Stream) is still independently useful. If we can get the indexer to expose a Stream, then we can also claim that packBuilder.WriteTo(new Indexer("some/dir")) is the OO way to do this (but you'd have to be careful not to pay the marshalling cost for the data buffer twice).

@dahlbyk
Copy link
Member

dahlbyk commented Nov 18, 2015

I just want to thank everyone here for reminding me how much I don't know about Git.

@whoisj
Copy link

whoisj commented Nov 18, 2015

@AMSadek I think I can live with that.

For the disposal of PackDefinition, we can either

Personally, I prefer using the disposable pattern. However, this library has generally avoided its use so terminate on .WriteTo makes sense ~ kind of. What happens if WriteTo is never called? How does the managed layer release its native handles?

Finally, if WriteTo does release the handles and (effectively) dispose the object, the method name might be better served as Close or Finalize or similar.

@dahlbyk
Copy link
Member

dahlbyk commented Nov 18, 2015

Unless I'm missing something, WriteTo not being called is equivalently challenging for the IDisposable and Action patterns. Either we specify up front where we're writing and flush when "done" (Dispose() or Action returns), or we require explicit writes (c.f. #1213 (comment)).

@whoisj
Copy link

whoisj commented Nov 18, 2015

Having the PackBuilder stream out as it writes makes just as much sense as anything, and is resilient to clean up burdens as well. The "natural" C# convention is at Dispose but IO operations in a Dispose call seem dangerous and Libgit2Sharp traditionally avoids IDisposable.

@AMSadek
Copy link
Contributor

AMSadek commented Nov 21, 2015

I've added a new commit to the original PR #1216 thread.

@whoisj

What happens if WriteTo is never called? How does the managed layer release its native handles?

You're right, it's gonna be tricky this way. I give up this idea in my latest commit.

@dahlbyk

Unless I'm missing something, WriteTo not being called is equivalently challenging for the IDisposable and Action patterns.

I agree with you on that.

Either we specify up front where we're writing and flush when "done" (Dispose() or Action returns), or we require explicit writes

With flexibility we give to write multiple packs and to different types of streams, I think allowing explicit write and disposal will be more convenient.

@bording bording closed this Apr 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants