-
Notifications
You must be signed in to change notification settings - Fork 899
Rename PackBuilder -> PackDefinition, improve PackOptions, introduce WriteTo #1213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename PackBuilder -> PackDefinition, improve PackOptions, introduce WriteTo #1213
Conversation
Consequently, add new PackBuilderOptions overload which takes an output stream
👍 ✨ thanks @kevin-david options as a |
@kevin-david Thanks for this. few comments below
One potential approach to work around this would be to change the new ctor to accept a Additional question: Would that make sense to also write the generated index into a different Stream? Thoughts? /cc @carlosmn |
@kevin-david there's a potential problem here when converting
Maybe both of them are nulls due to creating an object of My suggestion is either to check again here, or to keep it as a |
@@ -695,8 +694,15 @@ private PackBuilderResults InternalPack(PackBuilderOptions options, Action<PackB | |||
// call the provided action | |||
packDelegate(builder); | |||
|
|||
// writing the pack and index files | |||
builder.Write(options.PackDirectoryPath); | |||
// A valid stream or path is guaranteed by the PackBuilderOptions constructor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily, see my comment.
OK, back to a class we go. I'll fix up the commit message once I rebase at the end. I still think it makes sense make the @nulltoken I see what you mean about a single
How could we determine when the I did some playing around and it wasn't clear to me how we could do that without extracting |
Also, re:
I definitely agree - and |
The argument for mutable options simply comes down to API stability. Having moved away from optional parameters, making options immutable means we have to add a new constructor every time we expose a new option. F# records are perfect for immutable Options classes, but we're in C# so we have to make do. |
I was thinking what a pain it was to get rid of the optional parameters I added (which promptly broke the tests) and replace them with overloads as I was doing it. It was less code at first (before So, in general, is the pattern you guys follow taking "required" parameters in the constructor ( It still seems weird to me that I can change the max number of threads or |
On further consideration, required parameters are typically provided in the method itself, leaving I wonder if we do have any cases of options types having a non-default ctor. This might be another useful meta test to enforce our "taste"? What do you think, @nulltoken?
|
@dahlbyk 👍 |
And actually, the |
Just kidding, use an immutable class instead.
OK, updated once again with two overloads and making I have a feeling this will end up being one commit in the end because separating them out will be extraordinarily painful :) |
else | ||
results = orgRepo.ObjectDatabase.Pack(packBuilderOptions); | ||
{ | ||
results = orgRepo.ObjectDatabase.Pack(packDir, packBuilderOptions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well move this up out of the if
/else
since the calls are now identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, oops.
I like the move of the delegate to options, but |
{ | ||
Assert.Throws<ArgumentNullException>(() => | ||
{ | ||
orgRepo.ObjectDatabase.Pack(null); | ||
repo.ObjectDatabase.Pack(outputPackStream: null, options: new PackBuilderOptions()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also use Pack(default(Stream), …)
or Pack(default(string), …)
to differentiate between the overloads. Not sure which is more readable.
I like your thinking! Two new commits. One for small fixes, another for |
I like the removal of Builder elsewhere as well. Would it be difficult to drop. Builder in the test variable names as well, for consistency? |
Also remove `Builder` from results and options.
Finish eradication of `PackBuilder` other than its use when providing an Action<PackDefinition> to be performed during packing.
One day I'll learn to run the tests! I think I got rid of everything |
This works for me! @nulltoken? |
@dahlbyk @kevin-david I do ❤️ the way the code evolved! However, something bothering me and I'm starting to feel that @carlosmn was right in the beginning. The current closure based design compels #1216 to rely on very weird (and barely readable) code constructs. For instance, it looks awfully complex (and without any real pros) to do one revwalk and split the objects to different packs. Is it only me being overly moody or do you guys also feel some disturbance in the force? (cc @AMSadek) |
In order to move this discussion forward,I can think of two different designs:
After having thought about it, I'm not sure we could go with a Just my $ 0.02... Thoughts? |
I agree that the current design seems a bit funky - even looking at the tests, supplying a delegate to override the default packing behaivor requires some trickery. I think both options you proposed would work better than the current solution and obviate the need to rename the With my limited experience, it seems like the instance method hanging off of On the other hand, someone might use a |
Only plain DTOs from the user perspective (Options, TreeDefinition, ...)
Well, they would still need a |
That sounds like a decent argument for the
Absolutely, and thinking about it more, |
Ok. I'll send a PR (unless someone beats me to it) to make the |
I think there are two separate concepts at play here:
Unless I'm missing something, addressing the latter does not require changing the former - we replace a write on Now the real question is: what's the best API to handle the latter? The downside of explicit writes is that you get unexpected behavior if you forget to "flush", and we can't auto-write if we haven't been given a directory or stream. This is indeed Advanced, so we may be fine with that? |
@dahlbyk There's also an additional concern: Allow an easy way of splitting of a packbuilding among several pack files based on some threshold of number of objects per pack file. IOW the use case I'd like to be sure to be covered would be something like
@AMSadek used an example in which objects were splitted based on their types. I'm not sure if it's a real production use case or a somewhat contrived example. Would that be a real need, I can't think of a user friendly way to make the splitting automated. |
The more I think about this, the more I'm realizing we aren't going to get around Proposal:
Multi-pack usage would look something like... repo.ObjectDatabase.Pack(pd =>
{
var fileNum = 0;
foreach (var o in repo.ObjectDatabase)
{
pd.AddRecursively(o);
if (pd.ObjectsCount > 99) // Add could return `PackStatus` or equivalent instead
pd.Write(".git/objects/pack" + (fileNum++));
}
if (pd.ObjectsCount > 0) // Is `Write()` a no-op or error if nothing added?
pd.Write(".git/objects/pack" + fileNum);
}, null); |
Hey all, sorry for not following on the conversation, I got couple of questions here to catch the details I missed.
I agree with you here, I don't see much benefit from introducing
So why the current design is no longer valid ? is it because of introducing If it's the latter, I think this discussion should be on the other PR #1216 so that other contributors can follow up on it ... ( and not get confused like me :) ) .. but anyways Otherwise, my best bet would like what @nulltoken proposed first =>
and It should be as simple as following for writing each 100 object in a pack for a using (PackBuilder packBuilder = new PackBuilder(repo))
{
int currentObject = 0;
foreach (GitObject gitObject in repo.ObjectDatabase)
{
packBuilder.Add(gitObject);
if (currentObject++ % 100 == 0)
{
results = packBuilder.Write(packDirectoryPath);
packBuilder.Clear();
}
}
if (!packBuilder.Empty())
{
results = packBuilder.Write(packDirectoryPath);
packBuilder.Clear();
}
} where |
Well, I don't have a real world case in mind right now, but the way I see the whole pack usage is
I wouldn't worry much about that, the splitting criteria will be determined by the API user, and we can't automate all possible scenarios of how to choose objects to pack. I would like to hear what do you guys think of that :) |
I do have a real world case for creating packs with one object type. In the case of data mining repos, it is far more efficient if just the commit data to the parse engine. The data miner has no need for blob our tree data... Well maybe tree data, but absolutely not blobs. Also the write file API is weird to me. Why ask a pack builder object to understand the file system? WriteTo(Stream) is more elegant and SOLID. |
What is it meant by understand file system ? is that because we are passing a directory path to write files to it? ... I can see in other places in the API file paths are passed in similar way. Do you suggest a better way to write to files?
Are you suggesting to drop the write to file from the API ? |
I'm a strong proponent of the SOLID design pattern. Which implies that PackBuilder should build packs, not write to file systems. I get that the API is potentially easier to consume/understand of there's a WriteTo(string) entry-point but the ability to mount, parse, and understand a file system seems outside the expected behavior of an object designed to pack git blob deltas into a single file. The necessary logic for testing directory and file existence seems foreign here - thought, again, I understand the natural tenancy to place it here.
Honestly, I would rather have the Odb itself take care of writing the pack file back to where ever the Odb wants to put these kind of things. To assume the Odb is inside the I would much rather see the pack file reference passed back to the Odb for managing. Again, I do not own this API - @nulltoken is the best arbiter of these kinds of things. |
@whoisj
For the disposal of
Pros in first, we could write the same So yeah, just thinking out loud :)! |
We do have this in libgit2 proper, precisely because the While the core of the packbuilder is the foreach, it's not as convenient when doing C to pipe its output into an indexer, so we provide a function for this common use-case. But having one shouldn't detract from having the other. If you want to write into a repository, you should be passing the stream into whatever the object db gave you instead of assuming that you know what storage format it's using. This is where a Even with such a method, a |
I just want to thank everyone here for reminding me how much I don't know about Git. |
@AMSadek I think I can live with that.
Personally, I prefer using the disposable pattern. However, this library has generally avoided its use so terminate on Finally, if |
Unless I'm missing something, |
Having the |
I've added a new commit to the original PR #1216 thread.
You're right, it's gonna be tricky this way. I give up this idea in my latest commit.
I agree with you on that.
With flexibility we give to write multiple packs and to different types of streams, I think allowing explicit write and disposal will be more convenient. |
The options make sense to me as immutable, and a struct required one less null check :)
The idea here is to enable writing a pack to an arbitrary stream without the additional overhead of indexing - looking at the libgit2, there's some unnecessary work done if you're building a packfile to send to a remote server, for example.
Thanks to @whoisj for the guidance on interop with libgit2
/cc @AMSadek