Skip to content

Exposing Packbuilder capabilities of libgit2 #1183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 20, 2015

Conversation

AMSadek
Copy link
Contributor

@AMSadek AMSadek commented Aug 28, 2015

This pull request is intended to expose the the pack builder capabilities of libgit2, which is one way to produce the the .pack and .idx files for a repository. Some sample scenarios would be: Creating a pack file for a new or existing repo, splitting a repo over multiple pack files, build a pack file for certain branches/commits/trees if the entire repo is quite large, or build for commits returned by a revision walker.

@nulltoken
Copy link
Member

Thanks for this.It looks like an interesting feature to expose.

A few nitpicks, though.

  • It's generally advised to add a description to explain the intent behind the PR, provide the reader with a sample scenario...
  • We generally won't merge a Pull Request if there aren't a sufficient test coverage. Besides protecting us from potential future regressions, this also show cases the proposed API usage
  • Could you please fix the private member's name and location so that it fits with the codebase?
  • Please make sure that every file ends with a final trailing newline

Regarding the API, I'm not sure exposing a PackBuilder type would fit (we tend to not expose IDiposable types). How about something along the lines of PackBuilderResults repo.ObjectDatabase.Pack(Action<PackBuilder>, PackOptions options)?

From a different perspective, by taking a look at the libgit2 source code, it looks like most of the exposed functions actually delegate to git_packbuilder_insert_recur, which actually detects the type of the object being written. Is there any compelling use case that would require to expose all the git_packbuilder_insert_*() functions? Maybe something as simple as PacBuilder.Add<T>(T object) where T : GitObject could work?

/cc @carlosmn

Note: Your commits do not seem to be linked with your GitHub account, you may be willing to add the email address you used to commit to your GitHub account.

Cheers!

@carlosmn
Copy link
Member

git_packbuilder_insert_recur() is the newer version of the underlying logic and everything was rewritten to take advantage of some of the shortcuts it can take. The rest of the function were left so as not to break existing consumers and to provide a way to assert the type of object you're inserting so the compiler can warn you. The first reason is irrelevant here, but the second might be interesting (or maybe not).

Are you thinking of the Action<PackBuilder> as a way to force the using() onto the user? It seems awkward to force it when it's just memory that we're holding on to.

PacKBuilder.Add<T>(T object) where T : GitObject has the issue of not expressing whether you want to include a single object or include the objects it references (or even worse, it looks like single-object insert but you wanted to make it recursive). But a pair for Add<T> and AddRecursively<T> might work.

@nulltoken
Copy link
Member

Are you thinking of the Action<PackBuilder> as a way to force the using() onto the user? It seems awkward to force it when it's just memory that we're holding on to.

No the idea would be the opposite, the Packbuilder type would be instantiated and disposed by the ObjectDatabase.Pack() method. The consumer would work on the passed instance of it (more or less similarly to the RemoteUpdater.

PackBuilder.Add<T>(T object) where T : GitObject has the issue of not expressing whether you want to include a single object or include the objects it references (or even worse, it looks like single-object insert but you wanted to make it recursive). But a pair for Add<T> and AddRecursively<T> might work.

I'm not very familiar with the usage of the packbuilder. When would one want to only had one object and not its dependencies (eg. the Tree, but not the TreeEntrys). What would be the downside of always add recursively? Besides, isn't there a hashtable that prevents the same object from being added twice?

@carlosmn
Copy link
Member

No the idea would be the opposite, the Packbuilder type would be instantiated and disposed by the ObjectDatabase.Pack() method.

That sounds just like what I was trying to describe.

The thing with the object packer is that you may want to pass it around a bit, as you build up the list of objects that you want to pass.

I'm not very familiar with the usage of the packbuilder. When would one want to only had one object and not its dependencies

If you know that the other repository already has the objects in question, there's no need to send them. Or you may have a plan for which object goes into which packfile if you're creating a bunch of them.

@nulltoken
Copy link
Member

No the idea would be the opposite, the Packbuilder type would be instantiated and disposed by the ObjectDatabase.Pack() method.

That sounds just like what I was trying to describe.

Sorry. I misunderstood you.

The thing with the object packer is that you may want to pass it around a bit, as you build up the list of objects that you want to pass.

The idea was, instead of...

using (var pb = new PackBuilder())
{
  // Do stuff with pb
}

...rather promote something like

var packBuilderResult = repo.ObjectDatabase.Pack( pb => { DoStuff(pb); }, opts);

But a pair for Add<T> and AddRecursively<T> might work.

If you know that the other repository already has the objects in question, there's no need to send them. Or you may have a plan for which object goes into which packfile if you're creating a bunch of them.

👍 Thanks for the explanation. The xml doc should be explicit about the behavior of each method.

@AMSadek
Copy link
Contributor Author

AMSadek commented Sep 2, 2015

Thanks @nulltoken @carlosmn for the insights.

@nulltoken

Is there any compelling use case that would require to expose all the git_packbuilder_insert_*()

As Carlos mentioned, for cases if you don't want to add all objects or build multiple pack files.
My next modification would be add support for pack.packSizeLimit as in git.exe config ... so that I can limit the size of the pack files for very large repos.

How about something along the lines of 'PackBuilderResults repo.ObjectDatabase.Pack(Action, PackOptions options)'?

This works too however I don't see a big advantage of it other than using 'using (var pb = new PackBuilder())' ... if I understand you correctly, then 'PackBuilder' code will stay pretty much the same, but I need to add 'Pack(Action)' method in the 'ObjectDatabase' class

  • We generally won't merge a Pull Request if there aren't a sufficient test coverage. Besides protecting us from potential future regressions, this also show cases the proposed API usage

I'm working on that, if you can direct me to a concrete example showing the style used in here that will be great,

@nulltoken
Copy link
Member

if I understand you correctly, then 'PackBuilder' code will stay pretty much the same, but I need to add 'Pack(Action)' method in the 'ObjectDatabase' class

Mostly. But,

  • We would make the constructor internal
  • We would only expose Add<T> and AddRecursively<T>
  • WrittenObjectsCount would returned as a property of the PackBuilderResults type
  • SetMaximumNumberOfThreads would become a property of the PackBuilderOptions parameter (this would also accept the progress callback delegate).

Few thoughts:

  • I'm not sure about exposing ObjectsCount
  • Should Write() be exposed or should it be internally called by ObjectDatabase.Pack? What would happen would we call Write() twice in a row with different paths or with the same path?

I'm working on that, if you can direct me to a concrete example showing the style used in here that will be great,

The test project contains lots of tests that you get some inspiration from. I'm not sure to understand what kind of guidance you're looking for. Could you please elaborate a little bit more how we could help you with regards to writing tests?

@carlosmn
Copy link
Member

carlosmn commented Sep 3, 2015

If you're making the user provide a closure to include objects, it only makes sense that you'd do the writing when the closure has finished. Otherwise you are just making them do a really weird incantation of using().

There is then however the issue of what to do to support the _foreach() way of creating the packfile which allows the user to decide what to do with the output.

@nulltoken
Copy link
Member

@carlosmn Thanks for the feedback.

Let's not expose Write() either, then and pass in the path as an additional parameter to the Pack() method.

@AMSadek
Copy link
Contributor Author

AMSadek commented Sep 3, 2015

@nulltoken thanks for your suggestions, I'm considering them all in my next commit.

One a side note, I would like to be able to build a pack file from a rev walk by exposing int git_packbuilder_insert_walk(git_packbuilder *pb, git_revwalk *walk) method from libgit2.
My initial thought is to create a RevWalk type and expose its handle.
But by looking at the code, most of the functionality is already implemented in CommitLog and CommitFilter types.
My best bet now is to expose CommitEnumerator and its RevWalkerSafeHandle internally to be able to pass that to the Pack() method. ... may be as a parameter or within PackBuilderOptions type.
What do you think about that ?

The test project contains lots of tests that you get some inspiration from.

Regarding testing, that's what I meant. I'll look at some of these tests and I will follow its style. Thanks!

@nulltoken
Copy link
Member

One a side note, I would like to be able to build a pack file from a rev walk by exposing int git_packbuilder_insert_walk(git_packbuilder *pb, git_revwalk *walk) method from libgit2.
My initial thought is to create a RevWalk type and expose its handle.
But by looking at the code, most of the functionality is already implemented in CommitLog and CommitFilter types.

How about a Add(CommitFilter) overload to the proposed API?

@AMSadek
Copy link
Contributor Author

AMSadek commented Sep 3, 2015

I was hoping to do something more like what's in http://ben.straub.cc/2013/10/02/revwalk/ that libgit2 can do.

How about a Add(CommitFilter) overload to the proposed API?

Not sure if this going to do what we want, I think I better have PackBuilder.Add(CommitEnumerator)
and if I can internally expose its handle I will use git_packbuilder_insert_walk ... and if I can't expose it, then I will enumerate the commits and insert them one by one using git_packbuilder_insert

I would prefer to use git_packbuilder_insert_walk as I think it will be more efficient (faster to run) than inserting commits one by one using git_packbuilder_insert

@nulltoken
Copy link
Member

@AMSadek Let's start without the revwalk for now and make sure the API is clean and covered with tests. We'll then tackle this additional requirement.

@AMSadek
Copy link
Contributor Author

AMSadek commented Sep 3, 2015

@nulltoken yeah that's good too.

@AMSadek
Copy link
Contributor Author

AMSadek commented Sep 4, 2015

@nulltoken I've made some modification on my latest commit (it has some build issues still). The main change here is adding a method Add in ObjectDatabase type.
However, I don't see a way to write multiple pack files in this approach unlike the older one.
More specifically, I would like to be able to do something like this.

private static void TestMultiplePackFiles()
{
    int numObjects = 0;
    using (var repository = new Repository(@"\TestRepo5\.git"))
    {
        PackBuilder packBuilder = new PackBuilder(repository);

        foreach (GitObject obj in repository.ObjectDatabase)
        {
            packBuilder.InsertObject(obj);
            numObjects++;

            if (numObjects % 100 == 0)
            {
                packBuilder.Write(@"\TestRepo5\.git\objects\pack-" + (numObjects/100));
                packBuilder.Dispose();
                packBuilder = new PackBuilder(repository);
            }
        }

        if (numObjects % 100 != 0)
        {
            packBuilder.Write(@"\TestRepo5\.git\objects\pack");
            packBuilder.Dispose();
        }
    }
}

which is basically writing a separate pack file for each 100 objects. ... without exposing a write() method this is gonna be tricky to implement.
Please let me know what you think ...
Thanks,

@nulltoken
Copy link
Member

I've made some modification on my latest commit (it has some build issues still)

@AMSadek Sealing the type should fix the build.

However, I don't see a way to write multiple pack files in this approach unlike the older one.

As discussed above, could we first cover with tests the first scope, then extend it?

FWIW, this is why a description describing the high-level scenario you had in mind and the use cases would have been helpful in the beginning. 😉

@nulltoken
Copy link
Member

@AMSadek Do you need some help pushing this forward?

@AMSadek
Copy link
Contributor Author

AMSadek commented Sep 9, 2015

@nulltoken I wrote couple of tests already, just trying to figure a smart way to add and use the test repos I'm using for packbuilder tests. More specifically, copying the repo around, changing it's pack files, reading it again, and making sure it matches the old one ... if that makes sense.

@AMSadek
Copy link
Contributor Author

AMSadek commented Sep 10, 2015

@nulltoken so I have have added a PackBuilderFixture as the starting point for testing the PackBuilder, I've also copied a temp repo to use for these kind of test. This shows a use case when the user want to create a single pack file for a repo either using the default action or a custom action.

My next thoughts is how to enable to write multiple pack files with certain specifics or sizes.

FWIW, this is why a description describing the high-level scenario you had in mind and the use cases

I've updated it, have a look and let me know if it's unclear or confusing. Thanks! :)

@carlosmn
Copy link
Member

NB: this may be related: #192 as you need to create the .idx file in order to be able to do anything with the packfile.

@AMSadek
Copy link
Contributor Author

AMSadek commented Sep 10, 2015

@carlosmn thanks for referring to that

as you need to create the .idx file in order to be able to do anything with the packfile.

PackBuilder.Write() actually writes both the .pack file and it's associated .idx file, so I believe we don't need to worry about that.

/// <param name="gitObject">The object to be inserted.</param>
public void Add(GitObject gitObject)
{
Proxy.git_packbuilder_insert(packBuilderHandle, gitObject.Id, null);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're dereferencing gitObject here without first ensuring that it is not null.

@AMSadek
Copy link
Contributor Author

AMSadek commented Sep 16, 2015

@whoisj Thanks J for the comments, I did most of it. I will leave struct vs class comment till we settle on the design.

@AMSadek
Copy link
Contributor Author

AMSadek commented Sep 23, 2015

Hey @nulltoken , I've done couple of cleanups and fixes to the code. Please let me know what you think, and if you suggest more testing.

@carlosmn and @whoisj Also please let me know what you guys think and if you suggest more testing or modifications :).

@nulltoken
Copy link
Member

@AMSadek In order to move this forward, I took the liberty to update the code to try and showcase what I had in mind. You'll find below the result of this.

I also slightly reworked the test initialization, which clears some warnings from the output.

diff --git a/LibGit2Sharp.Tests/PackBuilderFixture.cs b/LibGit2Sharp.Tests/PackBuilderFixture.cs
index eaac988..bc097e9 100644
--- a/LibGit2Sharp.Tests/PackBuilderFixture.cs
+++ b/LibGit2Sharp.Tests/PackBuilderFixture.cs
@@ -35,13 +35,10 @@ internal void TestIfSameRepoAfterPacking(Action<IRepository, PackBuilder> packDe
             // compare

             string orgRepoPath = SandboxPackBuilderTestRepo();
-            SelfCleaningDirectory scd = BuildSelfCleaningDirectory(orgRepoPath + "_mrr");
-            string mrrRepoPath = scd.DirectoryPath;
-            string mrrRepoPathPackDirPath = mrrRepoPath + "/.git/objects";
+            string mrrRepoPath = SandboxPackBuilderTestRepo();
+            string mrrRepoPathPackDirPath = Path.Combine(mrrRepoPath,  "/.git/objects");

-            DirectoryHelper.CopyFilesRecursively(new DirectoryInfo(orgRepoPath), new DirectoryInfo(mrrRepoPath));
             DirectoryHelper.DeleteDirectory(mrrRepoPathPackDirPath);
-            Directory.CreateDirectory(mrrRepoPathPackDirPath);
             Directory.CreateDirectory(mrrRepoPathPackDirPath + "/pack");

             PackBuilderOptions packBuilderOptions = new PackBuilderOptions(mrrRepoPathPackDirPath + "/pack");
@@ -50,7 +47,7 @@ internal void TestIfSameRepoAfterPacking(Action<IRepository, PackBuilder> packDe
             {
                 PackBuilderResults results;
                 if (packDelegate != null)
-                    results = orgRepo.ObjectDatabase.Pack(packBuilderOptions, packDelegate);
+                    results = orgRepo.ObjectDatabase.Pack(packBuilderOptions, b => packDelegate(orgRepo, b));
                 else
                     results = orgRepo.ObjectDatabase.Pack(packBuilderOptions);

@@ -93,13 +90,13 @@ internal void AddingObjectsTestDelegate(IRepository repo, PackBuilder builder)
             {
                 foreach (Commit commit in branch.Commits)
                 {
-                    builder.AddRecursively<GitObject>(commit);
+                    builder.AddRecursively(commit);
                 }
             }

             foreach (Tag tag in repo.Tags)
             {
-                builder.Add<GitObject>(tag.Target);
+                builder.Add(tag.Target);
             }
         }

@@ -176,7 +173,7 @@ public void ExceptionIfAddNullObjectID()
             {
                 Assert.Throws<ArgumentNullException>(() =>
                 {
-                    orgRepo.ObjectDatabase.Pack(packBuilderOptions, (IRepository repo, PackBuilder builder) =>
+                    orgRepo.ObjectDatabase.Pack(packBuilderOptions, builder =>
                     {
                         builder.Add(null);
                     });
@@ -194,7 +191,7 @@ public void ExceptionIfAddRecursivelyNullObjectID()
             {
                 Assert.Throws<ArgumentNullException>(() =>
                 {
-                    orgRepo.ObjectDatabase.Pack(packBuilderOptions, (IRepository repo, PackBuilder builder) =>
+                    orgRepo.ObjectDatabase.Pack(packBuilderOptions, builder =>
                     {
                         builder.AddRecursively(null);
                     });
diff --git a/LibGit2Sharp/ObjectDatabase.cs b/LibGit2Sharp/ObjectDatabase.cs
index 1ea0b72..1ee33f8 100644
--- a/LibGit2Sharp/ObjectDatabase.cs
+++ b/LibGit2Sharp/ObjectDatabase.cs
@@ -652,7 +652,7 @@ public virtual MergeTreeResult MergeCommits(Commit ours, Commit theirs, MergeTre
         /// <returns>Packing results</returns>
         public virtual PackBuilderResults Pack(PackBuilderOptions options)
         {
-            return InternalPack(options, (IRepository repo, PackBuilder builder) =>
+            return InternalPack(options, builder =>
             {
                 foreach (GitObject obj in repo.ObjectDatabase)
                 {
@@ -667,7 +667,7 @@ public virtual PackBuilderResults Pack(PackBuilderOptions options)
         /// <param name="options">Packing options</param>
         /// <param name="packDelegate">Packing action</param>
         /// <returns>Packing results</returns>
-        public virtual PackBuilderResults Pack(PackBuilderOptions options, Action<IRepository, PackBuilder> packDelegate)
+        public virtual PackBuilderResults Pack(PackBuilderOptions options, Action<PackBuilder> packDelegate)
         {
             return InternalPack(options, packDelegate);
         }
@@ -679,7 +679,7 @@ public virtual PackBuilderResults Pack(PackBuilderOptions options, Action<IRepos
         /// <param name="packDelegate">Packing action. If null is passed, the pack method will invoke the default action of packing
         /// all objects in an arbitrary order.</param>
         /// <returns>Packing results</returns>
-        private PackBuilderResults InternalPack(PackBuilderOptions options, Action<IRepository, PackBuilder> packDelegate)
+        private PackBuilderResults InternalPack(PackBuilderOptions options, Action<PackBuilder> packDelegate)
         {
             Ensure.ArgumentNotNull(options, "options");
             Ensure.ArgumentNotNull(packDelegate, "buildDelegate");
@@ -692,7 +692,7 @@ private PackBuilderResults InternalPack(PackBuilderOptions options, Action<IRepo
                 builder.SetMaximumNumberOfThreads(options.MaximumNumberOfThreads);

                 // call the provided action
-                packDelegate(repo, builder);
+                packDelegate(builder);

                 // writing the pack and index files
                 builder.Write(options.PackDirectoryPath);

@AMSadek
Copy link
Contributor Author

AMSadek commented Oct 16, 2015

@nulltoken Thanks for doing this, I don't object to these changes. If you could, please push them to the branch or let me know if I need to reapply them on my end.

@nulltoken
Copy link
Member

@AMSadek Sadly, I can't push to your branch.

@AMSadek
Copy link
Contributor Author

AMSadek commented Oct 16, 2015

@nulltoken np, I will do these changes and commit them ... probably I will revert the commit where we introduced the Action<IRepository, PackBuilder> packDelegate and just returned to what it was :)
Thanks so much!

@AMSadek AMSadek force-pushed the AMSadek/packbuilder branch 2 times, most recently from e0c3581 to 0e3b2eb Compare October 16, 2015 23:55
@AMSadek
Copy link
Contributor Author

AMSadek commented Oct 17, 2015

@nulltoken I've done the changes, and rebased on vNext. Thank you!

@nulltoken
Copy link
Member

@whoisj @ethomson Any more comment?

@whoisj
Copy link

whoisj commented Oct 20, 2015

@nulltoken I think we're good, thanks for you feedback.

@AMSadek thanks for the good work.

I'll merge this by end of day if there are no more concerns/comments.

@nulltoken
Copy link
Member

@AMSadek Last request before merge: Could you please squash those commits into one cohesive one?

@AMSadek AMSadek force-pushed the AMSadek/packbuilder branch from 0e3b2eb to 8b5b890 Compare October 20, 2015 19:02
@AMSadek
Copy link
Contributor Author

AMSadek commented Oct 20, 2015

@whoisj @nulltoken Thanks! ... I've don't the squash, just need to rerun the build.

@nulltoken
Copy link
Member

@AMSadek I've just kicked the build... hard

nulltoken added a commit that referenced this pull request Oct 20, 2015
Exposing Packbuilder capabilities of libgit2
@nulltoken nulltoken merged commit 7b8e982 into libgit2:vNext Oct 20, 2015
@nulltoken
Copy link
Member

Thanks for this 🆒 addition!

✨ ✨ ✨ ✨ ✨

@nulltoken nulltoken added this to the v0.22 milestone Oct 20, 2015
@nulltoken
Copy link
Member

Published as NuGet pre-release package LibGit2Sharp.0.22.0-pre20151020205438

@AMSadek
Copy link
Contributor Author

AMSadek commented Oct 20, 2015

AWESOME!! Thanks so much guys for your great advice and feedback, I learned much from this PR.
@whoisj @nulltoken @ethomson @carlosmn @kevin-david

@AMSadek AMSadek deleted the AMSadek/packbuilder branch October 20, 2015 23:45
@dahlbyk dahlbyk mentioned this pull request Oct 21, 2015
@dahlbyk
Copy link
Member

dahlbyk commented Oct 22, 2015

which is basically writing a separate pack file for each 100 objects. ... without exposing a write() method this is gonna be tricky to implement.

Maybe something like this?

private static void TestMultiplePackFiles()
{
    int packCount = 0;
    using (var repository = new Repository(@"\TestRepo5\.git"))
    {
        var toPack = new List<GitObject>();

        foreach (GitObject obj in repository.ObjectDatabase)
        {
            toPack.Add(obj);

            if (toPack.Length == 100)
            {
                Pack(repository, toPack, ref packCount);
                toPack = new List<GitObject>();
            }
        }

        Pack(repository, toPack, ref packCount);
    }
}

private static void Pack(IRepository repository, List<GitObject> toPack, ref int packCount)
{
    if (toPack.Length == 0) return;

    var pbo = new PackBuilderOptions(@"\TestRepo5\.git\objects\pack-" + packCount++);
    repository.ObjectBuilder.Pack(pbo, pb =>
        {
            foreach (var o in toPack)
                pb.Add(o);
        });
}

@nulltoken
Copy link
Member

@dahlbyk How would we handle AddRecursive<T>() which would mark the object as interesting along with all the object referenced objects that are unknown from the pack? IOW, adding one object could lead to 50 objects being written.

I have the feeling that we should rather rely on the git_packbuilder_object_count() method if we want to more or less evenly distribute the number of objects per pack file.

@dahlbyk
Copy link
Member

dahlbyk commented Oct 22, 2015

That had occurred to me, but wasn't part of the sample code so I didn't account for it. 😈

Biggest problem at first thought is that our pack path is specified once as an option - would need some way to iterate on that.

@dahlbyk dahlbyk mentioned this pull request Oct 22, 2015
@nulltoken
Copy link
Member

Biggest problem at first thought is that our pack path is specified once as an option - would need some way to iterate on that.

Agreed. Let's wait for @AMSadek's feedback

@AMSadek
Copy link
Contributor Author

AMSadek commented Oct 26, 2015

Kindly find my answers in the recent PR #1216 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants