Skip to content

Add Sink that implements git's diffing improvement heuristics #2011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 18, 2025

Conversation

blinxen
Copy link
Contributor

@blinxen blinxen commented May 14, 2025

This PR was initially targeted at image-diff but the maintainer seems to not have enough time at the moment so I moved it here.

What does this PR do?

This PR introduces a new Sink that can be used to optimize diffs created with imara-diff. I tried to implement the same heuristics that git uses. The code is also structured very similar.

While trying to add it to gix-diff, I was unsure how it should be included. Is it just a normal Sink that can be used with imara_diff::diffor do we force users to use it by not re-exporting the imara_diff::diff method but creating a custom diff method in the gix-diff crate? What do the maintainers think?

Things that still need to be done

  • write some more tests
  • optimize the code
  • (maybe find better names for variables, structs and the Sink itself)
  • check the licensing since it was heavily inspired by the git implementation

@Byron
Copy link
Member

Byron commented May 15, 2025

Thanks so much for contributing, this will be such a great addition to gitoxide!

While trying to add it to gix-diff, I was unsure how it should be included. Is it just a normal Sink that can be used with imara_diff::diffor do we force users to use it by not re-exporting the imara_diff::diff method but creating a custom diff method in the gix-diff crate? What do the maintainers think?

As this is plumbing, and as gix-diff openly exposes imara-diff, it will be best to keep it as Sink that people can opt-in to.

As a matter of fact, it would be very nice if the algorithm could be used and I propose adding it as an option (probably even the default) to gix diff file in a follow-up.

Meanwhile I wonder if this PR should be ready for review, or if it should rather be a draft as you plan to make some more changes?
Every answer is a good answer, and I'd be happy to review it now and make it mergeable.

@blinxen
Copy link
Contributor Author

blinxen commented May 15, 2025

write some more tests
optimize the code
(maybe find better names for variables, structs and the Sink itself)

I think these three should be OK.

check the licensing since it was heavily inspired by the git implementation

For this one I am not sure how to proceed. I found contradicting information online regarding the licensing. The original code was written in C. The Rust code in this PR was inspired by it but I used a different way of doing the optimizations. The weight factors and the algorithm (I don't mean code here but the theory behind the algorithm) are 100% from the xdiff library used by git but the rest is my own work. How does LGPL-2.1 apply here 🤔 ? Maybe you know more than me on this topic :D.

@Byron
Copy link
Member

Byron commented May 16, 2025

Great, then I will take a look!

For this one I am not sure how to proceed. I found contradicting information online regarding the licensing. The original code was written in C. The Rust code in this PR was inspired by it but I used a different way of doing the optimizations. The weight factors and the algorithm (I don't mean code here but the theory behind the algorithm) are 100% from the xdiff library used by git but the rest is my own work. How does LGPL-2.1 apply here 🤔 ? Maybe you know more than me on this topic :D.

I certainly don't, but I wouldn't fret it. My intuition here is that the work itself is copyrighted, i.e. the source code as is, but not underlying knowledge. Besides, this wouldn't be the first reimplementation of semantics of the xdiff library, as I am sure more mature Git libraries already have something just like it.
And finally, if someone would be after this, I am sure they'd come after gitoxide as a means of distribution, and not the original author, so I guess I am feeling lucky :).

@EliahKagan
Copy link
Member

I think it may be preferable for the added 2c31d30 commit message to be conventional with fix:. As described in #2013 (comment), this is fixing a bug that breaks installation, so various crates that depend on gix-archive (of which there are few that do so directly, but I think more that do so through gix) can be able to build again. It will therefore be of equal or greater than usual interest to users reading changelogs.

@EliahKagan EliahKagan linked an issue May 16, 2025 that may be closed by this pull request
@Byron
Copy link
Member

Byron commented May 16, 2025

That is true! I missed that opportunity but fixed up the gix-archive changelog to prominently feature that information anyway.

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, this is a great start!

I have reworked it a little to be more flexible, and added unified diffs to the test to see the sliding in action.

Something I'd love to have is actual baseline tests, i.e. tests taken from the diff-slider-tools, that are then run through Git which produces the expectation for the algorithm itself.

Also I have a feeling that the direct slice indexing can panic when the conditions are right, and such a baseline could possibly help triggering such failures. Maybe one could also use a fuzzer to generate a wide variety of inputs to help battle-test this.

I hope you find such prospects intriguing, too :).

@Byron Byron enabled auto-merge May 18, 2025 20:01
@Byron Byron merged commit fd49eee into GitoxideLabs:main May 18, 2025
22 checks passed
@blinxen
Copy link
Contributor Author

blinxen commented May 18, 2025

I have reworked it a little to be more flexible, and added unified diffs to the test to see the sliding in action.

Thanks for the improvements!

Also I have a feeling that the direct slice indexing can panic when the conditions are right, and such a baseline could possibly help triggering such failures. Maybe one could also use a fuzzer to generate a wide variety of inputs to help battle-test this.
I hope you find such prospects intriguing, too :).

Definitely, will try to propose some tests when I have time.

Byron added a commit that referenced this pull request May 19, 2025
)

Against better judgement I was compelled to merge this implementation in,
assuming one test must be enough given that it is a transcription.

This, however, wasn't the case and there is strong evidence that it produces
diffs that aren't correct.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants