Skip to content

Move claimable_htlcs to separate lock #1772

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ViktorTigerstrom
Copy link
Contributor

Closes #1732

Moves claimable_htlcs out from the channel_state lock into a separate lock. This simplifies the removal of the channel_state lock, as well as simplifies changing its lock order for #1507.

@codecov-commenter
Copy link

codecov-commenter commented Oct 15, 2022

Codecov Report

Base: 90.77% // Head: 90.76% // Decreases project coverage by -0.00% ⚠️

Coverage data is based on head (c2037a1) compared to base (505102d).
Patch coverage: 94.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1772      +/-   ##
==========================================
- Coverage   90.77%   90.76%   -0.01%     
==========================================
  Files          87       87              
  Lines       47595    47597       +2     
  Branches    47595    47597       +2     
==========================================
- Hits        43204    43203       -1     
- Misses       4391     4394       +3     
Impacted Files Coverage Δ
lightning/src/ln/channelmanager.rs 85.38% <94.00%> (-0.02%) ⬇️
lightning-net-tokio/src/lib.rs 76.73% <0.00%> (-0.31%) ⬇️
lightning/src/ln/functional_tests.rs 96.94% <0.00%> (-0.08%) ⬇️
lightning/src/chain/channelmonitor.rs 90.70% <0.00%> (+0.05%) ⬆️
lightning/src/chain/onchaintx.rs 95.01% <0.00%> (+0.65%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2022-10-move-claimable-htlcs-to-seperate-lock branch from f048a08 to f76d011 Compare October 16, 2022 22:30
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM.

@@ -3650,7 +3655,9 @@ impl<Signer: Sign, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref> ChannelMana
true
});

channel_state.claimable_htlcs.retain(|payment_hash, (_, htlcs)| {
mem::drop(channel_state_lock);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than drop'ing here, note that the scope this is in was explicitly for the channel_state lock, so we should just move the next hunk out of the scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, thanks! 736d403

let peer_state = peer_state_mutex.lock().unwrap();
peer_state.latest_features.write(writer)?;
}
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think its worth bothering with a bunch of scopes here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, those were an attempt to not violate the "lock order branches" given my previous understanding of how we handle them. But given your clarification above there's no need for these scopes :)!

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2022-10-move-claimable-htlcs-to-seperate-lock branch 2 times, most recently from efbd2f2 to 01e4912 Compare October 31, 2022 01:17
@ViktorTigerstrom
Copy link
Contributor Author

Snap will look into fixing the CI windows build fail tomorrow.

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2022-10-move-claimable-htlcs-to-seperate-lock branch 5 times, most recently from 0c48ede to 938769c Compare October 31, 2022 12:31
@ViktorTigerstrom
Copy link
Contributor Author

Super sorry for the long and complicated change log here.

Added acquiring of all locks in the beginning of the ChannelManager::write function in a separate commit.

Still haven't fixed the CI issues with windows.

let forward_htlcs = self.forward_htlcs.lock().unwrap();
let channel_state = self.channel_state.lock().unwrap();
let per_peer_state = self.per_peer_state.write().unwrap();
let peer_states = per_peer_state.iter().map(|(peer_pubkey, peer_state_mutex)| (peer_pubkey, peer_state_mutex.lock().unwrap())).collect::<Vec<(&PublicKey, MutexGuard<'_, PeerState>)>>();
Copy link
Contributor Author

@ViktorTigerstrom ViktorTigerstrom Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheBlueMatt are we okay this creating a lock order for all the inner locks of the PeerState mutexes? In the "non-serialization" state (ie. outside of ChannelManager::write) we should never acquire two PeerState locks at the same time, due to the deadlock possibility. IIUC this "debug lockorder" will be violated in production runtime when we move towards parallelization, but that should be ok as long as the same thread doesn't acquire multiple PeerStates at the same time?

If we are not ok with this, I don't think we can take the locks in the beginning of the ChannelManager::write function, and revert back to how this PR originally acquired the locks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, yea, that's super gross. Sorry for the confusion, I hadn't intended to suggest that we should always take all the locks at the top, only that it doesn't really matter where the locks go, and if its easier to take them in the right order, don't bother trying to super manually scope things but rather just move the locks around/up until we get the right order. We can scope (or drop) if we have to, but there's no reason to do it "by default" to try to ensure we meet "lock-order requirements" (unless its gonna panic in tests).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok :)! I'll go ahead and remove the commit that takes all the locks in the beginning of the function.

Copy link

@ariard ariard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

/// failed/claimed by the user.
///
/// Note that, no consistency guarantees are made about the channels given here actually
/// existing anymore by the time you go to read them!
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could precise what is meant by "channels given here actually existing anymore by the time" It could mean a) The channel is closed on-chain or funding has been aborted or b) we don't have any reference in our local map (i.e ChannelHolder:by_id).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't those two equivalent? I'm not sure what would be more specific.

@@ -691,6 +684,8 @@ pub type SimpleRefChannelManager<'a, 'b, 'c, 'd, 'e, M, T, F, L> = ChannelManage
// | |
// | |__`pending_inbound_payments`
// | |
// | |__`claimable_htlcs`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding this lock order tree is currently prescriptive (where we would like to go) but not descriptive (where we're today), in the sense claimable_htlcs lock is currently taken in process_pending_htlc_forwards() without taking per_peer_state, or what do you aim to convey?

Copy link
Contributor Author

@ViktorTigerstrom ViktorTigerstrom Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the tree is "descriptive" today in the sense that none of the illustration is violated in any of our code (or well after this PR we will violate the branch illustration, but just in the ChannelManager::write function).
But you are correct that the claimable_htlcs lock for example is never acquired at the same time as the per_peer_state lock in our codebase.

However the tree illustration is also ment to give you guidance on which order you are supposed to take the locks in, in case you implement new code in the future that needs to acquire both locks at the same time.
I deliberately choose to illustrate the per_peer_state before pending_inbound_payments (and therefore also claimable_htlcs) even though that's not actually enforced currently by the "real" lock order, as pending_inbound_payments has a "real" lock order currently after channel_state currently. And I wanted to avoid that new PRs before #1507 adds code that creates "real" lock orders for pending_inbound_payments and pending_inbound_payments before per_peer_state but after channel_state as it would need to be changed once channels are moved to per_peer_state with #1507.

Though I've realized now that most of the places in the code that requires that pending_inbound_payments is locked before channel_state is actually removed with this PR :). I actually created a draft PR #1773 that moves the pending_inbound_payments and pending_outbound_payments lock order to before channel_state.

Now that I think about it again, we should actually include that lock order change with this PR, as doing so will solve most of the issues I've faced in this PR with the ChannelManager::write function, and will allow us to take the pending_inbound_payments, pending_outbound_payments and claimable_htlcs locks and hold them throughout the function 🎉

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last two commits look good! Will check out the first commit once it's updated

@ViktorTigerstrom
Copy link
Contributor Author

ViktorTigerstrom commented Nov 2, 2022

Thanks for the reviews @ariard and @valentinewallace :)!

Dropped the first commit that acquired all of the locks in the top of the ChannelManager::write function.
Added a new commit 6c94add that moves the lock order for pending_inbound_payments and pending_outbound_payments to before the channel_state lock (and therefore also the per_peer_state which was troublesome before). This now allows us to hold the locks throughout much of the ChannelManager::write function and therefore removes the need to clone the PaymentPurpose :)

Edit: So locks are still getting poisoned on Windows builds only... I'm not really sure why this is yet. I would have hoped that this was fixed by the latest version. I'll look into it.

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2022-10-move-claimable-htlcs-to-seperate-lock branch 2 times, most recently from c0813d3 to 6c94add Compare November 2, 2022 23:47
@@ -681,17 +674,19 @@ pub type SimpleRefChannelManager<'a, 'b, 'c, 'd, 'e, M, T, F, L> = ChannelManage
// |
// |__`forward_htlcs`
// |
// |__`channel_state`
// |__`pending_inbound_payments`
Copy link
Contributor Author

@ViktorTigerstrom ViktorTigerstrom Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general question. Would we actually in the long run like pending_inbound_payments and pending_outbound_payments to be peer specific (ie. be stored in the PeerState)? I haven't really researched all the implications that would have yet. From my first hunch it'd be pretty beneficial for parallelization (especially pending_outbound_payments), but will be hard in terms of backwards compatibly. But if we would like it, maybe my last commit isn't such a great idea in the long run, but might be ok for now.

Edit: Oh actually as they'd be part of the PeerState mutex lock and not individual locks any longer, the lock order change by the last commit should be ok :)

@ViktorTigerstrom
Copy link
Contributor Author

So none of the tests fails on a local windows build I setup. I'm thinking that it might help if I rebase on main.

Please let me know when if your okay with me rebasing this on upstream :)

@valentinewallace
Copy link
Contributor

So none of the tests fails on a local windows build I setup. I'm thinking that it might help if I rebase on main.

Please let me know when if your okay with me rebasing this on upstream :)

Go for it IMO. I tried to look into that error and it baffles me 😬

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2022-10-move-claimable-htlcs-to-seperate-lock branch from 6c94add to 8318961 Compare November 3, 2022 21:01
@ViktorTigerstrom
Copy link
Contributor Author

Thanks @valentinewallace.

Rebased on upstream with no changes. Hopefully CI passes now 😅

@ViktorTigerstrom
Copy link
Contributor Author

Hmmm ok so the CI still doesn't pass. I'm really baffled as well @valentinewallace 😅... Will try to figure out what's going on 😁

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2022-10-move-claimable-htlcs-to-seperate-lock branch from 8318961 to 8ce4756 Compare November 5, 2022 15:08
@ViktorTigerstrom
Copy link
Contributor Author

Rebased on upstream to fix merge conflicts

@ViktorTigerstrom
Copy link
Contributor Author

As I'm unable to produce the CI problems locally, I'm pushing a test commit to see if this is somehow related to that Windows treats the acquiring of the RwLock for per_peer_state differently compared to the other builds .

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2022-10-move-claimable-htlcs-to-seperate-lock branch 2 times, most recently from 93a64fc to c2037a1 Compare November 7, 2022 00:13
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to squash.

/// Map from payment hash to the payment data and any HTLCs which are to us and can be
/// failed/claimed by the user.
///
/// Note that, no consistency guarantees are made about the channels given here actually
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/,//

// | |__`per_peer_state`
// | |
// | |__`outbound_scid_aliases`
// | |__`pending_outbound_payments`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true - claim_funds_internal takes a pending_outbound_payments lock while holding the channel_state mutex. Maybe just drop the mutex-order-swap commit for now and we can do it separately later?

Copy link
Contributor Author

@ViktorTigerstrom ViktorTigerstrom Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm the channel_state lock is dropped (mem::drop) before pending_outbound_payments is locked in claim_funds_internal, no?

I can of course instead drop the mutex-order-swap commit for now if we'd prefer though :). That will require that we clone the PaymentPurpose for the claimable_htlcs in the ChannelMananger::write function as per the #1772 (comment) discussion, or that we do the weird scoping I did with 266175d. Are we ok with any of those alternatives?

@valentinewallace
Copy link
Contributor

@ViktorTigerstrom OOC, how did you debug the Windows poison error? The fix seems totally unrelated to the PR..

@ViktorTigerstrom
Copy link
Contributor Author

ViktorTigerstrom commented Nov 15, 2022

Sorry for taking some time to continue this PR, I'm having visitors this week so I'm a bit strapped on time.

@ViktorTigerstrom OOC, how did you debug the Windows poison error? The fix seems totally unrelated to the PR..

We'll I actually never managed to debug the error 😅... I tried debugging the error locally on a Windows machine, but weirdly enough no tests failed locally on that machine. I even tried both Windows 10 and 11 in case it was some weird specific OS version issue.
However I found deep in the trace that the somehow TestPersister::chain_sync_monitor_persistences had gotten a lock order before the ChannelManager::per_peer_state on the CI Windows build, which is why the fix commit above helped the issue. I'm planning to dig more into what's actually going on, but my best guess is so far that it's due to some difference for code under the Windows cfg, possible in lightning-persister/src/util.rs but not sure yet.

My best guess so far of why the errors don't appear locally on my Windows machine is that lock orders assigned across different tests aren't checked on a local build, but only trough CI (which may be related to some incorrect local configuration on my part or similar), as we have some tests that acquire the TestPersister::chain_sync_monitor_persistences lock in the actual test code.

Will let you know what I find once I've had time to dig into it deeper :)!

@valentinewallace
Copy link
Contributor

Thanks for the report, what a strange error. Can't see why this PR brought it out.

@TheBlueMatt
Copy link
Collaborator

Feel free to squash. Re: things different on CI vs local - did you try cd lightning && cargo test --features backtrace?

As the `channel_state` lock will be removed, we prepare for that by
flipping the lock order for `pending_inbound_payments` and
`pending_outbound_payments` locks to before the `channel_state` lock.
For Windows build only, the
`TestPersister::chain_sync_monitor_persistences` lock has a lock order
before the `ChannelManager::per_peer_state` lock. This fix ensures that
the `per_peer_state` lock isn't held before the
`TestPersister::chain_sync_monitor_persistences` lock is acquired.
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2022-10-move-claimable-htlcs-to-seperate-lock branch from c2037a1 to 782eb36 Compare November 21, 2022 20:58
@ViktorTigerstrom
Copy link
Contributor Author

Squashed without changes.

@TheBlueMatt are we okay with keeping 6b12117 as discussed in #1772 (comment), or would you prefer cloning the PaymentPurpose for the claimable_htlcs in the ChannelMananger::write function?

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@TheBlueMatt TheBlueMatt merged commit 32fdeb7 into lightningdevkit:main Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid taking and immediately releasing and re-taking channel_state in claim_funds
5 participants