Skip to content

Add doc about mem.*Unsafe consuming the input buffer #8209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

Groxx
Copy link

@Groxx Groxx commented Mar 29, 2025

I've been reading the new buffer-pool-related code due to recent OOM problems with other libraries misusing sync.Pool, and the varying behavior between Buffer and SliceBuffer had me confused for quite a while.

Buffer mutates itself when read, and implicitly calls Free when fully consumed:

grpc-go/mem/buffers.go

Lines 194 to 200 in 5edab9e

n := copy(buf, b.data)
if n == len(b.data) {
b.Free()
return n, nil
}
b.data = b.data[n:]

While SliceBuffer returns a sub-slice but does not modify itself:

grpc-go/mem/buffers.go

Lines 263 to 267 in 5edab9e

n := copy(buf, s)
if n == len(s) {
return n, nil
}
return n, s[n:]

The only way these *Unsafe funcs can be used correctly is by replacing the input-buffer with the result-buffer(s), as the current code does:

n, r.last = mem.ReadUnsafe(header, r.last)

... which seems worth documenting since it feels like a huge footgun otherwise: you only trigger the self-mutating-and-freeing behavior for relatively large data, which is less common in tests.

Granted, this isn't a technical barrier at all, but the behavior differences between implementations mislead me for a while and this "must not use input Buffer" requirement doesn't seem obviously-necessary to me when reading the code. Might as well save future-people that effort.


On a related note, Buffer feels... a bit concerningly strange in regards to concurrency, and I didn't see it called out in the PR that introduced it so I think it might not have been noticed.

E.g. Buffers are not concurrency-safe:

grpc-go/mem/buffers.go

Lines 39 to 41 in 5edab9e

// Note that a Buffer is not safe for concurrent access and instead each
// goroutine should use its own reference to the data, which can be acquired via
// a call to Ref().

But it uses an atomic refcount:

grpc-go/mem/buffers.go

Lines 75 to 80 in 5edab9e

type buffer struct {
origData *[]byte
data []byte
refs *atomic.Int32
pool BufferPool
}

... but it doesn't use it in an atomically-safe way, as it sets the value to nil when freeing:
b.refs = nil

And if you include the fact that the ref came from a pool:

refObjectPool.Put(b.refs)

it seems like this code cannot possibly be correct:

  • non-racing code is paying for atomics it does not need
  • racing code can avoid some checks by using a type that is race-safe
    • this might not be reachable in practice though
  • racing code can double-Put refcounts and buffers (with a racing Ref, Free, Ref, Free sequence)
  • racing code can leak refcount changes across Buffer instances due to pooled reuse

So this is introducing difficult-to-investigate failure modes that non-atomic or non-pooled instances would not have. It seems like it'd be better to either just use an int field or not-reuse atomics (if the goal is to detect races).

Also I would be quite surprised if pooled *int values perform better than non-pooled, given the cheap initialization and sync.Pool's overhead, but I don't have any evidence either way.

Is there some benefit to this I'm missing? Maybe something about it can catch tricky races or misuse that would otherwise be missed? It seems misleading and dangerous to me otherwise.

RELEASE NOTES: none

Copy link

CLA Not Signed

@Groxx Groxx changed the title Add note about mem.*Unsafe consuming the input buffer Add doc about mem.*Unsafe consuming the input buffer Mar 29, 2025
Copy link

codecov bot commented Mar 29, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.01%. Comparing base (5edab9e) to head (0e114fe).
Report is 71 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8209      +/-   ##
==========================================
- Coverage   82.17%   82.01%   -0.17%     
==========================================
  Files         410      410              
  Lines       40236    40236              
==========================================
- Hits        33065    32998      -67     
- Misses       5822     5873      +51     
- Partials     1349     1365      +16     
Files with missing lines Coverage Δ
mem/buffers.go 88.18% <ø> (ø)

... and 24 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@eshitachandwani
Copy link
Member

@Groxx Can you please sign the CLA ?

@eshitachandwani
Copy link
Member

@Groxx , the PR looks good to me , but for the other part , can you open a issue and we can have a discussion there.
Also adding Purnesh for a review.

@@ -207,13 +207,19 @@ func (b *buffer) String() string {

// ReadUnsafe reads bytes from the given Buffer into the provided slice.
// It does not perform safety checks.
//
// The returned Buffer MUST be used in place of the one passed in, as it may
Copy link
Contributor

@purnesh42H purnesh42H Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 2 implementations now: buffer and SliceBuffer and buffer can free when fully consumed. Because we are not doing type check we can't be sure if original buffer passed in is okay to be used again or not. So, we can just put that as the note because acting upon the returned buffer is expected anyways.

Note: provided Buffer `buf` must not be used again. For any remaining data, use the returned Buffer.

@purnesh42H
Copy link
Contributor

Defer to @dfawley for concurrency concern mentioned.

@purnesh42H purnesh42H assigned Groxx and dfawley and unassigned purnesh42H Apr 15, 2025
@dfawley
Copy link
Member

dfawley commented Apr 16, 2025

@PapaCharlie

Do you have any response to the concurrency issues raised in the first comment? Do you think we could remove the atomics, or maybe something is needed to address the concerns? Thanks!

@dfawley dfawley added the Area: Documentation Includes examples and docs. label Apr 16, 2025
@dfawley dfawley added this to the 1.73 Release milestone Apr 16, 2025
@dfawley
Copy link
Member

dfawley commented Apr 16, 2025

@Groxx we will need you to complete the CLA in order to merge this, thanks!

@PapaCharlie
Copy link
Contributor

Regarding the atomics, each goroutine is intended to have its own Buffer. In other words, a single buffer is not meant to be accessed concurrently. The docs mention the following:

// Note that a Buffer is not safe for concurrent access and instead each
// goroutine should use its own reference to the data, which can be acquired via
// a call to Ref().

However, you are right that it doesn't explicitly mention that the Ref call should still be synchronous. Namely, the code should look like this:

ref := buf.Ref()
go func() {
  // use ref here
)()
buf.Free()

Rather than like this:

go func() {
  ref := buf.Ref()
  // use ref here
}()
buf.Free()

Though I would argue that the second form counts as "concurrent access" :)

The atomic counters are used to determine when the buffer can be safely freed, as the references get counted down.

@PapaCharlie
Copy link
Contributor

As for the proposed change in general, the doc looks good to me. The original motivation for ReadUnsafe and SplitUnsafe was that they should only be used in the internal/transport package, where they would be used as expected, but it definitely pays to have better documentation around it!

since it feels like a huge footgun otherwise

Absolutely, hence calling them Unsafe in the first place. However, without that specific unsafe usage being made available by the mem package, it wasn't really possible to introduce this feature without tanking the performance.

@dfawley
Copy link
Member

dfawley commented Apr 16, 2025

Absolutely, hence calling them Unsafe in the first place. However, without that specific unsafe usage being made available by the mem package, it wasn't really possible to introduce this feature without tanking the performance.

If we don't want anyone else using it, we could have an internal package that exposes it to the rest of gRPC instead:

package grpc/internal/mem

var ReadUnsafe func(dst []byte, buf Buffer) (int, Buffer)
--
package grpc/mem

internalmem.ReadUnsafe = func(dst []byte, buf Buffer) (int, Buffer) {
	return buf.read(dst)
}

But if this is turning out to be useful for others then I'm fine with leaving it exposed.

@dfawley dfawley added Type: Documentation Documentation or examples and removed Area: Documentation Includes examples and docs. labels Apr 16, 2025
@Groxx
Copy link
Author

Groxx commented Apr 16, 2025

Regarding the atomics, each goroutine is intended to have its own Buffer. In other words, a single buffer is not meant to be accessed concurrently. The docs mention the following:

// Note that a Buffer is not safe for concurrent access and instead each
// goroutine should use its own reference to the data, which can be acquired via
// a call to Ref().

However, you are right that it doesn't explicitly mention that the Ref call should still be synchronous. Namely, the code should look like this:

ref := buf.Ref()
go func() {
  // use ref here
)()
buf.Free()

Rather than like this:

go func() {
  ref := buf.Ref()
  // use ref here
}()
buf.Free()

Though I would argue that the second form counts as "concurrent access" :)

The atomic counters are used to determine when the buffer can be safely freed, as the references get counted down.

So that's actually a good sample of why I think trying to be race-safer here with atomics is actually worse than not using atomics, even just from a safety standpoint.

(as an up-front note here: all of these can be made safe and can be made to race. I'm just pointing out that a plausible misuse exists)

go func() {
  ref := buf.Ref()
  // use ref here
}()
buf.Free()

^ this kind of construct is a common enough mistake that golang/go#18022 has led to adding a go vet check to try to prevent it. So yea, agreed - it shouldn't be done.

Forgetting to wait for the goroutine is also fairly common, so that example should have been something like this:

done := make(chan struct{})
go func() {
  ref := buf.Ref()
  // use ref here
  close(done)
}()
<-done
buf.Free()

(I fully realize your example was just sample code and this is reasonable to omit, I'm just showing it to be more concrete)
This is in principle safe with synchronous use, so I think we can ignore it. It's just dubious compared to the obviously-correct Ref() outside the goroutine.

One risk that atomics bring is that in a -race build, code that makes both mistakes will be considered safe

done := make(chan struct{})
b := buffer()
go func() {
	bb := b.Ref()
	time.Sleep(time.Second)
	// and then do not do anything with the value for some reason,
	// maybe some other check no-opped it.
	// or just don't use it in a racy way, e.g. maybe only read,
	// or only racy-write to different locations
	_ = bb
	close(done)
}() //
b.Free()
<-done // note incorrect waiting, `Ref` and `Free` race

while without atomics it'll be a race violation:

WARNING: DATA RACE
Read at 0x00c000090308 by goroutine 11:
  scratch/demos.(*normalbuf).Ref()
      .../go/src/scratch/demos/grpc_test.go:33 +0x3c
  scratch/demos.TestNormal.func2.1()
      .../go/src/scratch/demos/grpc_test.go:86 +0x38

Previous write at 0x00c000090308 by goroutine 10:
  scratch/demos.(*normalbuf).Free()
      .../go/src/scratch/demos/grpc_test.go:37 +0x108
  scratch/demos.TestNormal.func2()
      .../go/src/scratch/demos/grpc_test.go:92 +0xf0
Full test here
package demos

import (
	"sync/atomic"
	"testing"
	"time"
)

type atomicbuf struct {
	count int64 // must be used atomically
}

func (b *atomicbuf) Ref() *atomicbuf {
	atomic.AddInt64(&b.count, 1)
	return b
}
func (b *atomicbuf) Free() {
	atomic.AddInt64(&b.count, -1)
}

type normalbuf struct {
	count int64
}

func (b *normalbuf) Ref() *normalbuf {
	b.count++
	return b
}
func (b *normalbuf) Free() {
	b.count--
}

func TestAtomic(t *testing.T) {
	t.Run("safe in both constructs", func(t *testing.T) {
		done := make(chan struct{})
		b := &atomicbuf{}
		bb := b.Ref()
		go func() {
			time.Sleep(time.Second)
			// and then do not do anything with `bb` for some reason
			_ = bb
			close(done)
		}()
		b.Free()
		// ensuring the goroutine finishes, rather than ending the process early
		<-done
	})
	t.Run("unsafe but allowed with atomics", func(t *testing.T) {
		done := make(chan struct{})
		b := &atomicbuf{}
		go func() {
			bb := b.Ref() // moved inside the goroutine
			time.Sleep(time.Second)
			// and then do not do anything with the value for some reason
			_ = bb
			close(done)
		}()
		b.Free()
		// ensuring the race occurs, rather than ending the process early
		<-done
	})
}
func TestNormal(t *testing.T) {
	t.Run("safe in both constructs", func(t *testing.T) {
		done := make(chan struct{})
		b := &normalbuf{}
		bb := b.Ref()
		go func() {
			_ = bb // capture it
			time.Sleep(time.Second)
			// and then do not do anything with `bb` for some reason
			close(done)
		}()
		b.Free()
		// ensuring the goroutine finishes, rather than ending the process early
		<-done
	})
	t.Run("unsafe and blocked by race detector", func(t *testing.T) {
		done := make(chan struct{})
		b := &normalbuf{}
		go func() {
			bb := b.Ref() // moved inside the goroutine
			time.Sleep(time.Second)
			// and then do not do anything with the value for some reason
			_ = bb
			close(done)
		}()
		b.Free()
		// ensuring the race occurs, rather than ending the process early
		<-done
	})
}

With racy atomic code like that, it's also possible to trigger double-frees due to the non-atomic b.refs = nil line: that may not have propagated across threads, so old buf references may Ref(); Free() without triggering the < 0 case with panic("Cannot free freed buffer").

So it's hiding some misuse, and not reliably preventing misuse. I'd argue it should either be safe (with added runtime cost) or not (without), not something in the middle.

@Groxx
Copy link
Author

Groxx commented Apr 16, 2025

re CLA...

... tbh I don't think I can agree with a license that requires someone to share their mailing address. That seems like an absurd overreach.

@dfawley
Copy link
Member

dfawley commented Apr 16, 2025

This is the standard Linux Foundation / CNCF CLA, and the repo requires signing the agreement in order to merge pull requests. If you don't want to sign it, that's fine - we will just close this PR.

@dfawley
Copy link
Member

dfawley commented Apr 16, 2025

There's also some chance you may have attempted to go through the "corporate" license workflow instead of the "individual" one. Asking around, nobody was aware that you'd need to provide a mailing address to agree to the CLA. Unfortunately, there's no easy way for me to test this for myself.

@Groxx
Copy link
Author

Groxx commented Apr 21, 2025

There's also some chance you may have attempted to go through the "corporate" license workflow instead of the "individual" one. Asking around, nobody was aware that you'd need to provide a mailing address to agree to the CLA. Unfortunately, there's no easy way for me to test this for myself.

No, it's the individual one.
Individual_Contributor_License_Agreement.pdf
And the signing UI has that field required.

I see that a fair number of others also have that clause, but nothing seems to describe why it's necessary... and there are also a fair number that do not, e.g. Google's OpenSource CLA needs just a name and username.

So yea, that seems like an absurd overreach. Surely only legal jurisdiction is relevant, it's not like CNCF is going to mail me a letter at an unverified address if they want to change a project's license or something... and I thought the whole point of a CLA was to remove that need, by retaining rights for the project.

Copy link

This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Apr 27, 2025
@dfawley
Copy link
Member

dfawley commented Apr 28, 2025

Blocked on the CLA issue; being investigated in communitybridge/easycla#4638

@dfawley
Copy link
Member

dfawley commented May 14, 2025

@Groxx can you try again? IIUC they removed the address requirement.

Copy link

This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added stale and removed stale labels May 20, 2025
Copy link

This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label May 26, 2025
@eshitachandwani
Copy link
Member

@Groxx Closing the PR as it haas been 2 weeks since any activity. Feel free to re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants