Add doc about `mem.*Unsafe` consuming the input buffer #8209

Groxx · 2025-03-29T00:35:59Z

I've been reading the new buffer-pool-related code due to recent OOM problems with other libraries misusing sync.Pool, and the varying behavior between Buffer and SliceBuffer had me confused for quite a while.

Buffer mutates itself when read, and implicitly calls Free when fully consumed:

grpc-go/mem/buffers.go

Lines 194 to 200 in 5edab9e

    
           n := copy(buf, b.data) 
        
           if n == len(b.data) { 
        
           	b.Free() 
        
           	return n, nil 
        
           } 
        
           b.data = b.data[n:]

While SliceBuffer returns a sub-slice but does not modify itself:

grpc-go/mem/buffers.go

Lines 263 to 267 in 5edab9e

    
           n := copy(buf, s) 
        
           if n == len(s) { 
        
           	return n, nil 
        
           } 
        
           return n, s[n:]

The only way these *Unsafe funcs can be used correctly is by replacing the input-buffer with the result-buffer(s), as the current code does:

grpc-go/internal/transport/transport.go

Line 139 in 5edab9e

n, r.last = mem.ReadUnsafe(header, r.last)

... which seems worth documenting since it feels like a huge footgun otherwise: you only trigger the self-mutating-and-freeing behavior for relatively large data, which is less common in tests.

Granted, this isn't a technical barrier at all, but the behavior differences between implementations mislead me for a while and this "must not use input Buffer" requirement doesn't seem obviously-necessary to me when reading the code. Might as well save future-people that effort.

On a related note, Buffer feels... a bit concerningly strange in regards to concurrency, and I didn't see it called out in the PR that introduced it so I think it might not have been noticed.

E.g. Buffers are not concurrency-safe:

grpc-go/mem/buffers.go

Lines 39 to 41 in 5edab9e

    
           // Note that a Buffer is not safe for concurrent access and instead each 
        
           // goroutine should use its own reference to the data, which can be acquired via 
        
           // a call to Ref().

But it uses an atomic refcount:

grpc-go/mem/buffers.go

Lines 75 to 80 in 5edab9e

    
           type buffer struct { 
        
           	origData *[]byte 
        
           	data     []byte 
        
           	refs     *atomic.Int32 
        
           	pool     BufferPool 
        
           }

... but it doesn't use it in an atomically-safe way, as it sets the value to nil when freeing:

grpc-go/mem/buffers.go

Line 160 in 5edab9e

b.refs = nil

And if you include the fact that the ref came from a pool:

grpc-go/mem/buffers.go

Line 157 in 5edab9e

refObjectPool.Put(b.refs)

it seems like this code cannot possibly be correct:

non-racing code is paying for atomics it does not need
racing code can avoid some checks by using a type that is race-safe
- this might not be reachable in practice though
racing code can double-Put refcounts and buffers (with a racing Ref, Free, Ref, Free sequence)
racing code can leak refcount changes across Buffer instances due to pooled reuse

So this is introducing difficult-to-investigate failure modes that non-atomic or non-pooled instances would not have. It seems like it'd be better to either just use an int field or not-reuse atomics (if the goal is to detect races).

Also I would be quite surprised if pooled *int values perform better than non-pooled, given the cheap initialization and sync.Pool's overhead, but I don't have any evidence either way.

Is there some benefit to this I'm missing? Maybe something about it can catch tricky races or misuse that would otherwise be missed? It seems misleading and dangerous to me otherwise.

RELEASE NOTES: none

linux-foundation-easycla · 2025-03-29T00:36:03Z

❌ - login: @Groxx / name: Steven L . The commit (0e114fe) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

codecov · 2025-03-29T00:40:58Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.01%. Comparing base (5edab9e) to head (0e114fe).
Report is 71 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8209      +/-   ##
==========================================
- Coverage   82.17%   82.01%   -0.17%     
==========================================
  Files         410      410              
  Lines       40236    40236              
==========================================
- Hits        33065    32998      -67     
- Misses       5822     5873      +51     
- Partials     1349     1365      +16

Files with missing lines	Coverage Δ
mem/buffers.go	`88.18% <ø> (ø)`

... and 24 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

eshitachandwani · 2025-04-04T04:44:50Z

@Groxx Can you please sign the CLA ?

eshitachandwani · 2025-04-04T05:59:30Z

@Groxx , the PR looks good to me , but for the other part , can you open a issue and we can have a discussion there.
Also adding Purnesh for a review.

purnesh42H · 2025-04-15T08:29:27Z

mem/buffers.go

@@ -207,13 +207,19 @@ func (b *buffer) String() string {

 // ReadUnsafe reads bytes from the given Buffer into the provided slice.
 // It does not perform safety checks.
+//
+// The returned Buffer MUST be used in place of the one passed in, as it may


We have 2 implementations now: buffer and SliceBuffer and buffer can free when fully consumed. Because we are not doing type check we can't be sure if original buffer passed in is okay to be used again or not. So, we can just put that as the note because acting upon the returned buffer is expected anyways.

Note: provided Buffer `buf` must not be used again. For any remaining data, use the returned Buffer.

purnesh42H · 2025-04-15T08:31:08Z

Defer to @dfawley for concurrency concern mentioned.

dfawley · 2025-04-16T17:58:18Z

@PapaCharlie

Do you have any response to the concurrency issues raised in the first comment? Do you think we could remove the atomics, or maybe something is needed to address the concerns? Thanks!

dfawley · 2025-04-16T17:59:51Z

@Groxx we will need you to complete the CLA in order to merge this, thanks!

PapaCharlie · 2025-04-16T18:21:09Z

Regarding the atomics, each goroutine is intended to have its own Buffer. In other words, a single buffer is not meant to be accessed concurrently. The docs mention the following:

// Note that a Buffer is not safe for concurrent access and instead each
// goroutine should use its own reference to the data, which can be acquired via
// a call to Ref().

However, you are right that it doesn't explicitly mention that the Ref call should still be synchronous. Namely, the code should look like this:

ref := buf.Ref()
go func() {
  // use ref here
)()
buf.Free()

Rather than like this:

go func() {
  ref := buf.Ref()
  // use ref here
}()
buf.Free()

Though I would argue that the second form counts as "concurrent access" :)

The atomic counters are used to determine when the buffer can be safely freed, as the references get counted down.

PapaCharlie · 2025-04-16T18:31:38Z

As for the proposed change in general, the doc looks good to me. The original motivation for ReadUnsafe and SplitUnsafe was that they should only be used in the internal/transport package, where they would be used as expected, but it definitely pays to have better documentation around it!

since it feels like a huge footgun otherwise

Absolutely, hence calling them Unsafe in the first place. However, without that specific unsafe usage being made available by the mem package, it wasn't really possible to introduce this feature without tanking the performance.

dfawley · 2025-04-16T18:35:44Z

Absolutely, hence calling them Unsafe in the first place. However, without that specific unsafe usage being made available by the mem package, it wasn't really possible to introduce this feature without tanking the performance.

If we don't want anyone else using it, we could have an internal package that exposes it to the rest of gRPC instead:

package grpc/internal/mem

var ReadUnsafe func(dst []byte, buf Buffer) (int, Buffer)
--
package grpc/mem

internalmem.ReadUnsafe = func(dst []byte, buf Buffer) (int, Buffer) {
	return buf.read(dst)
}

But if this is turning out to be useful for others then I'm fine with leaving it exposed.

Groxx · 2025-04-16T19:49:59Z

Regarding the atomics, each goroutine is intended to have its own Buffer. In other words, a single buffer is not meant to be accessed concurrently. The docs mention the following:

// Note that a Buffer is not safe for concurrent access and instead each
// goroutine should use its own reference to the data, which can be acquired via
// a call to Ref().

However, you are right that it doesn't explicitly mention that the Ref call should still be synchronous. Namely, the code should look like this:
ref := buf.Ref()
go func() {
  // use ref here
)()
buf.Free()
Rather than like this:
go func() {
  ref := buf.Ref()
  // use ref here
}()
buf.Free()
Though I would argue that the second form counts as "concurrent access" :)

The atomic counters are used to determine when the buffer can be safely freed, as the references get counted down.

So that's actually a good sample of why I think trying to be race-safer here with atomics is actually worse than not using atomics, even just from a safety standpoint.

(as an up-front note here: all of these can be made safe and can be made to race. I'm just pointing out that a plausible misuse exists)

go func() {
  ref := buf.Ref()
  // use ref here
}()
buf.Free()

^ this kind of construct is a common enough mistake that golang/go#18022 has led to adding a go vet check to try to prevent it. So yea, agreed - it shouldn't be done.

Forgetting to wait for the goroutine is also fairly common, so that example should have been something like this:

done := make(chan struct{})
go func() {
  ref := buf.Ref()
  // use ref here
  close(done)
}()
<-done
buf.Free()

(I fully realize your example was just sample code and this is reasonable to omit, I'm just showing it to be more concrete)
This is in principle safe with synchronous use, so I think we can ignore it. It's just dubious compared to the obviously-correct Ref() outside the goroutine.

One risk that atomics bring is that in a -race build, code that makes both mistakes will be considered safe

done := make(chan struct{})
b := buffer()
go func() {
	bb := b.Ref()
	time.Sleep(time.Second)
	// and then do not do anything with the value for some reason,
	// maybe some other check no-opped it.
	// or just don't use it in a racy way, e.g. maybe only read,
	// or only racy-write to different locations
	_ = bb
	close(done)
}() //
b.Free()
<-done // note incorrect waiting, `Ref` and `Free` race

while without atomics it'll be a race violation:

WARNING: DATA RACE
Read at 0x00c000090308 by goroutine 11:
  scratch/demos.(*normalbuf).Ref()
      .../go/src/scratch/demos/grpc_test.go:33 +0x3c
  scratch/demos.TestNormal.func2.1()
      .../go/src/scratch/demos/grpc_test.go:86 +0x38

Previous write at 0x00c000090308 by goroutine 10:
  scratch/demos.(*normalbuf).Free()
      .../go/src/scratch/demos/grpc_test.go:37 +0x108
  scratch/demos.TestNormal.func2()
      .../go/src/scratch/demos/grpc_test.go:92 +0xf0

Full test here

package demos

import (
	"sync/atomic"
	"testing"
	"time"
)

type atomicbuf struct {
	count int64 // must be used atomically
}

func (b *atomicbuf) Ref() *atomicbuf {
	atomic.AddInt64(&b.count, 1)
	return b
}
func (b *atomicbuf) Free() {
	atomic.AddInt64(&b.count, -1)
}

type normalbuf struct {
	count int64
}

func (b *normalbuf) Ref() *normalbuf {
	b.count++
	return b
}
func (b *normalbuf) Free() {
	b.count--
}

func TestAtomic(t *testing.T) {
	t.Run("safe in both constructs", func(t *testing.T) {
		done := make(chan struct{})
		b := &atomicbuf{}
		bb := b.Ref()
		go func() {
			time.Sleep(time.Second)
			// and then do not do anything with `bb` for some reason
			_ = bb
			close(done)
		}()
		b.Free()
		// ensuring the goroutine finishes, rather than ending the process early
		<-done
	})
	t.Run("unsafe but allowed with atomics", func(t *testing.T) {
		done := make(chan struct{})
		b := &atomicbuf{}
		go func() {
			bb := b.Ref() // moved inside the goroutine
			time.Sleep(time.Second)
			// and then do not do anything with the value for some reason
			_ = bb
			close(done)
		}()
		b.Free()
		// ensuring the race occurs, rather than ending the process early
		<-done
	})
}
func TestNormal(t *testing.T) {
	t.Run("safe in both constructs", func(t *testing.T) {
		done := make(chan struct{})
		b := &normalbuf{}
		bb := b.Ref()
		go func() {
			_ = bb // capture it
			time.Sleep(time.Second)
			// and then do not do anything with `bb` for some reason
			close(done)
		}()
		b.Free()
		// ensuring the goroutine finishes, rather than ending the process early
		<-done
	})
	t.Run("unsafe and blocked by race detector", func(t *testing.T) {
		done := make(chan struct{})
		b := &normalbuf{}
		go func() {
			bb := b.Ref() // moved inside the goroutine
			time.Sleep(time.Second)
			// and then do not do anything with the value for some reason
			_ = bb
			close(done)
		}()
		b.Free()
		// ensuring the race occurs, rather than ending the process early
		<-done
	})
}

With racy atomic code like that, it's also possible to trigger double-frees due to the non-atomic b.refs = nil line: that may not have propagated across threads, so old buf references may Ref(); Free() without triggering the < 0 case with panic("Cannot free freed buffer").

So it's hiding some misuse, and not reliably preventing misuse. I'd argue it should either be safe (with added runtime cost) or not (without), not something in the middle.

Groxx · 2025-04-16T19:59:18Z

re CLA...

... tbh I don't think I can agree with a license that requires someone to share their mailing address. That seems like an absurd overreach.

dfawley · 2025-04-16T22:14:09Z

This is the standard Linux Foundation / CNCF CLA, and the repo requires signing the agreement in order to merge pull requests. If you don't want to sign it, that's fine - we will just close this PR.

dfawley · 2025-04-16T22:20:38Z

There's also some chance you may have attempted to go through the "corporate" license workflow instead of the "individual" one. Asking around, nobody was aware that you'd need to provide a mailing address to agree to the CLA. Unfortunately, there's no easy way for me to test this for myself.

Groxx · 2025-04-21T17:39:33Z

There's also some chance you may have attempted to go through the "corporate" license workflow instead of the "individual" one. Asking around, nobody was aware that you'd need to provide a mailing address to agree to the CLA. Unfortunately, there's no easy way for me to test this for myself.

No, it's the individual one.
Individual_Contributor_License_Agreement.pdf
And the signing UI has that field required.

I see that a fair number of others also have that clause, but nothing seems to describe why it's necessary... and there are also a fair number that do not, e.g. Google's OpenSource CLA needs just a name and username.

So yea, that seems like an absurd overreach. Surely only legal jurisdiction is relevant, it's not like CNCF is going to mail me a letter at an unverified address if they want to change a project's license or something... and I thought the whole point of a CLA was to remove that need, by retaining rights for the project.

github-actions · 2025-04-27T18:47:00Z

This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

dfawley · 2025-04-28T17:51:50Z

Blocked on the CLA issue; being investigated in communitybridge/easycla#4638

dfawley · 2025-05-14T15:44:43Z

@Groxx can you try again? IIUC they removed the address requirement.

github-actions · 2025-05-20T16:49:56Z

This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions · 2025-05-26T20:46:01Z

This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

eshitachandwani · 2025-05-27T12:20:45Z

@Groxx Closing the PR as it haas been 2 weeks since any activity. Feel free to re-open.

Add note about *Unsafe consuming the input buffer

0e114fe

Groxx changed the title ~~Add note about mem.*Unsafe consuming the input buffer~~ Add doc about mem.*Unsafe consuming the input buffer Mar 29, 2025

eshitachandwani approved these changes Apr 4, 2025

View reviewed changes

eshitachandwani requested a review from purnesh42H April 4, 2025 05:59

eshitachandwani assigned purnesh42H Apr 4, 2025

purnesh42H reviewed Apr 15, 2025

View reviewed changes

purnesh42H assigned Groxx and dfawley and unassigned purnesh42H Apr 15, 2025

dfawley approved these changes Apr 16, 2025

View reviewed changes

dfawley added the Area: Documentation Includes examples and docs. label Apr 16, 2025

dfawley added this to the 1.73 Release milestone Apr 16, 2025

dfawley removed their assignment Apr 16, 2025

dfawley added the Status: Requires Reporter Clarification label Apr 16, 2025

dfawley added Type: Documentation Documentation or examples and removed Area: Documentation Includes examples and docs. labels Apr 16, 2025

dfawley mentioned this pull request Apr 22, 2025

Is it expected to require a mailing address to sign CLA? communitybridge/easycla#4638

Open

github-actions bot added the stale label Apr 27, 2025

dfawley added Status: Blocked and removed Status: Requires Reporter Clarification stale labels Apr 28, 2025

dfawley added Status: Requires Reporter Clarification and removed Status: Blocked labels May 14, 2025

github-actions bot added stale and removed stale labels May 20, 2025

github-actions bot added the stale label May 26, 2025

eshitachandwani closed this May 27, 2025

	n := copy(buf, b.data)
	if n == len(b.data) {
	b.Free()
	return n, nil
	}

	b.data = b.data[n:]

	n := copy(buf, s)
	if n == len(s) {
	return n, nil
	}
	return n, s[n:]

	// Note that a Buffer is not safe for concurrent access and instead each
	// goroutine should use its own reference to the data, which can be acquired via
	// a call to Ref().

	type buffer struct {
	origData *[]byte
	data []byte
	refs *atomic.Int32
	pool BufferPool
	}

Add doc about mem.*Unsafe consuming the input buffer #8209

Add doc about mem.*Unsafe consuming the input buffer #8209

Uh oh!

Conversation

Groxx commented Mar 29, 2025 • edited by dfawley Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linux-foundation-easycla bot commented Mar 29, 2025

Uh oh!

codecov bot commented Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eshitachandwani commented Apr 4, 2025

Uh oh!

eshitachandwani commented Apr 4, 2025

Uh oh!

purnesh42H Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

purnesh42H commented Apr 15, 2025

Uh oh!

dfawley commented Apr 16, 2025

Uh oh!

dfawley commented Apr 16, 2025

Uh oh!

PapaCharlie commented Apr 16, 2025

Uh oh!

PapaCharlie commented Apr 16, 2025

Uh oh!

dfawley commented Apr 16, 2025

Uh oh!

Groxx commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Groxx commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dfawley commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dfawley commented Apr 16, 2025

Uh oh!

Groxx commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 27, 2025

Uh oh!

dfawley commented Apr 28, 2025

Uh oh!

dfawley commented May 14, 2025

Uh oh!

github-actions bot commented May 20, 2025

Uh oh!

github-actions bot commented May 26, 2025

Uh oh!

eshitachandwani commented May 27, 2025

Uh oh!

Uh oh!

Add doc about `mem.*Unsafe` consuming the input buffer #8209

Add doc about `mem.*Unsafe` consuming the input buffer #8209

Groxx commented Mar 29, 2025 •

edited by dfawley

Loading

codecov bot commented Mar 29, 2025 •

edited

Loading

purnesh42H Apr 15, 2025 •

edited

Loading

Groxx commented Apr 16, 2025 •

edited

Loading

Groxx commented Apr 16, 2025 •

edited

Loading

dfawley commented Apr 16, 2025 •

edited

Loading

Groxx commented Apr 21, 2025 •

edited

Loading