Skip to content

proposal: encoding/json: garbage-free reading of tokens #40128

Open
@rogpeppe

Description

@rogpeppe

As @bradfitz noted in the reviews of the original API, the Decoder.ReadToken API is a garbage factory. Although, as @rsc noted at the time, "a clean API ... is more important here. I expect people to use it to get to the position they want in the stream and then call Decode", the inefficiency is a problem in practice for anyone that wishes to use the encoding/json tokenizer as a basis for some other kind of decoder.

Dave Cheney's "Building a high performance JSON parser" details some of the issues involved. He comes to the conclusion that the interface-based nature of json.Token is a fundamental obstacle. I like the current interface-based API, but it does indeed make it impossible to return arbitrary tokens without creating garbage. Dave suggests a new Scanner API, somewhat more complex, that is also not backwardly compatible with the current API in encoding/json.

I propose instead that the following method be added to the encoding/json package:

// TokenBytes is like Token, except that for strings and numbers, it returns
// a static Token value with the actual data payload in the []byte parameter,
// which is only valid until the next call to Token or TokenBytes or Decode.
// For strings, the returned Token will be ""; for a number, it will be
// Number("0"); for all other kinds of token, the Token will be returned as by
// Token method and the []byte value will be nil.
//
// This is more efficient than using Token because it avoids the
// allocations required by that API.
func (dec *Decoder) TokenBytes() (Token, []byte, error)

Token can be implemented in terms of TokenBytes as follows:

func (dec *Decoder) Token() (Token, error) {
	tok, data, err := dec.TokenBytes()
	if err != nil || data == nil {
		return tok, err
	}
	switch tok {
	case "":
		return string(data)
	case Number(0):
		return Number(data)
	}
	panic("unreachable")
}

Discussion

This proposal relies on the observation that the Decoder.Token API only generates garbage for two kinds of tokens: numbers and strings. For all other token types, no garbage need be generated, as small numbers (json.Delim and bool) do not incur an allocation when boxed in an interface.

It maintains the current API as-is. Users can opt-in to the new API if they require efficiency at some risk of incorrectness (the caller could hold onto the data slice after the next call to Decode). The cognitive overhead of TokenBytes is arguably low because of its similarity to the existing API.

If this proposal is accepted, an Encoder.EncodeTokenBytes could easily be added to provide garbage-free streaming JSON generation too.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Hold

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions