Schemas and Coders and Benchmarks

This weekend I got nerd snipped into working on a part of my hobby SDK, that I didn’t have too much motivation to do. Beam Schema Row coders.

But they do need to be done eventually, so I implemented them.

In particular, what I needed was a way to take the Beam Schema Proto, and turn it into coder that could produce a dynamic row value. The existing Go SDK in principle could do it, but it’s not easy due to a choice I made years and years ago. I’m not convinced it’s the wrong choice though. Besides, adding the new handling to the API would take longer than adding it to the hobby SDK, or for what I needed it for from scratch.

Perhaps that’s not entirely true. I’m being paranoid about continuing to broaden the API surface of the existing SDK. It’s already quite large and complex, and the interactions are already quite subtle.

Anyway, I wrote a quick naive implementation of the code, added tests, made them all pass, and wrote a benchmark.

func BenchmarkRoundtrip(b *testing.B) {
	for _, test := range suite {
		b.Run(test.name, func(b *testing.B) {
			c := ToCoder(test.schema)
			b.ReportAllocs()
			b.ResetTimer()
			for range b.N {
				r := coders.Decode(c, test.data)
				if got, want := coders.Encode(c, r), test.data; !cmp.Equal(got, want) {
					b.Errorf("round trip decode-encode not equal: want %v, got %v", want, got)
				}
			}
		})
	}
}

It’s a pretty straightforward thing. I took the implementation straight from the above tests, set it to report allocations, and restart the benchmark timer, and then get to iterating.

This gives bad results though. The problem is here:

if got, want := coders.Encode(c, r), test.data; !cmp.Equal(got, want) {
    b.Errorf("round trip decode-encode not equal: want %v, got %v", want, got)
}

I kept the comparison in, to validate that everything continues to work as desired through the thousands of runs the coders would be put though. Or I was lazy about it. cmp is not intended to be high performance, it’s intended to be convenient and correct.

Had I used bytes.Equal instead of the general cmp.Equal, I’d have spent less CPU, and probably wouldn’t have noticed.

For this task, I didn’t much care about CPU. I cared about allocations, which do directly affect CPU and memory usage. And cmp is allocation heavy by comparison to a straight byte slice equality check.

That’s not all though. I had used my convenience functions for the benchmark too, to trivially get the through the decode and encode cycle. coders.Decode and coders.Encode are just small wrappers to simplify quick one off encodings, such as for testing.

But like cmp, they are convenient, not high performance.

The test body now looks like this:

c := ToCoder(test.schema)
// Mild shenanigans to prevent unnecessary allocations.
enc := coders.NewEncoder()
dec := *coders.NewDecoder(test.data)
n := len(test.data)
want := test.data

b.ReportAllocs()
b.ResetTimer()
for range b.N {
    enc.Reset(n)
    dec = *coders.NewDecoder(test.data)
    r := c.Decode(&dec)
    c.Encode(enc, r)
    if got := enc.Data(); !bytes.Equal(got, want) {
        b.Errorf("encoding not equal: want %v, got %v", want, got)
    }
}

Moving the Encoder and Decoder allocations out of the hot loop, removed them from the profile graph. There’s a bit of “fun” with the Decoder to avoid it getting heap allocated anew for each loop when the test data is being reset. Having the comparison back in adds a few nanoseconds per run, but helps keep the code staying robust.

All of this was a problem largely because I wanted to collect a nice and clean “before” and “after” versions of the metrics as I cleaned up and changed the implementation. I like seeing performance improvements. But now the results are incomparable, and not apples to apples, so I’ve cleared them away.

I am happy where this has ended up though. I incorporated the unsafe tricks protocol buffers use to minimize allocations and space in their protoreflect package: https://github.com/protocolbuffers/protobuf-go/blob/master/reflect/protoreflect/value_unsafe_go121.go.

The idea is the same, really. Be able to refer to and mutate values and fields efficiently, against a known schema. I’d use their implementation directly if I were able to, but they have it quite locked down for the same reasons why I’ve put this stuff in an internal package for the time being. I want it to be able to change.

This is basically half of a useful article, other than the lesson about being certain you know what you’re measuring in your benchmarks. We’ll see if I turn back the clock and collect clean measurements of the implementations.