A couple of years ago I worked on a project which needed to transport a large dataset over the wire. I looked at a number of technologies, and Google Protocol Buffers looked very interesting. Over the past week, I’ve been asked about my experience a couple of times, so I hope this provides a little bit of insight into how to use Protocol Buffers in Python when performance matters.
I wrote a little test case to model the serialization of the data I wanted to send, a list of 100 pairs of arrays, where each array contained 250,000 elements. The raw data size was 381 MB.
First, I ran the pure python test: the write took 83 seconds, the read took 202 seconds. Not good.
Next I tested the same data in C++: the write took 4.4 seconds and the read took 2.8 seconds. Impressive.
The obvious path then was to write the serialization code in C++ and expose it through an extension point. The read function, including putting all of the data into numpy arrays now takes 7.5 seconds. I only needed the read function from Python, but the write function should take about the same time.