A couple of years ago I worked on a project which needed to transport a large dataset over the wire. I looked at a number of technologies, and Google Protocol Buffers looked very interesting. Over the past week, I’ve been asked about my experience a couple of times, so I hope this provides a little bit of insight into how to use Protocol Buffers in Python when performance matters.
I wrote a little test case to model the serialization of the data I wanted to send, a list of 100 pairs of arrays, where each array contained 250,000 elements. The raw data size was 381 MB.
First, I ran the pure python test: the write took 83 seconds, the read took 202 seconds. Not good.
Next I tested the same data in C++: the write took 4.4 seconds and the read took 2.8 seconds. Impressive.
The obvious path then was to write the serialization code in C++ and expose it through an extension point. The read function, including putting all of the data into numpy arrays now takes 7.5 seconds. I only needed the read function from Python, but the write function should take about the same time.
Why not put the code somewhere ?
it looks like a common usecase (numpy+pbuffer).
L.
What does the code look like? Could you post it?
[...] This post was mentioned on Twitter by blogs of the world, Planet Python. Planet Python said: Enthought: Fast Protocol Buffers in Python http://bit.ly/98Tg0W [...]
I fully intend to post the code, but I was at a conference and didn’t have the code handy. Look for it next week.
It would be interesting to see how the cython equivilent of this code performs (when the code is out)
What version of protobuf were you using when you saw these results? The latest (2.3) is supposed to be 10-25 times faster for Python than previous versions. I haven’t benchmarked this myself, though.
I used protobuf 2.3. I see the notes in their change log, but I didn’t see any significant improvement. Maybe its depends on the message, and the didn’t optimize for large array like data structures?
Did you ever figure out the reason you didn’t see any improvement? I posted a link to your blog and subsequent entry with the code to the protobuf mailing list, since I am very curious. No replies yet, so wondering if you’ve discovered anything else.
I don’t use pure python protobufs in a production environment. We use java compiled messages. I use jython to concoct test messages based on the java versions, and started playing around with pure python out of curiosity.
Shelia- no, I never found the cause. To be honest, I didn’t spend too much time on it since I haven’t needed to update the code in the last 12-18 months.