Fast Numpy/Protobuf Deserialization Example

In my last post I wrote about how I was able to improve protobuf deserialization using a C++ extension. I’ve boiled down the code to its essence to show how I did it. Rather than zip everything up in a file, the code is short enough to show in its entirety.

Here’s the simplified protobuf message which is used to represent a time series as 2 arrays:

[sourcecode language=”python”]
package timeseries;

message TimeSeries {
repeated double times = 2;
repeated double values = 3;
}
[/sourcecode]

I then wrote a test app in Python and C++ to provide a benchmark. Here is the Python version:

[sourcecode language=”python”]
import numpy
import time_series_pb2

def write_test():
ts = time_series_pb2.TimeSeries()
for i in range(10000000):
ts.times.append(i)
ts.values.append(i*10.0)

import time
start = time.time();

f = open("ts.bin", "wb")
f.write(ts.SerializeToString())
f.close()

print time.time() – start

ts = time_series_pb2.TimeSeries()

import time
start = time.time();

f = open("ts.bin", "rb")
f.close()

t = numpy.array(ts.times._values)
v = numpy.array(ts.values._values)

print ‘Read time:’, time.time() – start
print "Read time series of length %d" % len(ts.times)

if __name__ == "__main__":
import sys
if len(sys.argv) < 2:
print "usage: %s <–read> | <–write>" % sys.argv[0]

else:
write_test()
[/sourcecode]

I will spare you the C++ standalone code, since it was only a stepping stone. Instead here is the C++ extension, with 2 exposed methods, one which deserializes a string and the other which operates on a file.

[sourcecode language=”cpp”]
#include <fcntl.h>

#include <Python.h>
#include <numpy/arrayobject.h>

#include "time_series.pb.h"

static PyObject* construct_numpy_arrays(timeseries::TimeSeries* ts)
{
// returns a tuple (t,v) where t and v are double arrays of the same length

PyObject* data_tuple = PyTuple_New(2);

long array_size = ts->times_size();
double* times = new double[array_size];
double* values = new double[array_size];

// the data must be copied because the tsid will go away and its mutable data
// will too
memcpy(times, ts->times().data(), ts->times_size()*sizeof(double));
memcpy(values, ts->values().data(), ts->values_size()*sizeof(double));

// put the arrays into numpy array objects
npy_intp dims[1] = {array_size};
PyObject* time_array = PyArray_SimpleNewFromData(1, dims, PyArray_DOUBLE, times);
PyObject* value_array = PyArray_SimpleNewFromData(1, dims, PyArray_DOUBLE, values);

PyTuple_SetItem(data_tuple, 0, time_array);
PyTuple_SetItem(data_tuple, 1, value_array);

return data_tuple;
}

static PyObject* TimeSeries_load(PyObject* self, PyObject* args)
{
char* filename = NULL;

if (! PyArg_ParseTuple(args, "s", &filename))
{
return NULL;
}

timeseries::TimeSeries ts;

int fd = open(filename, O_RDONLY);
coded_fs.SetTotalBytesLimit(500*1024*1024, -1);
ts.ParseFromCodedStream(&coded_fs);
fs.Close();
close(fd);

return construct_numpy_arrays(&ts);
}

static PyObject* TimeSeries_deserialize(PyObject* self, PyObject* args)
{
int buffer_length;
char* serialization = NULL;

if (! PyArg_ParseTuple(args, "t#", &serialization, &buffer_length))
{
return NULL;
}

coded_fs.SetTotalBytesLimit(500*1024*1024, -1);

timeseries::TimeSeries ts;
ts.ParseFromCodedStream(&coded_fs);
return construct_numpy_arrays(&ts);
}

static PyMethodDef TSMethods[] = {
{"deserialize", TimeSeries_deserialize, METH_VARARGS, "loads a TimeSeries from a string"}
};

#ifndef PyMODINIT_FUNC /* declarations for DLL import/export */
#define PyMODINIT_FUNC void
#endif
PyMODINIT_FUNC inittimeseries(void)
{
import_array();
(void) Py_InitModule("timeseries", TSMethods);
}
[/sourcecode]

Calling the exension from python is trivial:
[sourcecode language=”python”]
import time
import timeseries
start = time.time()
print "read and converted to numpy array in %f" % (time.time()-start)
print "timeseries contained %d values" % len(v)
[/sourcecode]

Fast Protocol Buffers in Python

A couple of years ago I worked on a project which needed to transport a large dataset over the wire. I looked at a number of technologies, and Google Protocol Buffers looked very interesting. Over the past week, I’ve been asked about my experience a couple of times, so I hope this provides a little bit of insight into how to use Protocol Buffers in Python when performance matters.

I wrote a little test case to model the serialization of the data I wanted to send, a list of 100 pairs of arrays, where each array contained 250,000 elements. The raw data size was 381 MB.

First, I ran the pure python test: the write took 83 seconds, the read took 202 seconds. Not good.

Next I tested the same data in C++: the write took 4.4 seconds and the read took 2.8 seconds. Impressive.

The obvious path then was to write the serialization code in C++ and expose it through an extension point. The read function, including putting all of the data into numpy arrays now takes 7.5 seconds. I only needed the read function from Python, but the write function should take about the same time.

Travis Oliphant announces…

Travis announces project to extend NumPy/SciPy to .Net

Travis Oliphant kicked off today’s SciPy 2010 Day 2 with a great keynote talk. He told the story of his own path to Python, filling his slides with the faces and work of other developers, scientists, and mathematicians inspiration, teachers, and collaborators. He explained how his academic trajectory, from electrical engineering, through a brief affair with neuroscience, to a biomedical engineering PhD, both drove and competed with his work creating NumPy.
Last, but not least, Travis closed his talk with rather large announcement: Enthought has undertaken the extension of NumPy and SciPy to the .NET framework. For details on the project refer to the official release.

SciPy 2010 underway!

Everyone minus Ian, the most valiant photographer!

We were thrilled to host SciPy 2010 in Austin this year. Everyone seems to be enjoying the cool weather (so what if its borne of thunderstorms?) and the plush conference center/hotel (even if we had to retrain their A/V team).
After two days of immensely informative Tutorials, the General Session began yesterday with speaker Dave Beazley’s awesome keynote on Python concurrency. In addition to the solid line-up of talks at the main conference, we had two very well-attended specialized tracks: Glen Otero, chaired the Bioinformatics track, while Brian Granger and Ken Elkabany coordinated the Parallel Processing & Cloud Computing talks. The day then closed with a conference reception and guacamole-fueled Birds of a Feather sessions.