Archive for the 'NumPy' category

Fast Numpy/Protobuf Deserialization Example

Jul 07 2010 Published by under NumPy

In my last post I wrote about how I was able to improve protobuf deserialization using a C++ extension. I’ve boiled down the code to its essence to show how I did it. Rather than zip everything up in a file, the code is short enough to show in its entirety.

Here’s the simplified protobuf message which is used to represent a time series as 2 arrays:

package timeseries;

message TimeSeries {
    repeated double times = 2;
    repeated double values = 3;
}

I then wrote a test app in Python and C++ to provide a benchmark. Here is the Python version:

import numpy
import time_series_pb2

def write_test():
    ts = time_series_pb2.TimeSeries()
    for i in range(10000000):
        ts.times.append(i)
        ts.values.append(i*10.0)

    import time
    start = time.time();

    f = open("ts.bin", "wb")
    f.write(ts.SerializeToString())
    f.close()

    print time.time() - start

def read_test():
    ts = time_series_pb2.TimeSeries()

    import time
    start = time.time();

    f = open("ts.bin", "rb")
    ts.ParseFromString(f.read())
    f.close()

    t = numpy.array(ts.times._values)
    v = numpy.array(ts.values._values)

    print 'Read time:', time.time() - start
    print "Read time series of length %d" % len(ts.times)

if __name__ == "__main__":
    import sys
    if len(sys.argv) < 2:
        print "usage:   %s <--read> | <--write>" % sys.argv[0]

    if sys.argv[1] == "--read":
        read_test()
    else:
        write_test()

I will spare you the C++ standalone code, since it was only a stepping stone. Instead here is the C++ extension, with 2 exposed methods, one which deserializes a string and the other which operates on a file.

#include <fcntl.h>

#include <Python.h>
#include <numpy/arrayobject.h>
#include <google/protobuf/io/coded_stream.h>
#include <google/protobuf/io/zero_copy_stream_impl_lite.h>
#include <google/protobuf/io/zero_copy_stream_impl.h>

#include "time_series.pb.h"

static PyObject* construct_numpy_arrays(timeseries::TimeSeries* ts)
{
    // returns a tuple (t,v) where t and v are double arrays of the same length

    PyObject* data_tuple = PyTuple_New(2);

    long array_size = ts->times_size();
    double* times = new double[array_size];
    double* values = new double[array_size];

    // the data must be copied because the tsid will go away and its mutable data
    // will too
    memcpy(times, ts->times().data(), ts->times_size()*sizeof(double));
    memcpy(values, ts->values().data(), ts->values_size()*sizeof(double)); 

    // put the arrays into numpy array objects
    npy_intp dims[1] = {array_size};
    PyObject* time_array = PyArray_SimpleNewFromData(1, dims, PyArray_DOUBLE, times);
    PyObject* value_array = PyArray_SimpleNewFromData(1, dims, PyArray_DOUBLE, values);

    PyTuple_SetItem(data_tuple, 0, time_array);
    PyTuple_SetItem(data_tuple, 1, value_array);

    return data_tuple;
}

static PyObject* TimeSeries_load(PyObject* self, PyObject* args)
{
    char* filename = NULL;

    if (! PyArg_ParseTuple(args, "s", &filename))
    {
        return NULL;
    }

    timeseries::TimeSeries ts;

    int fd = open(filename, O_RDONLY);
    google::protobuf::io::FileInputStream fs(fd);
    google::protobuf::io::CodedInputStream coded_fs(&fs);
    coded_fs.SetTotalBytesLimit(500*1024*1024, -1);
    ts.ParseFromCodedStream(&coded_fs);
    fs.Close();
    close(fd);

    return construct_numpy_arrays(&ts);
}

static PyObject* TimeSeries_deserialize(PyObject* self, PyObject* args)
{
    int buffer_length;
    char* serialization = NULL;

    if (! PyArg_ParseTuple(args, "t#", &serialization, &buffer_length))
    {
        return NULL;
    }
    google::protobuf::io::ArrayInputStream input(serialization, buffer_length);

    google::protobuf::io::CodedInputStream coded_fs(&input);
    coded_fs.SetTotalBytesLimit(500*1024*1024, -1);

    timeseries::TimeSeries ts;
    ts.ParseFromCodedStream(&coded_fs);
    return construct_numpy_arrays(&ts);
}

static PyMethodDef TSMethods[] = {
    {"load", TimeSeries_load, METH_VARARGS, "loads a TimeSeries from a file"},
    {"deserialize", TimeSeries_deserialize, METH_VARARGS, "loads a TimeSeries from a string"}
};

#ifndef PyMODINIT_FUNC  /* declarations for DLL import/export */
#define PyMODINIT_FUNC void
#endif
PyMODINIT_FUNC inittimeseries(void)
{
    import_array();
    (void) Py_InitModule("timeseries", TSMethods);
}

Calling the exension from python is trivial:

import time
import timeseries
start = time.time()
t, v = timeseries.load('ts.bin')
print "read and converted to numpy array in %f" % (time.time()-start)
print "timeseries contained %d values" % len(v)

One response so far

Fast Protocol Buffers in Python

Jul 01 2010 Published by under NumPy

A couple of years ago I worked on a project which needed to transport a large dataset over the wire. I looked at a number of technologies, and Google Protocol Buffers looked very interesting. Over the past week, I’ve been asked about my experience a couple of times, so I hope this provides a little bit of insight into how to use Protocol Buffers in Python when performance matters.

I wrote a little test case to model the serialization of the data I wanted to send, a list of 100 pairs of arrays, where each array contained 250,000 elements. The raw data size was 381 MB.

First, I ran the pure python test: the write took 83 seconds, the read took 202 seconds. Not good.

Next I tested the same data in C++: the write took 4.4 seconds and the read took 2.8 seconds. Impressive.

The obvious path then was to write the serialization code in C++ and expose it through an extension point. The read function, including putting all of the data into numpy arrays now takes 7.5 seconds. I only needed the read function from Python, but the write function should take about the same time.

9 responses so far

« Newer posts Older posts »

Featuring Advanced Search Functions plugin by YD