Compressing text using Python’s unicode support

Python’s unicode documentation could use a little bit of love. It does a great job of discussing utf-8 support, but doesn’t really discuss any other codecs. Yesterday I found my self wondering what other codecs were supported & eventually found the codecs page. There are the usual, ‘utf-8’, ‘utf-16’, ‘latin-1’, etc., but at the bottom of the page is the bit that interested me.

Using the methods, “encode” and “decode”, which are meant to support unicode encoding, you’re capable of converting your data into all sorts of useful serializations. The 2 that piqued by interest the most are ‘base64’ and ‘zlib’.

Sometimes its useful to compress data being sent back and forth through a web session. When handling a request, you will want to base64, then unzip to decode the data

request_data = request.POST['data']
deserialization = request_data.decode('base64_codec').decode('zlib_codec')

Then when replying to the request, the data should be zipped, then base64 encoded

reply_data = '...'
serialization = reply_data.encode('zlib_codec').encode('base64_codec')

This example is rather contrived, and relies on under-documented features of python, but I thought it was pretty cool.

[edited to fix the sample code]

11 thoughts on “Compressing text using Python’s unicode support

  1. avatarShawn Wheatley

    I believe the response encoding should look like: reply_data.encode(‘zlib_codec’).encode(‘base64_codec’)

  2. avatarNick Coghlan

    Since these are “text-to-text” or “binary-to-binary” transforms, though, the encode()/decode() methods in Python 3.x don’t support this style of usage – it’s a Python 2.x only feature).

    The codecs themselves are back in 3.2, but you need to go through the codecs module API in order to use them – they aren’t available via the object method shorthand.

  3. avatarJean-Paul

    You can also just use the names “zlib” and “base64”. Also, technically this really isn’t very related to unicode. Bytes-to-bytes encoding just shares an API with Unicode-to-bytes encoding. Unicode doesn’t ever get involved when you’re using codecs like zlib and base64.

  4. avatarpps

    Small mistake in second example. There should be:
    serialization = reply_data.encode(‘zlib_codec’).encode(‘base64_codec’)

  5. avatarMasklinn

    Note a few things for Python 3 though:

    * zlib_codec and base64_codec are not available in Python 3.1

    * they are mappings from byte to byte, so you still need to decode them using a text-encoding codec to get strings

  6. avatarbartek

    Cool stuff. I think the encoding part has a typo, as it would be more sensible to encode with base_64 and zip instead of zipping twice

  7. avatarBrandon Craig Rhodes

    The reason that those codecs are under-documented is that their placement in the codecs module was a mistake, and today we are encouraged to use the routines dedicated to base64 and zlib elsewhere in the Standard Library specifically, in the modules that go by those names:

    Why is it important to not treat these transforms as codecs? Because, strictly speaking, a codec takes a Unicode string (what will be known in Python 3 as strings plain and simple!) and converts it to a byte sequence that can be written to disk or transmitted over a network and each codec also implements the opposite conversion. Converting strings back and forth to byte objects is the semantic definition of a codec.

    But both base64 and zlib do something quite different: they take a raw byte string, and convert it to yet another raw byte stream. They are not encoding or decoding any set of symbols; they are agnostic about the meaning of the byte streams, and are simply converting them to an ASCII or compressed form that makes those bytes easier to transmit across certain media, or to store on space-limited media. As the Python community tries to educate its programmers especially newcomers about why byte strings and Unicode strings serve quite different functions, it will be increasingly important for us to communicate that binary transforms are not, in fact, codecs.

  8. avatarDavid Brailovsky

    also notice that “string”.encode(‘base64’) produces a different output than base64.b64encode(“string”).
    that can be quite annoying when some API requires you to encrypt a base64 encoded string and you use the str.encode instead of the base64 module.


Leave a Reply

Your email address will not be published. Required fields are marked *

Please leave these two fields as-is:

Protected by Invisible Defender. Showed 403 to 107,114 bad guys.