Posted on May 10, 2009.
Take one arbitrary limitation (Twitter only lets you send 140 characters). Note that Twitter allows you to use UTF-8 characters. Add a course in information theory. Roast under the heat of procrastination for several hours until juicy.
In english: I wondered just how much information you could send in a single tweet, and decided to find out.
Armed with the knowledge from an Information Theory course given by the singular David MacKay, I decided to see. So, it's common knowledge that Arithmetic Coding is the best way to compress information, getting you to within a couple of bits of the Shannon Limit so long as you have an accurate modelling system. The 'best' bit of the compression comes from having suitably subtle and ingenious models of the text that you're encoding; without such a model, you're probably not going to see the benefits of Arithmetic Coding. However, I'm always up for a challenge - especially one set by someone with an Erdos Number of two.
Admittedly, I'd attempted to make an arithmetic encoder for a homework, which I almost achieved. However, I cheated by using the arbitrary precision maths module in PHP which would break after a suitably long string. As I presented it to him, I felt a pang of shame even before I explained... so reinvigorated with the fire of project procrastination, I thought I'd try and get one working properly. Admittedly, my first instinct was to just find one written in PHP, to get onto the meat of this project. But there didn't seem to be one on the internet - until now! And it really does work properly, much to my amazement. Technically, it's a range encoder, but the two are mathematically identical. As an added bonus, a range encoder isn't protected by IBM patents, whereas an arithmetic coder is.
So, how do we transmit our bitstream over Twitter? Happily, Twitter supports UTF-8 characters, which can occupy up to four bytes each. For a four byte character, 21 are under control of the user. If we were to use just uppercase characters and space, for English this corresponds to about four bits per character we wish to send - so we could send up to 700 characters in such a case! We like punctuation though, so this means we can send fewer total characters, but it's probably worth it. And so began my trial with Unicode.
- First attempt: split the binary string into 21 bit blocks, pad the last one and send the corresponding UTF-8 characters. Unfortunately, not all 21 bit long binary strings map to valid characters - those above 0x10ffff are invalid. So drop down to 20 bits per UTF-8 character, and have the first bit as 0 - we're now definitely going to map onto a valid unicode character, right?
- Turns out, no. The points 0xD800 to 0xDFFF are UTF-16 surrogate pairs. So look for these, and if we would be trying to send one of these, send a three byte character instead (which will only be 16 bytes, but still a bargain).
- Unicode has canonical decompositions - so the character á is canonically identical to the characters representing a´ in sequence, for example. These are defined to be identical, and no application should function differently when presented by one sequence or the other. So now check to see that a character we are trying to send is fully decomposed - if not, drop to a three byte character if we were four, and a two byte character if we were three. I'm not convinced that there's a PHP library that does a complete job of this anywhere, so this is currently only a 'probably' step of the process - if a message fails to send, it's probably failed here.
- Line Feed / Carriage Return. Different OSes introduce the other - or not - after seeing one. So if the character we're trying to send is actually in the bottom 128 of characters, only take seven bits. Set the 32s bit to 1, so we're not using the awkward part of the ASCII table.
The upshot is that there is now a proof of concept of the compressor up along with the source. I've included the UTF-8 libraries with it which are from a couple of sources, but mostly MediaWiki. I based the range encoder on the pseudocode over at the wiki article, changing it to use binary rather than base 10. It's entirely for use at your own risk.
Lastly, it would seem that I am not the first to have an idea along these lines, and some have even gone much, much further. Arguably too far. Truly, the internet is a wonderful enabler for people who want to do things simply to see if they can be done. I find it amusing how all of these seem to have sprung up in the last month!