On BitTorrent

Apart from moving house and playing WEBoggle, I’ve also spent much of the past two weeks working on a BitTorrent tracker. It might seem like a strange thing to be doing, particularly in a week in which one person was fined USD 1 million for running a BitTorrent tracker/aggregator site. But don’t worry: I’m not in any danger. Contrary to what some people appear to believe, peer-to-peer technology does have substantial legitimate uses.

The project that I’m working on involves the legal online distribution of movies. These are big files, which means that we either have to (a) pay a huge amount for a fat pipe and server farm, or (b) use P2P and set it up with minimal initial investment. Since the target market is in Japan, where internet access is fast, cheap, and unmetered, it’s ideal for a P2P distribution method.

Setting up a tracker initially was very easy, but we needed to add some IP-address-based authentication to the tracker in order to restrict it to paying customers. I could have tacked the necessary functionality onto an existing tracker, but two things prompted me to make my own. First, the tracker is functionally very simple, and it was as easy to build one up from scratch as to understand an existing codebase. Second, most of the trackers available are very inefficient. This is partly the fault of the protocol (I have some ideas for improvement which I intend to suggest), but mostly due to naïve and inefficient indexing systems. The tracker is a sort of dating agency for peers. It stores no data; indeed, it does not know anything about the data being tracked beyond a simple hash that uniquely identifies each “torrent” or collection of data to be shared. However, with multiple and/or large swarms of peers, there is a large corpus of peer data to be stored and constantly updated. Defunct peers (those that did not sign off when disconnecting for whatever reason) should be removed from the pool from time to time. In our case, there is a dynamic whitelist of IP addresses associated with authenticated users. These are retired after a period of inactivity.

The point is, however, that none of this is technically hard: indexing has been extensively researched in computer science. The trick is to build on this by making use of existing code. In other words, use a well-optimised data store rather than a bit of roll-your-own. And that’s what I’ve done.

I used a combination of Ruby and PostgreSQL to develop the tracker; the ratio of code is approximately 3:2 PostgreSQL:Ruby. Essentially, Ruby is handling the wire protocol via WEBrick servlets, whilst all the peer data is stored and manipulated by PostgreSQL stored procedures. It’s hard to say until further testing is completed, but it ought to be fairly efficient in terms both of speed and of memory required. We’ll see if that really is the case!