Don’t bother buffering

I started out programming on 8- and 16-bit computers, and it was hard back then. Uphill both ways in the snow and all that. My first steps in C were taken on a 16-bit platform, in which using more than 64 KiB in a program required serious attention. I don’t have much cause to write in C these days, but when I do, I’m pleasantly surprised by how much easier it seems. Partly, that’s due to memory availability and bus width having outstripped the requirements of most tasks, but much of it is helped by the huge improvements in operating systems—at least, on the operating systems that I’m using—over the past few decades. However, I sometimes forget just how much is being done for me in the background.

Today, I needed to speed up the detection of the character encoding of a 2 GiB file from a few possibilities. Long story short, using Ruby and libiconv resulted in excessive memory use and/or poor performance.

I took a different tack, and tried the Unix approach of writing a small tool to do one task well: ‘given a file, tell me whether it is UTF-8, Latin-15, or Windows-1252, or something else unknown.’ I implemented it as a state machine, reading a byte at a time with fgetc(), using it to choose the next state, and eventually reaching either the end of the stream or the error state.

I assumed that using fgetc() like this would be slow. In my dim recollection, reading bytes off a stream one at a time was inefficient, and it was my job as the C programmer to implement buffering. So I dutifully did so. I allocated a megabyte of memory and told the stream to use it. And then I tested the effect on the two gigabyte file.

A one megabyte buffer didn’t make a difference, except, of course, that the program now took an extra megabyte of RAM when running. It still finished in around 45 seconds. I tried a 100 megabyte buffer instead. This time, it was actually slower by about 50%. The process would stop and wait for the buffer to be refilled every time it ran out of bytes, and this took time.

I ripped out the useless buffering code. Lesson: the operating system is smarter than I am. It’s a good lesson to learn.

Incidentally, I’ll be releasing the code on reevoolabs soon. Tomorrow, if I can.