Fixing invalid UTF-8 in Ruby, revisited

When working with UTF-8-encoded text from an untrusted source like a web form, it’s a good idea to fix any invalid byte sequences at the first stage, to avoid breaking later processing steps that depend on valid input.

For a long while, the Ruby idiom that I’ve been using and recommending to others is this:

ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string)

IGNORE is supposed to tell the processor to silently discard bytes that it can’t convert. The output thus contains only valid byte sequences from the input—exactly what we want.

Today, quite by accident, I discovered a problem with it. Iconv in all its forms (library and command-line, on Linux and on Mac OS X) will ignore invalid byte sequences unless they occur right at the end of the string; compare this:

ic.iconv("foo303bar") # => "foobar"

and this:

ic.iconv("foo303") # Iconv::InvalidCharacter: "303"

What’s more, it’s only a certain range of bytes that break the conversion:

(128..255).inject([]){ |acc, b|
  begin
    ic.iconv("foo%c" % b)
    acc
  rescue
    acc << b
  end
}

The ‘dangerous’ bytes are those in the range 194-253. To put it another way, that’s all bytes of the binary pattern /^1{2,6}0/—the leading bytes from a UTF-8 byte sequence. (Incidentally, it’s interesting to see that, at least on OS X, it recognises the never-used and since-withdrawn five- and six-byte sequences from the original UTF-8 specification).

All of this is useful in explaining why it happens, but not how to fix it. The fix, however, is simple:

ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]

Add a valid byte before converting, and remove it afterwards, and voilà—there’s never an invalid sequence at the end of the buffer. (It’s possible to improve the efficiency of this implementation if you don’t care about preserving the original string: use << instead of + to add the space.)

As to why //IGNORE doesn’t ignore this situation, I don’t know. As far as I can tell, the POSIX specification doesn’t specifically address the //IGNORE flag, so it’s hard to say what it should be doing.