Fixing invalid UTF-8 in Ruby, revisited
When working with UTF-8-encoded text from an untrusted source like a web form, it’s a good idea to fix any invalid byte sequences at the first stage, to avoid breaking later processing steps that depend on valid input.
For a long while, the Ruby idiom that I’ve been using and recommending to others is this:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8') valid_string = ic.iconv(untrusted_string)
IGNORE
is supposed to tell the processor to
silently discard bytes that it can’t convert. The output thus
contains only valid byte sequences from the input—exactly what we
want.
Today, quite by accident, I discovered a problem with it. Iconv in all its forms (library and command-line, on Linux and on Mac OS X) will ignore invalid byte sequences unless they occur right at the end of the string; compare this:
ic.iconv("foo303bar") # => "foobar"
and this:
ic.iconv("foo303") # Iconv::InvalidCharacter: "303"
What’s more, it’s only a certain range of bytes that break the conversion:
(128..255).inject([]){ |acc, b| begin ic.iconv("foo%c" % b) acc rescue acc << b end }
The ‘dangerous’ bytes are those in the range 194-253. To put it
another way, that’s all bytes of the binary pattern
/^1{2,6}0/
—the leading bytes from a UTF-8 byte
sequence. (Incidentally, it’s interesting to see that, at least on
OS X, it recognises the never-used and since-withdrawn five- and
six-byte sequences from the original UTF-8 specification).
All of this is useful in explaining why it happens, but not how to fix it. The fix, however, is simple:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8') valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
Add a valid byte before converting, and remove it afterwards,
and voilà—there’s never an invalid sequence at the end of the
buffer. (It’s possible to improve the efficiency of this
implementation if you don’t care about preserving the original
string: use <<
instead of +
to add
the space.)
As to why //IGNORE
doesn’t ignore this
situation, I don’t know. As far as I can tell, the POSIX
specification doesn’t specifically address the
//IGNORE
flag, so it’s hard to say what it should be
doing.