Fixing invalid UTF-8 in Ruby, revisited

When working with UTF-8-encoded text from an untrusted source like a web form, it’s a good idea to fix any invalid byte sequences at the first stage, to avoid breaking later processing steps that depend on valid input.

For a long while, the Ruby idiom that I’ve been using and recommending to others is this:

ic ='UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string)

IGNORE is supposed to tell the processor to silently discard bytes that it can’t convert. The output thus contains only valid byte sequences from the input—exactly what we want.

Today, quite by accident, I discovered a problem with it. Iconv in all its forms (library and command-line, on Linux and on Mac OS X) will ignore invalid byte sequences unless they occur right at the end of the string; compare this:

ic.iconv("foo303bar") # => "foobar"

and this:

ic.iconv("foo303") # Iconv::InvalidCharacter: "303"

What’s more, it’s only a certain range of bytes that break the conversion:

(128..255).inject([]){ |acc, b|
    ic.iconv("foo%c" % b)
    acc << b

The ‘dangerous’ bytes are those in the range 194-253. To put it another way, that’s all bytes of the binary pattern /^1{2,6}0/—the leading bytes from a UTF-8 byte sequence. (Incidentally, it’s interesting to see that, at least on OS X, it recognises the never-used and since-withdrawn five- and six-byte sequences from the original UTF-8 specification).

All of this is useful in explaining why it happens, but not how to fix it. The fix, however, is simple:

ic ='UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]

Add a valid byte before converting, and remove it afterwards, and voilà—there’s never an invalid sequence at the end of the buffer. (It’s possible to improve the efficiency of this implementation if you don’t care about preserving the original string: use << instead of + to add the space.)

As to why //IGNORE doesn’t ignore this situation, I don’t know. As far as I can tell, the POSIX specification doesn’t specifically address the //IGNORE flag, so it’s hard to say what it should be doing.


  1. Vrensk

    Wrote at 2006-07-05 23:19 UTC using Safari 417.9.3 on Mac OS X:

    This sounds really useful in a world with broken browsers. Do you have a tip on where to put this to make sure that all input is filtered? A before_filter that processes params[]?
  2. Paul

    Wrote at 2006-07-06 06:07 UTC using Firefox on Mac OS X:

    You could, but it would probably be more efficient to fit it in somewhere deeper, before Rails decodes the query string (or POST data) into the params hash. I’m not sure how easy that would be to achieve, however.
  3. Sanjeev

    Wrote at 2007-06-19 13:20 UTC using Firefox on Windows XP:

    Can we use this, also?

    valid_string = untrusted_string.unpack(‘C*’).pack(‘U*’)

    Is there any limitation to this method compared to yours?
  4. Paul Battley

    Wrote at 2007-06-19 13:30 UTC using Firefox on Mac OS X:

    Is there any limitation? Well, yes: it doesn’t appear to work!

    >> $KCODE = 'u'
    => "u"
    >> str = 'paté'
    => "paté"
    >> str.unpack('C*').pack('U*')
    => "paté"
  5. Sanjeev

    Wrote at 2007-06-19 13:54 UTC using Firefox on Windows XP:

    I get “pat\302\202” which correctly displayed in the browser. It is a valid utf-8.

    # the function is what you have given
    returns just “pat” at my end. It is string e with accents.
  6. Sanjeev

    Wrote at 2007-06-19 13:58 UTC using Firefox on Windows XP:

    it knocks off e with accents. *

    I am working on Windows. Is it specific to platform?
  7. Sanjeev

    Wrote at 2007-06-19 14:02 UTC using Firefox on Windows XP:

    if I give “pat\302\202” to the valid_utf_8 function it returns back the original string.

    It is just the win32 console is not able to display it properly on my end, but IE and Mozilla both are rendering it correctly.
  8. Paul Battley

    Wrote at 2007-06-19 15:40 UTC using Firefox on Mac OS X:

    Yep, that’s entirely a problem with the Windows console.

    I believe that it’s possible to set it to display in UTF-8, but the default setting is Windows 1252, at least on English versions.
  9. Sanjeev

    Wrote at 2007-06-19 15:43 UTC using Firefox on Windows XP:

    i think the problem lies that my code “paté” when I copy to shell is encoded as ascii.

    so unpack(“C*”) works in my case. pack(“U*”) packs the array to Unicode.

    it the string itself is in the unicode (it is proper and not malformed)

    unpack(“C*”) will not unpack correctly hence, i will not get back the same utf8 string.
  10. Paul Battley

    Wrote at 2007-06-19 16:49 UTC using Firefox on Mac OS X:

    Don’t confuse ASCII as a synonym for 8-bit (one byte per character) encodings. It’s not correct. ASCII is a 7-bit encoding, and can’t handle characters like ‘é’.

    In fact, what you’re pasting in is probably Windows 1252, an 8-bit encoding which is mostly the same as ISO-8859-1.

    Now it just so happens that the first 256 codepoints in Unicode are the same as ISO-8859-1, so your code works for your example.

    That’s not a solution for fixing broken Unicode, though: it’s just a way of translating ISO-8859-1 into UTF-8.
  11. Sanjeev

    Wrote at 2007-06-20 09:01 UTC using Firefox on Windows XP:

    Thanks for your responses. It really cleared many doubts I was having.

    [From Wikipedia]
    The exact response required of a UTF-8 decoder on invalid input is not uniformly defined by the standards. In general, there are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:

    1. Insert a replacement character (usually ’?’).
    2. Ignore the bytes.
    3. Interpret each byte according to a legacy encoding (often ISO-8859-1 or CP1252).
    4. Not notice and decode as if the bytes were some similar bit of UTF-8.
    5. Stop decoding and report an error (possibly giving the caller the option to continue).

    So we can do all types of stuff in case of malformed unicode strings. Ignoring is one, translating is one.
    Yahoo search uses translating behavior.
  12. Pavel Šmerk

    Wrote at 2009-01-18 16:30 UTC using Chrome on Windows XP:

    What about the following (without the use of Iconv):

    untrusted_string.unpack(‘U*’) rescue nil

    (=> only a test, whether the string is valid)

    untrusted_string.unpack(‘U*’).pack(‘U*’) rescue nil

    (=> returns string if it is valid or nil otherwise)
  13. Pavel Šmerk

    Wrote at 2009-01-18 16:42 UTC using Chrome on Windows XP:

    Ooops—- it’s a solution of another problem, I’m sorry! Feel free to remove my posts. :-)

    (While I was finding, how to recognize non-valid utf-8, I have found this article and read the comments—- and the pack/unpack stuff have inspired me to my solution. But, it does not, of course, repair broken utf-8 strings in any way. :-)
  14. Tor Erik

    Wrote at 2009-04-23 15:43 UTC using Safari 525.27.1 on Mac OS X:

    Just noticed that if in the case of “351” you have to add TWO spaces, or it will still throw Iconv::InvalidCharacter.
  15. Jade Rubick

    Wrote at 2011-02-16 17:08 UTC using Chrome 9.0.597.102 on Mac OS X:

    Hi Paul:

    This is super helpful. Thank you so much for posting it!

  16. Bogdan Gusiev

    Wrote at 2011-12-05 10:19 UTC using Chrome 14.0.835.202 on Linux:

    Works well for me.

    But Iconv is deprecated in Ruby 1.9.3:

    :in `block in require’: iconv will be deprecated in the future, use String#encode instead.

    Can you suggest how to reach the same with #encode method?
  17. Daniel

    Wrote at 2012-02-17 22:32 UTC using Chrome 19.0.1041.0 on Mac OS X:

    Like Bogdan, I’m not making headway on this. here’s a test-case
  18. Byung

    Wrote at 2012-03-25 12:08 UTC using Firefox 7.0.1 on Linux:

    It’s very helpful. Thank you!
  19. woto

    Wrote at 2012-05-15 01:13 UTC using Safari 534.55.3 on Mac OS X:

    It finally works, thanks. Best whishes from Russia.