Fixing invalid UTF-8 in Ruby, revisited

When working with UTF-8-encoded text from an untrusted source like a web form, it’s a good idea to fix any invalid byte sequences at the first stage, to avoid breaking later processing steps that depend on valid input.

For a long while, the Ruby idiom that I’ve been using and recommending to others is this:

ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string)

IGNORE is supposed to tell the processor to silently discard bytes that it can’t convert. The output thus contains only valid byte sequences from the input—exactly what we want.

Today, quite by accident, I discovered a problem with it. Iconv in all its forms (library and command-line, on Linux and on Mac OS X) will ignore invalid byte sequences unless they occur right at the end of the string; compare this:

ic.iconv("foo303bar") # => "foobar"

and this:

ic.iconv("foo303") # Iconv::InvalidCharacter: "303"

What’s more, it’s only a certain range of bytes that break the conversion:

(128..255).inject([]){ |acc, b|
  begin
    ic.iconv("foo%c" % b)
    acc
  rescue
    acc << b
  end
}

The ‘dangerous’ bytes are those in the range 194-253. To put it another way, that’s all bytes of the binary pattern /^1{2,6}0/—the leading bytes from a UTF-8 byte sequence. (Incidentally, it’s interesting to see that, at least on OS X, it recognises the never-used and since-withdrawn five- and six-byte sequences from the original UTF-8 specification).

All of this is useful in explaining why it happens, but not how to fix it. The fix, however, is simple:

ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]

Add a valid byte before converting, and remove it afterwards, and voilà—there’s never an invalid sequence at the end of the buffer. (It’s possible to improve the efficiency of this implementation if you don’t care about preserving the original string: use << instead of + to add the space.)

As to why //IGNORE doesn’t ignore this situation, I don’t know. As far as I can tell, the POSIX specification doesn’t specifically address the //IGNORE flag, so it’s hard to say what it should be doing.

Comments

Skip to the comment form

  1. Vrensk

    Wrote at 2006-07-05 23:19 UTC using Safari 417.9.3 on Mac OS X:

    This sounds really useful in a world with broken browsers. Do you have a tip on where to put this to make sure that all input is filtered? A before_filter that processes params[]?
  2. Paul

    Wrote at 2006-07-06 06:07 UTC using Firefox 1.5.0.4 on Mac OS X:

    You could, but it would probably be more efficient to fit it in somewhere deeper, before Rails decodes the query string (or POST data) into the params hash. I’m not sure how easy that would be to achieve, however.
  3. Sanjeev

    Wrote at 2007-06-19 13:20 UTC using Firefox 2.0.0.4 on Windows XP:

    Can we use this, also?

    valid_string = untrusted_string.unpack(‘C*’).pack(‘U*’)

    Is there any limitation to this method compared to yours?
  4. Paul Battley

    Wrote at 2007-06-19 13:30 UTC using Firefox 2.0.0.4 on Mac OS X:

    Is there any limitation? Well, yes: it doesn’t appear to work!

    >> $KCODE = 'u'
    => "u"
    >> str = 'paté'
    => "paté"
    >> str.unpack('C*').pack('U*')
    => "paté"
  5. Sanjeev

    Wrote at 2007-06-19 13:54 UTC using Firefox 2.0.0.4 on Windows XP:

    I get “pat\302\202” which correctly displayed in the browser. It is a valid utf-8.

    # the function is what you have given
    ‘paté’.to_valid_utf8
    returns just “pat” at my end. It is string e with accents.
  6. Sanjeev

    Wrote at 2007-06-19 13:58 UTC using Firefox 2.0.0.4 on Windows XP:

    it knocks off e with accents. *

    I am working on Windows. Is it specific to platform?
  7. Sanjeev

    Wrote at 2007-06-19 14:02 UTC using Firefox 2.0.0.4 on Windows XP:

    if I give “pat\302\202” to the valid_utf_8 function it returns back the original string.

    It is just the win32 console is not able to display it properly on my end, but IE and Mozilla both are rendering it correctly.
  8. Paul Battley

    Wrote at 2007-06-19 15:40 UTC using Firefox 2.0.0.4 on Mac OS X:

    Yep, that’s entirely a problem with the Windows console.

    I believe that it’s possible to set it to display in UTF-8, but the default setting is Windows 1252, at least on English versions.
  9. Sanjeev

    Wrote at 2007-06-19 15:43 UTC using Firefox 2.0.0.4 on Windows XP:

    i think the problem lies that my code “paté” when I copy to shell is encoded as ascii.

    so unpack(“C*”) works in my case. pack(“U*”) packs the array to Unicode.

    it the string itself is in the unicode (it is proper and not malformed)

    unpack(“C*”) will not unpack correctly hence, i will not get back the same utf8 string.
  10. Paul Battley

    Wrote at 2007-06-19 16:49 UTC using Firefox 2.0.0.4 on Mac OS X:

    Don’t confuse ASCII as a synonym for 8-bit (one byte per character) encodings. It’s not correct. ASCII is a 7-bit encoding, and can’t handle characters like ‘é’.

    In fact, what you’re pasting in is probably Windows 1252, an 8-bit encoding which is mostly the same as ISO-8859-1.

    Now it just so happens that the first 256 codepoints in Unicode are the same as ISO-8859-1, so your code works for your example.

    That’s not a solution for fixing broken Unicode, though: it’s just a way of translating ISO-8859-1 into UTF-8.
  11. Sanjeev

    Wrote at 2007-06-20 09:01 UTC using Firefox 2.0.0.4 on Windows XP:

    Thanks for your responses. It really cleared many doubts I was having.

    [From Wikipedia]
    The exact response required of a UTF-8 decoder on invalid input is not uniformly defined by the standards. In general, there are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:

    1. Insert a replacement character (usually ’?’).
    2. Ignore the bytes.
    3. Interpret each byte according to a legacy encoding (often ISO-8859-1 or CP1252).
    4. Not notice and decode as if the bytes were some similar bit of UTF-8.
    5. Stop decoding and report an error (possibly giving the caller the option to continue).

    So we can do all types of stuff in case of malformed unicode strings. Ignoring is one, translating is one.
    Yahoo search uses translating behavior.
  12. Pavel Šmerk

    Wrote at 2009-01-18 16:30 UTC using Chrome 1.0.154.36 on Windows XP:

    What about the following (without the use of Iconv):

    untrusted_string.unpack(‘U*’) rescue nil

    (=> only a test, whether the string is valid)

    untrusted_string.unpack(‘U*’).pack(‘U*’) rescue nil

    (=> returns string if it is valid or nil otherwise)
  13. Pavel Šmerk

    Wrote at 2009-01-18 16:42 UTC using Chrome 1.0.154.36 on Windows XP:

    Ooops—- it’s a solution of another problem, I’m sorry! Feel free to remove my posts. :-)

    (While I was finding, how to recognize non-valid utf-8, I have found this article and read the comments—- and the pack/unpack stuff have inspired me to my solution. But, it does not, of course, repair broken utf-8 strings in any way. :-)
  14. Tor Erik

    Wrote at 2009-04-23 15:43 UTC using Safari 525.27.1 on Mac OS X:

    Just noticed that if in the case of “351” you have to add TWO spaces, or it will still throw Iconv::InvalidCharacter.
  15. Jade Rubick

    Wrote at 2011-02-16 17:08 UTC using Chrome 9.0.597.102 on Mac OS X:

    Hi Paul:

    This is super helpful. Thank you so much for posting it!

    Jade
  16. Bogdan Gusiev

    Wrote at 2011-12-05 10:19 UTC using Chrome 14.0.835.202 on Linux:

    Works well for me.

    But Iconv is deprecated in Ruby 1.9.3:

    :in `block in require’: iconv will be deprecated in the future, use String#encode instead.

    Can you suggest how to reach the same with #encode method?
  17. Daniel

    Wrote at 2012-02-17 22:32 UTC using Chrome 19.0.1041.0 on Mac OS X:

    Like Bogdan, I’m not making headway on this. here’s a test-case

    https://gist.github.com/e9d4b390bf11b1689c73
  18. Byung

    Wrote at 2012-03-25 12:08 UTC using Firefox 7.0.1 on Linux:

    It’s very helpful. Thank you!
  19. woto

    Wrote at 2012-05-15 01:13 UTC using Safari 534.55.3 on Mac OS X:

    It finally works, thanks. Best whishes from Russia.

Leave a comment

Please read the comment guidelines before posting. Comments are Gravatar-enabled. Your email address will not be published.

To prove that you’re human, type human in the Bot check field.

Trying to post some program output or a long code sample? Please use a paste service and link to it instead.