HTMLEntities now works with Ruby 1.9 and JRuby
If you’re using my HTMLEntities library—and it seems that quite a lot of people are—you may be glad to know that it now (as of version 4.1.0) works with both Ruby 1.9.1 and JRuby 1.3.1.
I’ve been aware for a while that it wasn’t compatible with Ruby 1.9. That’s not really surprising, due to the new regular expression engine (Oniguruma) and significant changes in the way that character encoding is handled between 1.8 and 1.9, but I finally did something about it.
There were two things I had to do to get regular expressions working in Ruby 1.9. One was to specify the encoding of the test files, which contain verbatim UTF-8 strings. I simply added the relevant directive at the top of those files:
# encoding: UTF-8
The second issue was that, as Oniguruma understands Unicode
codepoints, I needed to use codepoint ranges instead of byte
ranges. This was a bit tricky as it’s not documented in the
Oniguruma
syntax. I had to find it by trial and error. For future
reference, you use \u{N}
, where N is the hexadecimal
codepoint. For example, this matches codepoints outside the
printable ASCII range:
/[^\u{20}-\u{7E}]/
As a bonus, I also tested is against JRuby and got it working there. The performance is, alas, noticeably worse on JRuby than on either 1.8 or 1.9. I suspect that’s due to the additional layers of indirection in the regular expression engine and string handling, but I’m not sure. Still, working is better than not working, so I count it as progress.
I’m very interested in hearing any feedback, good or bad, and especially if I’ve accidentally introduced any bugs in spite of the test suite.