I know Unicode-fu
By chance, whilst researching content management packages today, I found my name mentioned in a gratifying context:
Here’s what happened: I found Michael Schubert’s interesting post on Levenshtein Distance in Ruby. (More about the algorithm here.) And he pointed to another implementation over at the Ruby Application Archive: levenshtein.rb.
Now it’s this second one that sorta surprised me, because it seems to support Unicode. How do I know this? Because. The first thing I do when I try anything is feed it some Unicode. What can I say? It’s a character flaw.
Now, the Levenshtein algorithm most definitely counts characters. That’s what edit distance is all about, after all. But we’ve been through this all before, remember? Counting
utf-8stuff in Ruby doesn’t work so hot, right?
Er, that’s what I thought. But behold, from the docs for that script:
distance(str1, str2) Calculate the Levenshtein distance between two strings str1 and str2. str1 and str2 should be ASCII or UTF-8 .
¿Como say what?
Behold, black magic ( Paul Battley, whoever you are, you have Unicode fu!):
It’s given me an Idea, however. I do actually know quite a lot about text encodings and the realities of multilingual processing. I’m interested in it, and I have plenty of experience. I could probably make a job out of consulting in that field, because it seems to be something that a lot of people don’t fully grasp. And I’d like to do that, in fact. The real difficulty lies in marketing my services, and finding the people who need that expertise and are willing to pay for it. It’s something to think about, anyway.
The graphical side of my site redesign is coming along nicely. I’d say it’s 99% done, but I haven’t been through the pain of testing it in Internet Explorer, my less-than-enthusiastic views on which I have previously exposited. It looks pretty damn good, though I say so myself. I’ve kept the same colours but expanded the palette slightly thanks to one of The Return of Design’s colour schemes. The search for a decent CMS, however, is not so easy. There is plenty of choice, but too many of the candidates are poorly-written; I want something that is going to be easy to hack for my own requirements, and spaghetti code isn’t going to help me in that respect.