News About News

I spent a bit of time today improving my news aggregator.

The two main features are:

Caching
Fixing broken text

Caching provides several advantages. It is possible to schedule the program to run more frequently in order to pick up unreliable channels, without hammering servers (and getting IP blocked). It also means that formatting tweaks can be tested quickly and easily.

As far as broken text is concerned, things are a little complicated. RSS should be in UTF-8 format. It is just about acceptable to use a different encoding, providing that this is specified in the header. However, many feeds, despite announcing themselves as UTF-8, are nothing of the sort. In English feeds, this tends to mean garbled quotation marks and currency symbols. In French, all the accents turn into blanks or random kanji. Many people, it seems, just assume that continuing to use a character set that has only a fifty percent correspondance is close enough.

Fortunately, it’s relatively straightforward to find these illegal characters and translate them into legal ones, as this chunk of code I came up with shows:

# Kludge for fixing feeds made by ignorami whose UTF-8 is actually
ISO-8859-1.
# Since the extended characters are not legal in context (except in
very unlikely
# circumstances), it’s possible to find them and translate them
back again.
# The mapping of ISO-8859-1 to Unicode is straightforward (it’s an
algorithm, not
# a lookup table).  It’s Eurocentric, but it seems to be only users
of Western
# languages who assume that their characters are just going to show
up OK.
#
# To find an illegal character c(n), there are two possibilities:
# c(n) is 11xx xxxx but next character c(n+1) is not 10xx xxxx
# c(n) is 10xx xxxx but previous character c(n-1) is not 1xxx xxxx
# Since Windows “Smart Quotes” (0x91 to 0x94) pop up a lot, but
aren’t in the 
# proper ISO coding, I’ve added a fix for them as well.

for ($i = 1; $i < (length ($raw)-1); $i++)
{
  my $c1 = ord (substr ($raw, $i, 1));
  if ($c1 & 0x80) # otherwise, it's plain ASCII, and definitely
legal
  {
    my $c0 = ord (substr ($raw, $i-1, 1));
    my $c2 = ord (substr ($raw, $i+1, 1));
    if ((($c1 >> 6) == 0x03 && ($c2 >> 6) !=
0x02)
    || (($c1 >> 6) == 0x02 && ($c0 >> 7) !=
0x01))
    {
      my $char;
      if (($c1 >> 2) == 0x24 ) # quotes
      { $char = sprintf ("%c%c%c”, 0xE2, 0x80, $c1 + (($c1 <=
0x92) ? 7 : 9)); }
      else # regular ISO->UTF conversion
      { $char = sprintf ("%c%c”, ($c1 >> 6) | 0xC0, ($c1
& 0x3F) | 0x80); }
      substr ($raw, $i, 1, $char);
      $i++;
    }
  }
}

I’m very pleased with it. It fixes the French and English channels without breaking the Japanese ones. I couldn’t even put Le Monde in before, because it was completely unreadable, but I have now added it.

I think that the news engine is almost ready for public release. I’ve been using it daily to provide my breakfast and lunchtime reading, and I haven’t found any problems. I’ve cleaned up some rough edges, like the handling of illegal characters. It’s simple and well documented, and easy to customise. I’m just wondering whether I should find a better name. Its working title is “Arse” from RSS. Try saying “RSS” out loud. It has a contrived acrostic, too: “An RSS Summary Engine.” It’s a bit puerile, and I’m not sure that I want my name to be forever attached to a program called “Arse"! Suggestions gratefully accepted...