[Web4lib] RSS and diacritics

Tue Nov 27 15:34:37 EST 2007

---- Original message ----
>Date: Tue, 27 Nov 2007 14:56:56 -0500
>From: Bob Duncan <duncanr at lafayette.edu>  
>Subject: [Web4lib] RSS and diacritics  
>To: web4lib at webjunction.org
>
>
>Greetings,
>
>I'm getting ready to offer RSS feeds for our library's recent 
>acquisitions lists and have run into a little snag:  characters with 
>diacritics.  I understand why I can't use HTML character entity 
>references and expect all feed readers to play nicely, so I tried 
>encoding the ampersand in the HTML entity reference (a suggested fix 
>that I can no longer document).  While this works great for some feed 
>readers, other readers and the two major browsers display the raw 
>code instead of the character with diacritical mark.
>
>Other than displaying plain letters without diacritics, is there a 
>way to code feeds so that all (or at least most) feed readers will 
>display the character with the mark?  (I'd like to be able to this in 
>item titles and descriptions.)
>
>Thanks,
>

I guess I'm a little confused.  This could possibly be several problems and there's a lot more we need to know.  Where are you getting your information from that has diacritics?  What encoding are those diacritics?  Are you sure the data isn't being converted or corrupted when you are querying the source?

RSS feeds are XML.  If you're pulling unicode information and putting it directly into the RSS feed and the RSS feed's encoding matches, you shouldn't have an issue.  The diacritics will be there.

That being said, unicode isn't very well supported as of yet.  There's a lot of software and fonts that don't have very complete character sets.  Arial Unicode so far has the most complete that I know of.  People using a browser will have to have it set to use a unicode font to see unicode characters correctly.  On top of that, there's a lot of software that mishandles combining diacritics (IE 6 is one example, if I recall correctly) and will never display them correctly.

Other issues like bi-directionality are ambiguous and not clear even now.  For example, if you have Korean and English in one document, it's not clear what layer of the software is required to do the work necessary so each can be read in the right direction.

Unicode issues can run through several layers of software, even for the server-side software that is commonly used for generating things like RSS feeds.  Often unicode support is feasible, but it must be done purposefully and it's not.

Unicode issues can be tricky, but you should be able to trace the data through the system and ensure that it's unicode at every step.

Of course, if the source data isn't even in unicode, that's another issue.  

Jon Gorman