[Web4lib] RSS and diacritics

Tue Nov 27 18:54:48 EST 2007

Which version of RSS are you using, and does its schema/DTD defined the 
entities you want to use?

re, NCRs have a look at http://www.w3.org/International/questions/qa-escapes

Bob Duncan wrote:
> At 03:56 PM 11/27/2007, Jonathan Gorman wrote:
>> Apologizes, In rereading I realized I mis-interpreted what you were 
>> saying.  I thought you had two distinct problems (using html character 
>> entities) and issues with diacritics.
> 
> Phew!  I thought I was going to have to attempt a reply to your first 
> response. ;o)
> 
>> The answer as far as the entities?  RSS can be a mess ;).  RSS feeds 
>> are XML.  Sadly, a widespread practice has occurred of using "escaped 
>> html" in fields of the RSS feeds.  There's no way to ensure that these 
>> escaping nightmares will be parsed correctly.
>>

named entities need to be defined. XML by default only supports a small 
handful. Most of the named entities in HTMl don't exist in XML, unless 
the schema or DTD in question defines them.

for XML documents its best to use an appropriate encoding that supports 
all your character requirements rather than using entities or NCRs.

>> HTML defines some character entities, but RSS doesn't have all of 
>> them.  You can attempt to add these characters to the RSS feed via 
>> including them in a Doctype declaration at the beginning of the feed.  
>> This wikipedia page looks like it has some examples of that: 
>> http://en.wikipedia.org/wiki/XML.

yep

>> The best solution?  Not really sure.  I'd lean towards not using 
>> "escaped html" in my RSS feed.  Instead use just rss and the character 
>> references, which should display cleanly assuming that the rss feeder 
>> isn't junk.

best solution: choose an appropriate encoding for your data and declare 
that encoding.

>> (And by character reference, I mean use &#x..; where .. is the 
>> appropriate code point).
> 
> 
> One other question:  which numeric reference is preferable?  For 
> example, both &#xC9; and &#201; (xC9 and 201) produce a Latin capital E 
> acute.  Are there good reasons to use one over the other?  (And is 
> either more likely than the other to be correctly rendered by browsers 
> in non-RSS situations?)

Decimal is more likely to work with older browsers, either should work 
with modern browsers, and hexadecimal is easier to work with when editing.

-- 
Andrew Cunningham
Research and Development Coordinator (Vicnet)
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Email: andrewc+AEA-vicnet.net.au
Alt. email: lang.support+AEA-gmail.com

Ph: +613-8664-7430                    Fax:+613-9639-2175
Mob: 0421-450-816

http://www.slv.vic.gov.au/            http://www.vicnet.net.au/
http://www.openroad.net.au/           http://www.mylanguage.gov.au/
http://home.vicnet.net.au/~andrewc/