[Web4lib] RSS and diacritics
Andrew Cunningham
andrewc at vicnet.net.au
Tue Nov 27 18:54:48 EST 2007
Which version of RSS are you using, and does its schema/DTD defined the
entities you want to use?
re, NCRs have a look at http://www.w3.org/International/questions/qa-escapes
Bob Duncan wrote:
> At 03:56 PM 11/27/2007, Jonathan Gorman wrote:
>> Apologizes, In rereading I realized I mis-interpreted what you were
>> saying. I thought you had two distinct problems (using html character
>> entities) and issues with diacritics.
>
> Phew! I thought I was going to have to attempt a reply to your first
> response. ;o)
>
>> The answer as far as the entities? RSS can be a mess ;). RSS feeds
>> are XML. Sadly, a widespread practice has occurred of using "escaped
>> html" in fields of the RSS feeds. There's no way to ensure that these
>> escaping nightmares will be parsed correctly.
>>
named entities need to be defined. XML by default only supports a small
handful. Most of the named entities in HTMl don't exist in XML, unless
the schema or DTD in question defines them.
for XML documents its best to use an appropriate encoding that supports
all your character requirements rather than using entities or NCRs.
>> HTML defines some character entities, but RSS doesn't have all of
>> them. You can attempt to add these characters to the RSS feed via
>> including them in a Doctype declaration at the beginning of the feed.
>> This wikipedia page looks like it has some examples of that:
>> http://en.wikipedia.org/wiki/XML.
yep
>> The best solution? Not really sure. I'd lean towards not using
>> "escaped html" in my RSS feed. Instead use just rss and the character
>> references, which should display cleanly assuming that the rss feeder
>> isn't junk.
best solution: choose an appropriate encoding for your data and declare
that encoding.
>> (And by character reference, I mean use &#x..; where .. is the
>> appropriate code point).
>
>
> One other question: which numeric reference is preferable? For
> example, both É and É (xC9 and 201) produce a Latin capital E
> acute. Are there good reasons to use one over the other? (And is
> either more likely than the other to be correctly rendered by browsers
> in non-RSS situations?)
Decimal is more likely to work with older browsers, either should work
with modern browsers, and hexadecimal is easier to work with when editing.
--
Andrew Cunningham
Research and Development Coordinator (Vicnet)
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia
Email: andrewc+AEA-vicnet.net.au
Alt. email: lang.support+AEA-gmail.com
Ph: +613-8664-7430 Fax:+613-9639-2175
Mob: 0421-450-816
http://www.slv.vic.gov.au/ http://www.vicnet.net.au/
http://www.openroad.net.au/ http://www.mylanguage.gov.au/
http://home.vicnet.net.au/~andrewc/
More information about the Web4lib
mailing list