[WEB4LIB] Re: non-SGML characters
Eric Hellman
eric at openly.com
Thu Jan 31 23:51:51 EST 2002
If you use numeric entities in xml, it won't matter what what encoding you set.
In other words, • means BULLET, BLACK SMALL CIRCLE
whether your encoding is utf-8, shift-JIS, euc-kr or mac-symbol
in xml, the encoding tells the parser how to read bytes. however, the
character set is ALWAYS Unicode, excluding the control characters.
the Unicode character #149 is a in a control character zone and is
not a legal XML character.
Eric
At 1:00 PM -0800 1/31/02, Thomas Dowling wrote:
> >I think you can solve the academic validation problem with
>>"charset=Windows-1252". However, that won't solve the fact that there are
>>browsers that will either not display any character there, or display some
>>other character. I don't know of any browsers that actually change their
>>character handling based on the charset in the content type. There may be
>>*indexers* - but they'd want an actual HTTP header, not a meta tag (if I'm
>>wrong, someone stop me before I make a fool of myself).
>
>
>Well, that ship has sailed. Obviously, browsers respond to charsets in
>order to display pages in non-Roman scripts. Also, while changing the
>charset might make the actual character #149 valid, the numeric character
>entity "•" still represents Unicode and is still invalid (you see,
>there's SGML's and XML's "document character set" which isn't necessarily
>your *document's* character set...suddenly my brain hurts).
>
>So stick with UTF-8 and valid entities • or •.
>
>
>Thomas Dowling
>OhioLINK - Ohio Library and Information Network
>tdowling at ohiolink.edu
--
Eric Hellman, President Openly Informatics, Inc.
eric at openly.com 2 Broad St., 2nd Floor
tel 1-973-509-7800 fax 1-734-468-6216 Bloomfield, NJ 07003
http://www.openly.com/1cate/ 1 Click Access To Everything
http://my.linkbaton.com/ Links that Learn
More information about the Web4lib
mailing list