[WEB4LIB] non-SGML characters
Thomas Dowling
tdowling at ohiolink.edu
Thu Jan 31 14:56:51 EST 2002
At 02:02 PM 1/31/2002, bob at esrl.lib.md.us wrote:
>Hello.
>
>I've been using the • character to separate our address information
>at the bottom of our pages. But now, as I'm moving to XHTML 1.0, I'm
>finding that when I validate these characters are returned with the error
>"reference to non-SGML character".
>
>Here's my line to define content-type:
>
><meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
>
>Is there a different charset I should be using that would allow these
>characters to validate properly?
I think you can solve the academic validation problem with
"charset=Windows-1252". However, that won't solve the fact that there are
browsers that will either not display any character there, or display some
other character. I don't know of any browsers that actually change their
character handling based on the charset in the content type. There may be
*indexers* - but they'd want an actual HTTP header, not a meta tag (if I'm
wrong, someone stop me before I make a fool of myself).
You might have more success rewriting them as legal characters in
UTF-8. The bullet character (#149) = • = • (&8226; will have
wider support.)
>I could just take them out. They aren't critical. But I'd like to know
>why they don't work with XHTML.
Characters 128 through 159 have always been undefined in the ISO Latin-1
character sets and in Unicode, which includes the UTF-8 you're sending to
the validator.
Other common characters whose Windows-1252 position is undefined in ISO
Latin or Unicode:
Ellipsis (#133) = … = …
En Dash (#150) = – = –
Em Dash (#150) = — = —
Curly single quote (#145 and #146) = ‘ and ’ = ‘ and ’
Curly double quotes (#147 and #148) = “ and ” = “ and ”
[Unregistered] Trademark (#153) = ™ = ™
Euro (#128) = € = €
NS4 will only understand the numeric entities. ☹
Thomas Dowling
OhioLINK - Ohio Library and Information Network
tdowling at ohiolink.edu
More information about the Web4lib
mailing list