[Web4lib] RSS and diacritics

Jonathan Gorman jtgorman at uiuc.edu
Tue Nov 27 22:52:52 EST 2007


>> There's a lot of software and fonts that don't have very complete character sets.  Arial Unicode so far has the most complete that I know of. People using a browser will have to have it set to use a unicode font to 
>see unicode characters correctly.  On top of that, there's a lot of 
>software that mishandles combining diacritics (IE 6 is one example, if I 
>recall correctly) and will never display them correctly.
>> 
>
>There are a few common misconceptions here.
>
>all modern web browsers are Unicode based at the core, older 8-bit 
>legacy encodings are supported by transcoding to Unicode on the fly
>
>This has been the case since Netscape 4 and IE 3/4
>

Well, yes, but if the font they're using doesn't have anything beyond the basic ascii mapping, they're not what I would call unicode compatible ;).  In my defense  I wasn't saying browsers have issues, but there's a lot of software that does.    

IE 6 for a while in a default configuration did have several bugs relating to unicode.  Heck, there's till a lot of programming languages out there that have horrible unicode support.  I was flabbergasted a few years ago to see how poor the support was in Ruby, for crying out loud.

>All core operating system fonts (Windows and MacOS are Unicode based) 
>even core fonts on Windows 98.
>

Well, I'm not an expert on these things.  My main goal was to advise someone having trouble with seeing characters appear on the webpage to use by default a font that had the widest implemented character set.  I chose the font that came to mind, which is probably out of date ;).  I didn't think that Windows 98 core fonts, once you got out of the ascii range, were very compatible, but I don't pay much attention to these things.  Neither do the people who set all their default fonts too...*shudders* comic sans.

>There are no pan unicode fonts. There are too many characters in unicode 
>to be able to have a single font support them. Fonts have physical limits.
>

I guess I'm confused here.  How does a font have a physical limit?  While certainly a daunting task, I would think it's certainly possible someone could come up with a font that has all the unicode characters that are currently speced out.  (True, there's a lot of unassigned ones, but what's the point of debating about that). 

>Arial Unicode MS only supports a very old version of Unicode, and that 
>incompletely. It is useful for characters with diacritics when those 
>characters are precomposed characters. It is not suitable for combining 
>diacritics. It doesn't have the required mark and mkmk OpenType features 
>for the Latin script.
>
>Combining diacritic support on the Windows platform requires:
>
>1) an appropriate font, and
>2) an appropriate font rendering system
>
>For Windows this means:
>
>a) using Windows Vista, or
>b) using Windows XP (Service Pack 2) and installing an appropriate font. 
>There are a small number of fonts available and enabling complex script 
>support.


I guess I'm confused.  So all core fonts since Windows 98 are unicode fonts, just as long as you don't expect them to do unicodish things like combining diacritics?  Then you need a new OS?  I don't quite get what you're saying.  

>
>IE6 will display combining diacrtics correctly on Windows XP SP2 (with 
>complex script support enabled) and if you are using an appropriate 
>font, e.g. Doulos SIL, Charis SIL, the Gentium Book beta, 
>African/Aboriginal Sans , African/Aboriginal Serif, Code 2000, and 
>possibly the latest DejaVu fonts, etc..
>

Thanks ;).  I'm planning on poking at some of these fonts.  I've been looking for a replacement for Arial Unicode for a while now.  Sadly, a great deal of our patrons and our other folks aren't going to be installing other fonts, so we're stuck with trying to choose fonts they might have from installing something like Word.  

>
>> Other issues like bi-directionality are ambiguous and not clear even now.  For example, if you have Korean and English in one document, it's not clear what layer of the software is required to do the work necessary so each can be read in the right direction.
>> 
>
>Korean doesn't require bidi support. I think you are thinking of 
>vertical text layout here, not bidi support.
>

Ya caught me, thinking about two different issues and gave a vertical layout language issue when I was thinking about a bi-di one.  (Although, if I remember correctly Korean is read from top to bottom, right to left.  Not sure that that problem is called.  Over Under Sideways Downt (OUSD?).  In any case, directionality can be a pain. 

>Also in XML a schema or DTD should define mechanisms
 for handling bdi 
>support or should reference ITS namespace. The RSS schemas/DTD do not. 
>Lack of bidi support in RSS has been a long standing issue.
>

Well, I'd argue it's not clear to me that it's necessarily the schema that should defining the mechanisms.  After all, that's a lower-level interaction I'd like to see remain constant despite changes in schema or in the absence of one.  My editor should know how to handle it, regardless of the XML or if I'm editing XML.  But I'm not really an expert in these things, just trying to give what practical advice I can. Right now in practice this seems to be an issue whether there is a schema or not.


Thanks for some of the tips, but I'm not really sure if what I had were misconceptions.  I may have made some generalizations, but that's because these issues really are too complex to address in this forum.  (I also admit to being somewhat more sloppy in my phrasing when trying to help someone as opposed to just musing on a concept).  

In practice when advising some people on various unicode issues I've found myself giving the following advice:

1) Be aware all layers of software can be prone to having different issues or configuration needs with unicode.  Make sure you're passing along the encoding you intend and some over-zealous piece of software isn't attempting to map what it thinks is MARC-8 to some ancient Swahili character set.

2) Make sure you can actually view the file you're looking at with the font you have.  Depressing number of people have said "There's something wrong with this file" when reality was "My font can't display this character, so it's showing this cute little box".

3) Try to avoid combining diacritics. 

4) Software lags several years behind changes to the unicode standards, probably because many people are still trying to understand the old ones ;).  See rule 3.

5) There's a lot of issues that don't seem clear.  Where should bi-di issues be addressed?  Is fancy bred in the heart or in the head?

And on that note, I've talked too long.  

Jon


More information about the Web4lib mailing list