[Web4lib] RSS and diacritics

Andrew Cunningham andrewc at vicnet.net.au
Wed Nov 28 02:19:17 EST 2007



Jonathan Gorman wrote:

> Well, yes, but if the font they're using doesn't have anything beyond the basic ascii mapping, they're not what I would call unicode compatible ;).  In my defense  I wasn't saying browsers have issues, but there's a lot of software that does.    

Core windows fonts were designed to support WGL 4.0 which was seen as a 
necessary subset of Unicode to meet the needs of European languages. 
Documentation should be available on the Microsoft Typography site.

The reality is that fonts are developed for specific sets of languages 
and scripts. From the point of typographic design it isn't desirable to 
mix scripts. it is a fine are to design glyphs for one script that can 
harmonise with another script without distorting the scripts.

I'd draw a distinction between:

1) software that is not internationalised or the developers made a mess of;
2) software based on the windows 95 internationalization model (i.e. 
remapping Unicode to Windows codepages) although microsoft itself moved 
away form this model with web browsing technology. One languages 
directly supported by code pages are su[pported by this model.
3) windows 2000 internationalization model, which is unicode at the core 
and maps unicode to code pages.

As you indicate, there is a lot of badly written code out there. But the 
approaches to handling  Unicode have been around for many years. We've 
gone through various interactions of operating systems and applications 
that are Unicode based. And there are only a limited number of languages 
  or scripts that are problematic or difficult now.

Personally, I'm waiting for Mon, S'gaw Karen, Cham and Viet-Tai support, 
and currently testing a Mon and S;'gaw karen Unicode 5.1 beta solution.

Most languages are so straight forward and easy these days, esp on the web.

> IE 6 for a while in a default configuration did have several bugs relating to unicode.  Heck, there's till a lot of programming languages out there that have horrible unicode support.  I was flabbergasted a few years ago to see how poor the support was in Ruby, for crying out loud.

PHP 4 and 5 are even worse

and I will not discuss the warped Perl character model.

> 
> Well, I'm not an expert on these things.  My main goal was to advise someone having trouble with seeing characters appear on the webpage to use by default a font that had the widest implemented character set.  I chose the font that came to mind, which is probably out of date ;).  I didn't think that Windows 98 core fonts, once you got out of the ascii range, were very compatible, but I don't pay much attention to these things.  Neither do the people who set all their default fonts too...*shudders* comic sans.
> 

the approach we take is the opposite, we follow w3c internationalisation 
best practice, tag language and language change (also required for WCAG 
1.0 compliance). We also use language specific styling, so that 
different languages would use the most appropriate fonts for that 
language/script.

You are also forgetting font linking technologies built into the web 
browser and the font rendering system. The only time there are problems 
is when the web page or site style sheet actually ussues the wrong fonts 
or the fonts don't support necessary languages.

On of the reasons when we use hotamil, yahoo or gmail, we use firefox, 
with stylish and override the fonts used on those sites so we can 
selectively use more appropriate fonts to display certain languages, esp 
African languages.

Also font coverage on each version of Windows is different because each 
veerion of windows supports a different range of languages. Vista has 
Khmer and Lao fonts by default, but you will not find any shipped on 
older versions of windows.

comes down to the web developer and programmers doing a good web 
internationalisation job.

>> There are no pan unicode fonts. There are too many characters in unicode 
>> to be able to have a single font support them. Fonts have physical limits.
>>
> 
> I guess I'm confused here.  How does a font have a physical limit?  While certainly a daunting task, I would think it's certainly possible someone could come up with a font that has all the unicode characters that are currently speced out.  (True, there's a lot of unassigned ones, but what's the point of debating about that). 

There are a limited number of glyphs that can be contained in a TrueType 
font, i.e. 65536 glyphs. So to support all existing CJK ideographs, 
you'd need two fonts.

Even a script like Devanagri which has a limited number of characters, 
requires thousands of additional glyphs to support necessary ligatures 
and conjuncts.

A Urdu nastalique opentype font could max out the available glyphs and 
processing instructions just for one language.

The current trend on windows is to make script specific fonts which may 
not even include the basic Latin characters, and different UI fonts for 
different scripts.

>> Arial Unicode MS only supports a very old version of Unicode, and that 
>> incompletely. It is useful for characters with diacritics when those 
>> characters are precomposed characters. It is not suitable for combining 
>> diacritics. It doesn't have the required mark and mkmk OpenType features 
>> for the Latin script.
>>
>> Combining diacritic support on the Windows platform requires:
>>
>> 1) an appropriate font, and
>> 2) an appropriate font rendering system
>>
>> For Windows this means:
>>
>> a) using Windows Vista, or
>> b) using Windows XP (Service Pack 2) and installing an appropriate font. 
>> There are a small number of fonts available and enabling complex script 
>> support.
> 
> 
> I guess I'm confused.  So all core fonts since Windows 98 are unicode fonts, just as long as you don't expect them to do unicodish things like combining diacritics?  Then you need a new OS?  I don't quite get what you're saying.  
> 

To do combining diacritics needs not just a font, but also a font 
rendering system that knows how to use the information in the font. On 
windows this is Uniscribe. Different versions of windows have different 
versions fo uniscribe. Over time Microsoft add more support. The first 
versions of uniscribe to shift with combining diacritic support for 
Latin and Cyrillic script was the versions in Office 2003 (local copy, 
not system copy) and Windows XP Service Pack 2. But no fonts were 
shipped. had to use third party fonts. And for many langauges this is 
necessary.

You also ahve to enable complex script support for Windows XP, since it 
doesn't use uniscribe by default, unless you've enabled the RTL and 
Complex script support.

I.e. latin script needs to be treated as a complex script rather than as 
a non-complex script.

Vista was the first version of windows to ship with appropriate fonts. 
Currently the new versions of the old core fonts. and the new UI font.

The only combining diacritics that wrk on older versions of windows are 
those that belong to the repertoire Microsoft use for Vietnamese 
support, and will only work with fonts that have Windows 1258 support 
built in. But these use GSUB ratrher than GPOS tables in the fonts if 
memory serves me correctly.



>> IE6 will display combining diacrtics correctly on Windows XP SP2 (with 
>> complex script support enabled) and if you are using an appropriate 
>> font, e.g. Doulos SIL, Charis SIL, the Gentium Book beta, 
>> African/Aboriginal Sans , African/Aboriginal Serif, Code 2000, and 
>> possibly the latest DejaVu fonts, etc..
>>
> 
> Thanks ;).  I'm planning on poking at some of these fonts.  I've been looking for a replacement for Arial Unicode for a while now.  Sadly, a great deal of our patrons and our other folks aren't going to be installing other fonts, so we're stuck with trying to choose fonts they might have from installing something like Word.  

best choice is to choose fonts that ship with international English 
versions of windows. And avoid the one font fits all approach.


> In practice when advising some people on various unicode issues I've found myself giving the following advice:
> 
> 1) Be aware all layers of software can be prone to having different issues or configuration needs with unicode.  Make sure you're passing along the encoding you intend and some over-zealous piece of software isn't attempting to map what it thinks is MARC-8 to some ancient Swahili character set.

I'd add that you also need to be very specific about wish parts of 
Unicode you need. I find that most vendors claim to support Unicode, and 
they do, but usually a very small subset. Unicode doesn't require 
supporting everything. So its important to know what you need and 
specify it.

> 2) Make sure you can actually view the file you're looking at with the font you have.  Depressing number of people have said "There's something wrong with this file" when reality was "My font can't display this character, so it's showing this cute little box".

yep

> 3) Try to avoid combining diacritics. 
> 

nothing wrong with combining diacritics, but for most languages 
libraries need, combining diacritics aren't necessary, although there 
are lots of languages where there is no choice, the language needs 
combining diacritics.

The main problem is that w3c has always recommended that Unicode web 
pages use Unicode Normalization Form C (NFC) but vendors don't bother 
normalising data before display. If they used NFC then you wouldn't need 
to worry about most combining diacritics at the web ned.

Still a problem at the cataloguing stage, but if cataloguing tools are 
based on the win2000/XP internationalization model the clients will work 
well.

> 4) Software lags several years behind changes to the unicode standards, probably because many people are still trying to understand the old ones ;).  See rule 3.

maybe why I'm leaning more towards Linux/Gnome with a graphite enabled 
version of pango. Does my heart good to see Myanmar displayed the way it 
should.

> 5) There's a lot of issues that don't seem clear.  Where should bi-di issues be addressed?  Is fancy bred in the heart or in the head?

no that would take w2ay to much time to dissect.

-- 
Andrew Cunningham
Research and Development Coordinator (Vicnet)
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Email: andrewc+AEA-vicnet.net.au
Alt. email: lang.support+AEA-gmail.com

Ph: +613-8664-7430                    Fax:+613-9639-2175
Mob: 0421-450-816

http://www.slv.vic.gov.au/            http://www.vicnet.net.au/
http://www.openroad.net.au/           http://www.mylanguage.gov.au/
http://home.vicnet.net.au/~andrewc/


More information about the Web4lib mailing list