[Web4lib] white space in web pages

jqj at darkwing.uoregon.edu jqj at darkwing.uoregon.edu
Mon Jul 24 15:57:33 EDT 2006


Keith D. Engwall writes:

>For those who don't know (everyone probably does, but just in 
>case), ASCII vs. Binary in transfers only matters when you are 
>transferring between Windows and UNIX.

Beware also that "ascii" != "text".  Your text file may be ISO-8859-1, or
UTF-8 or UTF-16.  You should only count on an ascii transfer preserving
characters in the ascii subset, and hence could corrupt such files even
though they look like ascii.  As a practical matter, ascii transfers
typically do preserve nulls and all 8 bits of bytes, but that's not what the
spec promises.

Bottom line is, as Keith notes, "It is never ok to transfer binary files [or
text files in non-ascii character sets] in ASCII."

Robust code for processing text files on systems that use Unix/Linux,
classic MacOS, or Windows file storage conventions (stream of bytes, with
lines terminated by LF, CR, or CRLF respectively) should always be written
to handle any of the respective end of line conventions, including varying
end of line conventions within a single file.  AND should handle a variety
of encodings including UTF-16.

By the way, back in the old days some operating systems used other
approaches for encoding text files, e.g. each line a character count
followed by a sequence of characters, with no trailing CR or LF at all, or
even each line a null-terminated string (shades of C).  Or even (shudder)
ebcdic.

JQ Johnson, Director                 Office: 115F Knight Library 
Center for Educational Technologies  mailto:jqj at uoregon.edu
1299 University of Oregon            phone: 1-541-346-1746; -3485 fax
Eugene, OR 97403-1299                http://darkwing.uoregon.edu/~jqj/



More information about the Web4lib mailing list