[WEB4LIB] Cleanup Word docs converted to HTML

Rich Kulawiec rsk at magpage.com
Thu Jun 20 17:27:29 EDT 2002


On Thu, Jun 20, 2002 at 06:50:06AM -0700, Dan Lester wrote:
> I've searched the MS site and haven't found a new version.  If anyone
> could offer a pointer to a new version of the MS cleanup tool, or to
> some other tool that will do a similar job, I'd appreciate it.

Several people have already mentioned "tidy", so I'll omit that and
move on to my other favorite tool for dealing with badly broken
HTML (such as that generated by FrontPage): the demoroniser.

The demoroniser's man page synopsis is "correct moronic and gratuitously
incompatible HTML generated by Microsoft applications"; for full details:

	http://www.fourmilab.ch/webtools/demoroniser/

Also worth reading is this page:

	http://www.perl.com/language/misc/ms-ascii.html

which contains links to still more resources on the subject.

Sometimes, though, the HTML is so far beyond repair that the best
solution I can apply is to render it in a browser; save it as text;
and then mark it up from scratch (by hand, using vi).  This is somewhat
tedious but often worth it: I recently reduced a ~200K page to ~20K and
eliminated several hundred errors (as reported by tidy) in the process.

---Rsk
Rich Kulawiec
rsk at magpage.com




More information about the Web4lib mailing list