[WEB4LIB] Cleanup Word docs converted to HTML

Thomas Dowling tdowling at ohiolink.edu
Thu Jun 20 10:17:24 EDT 2002

At 09:52 AM 6/20/2002, Dan Lester wrote:
>Many of us are familiar with the vast amounts of style and other
>information that is buried in HTML pages created by "Save as HTML"
>from Word 2000 and related products.
>In the past we've used an MS download called something like
>msofhtml.exe that does a pretty decent job of cleaning out the
>formatting and styles from Word 2000.  However, it appears that
>program won't work under WinXP or with OfficeXP software.  It seems to
>require that Office 2000 be installed on the machine.
>I've searched the MS site and haven't found a new version.  If anyone
>could offer a pointer to a new version of the MS cleanup tool, or to
>some other tool that will do a similar job, I'd appreciate it.
>Meanwhile, I continue to encourage our authors to create pages in
>FrontPage and not in Word, and to cleanup the converted pages.

HTML Tidy does a decent job of scraping the kruft off of Microsorta HTML - 
<http://tidy.sourceforge.net/>.  Tidy is also built into a couple of HTML 
editors, HTML-Kit being the best known (at least by me).  Note especially 
the "clean" option:

Type: Boolean
Default: no
Example: y/n, yes/no, t/f, true/false, 1/0 This option specifies if Tidy 
should strip out surplus presentational tags and attributes replacing them 
by style rules and structural markup as appropriate. It works well on the 
HTML saved by Microsoft Office products.

Thomas Dowling
OhioLINK - Ohio Library and Information Network
tdowling at ohiolink.edu

More information about the Web4lib mailing list