[Web4lib] word/html converter

Hankinson, Andrew HankiA at parl.gc.ca
Mon Jul 10 15:35:20 EDT 2006


I have a three-step process for doing this:

1.) Save your Word file as HTML (Filtered) from Word.  It will get rid
of the bulk of the junk.

2.) Get HTMLTrim: http://dev.int64.org/htmltrim.html
Here's the settings I use (your requirements may vary...) (found under
'options'):
 Markup 1: 
 - CHECK Output as XHTML
 - CHECK enclose text in body with paragraphs
 - SELECT Document type: Strict
 
Markup 2:
 - CHECK Output " as "
 - CHECK Use numeric entities
 - CHECK Source document is from Microsoft Word
 - CHECK Output unadorned & characters as &
 - CHECK Output non-breaking spaces as entities

Encoding:
 - (OPTIONAL-if you just want to output the text, and not a complete
HTML file) CHECK Output Body Only

Layout:
 - No changes

Cleanup:
 - CHECK Replace I with em... 
 - CHECK Replace presentational tags and attributes
 - CHECK Remove Proprietary Tags
 - CHECK Remove Proprietary Attributes
 - CHECK Discard font and center tags

Hit OK, add your file(s) and then click "Tidy."  It will overwrite the
changes to your original file, but will save the unclean version as a
backup file.  This will get rid of even more junk.

3.) Use Dreamweaver or text editor to find and replace all the empty
<p></p> tags as well as any other tags the first two steps might miss.
I usually take out all the Table attributes (border, cellspacing, etc.),
the list styles (type="disc"), etc.

This takes you a long way towards getting nice, clean HTML with properly
encoded characters from Word without having to go through it
line-by-line.

Cheers,
Andrew

-----Original Message-----
From: web4lib-bounces at webjunction.org
[mailto:web4lib-bounces at webjunction.org] On Behalf Of Sarah Smith
Sent: July 10, 2006 3:18 PM
To: web4lib at webjunction.org
Subject: [Web4lib] word/html converter

We have a word doc with text boxes galore that make up our newsletter.
We are able to save the Word doc as html up to about 2 pages worth. Past
two pages, things start to get jumbled up. I've looked into MS Publisher
and Adobe Acrobat and neither of those seems to be a viable option. Deal
is we need a print version and an html version. The person creating it
knows Word but not html, so doing an html version first is not an option
at this point. We are willing to pay money for a miraculous converter
and I have been assured "other libraries do it." Anyone have any tips?
TIA,

 

Sarah Smith

Systems Supervisor, ssmith at saclibrary.org  

Sacramento Public Library <http://www.saclibrary.org> 

828 I Street, Sacramento, CA 95814

Phone (916) 264-2892; Fax (916) 264-2959 

 

_______________________________________________
Web4lib mailing list
Web4lib at webjunction.org
http://lists.webjunction.org/web4lib/


More information about the Web4lib mailing list