[Web4lib] word/html converter
    Hankinson, Andrew 
    HankiA at parl.gc.ca
       
    Mon Jul 10 15:46:04 EDT 2006
    
    
  
PS: I should also mention that this process works MUCH better if your
Word file is properly styled to begin with.  This means using Headers
for headings (instead of simply changing the font size), marking up code
/ preformatted text, using actual lists instead of indented bulleted
paragraphs, that sort of thing.  Word preserves these as tags.
If your Word file is not properly formatted, spend some time doing so
before exporting it - it will save you time in the long run.
-----Original Message-----
From: web4lib-bounces at webjunction.org
[mailto:web4lib-bounces at webjunction.org] On Behalf Of Hankinson, Andrew
Sent: July 10, 2006 3:35 PM
To: web4lib at webjunction.org
Subject: RE: [Web4lib] word/html converter
I have a three-step process for doing this:
1.) Save your Word file as HTML (Filtered) from Word.  It will get rid
of the bulk of the junk.
2.) Get HTMLTrim: http://dev.int64.org/htmltrim.html
Here's the settings I use (your requirements may vary...) (found under
'options'):
 Markup 1: 
 - CHECK Output as XHTML
 - CHECK enclose text in body with paragraphs
 - SELECT Document type: Strict
 
Markup 2:
 - CHECK Output " as "
 - CHECK Use numeric entities
 - CHECK Source document is from Microsoft Word
 - CHECK Output unadorned & characters as &
 - CHECK Output non-breaking spaces as entities
Encoding:
 - (OPTIONAL-if you just want to output the text, and not a complete
HTML file) CHECK Output Body Only
Layout:
 - No changes
Cleanup:
 - CHECK Replace I with em... 
 - CHECK Replace presentational tags and attributes
 - CHECK Remove Proprietary Tags
 - CHECK Remove Proprietary Attributes
 - CHECK Discard font and center tags
Hit OK, add your file(s) and then click "Tidy."  It will overwrite the
changes to your original file, but will save the unclean version as a
backup file.  This will get rid of even more junk.
3.) Use Dreamweaver or text editor to find and replace all the empty
<p></p> tags as well as any other tags the first two steps might miss.
I usually take out all the Table attributes (border, cellspacing, etc.),
the list styles (type="disc"), etc.
This takes you a long way towards getting nice, clean HTML with properly
encoded characters from Word without having to go through it
line-by-line.
Cheers,
Andrew
-----Original Message-----
From: web4lib-bounces at webjunction.org
[mailto:web4lib-bounces at webjunction.org] On Behalf Of Sarah Smith
Sent: July 10, 2006 3:18 PM
To: web4lib at webjunction.org
Subject: [Web4lib] word/html converter
We have a word doc with text boxes galore that make up our newsletter.
We are able to save the Word doc as html up to about 2 pages worth. Past
two pages, things start to get jumbled up. I've looked into MS Publisher
and Adobe Acrobat and neither of those seems to be a viable option. Deal
is we need a print version and an html version. The person creating it
knows Word but not html, so doing an html version first is not an option
at this point. We are willing to pay money for a miraculous converter
and I have been assured "other libraries do it." Anyone have any tips?
TIA,
 
Sarah Smith
Systems Supervisor, ssmith at saclibrary.org  
Sacramento Public Library <http://www.saclibrary.org> 
828 I Street, Sacramento, CA 95814
Phone (916) 264-2892; Fax (916) 264-2959 
 
_______________________________________________
Web4lib mailing list
Web4lib at webjunction.org
http://lists.webjunction.org/web4lib/
_______________________________________________
Web4lib mailing list
Web4lib at webjunction.org
http://lists.webjunction.org/web4lib/
    
    
More information about the Web4lib
mailing list