[Web4lib] Copying text from Word to Emacs/vi

R. Wood rw at ncf.ca
Thu Feb 26 21:30:11 EST 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Allegedly, on Thu, Feb 26, 2009 at 04:14:07PM -0500, Bob Long stated:
> A lot of my work with our website involves copying text that staff
> have sent to me in a Word document and either adding it to an existing
> page or using it to create a new page.
> 
> I'm a little old fashioned and like to work directly on the server,
> typically using Emacs as my editor. But I've noticed this problem also
> occurs in vi.

Hi,

(It would be interesting to see if it also occurred when using the 'vim'
editor, but this may be but a distraction...)

> The problem is that when I copy the text from the Word document into
> Emacs, certain characters come out as periods rather than the original
> character. The biggest culprit seems to be apostrophes. But also
> include quotation marks, emdashes, and ellipsis to name a few others.

Off the cuff I would say that this is either a 'UTF' issue, or a
'non-ascii Word characters' issue.  If the former, then ignore the
advice in this email ;-)  If, however, it is just a problem with the
dozen or so weird characters that Word typically inserts into its
documents, then one strategy would be to find a method of 'sanitizing'
the characters in the Word document to more generic ASCII-like
characters, before pasting to emacs/vi in the *nix environment (probably
a good idea when outputting HTML in any case). 

One method is to use 'sed' (which runs on windows as well as *nix) to
run substitutions on specified characters for all text in a file.  A sed
file looks something like this for example:
===== 8< =======================================================
s/`/'/g
s/’/'/g
s/“/"/g
s/”/"/g
===== >8 =======================================================
The 's' means 'substitute' the first character with the 2nd character,
and the 'g' means do all instances of the substitution on a given line.
You might save the above (along with any other substitutions you want to
make) in a file named 'word2txt.sed'.  Then you can either use your
favourite text editor to invoke the sed file to act upon the plaintext
you have saved in a file.  Or you could just do the operation on the DOS
command line (once you've installed sed of course) with something like:
  sed -f word2txt.sed plaintextfile.txt > plaintextfile.txt.NEW
This tells sed to apply the substitutions in the sed file to the
plaintext file and save it to a new, hopefully now sanitized, file.

So a sample workflow might look like:
- - 'Save As' the Word document to a plaintext file.
- - Apply the sed substitutions.
- - Open the new file in your favourite text editor.
- - Copy and paste from the local file to emacs/vi on the *nix server,
  etc.

> I've tried what I would consider to be the obvious workarounds for
> this; saving the Word document as a plain-text document and copying
> from there, saving the Word document as a html file and copying from
> there.  Nothing seems to work.
> 
> I suppose I could do my editing in Notepad (everything copies fine
> from Word to Notepad) and ftp it all back and forth. 

Going from emacs to notepad seems less than ideal ;-)

> But I don't work that way. I like to copy the text to Emacs, add a few
> tags, done!
> 
> Is there a way to copy text from a Word document into Emacs without
> having to go through and clean up all of those infernal periods?
> 
> -- 
> Robert Long, Library Systems Administrator
> Talbot County Free Library
> 410 822 1626 (v)
> 410 820 8217 (f)

Good luck and HTH,
Raymond
- -- 
"Be Nice, or Leave - By Order of the Management"
(Sign above door, Black Sheep Inn, Wakefield)
GPG Fingerprint: 2E4D 8605 DD48 E80F F893  1C02 B65D 86D9 3B3C 0E03
Encrypted E-mail Preferred
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFJp1Aztl2G2Ts8DgMRAuVLAJ9dLFG+VJwkmD6Pat6NSL1FzZNsYwCfbBCG
a7PkAKDST7wWVyVbXgH+7xI=
=HNhJ
-----END PGP SIGNATURE-----




More information about the Web4lib mailing list