Saturday November 20, 2004

Word DOC to PDF

Because my thesis has been approaching completion, I've been thinking about putting it up on the web.  The standard format for academic articles seems to be Adobe's PDF (Portable Document Format), but my thesis, along with most of the other papers I've written, are in Microsoft Word's DOC format.  I figured there'd be a simple, free way to convert from DOC to PDF, but that turns out not to be the case.  However, after a bit of research, I figured out a way to do clean conversions from DOC to PDF using only free software.  Below, I describe the steps necessary to get this working on a Windows XP machine.

A Google search on "DOC to PDF" produces a bunch of links to commercial software for doing the conversion, along with OpenOffice and Adobe's own Acrobat, both of which can import DOC files and save directly to PDF.  OpenOffice is free, which is nice, but unfortunately it changes the formatting in an imported Word document—nothing too drastic, mind you (although I wish it wouldn't tweak the indentation of numbered lists), but I was looking for a solution that doesn't require me to carefully reexamine the document's formatting before producing a PDF.  Acrobat would presumably have a similar issue, but I haven't evaluated it because it costs several hundred dollars.

Adobe does have a web-based service for converting various document formats to PDF, but it costs money too, although the first five conversions are free.  (I think this might make them conversion pushers.)  Adobe apparently used to supply a piece of software called PDF Writer that acted as a printer driver, converting whatever you printed into a PDF file.  That sounds like what I want (no reformatting required!), but apparently they don't ship it for free any more.

Fortunately, there's a way to get this same print-to-PDF functionality using free software, in two stages: first, convert the Word document into PostScript, a commonly used printer interface language, and then convert that into PDF.  We'll accomplish the first step by installing a PostScript printer driver and using the Windows "Print to File" feature to create a PostScript file, and then we'll convert that to PDF using the free Ghostscript implementation of PostScript.

[Update: Before jumping into this procedure, you should scroll down and read about PDFCreator, a simpler alternative.]

Step 1: PostScript Printer Driver

  1. In the Windows Control Panel, open up Printers and Faxes
  2. Add Printer
  3. Local Printer, no Plug-and-Play Detect
  4. Choose anything for the port—it doesn't matter, because we're never going to actually send the output to a printer.  (I chose LPT1.)
  5. Select the Apple LaserWriter II NT v47.0 driver.  The LaserWriter has been around forever (version 47!), so hopefully the PostScript output is very stable.  It also seems to be part of the default Windows XP installation, because it didn't ask me to insert my Windows CD—always a plus.

Step 2: Ghostscript

Ghostscript is a free piece of software (from GNU) that interprets PostScript files.  In other words, it acts like a software printer that renders the output to the screen rather than on paper.  We're going to install both Ghostscript (the interpreter) and GSView (the file viewer).  (Note: I don't know why PostScript has InterCaps but Ghostscript doesn't, but that seems to be the way it is.)

  1. Go to this site
  2. Download and install Ghostscript
  3. Download and install GSView

I installed versions 8.14 and 4.6, respectively, but get the most recent versions.

Step 3: Convert DOC to PDF!

  1. Open your document in Word
  2. File, Print
  3. Select the LaserWriter
  4. Check the Print to File checkbox
  5. Hit OK, and give it a file name to save to (like test.prn)
  6. Outside of Word, go rename that .prn file to test.ps
  7. Open it in GSView
  8. File, Convert
  9. Select pdfwrite (the default converter), and leave it at 600 dpi (there are tons of other settings available if you click Properties, but the defaults will work fine)
  10. Hit OK
  11. Save to test.pdf
  12. Double-click on test.pdf, and revel in the glorious portability of your document (assuming you have Adobe Reader installed)

Except...

We're 90% of the way there, but there's a problem with the default settings that linguists are likely to run into.  To see it, make a test document with text in an odd font (like Doulos SIL) in 12 point and 24 point, or just copy this into Word:

Doulos SIL: 12pt
Doulos SIL: 24pt

Now convert that document to PDF using the procedure above.  Open the PDF and zoom in on the text.  With the default settings, the 24pt text is nice and smooth, but the 12pt text is pixelated and ugly.  It turns out that PostScript implementations have a set of built-in fonts that always look smooth, but Doulos SIL isn't one of them.  With non-built-in fonts, information about how to render text in the font has to be embedded into the PostScript file, and by default the Windows LaserWriter driver only does that for large fonts.  With smaller fonts, it bitmaps them at some resolution (possibly screen resolution), and that doesn't look very nice.  We need to tell it to embed rendering information for fonts of all sizes.

Step 4: Always Download Fonts

  1. Back to Control Panel, Printers and Faxes
  2. Right-click on the LaserWriter printer, and select Properties
  3. Printing Preferences...
  4. Advanced...
  5. Change "TrueType Font" from "Substitute with Device Font" to "Download as Softfont"
  6. Expand PostScript Options
  7. Change "TrueType Font Download Option" from "Automatic" to "Outline"
  8. OK, OK

That should do the trick.  Convert the font test document again, and zoom in—no more jaggy fonts.  Sweet!

Converting a document requires repeating all the actions in Step 3 above, which is kind of cumbersome.  It might be macroable, but I haven't tried that yet, since I don't convert documents all that often.  I haven't noticed any problems with the resulting PDF files, but it's possible that I just haven't fed it anything tricky yet.  If you have trouble with something, or have any suggestions about how to improve this process, please post a comment.  In particular, I'm curious to know if tweaking any of the settings in GSView's PDF converter improves the output.

[Now playing: "Rusty Cage" by Johnny Cash]

[Update:  I forgot to link to Will Duquette's The Perils of PDF series of posts over at A View from the Foothills, which describe his various attempts at trying to produce PDF files from a different sort of document, and under a different set of constraints.]

[Update: In a comment, greg suggests PDFCreator, which rolls the PostScript and PDF conversions into one printer driver.  (Quickie installation instructions for the current version (0.8): download the AFPL Ghostscript version of PDFCreator, install it, then download Patch02 and install it.)  PDFCreator is pretty slick, and saves you from having to go through Step 3 above every time you convert a document.  You also don't have to go through Step 4 to prevent it from bitmapping non-built-in fonts, which is nice.

There's two small issues, though.  First, when I use the PostScript+GSView procedure to create a PDF of my 92-page thesis, the resulting file is 288,078 bytes in size.  When I convert it using PDFCreator with the same settings (both at 600 dpi, in particular), the resulting PDF file is 329,838 bytes in size—15% larger.  I'm not sure why this is the case, since they're both using AFPL Ghostscript to do the conversion.  That's the second (more minor) issue: I want to have GSView on my computer so that I can view and print PostScript files, and that requires Ghostscript.  PDFCreator also installs a tweaked version of Ghostscript, and so now I have two copies of AFPL Ghostscript installed in different places.  This isn't that big a deal, since I have plenty of disk space, but it's too bad they don't share a single copy.  There's probably a way to perform surgery on the config files to get them to share, but I don't feel motivated to figure it out.]

I am The Tensor, and I approve this post.
02:20 AM in Web/Tech | Submit: | Links:

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c88ad53ef00d8345b67e969e2

Listed below are links to weblogs that reference Word DOC to PDF:

» PDF from DOC, DOC from Wiki (or the long way to Documentation) from Thomas Williams
I love documentation about as much as the next person, or perhaps a litle bit less. I choose to use a... [Read More]

Tracked on Jul 14, 2005 10:14:14 PM

Comments

Have you tried Pdf995?

http://www.pdf995.com/

It's free but for enduring a popup after each conversion. Works pretty well.

We used this at my campus job (tech support for profs *shudder*), since we had limited licenses for Acrobat.

Posted by: Angelo at Nov 20, 2004 7:00:48 AM

Check out PDFCreator at http://sourceforge.net/projects/pdfcreator/. It's an open source program that installs itself as a printer much like Adobe's commercial offering.

Posted by: greg at Nov 20, 2004 8:05:02 AM

You may wish to use this plug-in solution for Word:
http://www.cib.de/english/products/pdf/cibpdfplugin_freeware.htm.
Looks less complicated than the solution you describe.. ;)

Posted by: Dana at Nov 21, 2004 6:48:16 AM

You may want to reduce your dpi to 300. Anything above 300 is overkill. 72dpi is for web and anything that needs to be printed should be 300. If you reduce your dpi your file size should be smaller.

Posted by: Blinger at Nov 21, 2004 7:09:40 PM

Using the LaserWriter+GSView method, I get:

  Thesis at 600 dpi: 288,078 bytes
  Thesis at 300 dpi: 288,578 bytes (slightly *larger*)

I'm a little bit surprised it went up, but I'm not surprised there wasn't a significant difference. Since there are no graphics in my thesis (just one table), there shouldn't be anything that's rendered into a bitmap in the PDF version, so the resolution should be irrelevant.

Oh, and another datapoint on the two methods, converting a one-page multiple-font-size Doulos SIL test document:

  LaserWriter+GSView: 47,762 bytes
  PDFCreator: 7,530 bytes

So for that document, using PDFCreator is a big win. Hmm. I'd like to know what's causing all this variation in the output file sizes—I obviously would like the smallest size possible without sacrificing quality.

Posted by: The Tensor at Nov 21, 2004 7:37:17 PM

Well, I want a PDF converter from doc for Linux.. Thank You.

Posted by: syerman at Jan 16, 2005 7:00:27 AM

this tutorial is realy good thanks..

Posted by: kazaj at Aug 1, 2005 3:32:16 PM

There's a program called (something like) Print to PDF (or print2pdf) that allows you to "print" a doc file (like a printer) to a PDF file. Useful. . .look into it . . your way seems a bit complicated.

Posted by: Aaron Morse at Aug 24, 2005 6:29:00 AM

"If you reduce your dpi your file size should be smaller."

Erm, not quite. PDF is, in general, a vector format. This would only make any difference if you were using bitmapped images in the document. Anyway, PrimoPDF is another (completely free) piece of software acts like a printer driver for saving any document as a PDF.

Also, if you're a bit more adventurous, forget Word completely--LaTeX is great for things like theses. It gives you a level of control that Word never could, though of course the tradeoff is that it's a lot harder to learn.

Posted by: Chris Ball at Dec 10, 2005 5:27:27 PM

Let me just add a word of warning about LaTeX. If you want to have detailed control over the typesetting of your document, or you need to typeset something complex (like math notation, or HPSG's typed feature structures), LaTeX is a powerful tool. If you just want to write a straightforward document with a few font and size changes, avoid LaTeX like the plague. Its interface is, well, no inteface at all—it's like coding up a computer program that will emit your document. Figuring out what package is best for a particular need can be challenging, and even though it's a very mature system, it's not free of bugs. I spent a frustrating couple of hours earlier this year trying to work around a line-breaking error in a paper I submitted.

Dealing with IPA is a huge pain in the ass, too, compared with modern Unicode-based word processors.

Posted by: The Tensor at Dec 10, 2005 6:43:25 PM

Following on from previous comment, I would put a vote in for LaTeX also, it is a little tricky if you are only used to using word processing of Word or OO.org interfaces, but I would imagine for a linguist (and a scifi/geeky one at that) the appeal of interspersing the text with formatting command language would be natural.
The degree of control one has over the output is very precise, and it is particularly suited to works of technical nature, such as thesis or academic work.

Posted by: mihaly at Dec 22, 2005 5:50:05 AM

Thank you very much! Your tutorial was REALLY helpful :)

Cheers,

Salvador

Posted by: Salvador Venegas at Oct 9, 2006 11:52:42 AM

Very usefull thank you. Especialy to Chris Ball for PrimoPDF link!

Posted by: Martin at Dec 26, 2006 3:16:01 AM

You can also try FreePdfXP. It's great and abolutely free.
I have used it for a long time and it always make the pdfs great.

From time to time I have a problem and it doesn't print anymore:
just be sure to have at hand the executable setup at hand (or check for a new version!) and install it again. The next time you use it succesfully, it will ask you for the past documents you couldn't print, if any.
http://www.shbox.de/fpxp.htm

Posted by: Dav at Jan 22, 2007 10:50:27 AM

You can install PDFCreator without Ghostscript.

Posted by: wildeny at Mar 7, 2007 4:55:13 AM

Thanks for sharing this article! I am fond of pictures, I always convert pictures to formats that I want. So I know some software that can help you to convert your pictures to the formats you want, such as,Easy Icon Maker, Help you edit transparent or opaque icon and extract an icon from an EXE or DLL file.
You can have a try in http://www.qweas.com/download/graphics/icon_tools/easy_icon_maker.htm
Enjoy yourself!

Posted by: lily at Apr 5, 2007 7:45:06 PM

CZ-Doc2Pdf , it can also convert Word to PDF, DOC to PD.

Posted by: freda at May 13, 2007 6:42:10 PM

Try using the application from Drawloop. Their site, if you subscribe, its free, allows you to upload many type of files including URLs and it automatically converts them into PDF format. You can upload multiple documents at a time and move them into a specific order as well as combine them into a single PDF. Drawloop has a add-on extension for firefox called LOOP. Thats free as well.
Drawloop.com
https://addons.mozilla.org/en-US/firefox/addon/4738

Posted by: Q at Jun 7, 2007 12:59:55 PM

LOOP Plugin for firefox takes the cake! Thanks freda.

Posted by: ejb at Jun 26, 2007 2:46:01 PM

Thanks for that nice tutorial. Now i will be able to write my own pdf-files without a big watermark "demoversion" on the pdf-file. Thanks!

Posted by: iceweasle at Jul 15, 2007 10:17:46 AM

Anyone know how to change a PDF to a JPEG?

Posted by: Meg at Aug 9, 2007 11:38:28 AM

I indexed this on PDF to Doc!

Anyone got a freeware or open source that program that allows you to convert from PDF format to word document format (inc. layout and graphics)

I have seen so many BAD expamples.

Posted by: Stephen Adams at Aug 19, 2007 7:31:11 AM

Very useful guide. Thank you. It worked for me too.

Here is one more hint: If you need colors in your PDF files, try using Canon PS-IPU Color Laser Copier v52.3 instead of Apple LaserWriter II NT v47.0 driver.

Posted by: Daniel at Oct 26, 2007 11:12:15 AM

OMG thanks mate. Saves me a lot of money. :P

Posted by: CJ at Dec 12, 2007 10:16:59 PM

Thanks for that tutorial. Best I found after some search and with this, anybody can do it.
The good thing about this doing it yourself is that it's all freeware and most other pdf converters have some annoying watermarks or even worse.
Ghostscript (and GSView) may be more complicated but I think it's worth it.

PS: You don't need to be admin. A big plus.

Posted by: Markus at Feb 16, 2009 12:23:46 AM

http://www.nemopdf.com: Easiest PDF converter solutions: Word to PDF, PDF to Word, printable files to PDF...

Posted by: luiana at Oct 20, 2009 12:29:33 AM

Tweak PDF To Word 3.0 is a little program to convert PDF to editable Word for easier editing. The converter enjoys an exact conversion with the output Word retaining intact of all the original features of the PDF, including layout, image positioning, text font, graphics, hyperlinks, etc.
Encrypted PDF files can be converted to Word, too.
http://www.tweakpdf.com

Posted by: Mass Tweak at Nov 24, 2009 11:29:41 PM