Converting HTML pages into PDF (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Tue, 17 Jan 2012

Converting HTML pages into PDF

I've long wanted a way of converting my HTML presentation slides to PDF. Mostly because conference organizers tend to freak out at slides in any format other than Open/Libre Office, Powerpoint or PDF. Too many times, I've submitted a tarball of my slides and discovered they weren't even listed on the conference site. (I ask you, in what way is a tarball more difficult to deal with than an .odp file?) Slide-sharing websites also have a limited set of formats they'll accept.

A year or so ago, I added screenshot capability to my webkit-based presentation program, Preso, do "screenshots", but I really needed PDF, not images.

Now, creating PDF from HTML shouldn't be that hard. Every browser has a print function that can print to a PDF file. So why is it so hard to create PDF from HTML in any kind of scriptable way?

After much searching and experimenting, I finally found a Python code snippet that worked: XHTML to PDF using PyGTK4 Webkit from Alex Dong. It uses Python-Qt, not GTK, so I can't integrate it into my Preso app, but that's okay -- a separate tool is just as good.

(I struggled to write an equivalent in PyGTK, but gave up due to the complete lack of documentation of Python-Webkit-GTK, and not much more for gtk.PrintOperation(). QWebView's documentation may not be as complete as I'd like, but at least there is some.)

Printing from QtWebView to QPrinter

Here are the important things I learned about QWebView from fiddling around with Alex's code to adapt it to what I needed, which is printing a list of pages to sequentially numbered files:

Things I learned about QPrinter():

Anyway, it's a little hacky with that empirical zoom factor ... but it works! The program is here: qhtmlprint: convert HTML to PDF using Qt Webkit.

And it does produce reasonable PDF, with the text properly vectorized, not just raster screenshots of each page.

Printing the slides in the right order

Terrific -- now I can feed a list of slides to qhtmlprint and get a bunch of PDF files back. How do I print the right slides?

My slides are listed in order in an array inside a Javascript file, one per line. If I grep .html navigate.js, I get a list like this:

    "arduino.html",
    "img.html?pix/arduinos/arduino-clones.jpg",
    "getting_started.html",
    "img.html?pix/projects/led.jpg",
    //"blink.html",
    "arduino-ide.html",

To pass that to qhtmlprint, I only need to remove the commented-out lines (the ones with //) and strip off the quotes and commas. I can do that all in one command with a grep and sed pipeline:

qhtmlprint ` fgrep .html navigate.js  | grep -v // | sed -e 's/",/"/' -e 's/"//g' `

And voiaà! I have a bunch of fileNNN.pdf files.

Creating a multi-page slide deck

Okay, great! Now how do I stick those files all together into one slide deck I can submit to conference organizers?

That part's easy -- Ghostscript can do it.

gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=slidedeck.pdf -dBATCH file*.pdf

And now slidedeck.pdf contains my whole presentation, ready to go.

Tags: , ,
[ 12:16 Jan 17, 2012    More programming | permalink to this entry | ]

Comments via Disqus:

blog comments powered by Disqus