No PDFs! The Sunlight Foundation point out that PDFs are a terrible way of implementing “more transparent government” due to their general lack of structure. At the Guardian (and I’m sure at other newspapers) we waste an absurd amount of time manually extracting data from PDF files and turning it in to something more useful. Even CSV is significantly more useful for many types of information.
For analysis, PDFs are about the worst format possible - hardly better than a scanned image.
In fact, at various organisations I've worked with, the only really viable approach to extracting text (let alone structured content) from a high proportion of PDFs is to treat them as scanned images and apply OCR software to them.
Richard Boulton - 1st November 2009 13:10 - #
right, It'a a pity that the ability of PDF to accomodate structured data (e.g. using tagged PDF) is not widely used, that could help a lot.
Beside this, it seems that PDF (especially PDF/A) is well on the road to become a de-facto standard for documents that need long-term conservation (e.g. business docs, contracts, invoices, ...)
Luca Mearelli - 1st November 2009 14:15 - #
It's not really right to say that PDFs lack structure, although there are a wide range of PDFs, some of which are little better than images.
I teach people (often Government people) how to make accessible PDFs, a large part of which is ensuring the structure is there and correct.
If you're using Word 2003+ and Acrobat Pro 7+, it will create a structured PDF by default. That is if you structure the Word document (i.e. use styles).
The biggest issue that prevents Government documents being accessible/structured is really authoring practices in Word (and other programs).
That said, I'd assume it is several magnitudes harder to get information out of a binary format like PDF, even if it has structure?
Open government isn't about being open and giving information to the people. It's about NOT giving information to the people and LOOKING like you're doing just that.
No government, not even the holy Obama's is interested in a lot of inspection and criticism.
Dave K - 1st November 2009 16:01 - #
Yet PDFs are a great way of distributing documents that can be read by the majority of people regardless of their chosen platform.
If PDFs are bad for extraction then that isn't really a fault of the PDF document per se, it's more a fault of the content producer for not making a companion document available in a machine-readable format.
Jonathan Hollin - 1st November 2009 16:42 - #
PDF's are not a good platform-independent format, because they can't be reformatted. How do you read a PDF on a cellphone? How do you read a PDF at font size 25? PDF's are good for one thing only: as an intermediate step towards a print-out. Any sort of information distribution that isn't meant for print should use another format, like OpenDocument, HTML or even RTF.
Joeri Sebrechts - 2nd November 2009 09:56 - #
PDFs hinder usability on screen. In pretty much every browser, including Google Chrome, the keyboard access is broken by the PDF reader. For example, you cannot hit the Backspace key to go back, you can't find content using CTRL+F and so on. It is sad that something that was meant for printing is used for sharing.
Rag Kidiyoor - 2nd November 2009 13:31 - #
OpenOffice creates tagged PDFs by default
Joeri - cellphones should be able to re-flow the document, which will work if it's been put together well. (Which is to say, not the government ones.)
In acrobat reader, go to view > zoom > reflow.
That *should* be how cell phones deal with it.
Rag - This is something you can turn off in the options, so that it opens in a reader rather than the browser. (Not that many people know about that.)