Simon Willison’s Weblog

Subscribe

17 items tagged “pdf”

2024

Portable EPUBs. Will Crichton digs into the reasons people still prefer PDF over HTML as a format for sharing digital documents, concluding that the key issues are that HTML documents are not fully self-contained and may not be rendered consistently.

He proposes “Portable EPUBs” as the solution, defining a subset of the existing EPUB standard with some additional restrictions around avoiding loading extra assets over a network, sticking to a smaller (as-yet undefined) subset of HTML and encouraging interactive components to be built using self-contained Web Components.

Will also built his own lightweight EPUB reading system, called Bene—which is used to render this Portable EPUBs article. It provides a “download” link in the top right which produces the .epub file itself.

There’s a lot to like here. I’m constantly infuriated at the number of documents out there that are PDFs but really should be web pages (academic papers are a particularly bad example here), so I’m very excited by any initiatives that might help push things in the other direction. # 25th January 2024, 8:32 pm

2023

textra (via) Tiny (432KB) macOS binary CLI tool by Dylan Freedman which produces high quality text extraction from PDFs, images and even audio files using the VisionKit APIs in macOS 13 and higher. It handles handwriting too! # 23rd March 2023, 9:08 pm

2019

Automate the Boring Stuff with Python: Working with PDF and Word Documents. I stumbled across this while trying to extract some data from a PDF file (the kind of file with actual text in it as opposed to dodgy scanned images) and it worked perfectly: PyPDF2.PdfFileReader(open(“file.pdf”, “rb”)).getPage(0).extractText() # 6th November 2019, 4:17 pm

2017

arxiv-vanity (via) Beautiful new project from Ben Firshman and Andreas Jansson: “Arxiv Vanity renders academic papers from Arxiv as responsive web pages so you don’t have to squint at a PDF”. It works by pulling the raw LaTeX source code from Arxiv and rendering it to HTML using a heavily customized Pandoc workflow. The real fun is in the architecture: it’s a Django app running on Heroku which fires up on-demand Hyper.sh Docker containers for each individual rendering job. # 25th October 2017, 8:06 pm

2010

pdf.js. A JavaScript library for creating simple PDF files. Works (flakily) in your browser using a data:URI hack, but is also compatible with server-side JavaScript implementations such as Node.js. # 17th June 2010, 7:39 pm

2009

node.js at JSConf.eu (PDF). node.js creator Ryan Dahl’s presentation at this year’s JSConf.eu. The principle philosophy is that I/O in web applications should be asynchronous—for everything. No blocking for database calls, no blocking for filesystem access. JavaScript is a mainstream programming language with a culture of callback APIs (thanks to the DOM) and is hence ideally suited to building asynchronous frameworks. # 17th November 2009, 6:07 pm

Adobe is Bad for Open Government. The problem isn’t just that PDFs are a bad way of sharing data, it’s that Adobe have been actively lobbying the US government to use their PDF and Flash formats for open government initiatives. # 1st November 2009, 12:51 pm

Prawn (via) Really nice PDF generation library for Ruby, used to generate Dopplr’s beautiful end of year reports. # 16th January 2009, 4:04 pm

Dopplr presents the Personal Annual Report 2008: freshly generated for you, and Barack Obama... So classy it hurts. I’d love to know what library they used to generate the PDF. # 16th January 2009, 12:17 pm

2008

Robust Defenses for Cross-Site Request Forgery [PDF]. Fascinating report which introduces the “login CSRF” attack, where an attacker uses CSRF to log a user in to a site (e.g. PayPal) using the attacker’s credentials, then waits for them to submit sensitive information or bind the account to their credit card. The paper also includes an in-depth study of potential protection measures, including research that shows that 3-11% of HTTP requests to a popular ad network have had their referer header stripped. Around 0.05%-0.10% of requests have custom HTTP headers such as X-Requested-By stripped. # 24th September 2008, 9:40 am

PDFMiner. Useful looking PDF parsing library in Python—can produce an XML representation of the text and style information in a PDF document. # 3rd August 2008, 3:29 pm

Scaling your website with the Perlbal web server (PDF) (via) Perlbal documentation is pretty thin on the ground; this is a really useful introduction from Frank Wiles. # 17th June 2008, 10:39 pm

OSM Super-Strength Export. Awesome new feature on OpenStreetMap: you can browse to anywhere on the map, then hit “export” and download a rendered bitmap or vector (PDF and SVG) image of the currently displayed map—and because it’s OSM there’s no watermark and a very liberal usage license. # 22nd April 2008, 9:56 am

2007

Restructured Text to Anything. Slick set of online tools for converting Restructured Text (one of the more mature wiki-style markup languages) to HTML or PDF. Includes a nice looking API. Powered by Django. # 13th September 2007, 3:54 pm

PDF Shrink. $35 OS X app that crunches down the size of PDF files—useful if you often embed photos in your presentations. # 28th May 2007, 2:21 pm

The Adobe PDF XSS Vulnerability. If you host a PDF file anywhere on your site, you’re vulnerable to an XSS attack due to a bug in Acrobat Reader versions below 8. The fix is to serve PDFs as application/octet-stream to avoid them being displayed inline. # 11th January 2007, 4:23 pm

2002

PHP generated PDFs

R&OS PDF PHP classes (via tidak ada). This is the most useful PHP library I’ve seen in a long time. It allows dynamic generation of PDF files without needing any additional modules installed on the server (although GD is required if you want to add images to your PDFs). It is extremely easy to use and has an impressive set of features, including PDF drawing tools, built in page number support and excellent documentation. On the topic of PDFs, Yes You Can advocates their use for presentations and touches on a method of generating them using Python.

[... 113 words]