Simon Willison’s Weblog

Subscribe

Sunday, 3rd November 2024

Docling. MIT licensed document extraction Python library from the Deep Search team at IBM, who released Docling v2 on October 16th.

Here's the Docling Technical Report paper from August, which provides details of two custom models: a layout analysis model for figuring out the structure of the document (sections, figures, text, tables etc) and a TableFormer model specifically for extracting structured data from tables.

Those models are available on Hugging Face.

Here's how to try out the Docling CLI interface using uvx (avoiding the need to install it first - though since it downloads models it will take a while to run the first time):

uvx docling mydoc.pdf --to json --to md

This will output a mydoc.json file with complex layout information and a mydoc.md Markdown file which includes Markdown tables where appropriate.

The Python API is a lot more comprehensive. It can even extract tables as Pandas DataFrames:

from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
for table in result.document.tables:
    df = table.export_to_dataframe()
    print(df)

I ran that inside uv run --with docling python. It took a little while to run, but it demonstrated that the library works.

# 4:57 am / ibm, ocr, pdf, python, ai, hugging-face, uv

California Clock Change. The clocks go back in California tonight and I finally built my dream application for helping me remember if I get an hour extra of sleep or not, using a Claude Artifact. Here's the transcript.

California Clock Change. For Pacific Time (PST/PDT) only. When you go to bed on Saturday, November 2, 2024That's tonight!, you will get an extra hour of sleep! The clocks fall back from 2:00 AM to 1:00 AM on Sunday, November 3, 2024.

This is one of my favorite examples yet of the kind of tiny low stakes utilities I'm building with Claude Artifacts because the friction involved in churning out a working application has dropped almost to zero.

(I added another feature: it now includes a note of what time my Dog thinks it is if the clocks have recently changed.)

# 5:11 am / projects, timezones, ai, llms, ai-assisted-programming, claude-artifacts

Building technology in startups is all about having the right level of tech debt. If you have none, you’re probably going too slow and not prioritizing product-market fit and the important business stuff. If you get too much, everything grinds to a halt. Plus, tech debt is a “know it when you see it” kind of thing, and I know that my definition of “a bunch of tech debt” is, to other people, “very little tech debt.”

Tom MacWright

# 4:36 pm / tom-macwright, technical-debt

2024 » November

MTWTFSS
    123
45678910
11121314151617
18192021222324
252627282930