Simon Willison’s Weblog

Subscribe

Sunday, 3rd November 2024

Docling. MIT licensed document extraction Python library from the Deep Search team at IBM, who released Docling v2 on October 16th.

Here's the Docling Technical Report paper from August, which provides details of two custom models: a layout analysis model for figuring out the structure of the document (sections, figures, text, tables etc) and a TableFormer model specifically for extracting structured data from tables.

Those models are available on Hugging Face.

Here's how to try out the Docling CLI interface using uvx (avoiding the need to install it first - though since it downloads models it will take a while to run the first time):

uvx docling mydoc.pdf --to json --to md

This will output a mydoc.json file with complex layout information and a mydoc.md Markdown file which includes Markdown tables where appropriate.

The Python API is a lot more comprehensive. It can even extract tables as Pandas DataFrames:

from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
for table in result.document.tables:
    df = table.export_to_dataframe()
    print(df)

I ran that inside uv run --with docling python. It took a little while to run, but it demonstrated that the library works.

# 4:57 am / cli, ibm, ocr, pdf, python, ai, hugging-face, uv

California Clock Change. The clocks go back in California tonight and I finally built my dream application for helping me remember if I get an hour extra of sleep or not, using a Claude Artifact. Here's the transcript.

California Clock Change. For Pacific Time (PST/PDT) only. When you go to bed on Saturday, November 2, 2024That's tonight!, you will get an extra hour of sleep! The clocks fall back from 2:00 AM to 1:00 AM on Sunday, November 3, 2024.

This is one of my favorite examples yet of the kind of tiny low stakes utilities I'm building with Claude Artifacts because the friction involved in churning out a working application has dropped almost to zero.

(I added another feature: it now includes a note of what time my Dog thinks it is if the clocks have recently changed.)

# 5:11 am / projects, timezones, ai, llms, ai-assisted-programming, claude-artifacts, prompt-to-app

Tool California Clock Change - PST/PDT Only — Track upcoming and recent Daylight Saving Time changes for California's Pacific Time zone (PST/PDT). The page automatically detects your timezone and displays when clocks will spring forward or fall back, along with the current DST status and helpful reminders about how the time change affects daily routines. Users outside California can still view the information by enabling a pretend mode.

Building technology in startups is all about having the right level of tech debt. If you have none, you’re probably going too slow and not prioritizing product-market fit and the important business stuff. If you get too much, everything grinds to a halt. Plus, tech debt is a “know it when you see it” kind of thing, and I know that my definition of “a bunch of tech debt” is, to other people, “very little tech debt.”

Tom MacWright

# 4:36 pm / tom-macwright, technical-debt

Tool animated-rainbow-border — Display an animated rainbow gradient border effect around a centered box with interactive controls. The page features a dark theme with a glowing, color-shifting border that can be toggled on and off using the provided button. The animation combines gradient shifting and pulsing effects to create a dynamic, eye-catching visual presentation.
Saturday, 2nd November 2024
Monday, 4th November 2024