Simon Willison's Weblog: ai2

olmOCR

2025-02-26T02:04:03+00:00

New from Ai2 - olmOCR is "an open-source tool designed for high-throughput conversion of PDFs and other documents into plain text while preserving natural reading order".

At its core is allenai/olmOCR-7B-0225-preview, a Qwen2-VL-7B-Instruct variant trained on ~250,000 pages of diverse PDF content (both scanned and text-based) that were labelled using GPT-4o and made available as the olmOCR-mix-0225 dataset.

The olmocr Python library can run the model on any "recent NVIDIA GPU". I haven't managed to run it on my own Mac yet - there are GGUFs out there but it's not clear to me how to run vision prompts through them - but Ai2 offer an online demo which can handle up to ten pages for free.

Given the right hardware this looks like a very inexpensive way to run large scale document conversion projects:

We carefully optimized our inference pipeline for large-scale batch processing using SGLang, enabling olmOCR to convert one million PDF pages for just $190 - about 1/32nd the cost of using GPT-4o APIs.

The most interesting idea from the technical report (PDF) is something they call "document anchoring":

Document anchoring extracts coordinates of salient elements in each page (e.g., text blocks and images) and injects them alongside raw text extracted from the PDF binary file. [...]

Document anchoring processes PDF document pages via the PyPDF library to extract a representation of the page’s structure from the underlying PDF. All of the text blocks and images in the page are extracted, including position information. Starting with the most relevant text blocks and images, these are sampled and added to the prompt of the VLM, up to a defined maximum character limit. This extra information is then available to the model when processing the document.

The one limitation of olmOCR at the moment is that it doesn't appear to do anything with diagrams, figures or illustrations. Vision models are actually very good at interpreting these now, so my ideal OCR solution would include detailed automated descriptions of this kind of content in the resulting text.

Via Luca Soldaini

Tags: vision-llms, ai, qwen, llms, fine-tuning, pdf, generative-ai, ocr, ai2

What do people really ask chatbots? It’s a lot of sex and homework

2024-08-04T18:59:46+00:00

What do people really ask chatbots? It’s a lot of sex and homework

Jeremy B. Merrill and Rachel Lerman at the Washington Post analyzed WildChat, a dataset of 1 million ChatGPT-style interactions collected and released by the Allen Institute for AI.

From a random sample of 458 queries they categorized the conversations as 21% creative writing and roleplay, 18% homework help, 17% "search and other inquiries", 15% work/business and 7% coding.

I talked to them a little for this story:

“I don’t think I’ve ever seen a piece of technology that has this many use cases,” said Simon Willison, a programmer and independent researcher.

Tags: washington-post, generative-ai, chatgpt, ai, llms, ai2

Open Language Models (OLMos) and the LLM landscape

2024-02-02T04:11:40+00:00

Open Language Models (OLMos) and the LLM landscape

OLMo is a newly released LLM from the Allen Institute for AI (AI2) currently available in 7b and 1b parameters (OLMo-65b is on the way) and trained on a fully openly published dataset called Dolma.

The model and code are Apache 2, while the data is under the “AI2 ImpACT license”.

From the benchmark scores shared here by Nathan Lambert it looks like this may be the highest performing model currently available that was built using a fully documented training set.

What’s in Dolma? It’s mainly Common Crawl, Wikipedia, Project Gutenberg and the Stack.

Via @natolambert

Tags: llms, ai, generative-ai, training-data, ai2