Items tagged ocr in 2022

Filters: Year: 2022 × ocr × Sorted by date

4 results

Building a searchable archive for the San Francisco Microscopical Society

The San Francisco Microscopical Society was founded in 1870 by a group of scientists dedicated to advancing the field of microscopy.

[... 1845 words]

5:24 pm / 25th August 2022 / ocr, projects, datasette, weeknotes

Digitizing 55,000 pages of civic meetings (via) Philip James has been building public, searchable archives of city council meetings for various cities—Oakland and Alamedia so far—using my s3-ocr script to run Textract OCR against the PDFs of the minutes, and deploying them to Fly using Datasette. This is a really cool project, and very much the kind of thing I’ve been hoping to support with the tools I’ve been building. # 22nd August 2022, 4:26 pm

Litestream backups for Datasette Cloud (and weeknotes)

My main focus this week has been adding robust backups to the forthcoming Datasette Cloud.

[... 1604 words]

5:19 pm / 11th August 2022 / ocr, s3, datasette, weeknotes, datasettecloud, fly, litestream, gpt3, dalle

s3-ocr: Extract text from PDF files stored in an S3 bucket

I’ve released s3-ocr, a new tool that runs Amazon’s Textract OCR text extraction against PDF files in an S3 bucket, then writes the resulting text out to a SQLite database with full-text search configured so you can run searches against the extracted data.

[... 1493 words]

9:40 pm / 30th June 2022 / aws, ocr, projects, s3, weeknotes, s3credentials

Simon Willison’s Weblog