Simon Willison’s Weblog

Subscribe

6 items tagged “beautifulsoup”

2023

Migrating out of PostHaven. Amjith Ramanujam decided to migrate his blog content from PostHaven to a Markdown static site. He used shot-scraper (shelled out to from a Python script) to scrape his existing content using a snippet of JavaScript, wrote the content to a SQLite database using sqlite-utils, then used markdownify (new to me, a neat Python package for converting HTML to Markdown via BeautifulSoup) to write the content to disk as Markdown. # 24th May 2023, 7:38 pm

2018

Fast Autocomplete Search for Your Website

Every website deserves a great search engine—but building a search engine can be a lot of work, and hosting it can quickly get expensive.

[... 4159 words]

Fast Autocomplete Search for Your Website (via) I wrote a tutorial for the 24 ways advent calendar on building fast autocomplete search for a website on top of Datasette and SQLite. I built the demo against 24 ways itself—I used wget to recursively fetch all 330 articles as HTML, then wrote code in a Jupyter notebook to extract the raw data from them (with BeautifulSoup) and load them into SQLite using my sqlite-utils Python library. I deployed the resulting database using Datasette, then wrote some vanilla JavaScript to implement autocomplete using fast SQL queries against the Datasette JSON API. # 19th December 2018, 12:26 am

2017

Recovering missing content from the Internet Archive

When I restored my blog last weekend I used the most recent SQL backup of my blog’s database from back in 2010. I thought it had all of my content from before I started my 7 year hiatus, but in watching the 404 logs I started seeing the occasional hit to something that really should have been there but wasn’t. Turns out the SQL backup I was working from was missing some content.

[... 636 words]

2009

How to Make a US County Thematic Map Using Free Tools. This is the trick I’ve been using to generate choropleths at the Guardian for the past year: figure out the preferred colours for a set of data in a Python script and then rewrite an SVG file to colour in the areas. I use ElementTree rather than BeautifulSoup but the technique is exactly the same. The best thing about SVG is that our graphics department can export them directly out of Illustrator, with named layers and paths automatically becoming SVG ID attributes. Bonus tip: sometimes you don’t have to rewrite the SVG XML at all, instead you can generate CSS to colour areas by ID selector and inject it in to the top of the file. # 12th November 2009, 10:49 am

2007

soupselect. My simple extension to BeautifulSoup that allows you to grab elements using CSS selectors; should be useful for parsing microformats. # 28th February 2007, 1:47 pm