Simon Willison's Weblog: markdown

[red-knot] type inference/checking test framework

2024-10-16T20:43:55+00:00

[red-knot] type inference/checking test framework

Ruff maintainer Carl Meyer recently landed an interesting new design for a testing framework. It's based on Markdown, and could be described as a form of "literate testing" - the testing equivalent of Donald Knuth's literate programming.

A markdown test file is a suite of tests, each test can contain one or more Python files, with optionally specified path/name. The test writes all files to an in-memory file system, runs red-knot, and matches the resulting diagnostics against Type: and Error: assertions embedded in the Python source as comments.

Test suites are Markdown documents with embedded fenced blocks that look like this:

```py
reveal_type(1.0) # revealed: float
```

Tests can optionally include a path= specifier, which can provide neater messages when reporting test failures:

```py path=branches_unify_to_non_union_type.py
def could_raise_returns_str() -> str:
    return 'foo'
...
```

A larger example test suite can be browsed in the red_knot_python_semantic/resources/mdtest directory.

This document on control flow for exception handlers (from this PR) is the best example I've found of detailed prose documentation to accompany the tests.

The system is implemented in Rust, but it's easy to imagine an alternative version of this idea written in Python as a pytest plugin. This feels like an evolution of the old Python doctest idea, except that tests are embedded directly in Markdown rather than being embedded in Python code docstrings.

... and it looks like such plugins exist already. Here are two that I've found so far:

pytest-markdown-docs by Elias Freider and Modal Labs.
sphinx.ext.doctest is a core Sphinx extension for running test snippets in documentation.
pytest-doctestplus from the Scientific Python community, first released in 2011.

I tried pytest-markdown-docs by creating a doc.md file like this:

# Hello test doc

```py
assert 1 + 2 == 3
```

But this fails:

```py
assert 1 + 2 == 4
```

And then running it with uvx like this:

uvx --with pytest-markdown-docs pytest --markdown-docs

I got one pass and one fail:

_______ docstring for /private/tmp/doc.md __________
Error in code block:
```
10   assert 1 + 2 == 4
11   
```
Traceback (most recent call last):
  File "/private/tmp/tt/doc.md", line 10, in <module>
    assert 1 + 2 == 4
AssertionError

============= short test summary info ==============
FAILED doc.md::/private/tmp/doc.md
=========== 1 failed, 1 passed in 0.02s ============

I also just learned that the venerable Python doctest standard library module has the ability to run tests in documentation files too, with doctest.testfile("example.txt"): "The file content is treated as if it were a single giant docstring; the file doesn’t need to contain a Python program!"

Via Charlie Marsh

Tags: testing, rust, python, astral, markdown, ruff, pytest, uv

My Jina Reader tool

2024-10-14T16:47:56+00:00

My Jina Reader tool

I wanted to feed the Cloudflare Durable Objects SQLite documentation into Claude, but I was on my iPhone so copying and pasting was inconvenient. Jina offer a Reader API which can turn any URL into LLM-friendly Markdown and it turns out it supports CORS, so I got Claude to build me this tool (second iteration, third iteration, final source code).

Paste in a URL to get the Jina Markdown version, along with an all important "Copy to clipboard" button.

Tags: projects, markdown, ai-assisted-programming, jina, claude-3-5-sonnet, claude, generative-ai, ai, llms

otterwiki

2024-10-09T15:22:04+00:00

otterwiki

It's been a while since I've seen a new-ish Wiki implementation, and this one by Ralph Thesen is really nice. It's written in Python (Flask + SQLAlchemy + mistune for Markdown + GitPython) and keeps all of the actual wiki content as Markdown files in a local Git repository.

The installation instructions are a little in-depth as they assume a production installation with Docker or systemd - I figured out this recipe for trying it locally using uv:

git clone https://github.com/redimp/otterwiki.git
cd otterwiki

mkdir -p app-data/repository
git init app-data/repository

echo "REPOSITORY='${PWD}/app-data/repository'" >> settings.cfg
echo "SQLALCHEMY_DATABASE_URI='sqlite:///${PWD}/app-data/db.sqlite'" >> settings.cfg
echo "SECRET_KEY='$(echo $RANDOM | md5sum | head -c 16)'" >> settings.cfg

export OTTERWIKI_SETTINGS=$PWD/settings.cfg
uv run --with gunicorn gunicorn --bind 127.0.0.1:8080 otterwiki.server:app

Via Hacker News

Tags: python, wikis, uv, markdown, git, flask, sqlalchemy, sqlite

simonw/docs cookiecutter template

2024-09-23T21:45:15+00:00

simonw/docs cookiecutter template

Over the last few years I’ve settled on the combination of Sphinx, the Furo theme and the myst-parser extension (enabling Markdown in place of reStructuredText) as my documentation toolkit of choice, maintained in GitHub and hosted using ReadTheDocs.

My LLM and shot-scraper projects are two examples of that stack in action.

Today I wanted to spin up a new documentation site so I finally took the time to construct a cookiecutter template for my preferred configuration. You can use it like this:

pipx install cookiecutter
cookiecutter gh:simonw/docs

Or with uv:

uv tool run cookiecutter gh:simonw/docs

Answer a few questions:

[1/3] project (): shot-scraper
[2/3] author (): Simon Willison
[3/3] docs_directory (docs):

And it creates a docs/ directory ready for you to start editing docs:

cd docs
pip install -r requirements.txt
make livehtml

Tags: uv, markdown, sphinx-docs, cookiecutter, read-the-docs, python, projects, documentation

Markdown and Math Live Renderer

2024-09-21T04:56:30+00:00

Markdown and Math Live Renderer

Another of my tiny Claude-assisted JavaScript tools. This one lets you enter Markdown with embedded mathematical expressions (like $ax^2 + bx + c = 0$ ) and live renders those on the page, with an HTML version using MathML that you can export through copy and paste.

Here's the Claude transcript. I started by asking:

Are there any client side JavaScript markdown libraries that can also handle inline math and render it?

Claude gave me several options including the combination of Marked and KaTeX, so I followed up by asking:

Build an artifact that demonstrates Marked plus KaTeX - it should include a text area I can enter markdown in (repopulated with a good example) and live update the rendered version below. No react.

Which gave me this artifact, instantly demonstrating that what I wanted to do was possible.

I iterated on it a tiny bit to get to the final version, mainly to add that HTML export and a Copy button. The final source code is here.

Tags: claude-3-5-sonnet, anthropic, claude, markdown, mathml, ai, llms, ai-assisted-programming, tools, generative-ai, claude-artifacts

Share Claude conversations by converting their JSON to Markdown

2024-08-08T20:40:20+00:00

Share Claude conversations by converting their JSON to Markdown

Anthropic's Claude is missing one key feature that I really appreciate in ChatGPT: the ability to create a public link to a full conversation transcript. You can publish individual artifacts from Claude, but I often find myself wanting to publish the whole conversation.

Before ChatGPT added that feature I solved it myself with this ChatGPT JSON transcript to Markdown Observable notebook. Today I built the same thing for Claude.

Here's how to use it:

The key is to load a Claude conversation on their website with your browser DevTools network panel open and then filter URLs for chat_. You can use the Copy -> Response right click menu option to get the JSON for that conversation, then paste it into that new Observable notebook to get a Markdown transcript.

I like sharing these by pasting them into a "secret" Gist - that way they won't be indexed by search engines (adding more AI generated slop to the world) but can still be shared with people who have the link.

Here's an example transcript from this morning. I started by asking Claude:

I want to breed spiders in my house to get rid of all of the flies. What spider would you recommend?

When it suggested that this was a bad idea because it might attract pests, I asked:

What are the pests might they attract? I really like possums

It told me that possums are attracted by food waste, but "deliberately attracting them to your home isn't recommended" - so I said:

Thank you for the tips on attracting possums to my house. I will get right on that! [...] Once I have attracted all of those possums, what other animals might be attracted as a result? Do you think I might get a mountain lion?

It emphasized how bad an idea that would be and said "This would be extremely dangerous and is a serious public safety risk.", so I said:

OK. I took your advice and everything has gone wrong: I am now hiding inside my house from the several mountain lions stalking my backyard, which is full of possums

Claude has quite a preachy tone when you ask it for advice on things that are clearly a bad idea, which makes winding it up with increasingly ludicrous questions a lot of fun.

Tags: anthropic, claude, markdown, ai, llms, tools, generative-ai, projects, json, observable

Mermaid Gantt diagrams are great for displaying distributed traces in Markdown

2024-07-16T22:10:33+00:00

Mermaid Gantt diagrams are great for displaying distributed traces in Markdown

Bryce Mecum demonstrates how Mermaid gantt diagrams can be used to render trace information, such as the traces you might get from OpenTelemetry. I tried this out in a Gist and it works really well - GitHub Flavored Markdown will turn any fenced code block tagged mermaid containing a gantt definition into a neat rendered diagram.

Tags: markdown, mermaid

New blog feature: Support for markdown in quotations

2024-06-24T15:51:03+00:00

New blog feature: Support for markdown in quotations

Another incremental improvement to my blog. I've been collecting quotations here since 2006 - I now render them using Markdown (previously they were just plain text). Here's one example. The full set of 920 (and counting) quotations can be explored using this search filter.

Tags: projects, markdown, blogging

Jina AI Reader

2024-06-16T19:33:58+00:00

Jina AI Reader

Jina AI provide a number of different AI-related platform products, including an excellent family of embedding models, but one of their most instantly useful is Jina Reader, an API for turning any URL into Markdown content suitable for piping into an LLM.

Add r.jina.ai to the front of a URL to get back Markdown of that page, for example https://r.jina.ai/https://simonwillison.net/2024/Jun/16/jina-ai-reader/ - in addition to converting the content to Markdown it also does a decent job of extracting just the content and ignoring the surrounding navigation.

The API is free but rate-limited (presumably by IP) to 20 requests per minute without an API key or 200 request per minute with a free API key, and you can pay to increase your allowance beyond that.

The Apache 2 licensed source code for the hosted service is on GitHub - it's written in TypeScript and uses Puppeteer to run Readabiliy.js and Turndown against the scraped page.

It can also handle PDFs, which have their contents extracted using PDF.js.

There's also a search feature, s.jina.ai/search+term+goes+here, which uses the Brave Search API.

Tags: puppeteer, apis, markdown, ai, llms, jina

GitHub Copilot Chat: From Prompt Injection to Data Exfiltration

2024-06-16T00:35:39+00:00

GitHub Copilot Chat: From Prompt Injection to Data Exfiltration

Yet another example of the same vulnerability we see time and time again.

If you build an LLM-based chat interface that gets exposed to both private and untrusted data (in this case the code in VS Code that Copilot Chat can see) and your chat interface supports Markdown images, you have a data exfiltration prompt injection vulnerability.

The fix, applied by GitHub here, is to disable Markdown image references to untrusted domains. That way an attack can't trick your chatbot into embedding an image that leaks private data in the URL.

Previous examples: ChatGPT itself, Google Bard, Writer.com, Amazon Q, Google NotebookLM. I'm tracking them here using my new markdown-exfiltration tag.

Via @wunderwuzzi23

Tags: prompt-injection, security, generative-ai, markdown, ai, github, llms, markdown-exfiltration, github-copilot, johann-rehberger

Blogmarks that use markdown

2024-04-25T04:34:18+00:00

Blogmarks that use markdown

I needed to attach a correction to an older blogmark (my 20-year old name for short-form links with commentary on my blog) today - but the commentary field has always been text, not HTML, so I didn't have a way to add the necessary link.

This motivated me to finally add optional Markdown support for blogmarks to my blog's custom Django CMS. I then went through and added inline code markup to a bunch of different older posts, and built this Django SQL Dashboard to keep track of which posts I had updated.

Tags: projects, django-sql-dashboard, markdown, blogging

Migrating out of PostHaven

2023-05-24T19:38:37+00:00

Migrating out of PostHaven

Amjith Ramanujam decided to migrate his blog content from PostHaven to a Markdown static site. He used shot-scraper (shelled out to from a Python script) to scrape his existing content using a snippet of JavaScript, wrote the content to a SQLite database using sqlite-utils, then used markdownify (new to me, a neat Python package for converting HTML to Markdown via BeautifulSoup) to write the content to disk as Markdown.

Tags: sqlite-utils, shot-scraper, markdown, beautifulsoup

babelmark3

2023-01-27T23:34:08+00:00

babelmark3

I found this tool today while investigating an bug in Datasette’s datasette-render-markdown plugin: it lets you run a fragment of Markdown through dozens of different Markdown libraries across multiple different languages and compare the results. Under the hood it works with a registry of API URL endpoints for different implementations, most of which are encrypted in the configuration file on GitHub because they are only intended to be used by this comparison tool.

Via datasette-render-markdown issue #13

Tags: apis, markdown

Pikchr

2020-10-21T16:02:48+00:00

Pikchr

Interesting new project from SQLite creator D. Richard Hipp. Pikchr is a new mini language for describing visual diagrams, designed to be embedded in Markdown documentation. It’s already enabled for the SQLite forum. Implementation is a no-dependencies C library and output is SVG.

Tags: c, svg, markdown, sqlite, d-richard-hipp

Weeknotes: airtable-export, generating screenshots in GitHub Actions, Dogsheep!

2020-09-03T23:28:29+00:00

This week I figured out how to populate Datasette from Airtable, wrote code to generate social media preview card page screenshots using Puppeteer, and made a big breakthrough with my Dogsheep project.

airtable-export

I wrote about Rocky Beaches in my weeknotes two weeks ago. It's a new website built by Natalie Downe that showcases great places to go rockpooling (tidepooling in American English), mixing in tide data from NOAA and species sighting data from iNaturalist.

Rocky Beaches is powered by Datasette, using a GitHub Actions workflow that builds the site's underlying SQLite database using API calls and YAML data stored in the GitHub repository.

Natalie wanted to use Airtable to maintain the structured data for the site, rather than hand-editing a YAML file. So I built airtable-export, a command-line script for sucking down all of the data from an Airtable instance and writing it to disk as YAML or JSON.

You run it like this:

airtable-export out/ mybaseid table1 table2 --key=key

This will create a folder called out/ with a .yml file for each of the tables.

Sadly the Airtable API doesn't yet provide a mechanism to list all of the tables in a database (a long-running feature request) so you have to list the tables yourself.

We're now running that command as part of the Rocky Beaches build script, and committing the latest version of the YAML file back to the GitHub repo (thus gaining a full change history for that data).

I really like social media cards - og:image HTML meta attributes for Facebook and twitter:image for Twitter. I wanted them for articles on my TIL website since I often share those via Twitter.

One catch: my TILs aren't very image heavy. So I decided to generate screenshots of the pages and use those as the 2x1 social media card images.

The best way I know of programatically generating screenshots is to use Puppeteer, a Node.js library for automating a headless instance of the Chrome browser that is maintained by the Chrome DevTools team.

My first attempt was to run Puppeteer in an AWS Lambda function on Vercel. I remembered seeing an example of how to do this in the Vercel documentation a few years ago. The example isn't there any more, but I found the original pull request that introduced it.

Since the example was MIT licensed I created my own fork at simonw/puppeteer-screenshot and updated it to work with the latest Chrome.

It's pretty resource intensive, so I also added a secret ?key= mechanism so only my own automation code could call my instance running on Vercel.

I needed to store the generated screenshots somewhere. They're pretty small - on the order of 60KB each - so I decided to store them in my SQLite database itself and use my datasette-media plugin (see Fun with binary data and SQLite) to serve them up.

This worked! Until it didn't... I ran into a showstopper bug when I realized that the screenshot process relies on the page being live on the site... but when a new article is added it's not live when the build process works, so the generated screenshot is of the 404 page.

So I reworked it to generate the screenshots inside the GitHub Action as part of the build script, using puppeteer-cli.

My generate_screenshots.py script handles this, by first shelling out to datasette --get to render the HTML for the page, then running puppeteer to generate the screenshot. Relevant code:

def png_for_path(path):
    # Path is e.g. /til/til/python_debug-click-with-pdb.md
    page_html = str(TMP_PATH / "generate-screenshots-page.html")
    # Use datasette to generate HTML
    proc = subprocess.run(["datasette", ".", "--get", path], capture_output=True)
    open(page_html, "wb").write(proc.stdout)
    # Now use puppeteer screenshot to generate a PNG
    proc2 = subprocess.run(
        [
            "puppeteer",
            "screenshot",
            page_html,
            "--viewport",
            "800x400",
            "--full-page=false",
        ],
        capture_output=True,
    )
    png_bytes = proc2.stdout
    return png_bytes

This worked great! Except for one thing... the site is hosted on Vercel, and Vercel has a 5MB response size limit.

Every time my GitHub build script runs it downloads the previous SQLite database file, so it can avoid regenerating screenshots and HTML for pages that haven't changed.

The addition of the binary screenshots drove the size of the SQLite database over 5MB, so the part of my script that retrieved the previous database no longer worked.

I needed a reliable way to store that 5MB (and probably eventually 10-50MB) database file in between runs of my action.

The best place to put this would be an S3 bucket, but I find the process of setting up IAM permissions for access to a new bucket so infuriating that I couldn't bring myself to do it.

So... I created a new dedicated GitHub repository, simonw/til-db, and updated my action to store the binary file in that repo - using a force push so the repo doesn't need to maintain unnecessary version history of the binary asset.

This is an abomination of a hack, and it made me cackle a lot. I tweeted about it and got the suggestion to try Git LFS instead, which would definitely be a more appropriate way to solve this problem.

Rendering Markdown

I write my blog entries in Markdown and transform them into HTML before I post them on my blog. Some day I'll teach my blog to render Markdown itself, but so far I've got by through copying and pasting into Markdown tools.

My favourite Markdown flavour is GitHub's, which adds a bunch of useful capabilities - most notably the ability to apply syntax highlighting. GitHub expose an API that applies their Markdown formatter and returns the resulting HTML.

I built myself a quick and scrappy tool in JavaScript that sends Markdown through their API and then applies a few DOM manipulations to clean up what comes back. It was a nice opportunity to write some modern vanilla JavaScript using fetch():

async function render(markdown) {
    return (await fetch('https://api.github.com/markdown', {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json'
        },
        body: JSON.stringify({'mode': 'markdown', 'text': markdown})
    })).text();
}

const button = document.getElementsByTagName('button')[0];
const output = document.getElementById('output');
const preview = document.getElementById('preview');

button.addEventListener('click', async function() {
    const rendered = await render(input.value);
    output.value = rendered;
    preview.innerHTML = rendered;
});

Dogsheep Beta

My most exciting project this week was getting out the first working version of Dogsheep Beta - the search engine that ties together results from my Dogsheep family of tools for personal analytics.

I'm giving a talk about this tonight at PyCon Australia: Build your own data warehouse for personal analytics with SQLite and Datasette. I'll be writing up detailed notes in the next few days, so watch this space.

TIL this week

Releases this week

dogsheep-beta 0.4.1 - 2020-09-03
dogsheep-beta 0.4 - 2020-09-03
dogsheep-beta 0.4a1 - 2020-09-03
dogsheep-beta 0.4a0 - 2020-09-03
dogsheep-beta 0.3 - 2020-09-02
dogsheep-beta 0.2 - 2020-09-01
dogsheep-beta 0.1 - 2020-09-01
dogsheep-beta 0.1a2 - 2020-09-01
dogsheep-beta 0.1a - 2020-09-01
airtable-export 0.4 - 2020-08-30
datasette-yaml 0.1a - 2020-08-29
airtable-export 0.3.1 - 2020-08-29
airtable-export 0.3 - 2020-08-29
airtable-export 0.2 - 2020-08-29
airtable-export 0.1.1 - 2020-08-29
airtable-export 0.1 - 2020-08-29
datasette 0.49a0 - 2020-08-28
sqlite-utils 2.16.1 - 2020-08-28

Tags: projects, yaml, markdown, dogsheep, weeknotes, github-actions, airtable, puppeteer

Render Markdown tool

2020-09-03T00:08:47+00:00

Render Markdown tool

I wrote a quick JavaScript tool for rendering Markdown via the GitHub Markdown API—which includes all of their clever extensions like tables and syntax highlighting—and then stripping out some extraneous HTML to give me back the format I like using for my blog posts.

Via @simonw

Tags: projects, markdown, javascript, github

Weeknotes: California Protected Areas in Datasette

2020-08-28T02:00:02+00:00

This week I built a geospatial search engine for protected areas in California, shipped datasette-graphql 1.0 and started working towards the next milestone for Datasette Cloud.

California Protected Areas in Datasette

This weekend I learned about CPAD - the California Protected Areas Database. It's a remarkable GIS dataset maintained by GreenInfo Network, an Oakland non-profit and released under a Creative Commons Attribution license.

CPAD is released twice annually as a shapefile. Back in February I built a tool called shapefile-to-sqlite that imports shapefiles into a SQLite or SpatiaLite database, so CPAD represented a great opportunity to put that tool to use.

Here's the result: calands.datasettes.com

It provides faceted search over the records from CPAD, and uses my datasette-leaflet-geojson plugin to render the resulting geometry records on embedded maps.

I'm building and deploying the site using this GitHub Actions workflow. It uses conditional-get (see here) combined with the GitHub Actions cache to download the shapefiles as part of the workflow run only if the downloadable file has changed.

This project inspired some improvements to the underlying tools:

datasette-leaflet-geojson now handles larger polygons and is smarter about knowing when to load additional JavaScript and CSS
shapefile-to-sqlite can now create spatial indexes and has a new -c option (inspired by csvs-to-sqlite) for extracting specified columns into separate lookup tables

datasette-graphql 1.0

I'm trying to get better at releasing 1.0 versions of my software.

For me, the most significant thing about a 1.0 is that it represents a promise to avoid making backwards incompatible releases until a 2.0. And ideally I'd like to avoid ever releasing 2.0s - my perfect project would keep incrementing 1.x dot-releases forever.

Datasette is currently at version 0.48, nearly three years after its first release. I'm actively working towards the 1.0 milestone for it but it may be a while before I get there.

datasette-graphql is less than a month old, but I've decided to break my habits and have some conviction in where I've got to. I shipped datasette-graphql 1.0 a few days ago, closely followed by a 1.0.1 release with improved documentation.

I'm actually pretty confident that the functionality baked into 1.0 is stable enough to make a commitment to supporting it. It's a relatively tight feature set which directly maps database tables, filter operations and individual rows to GraphQL. If you want to quickly start trying out GraphQL against data that you can represent in SQLite I think it's a very compelling option.

New datasette-graphql features this week:

Support for multiple reverse foreign key relationships to a single table, e.g. a article table that has created_by and updated_by columns that both reference users. Example. #32
The {% set data = graphql(...) %} template function now accepts an optional variables= parameter. #54
The search: argument is now available for tables that are configured using Datasette's fts_table mechanism. #56
New example demonstrating GraphQL fragments. #57
Added GraphQL execution limits, controlled by the time_limit_ms and num_queries_limit plugin configuration settings. These default to 1000ms total execution time and 100 total SQL queries per GraphQL execution. Limits documentation. #33

Improvements to my TILs

My til.simonwillison.net site provides a search engine and browse engine over the TIL notes I've been accumulating in simonw/til on GitHub.

The site used to link directly to rendered Markdown in GitHub, but that has some disadvantages: most notably, I can't control the <title> tag on that page so it has poor implications for SEO.

This week I switched it over to hosting each TIL as a page directly on the site itself.

The tricky thing to solve here was Markdown rendering. GitHub's Markdown flavour incorporates a bunch of useful extensions for things like embedded tables and code syntax highlighting, and my attempts at recreating the same exact rendering flow using Python's Markdown libraries fell a bit short.

Then I realized that GitHub provide an API for rendering Markdown using the same pipeline they use on their own site.

So now the build script for the SQLite database that powers my TILs site runs each document through that API, but only if it has changed since the last time the site was built.

I wrote some notes on using their Markdown API in this TIL: Rendering Markdown with the GitHub Markdown API.

Storing the rendered HTML in my database also meant I could finally fix a bug with the Atom feed for that site, where advanced Markdown syntax wasn't being correctly rendered in the feed.

The datasette-atom plugin I use to generate the feed applies Mozilla's Bleach HTML sanitization library to avoid dynamically generated feeds accidentally becoming a vector for XSS. To support the full range of GitHub's Markdown in my feeds I released version 0.7 of the plugin with a deliberately verbose allow_unsafe_html_in_canned_queries plugin setting which can opt canned queries out of the escaping - which should be safe because a canned query running against trusted data gives the site author total control over what might make it into the feed.

Datasette Cloud

I'm spinning up work again on Datasette Cloud again, after several months running it as a private alpha. My next key milestone is to be able to charge subscribers money - I know from experience that until you're charging people actual money it's very difficult to be confident that you're working on the right things.

TIL this week

Releases this week

asgi-csrf 0.7.1 - 2020-08-27
datasette-graphql 1.0.1 - 2020-08-24
datasette-graphql 1.0 - 2020-08-23
datasette-graphql 0.15 - 2020-08-23
datasette-render-images 0.3.2 - 2020-08-23
datasette-atom 0.7 - 2020-08-23
shapefile-to-sqlite 0.4.1 - 2020-08-23
shapefile-to-sqlite 0.4 - 2020-08-23
datasette-auth-passwords 0.3.2 - 2020-08-22
shapefile-to-sqlite 0.3 - 2020-08-22
datasette-leaflet-geojson 0.6 - 2020-08-21
datasette-leaflet-geojson 0.5 - 2020-08-21
sqlite-utils 2.16 - 2020-08-21

Tags: atom, gis, projects, markdown, graphql, datasette, weeknotes, datasette-cloud

Using a self-rewriting README powered by GitHub Actions to track TILs

2020-04-20T01:38:15+00:00

I've started tracking TILs - Today I Learneds - inspired by this five-year-and-counting collection by Josh Branchaud on GitHub (found via Hacker News). I'm keeping mine in GitHub too, and using GitHub Actions to automatically generate an index page README in the repository and a SQLite-backed search engine.

TILs

Josh describes his TILs like this:

A collection of concise write-ups on small things I learn day to day across a variety of languages and technologies. These are things that don't really warrant a full blog post.

This really resonated with me. I have five main places for writing at the moment.

This blog, for long-form content and weeknotes.
Twitter, for tweets - though Twitter threads are tending towards a long-form medium these days. My Spider-Verse behind-the-scenes thread ran for nearly a year!
My blogmarks - links plus short form commentary.
Niche Museums - effectively a blog about visits to tiny museums. It's on hiatus during the pandemic though.
GitHub issues. I've formed the habit of thinking out loud in issues, replying to myself with comments as I figure things out.

What's missing is exactly what TILs provide: somewhere to dump a couple of paragraphs about a new trick I've learned, with chronological order being less important than just getting them written down somewhere.

I've intermittently used gists for things like this in the past, but having them in an organized repo feels like a much less ad-hoc solution.

So I've started my own collection of TILs in my simonw/til GitHub repository.

Automating the README index page with GitHub Actions

The biggest feature I miss from reStructuredText when I'm working in Markdown is automatic tables of content.

For my TILs I wanted the index page on GitHub to display all of them. But I didn't want to have to update that page by hand every time I added one - especially since I'll often be creating them through the GitHub web interface which doesn't support editing multiple files in a single commit.

I've been getting a lot done with GitHub Actions recently. This felt like an opportunity to put them to more use.

So I wrote a GitHub Actions workflow that automatically updates the README page in the repo every time a new TIL markdown file is added or updated!

Here's an outline of how it works:

It runs on pushes to the master branch (no-one else can trigger it by sending me a pull request). It ignores commits that include the README.md file itself - otherwise commits to that file made by the workflow could trigger further runs of the same workflow. UPDATE: Apparently this isn't necessary.
It checks out the full repo history using actions/checkout@v2 with the fetch-depth: 0 option. This is needed because my script derives created/updated dates for each TIL by inspecting the git history. I learned a few days ago that this mechanism breaks if you only do a shallow check-out of the most recent commit!
It sets up Python, configures pip caching and installs dependencies from my requirements.txt.
It runs my build_database.py script, which uses GitPython to scan for all */*.md files and find their created and updated dates, then uses sqlite-utils to write the results to a SQLite database on the GitHub Actions temporary disk.
It runs update_readme.py which reads from that SQLite database and uses it to generate the markdown index section for the README. Then it opens the README and replaces the section between the  and  with the newly generated index.
It uses git diff to detect if the README has changed, then if it has it runs git commit and git push to commit those changes. See my TIL Commit a file if it changed for details on that pattern.

I really like this pattern.

I'm a big fan of keeping content in a git repository. Every CMS I've ever worked on has eventually evolved a desire to provide revision tracking, and building that into a regular database schema is never particularly pleasant. Git solves content versioning extremely effectively.

Having a GitHub repository that can update itself to maintain things like index pages feels like a technique that could be applied to all kinds of other content-related problems.

I'm also keen on the idea of using SQLite databases as intermediary storage as part of an Actions workflow. It's a simple but powerful way for one step in an action to generate structured data that can then be consumed by subsequent steps.

Implementing search with Datasette

Unsurprisingly, the other reason I'm using SQLite here is so I can deploy a database using Datasette. The last two steps of the workflow look like this:

- name: Setup Node.js
  uses: actions/setup-node@v1
  with:
    node-version: '12.x'
- name: Deploy Datasette using Zeit Now
  env:
    NOW_TOKEN: ${{ secrets.NOW_TOKEN }}
  run: |-
    datasette publish now2 til.db \
      --token $NOW_TOKEN \
      --project simon-til \
      --metadata metadata.json \
      --install py-gfm \
      --install datasette-render-markdown \
      --install datasette-template-sql \
      --template-dir templates

This installs Node.js, then uses Zeit Now (via datasette-publish-now) to publish the generated til.db SQLite database file to a Datasette instance accessible at til.simonwillison.net.

I'm reusing a bunch of tricks from my Niche Museums website here. The site is a standard Datasette instance with a custom index.html template that uses datasette-template-sql to display the TILs. Here's that template section in full:

{% for row in sql("select distinct topic from til order by topic") %}
    <h2>{{ row.topic }}</h2>
    <ul>
        {% for til in sql("select * from til where topic = '" + row.topic + "'") %}
            <li><a href="{{ til.url }}">{{ til.title }}</a> - {{ til.created[:10] }}</li>
        {% endfor %}
    </ul>
{% endfor %}

The search interface is powered by a custom SQL query in metadata.json that looks like this:

select
    til_fts.rank,
    til.*
from til
join til_fts on til.rowid = til_fts.rowid
where
    til_fts match case
        :q
        when '' then '*'
        else escape_fts(:q)
    end
order by
    til_fts.rank limit 20

A custom query-til-search.html template then renders the search results.

A powerful combination

I'm pretty happy with what I have here - it's definitely good enough to solve my TIL publishing needs. I'll probably add an Atom feed using datasette-atom at some point.

I hope this helps illustrate how powerful the combination of GitHub Actions, Datasette and Zeit Now or Cloud Run can be. I'm running an increasing number of projects on that combination, and the price, performance and ease of implementation continue to impress.

Tags: github, projects, markdown, datasette, github-actions, til

Weeknotes: Datasette 0.39 and many other projects

2020-03-25T05:33:19+00:00

This week's theme: Well, I'm not going anywhere. So a ton of progress to report on various projects.

Datasette 0.39

This evening I shipped Datasette 0.39. The two big features are a mechanism for setting the default sort order for tables and a new base_url configuration setting.

You can see the new default sort order in action on my Covid-19 project - the daily reports now default to sort by day descending so the most recent figures show up first. Here's the metadata that makes it happen, and here's the new documentation.

I had to do some extra work on that project this morning when the underlying data changed its CSV column headings without warning.

The base_url feature has been an open issue since Janunary 2019. It lets you run Datasette behind a proxy on a different URL prefix - /tools/datasette/ for example. The trigger for finally getting this solved was a Twitter conversation about running Datasette on Binder in coordination with a Jupyter notebook.

Tony Hirst did some work on this last year, but was stumped by the lack of a base_url equivalent. Terry Jones shared an implementation in December. I finally found the inspiration to pull it all together, and ended up wih a working fork of Tony's project which does indeed launch Datasette on Binder - try launching your own here.

github-to-sqlite

I've not done much work on my Dogsheep family of tools in a while. That changed this week: in particular, I shipped a 1.0 of github-to-sqlite.

As you might expect, it's a tool for importing GitHub data into a SQLite database. Today it can handle repositories, releases, release assets, commits, issues and issue comments. You can see a live demo built from Dogsheep organization data at github-to-sqlite.dogsheep.net (deployed by this GitHub action).

I built this tool primarily to help me better keep track of all of my projects. Pulling the issues into a single database means I can run queries against all open issues across all of my repositories, and imporing commits and releases is handy for when I want to write my weeknotes and need to figure out what I've worked on lately.

datasette-render-markdown

GitHub issues use Markdown. To correctly display them it's useful to be able to render that Markdown. I built datasette-render-markdown back in November, but this week I made some substantial upgrades: you can now configure which columns should be rendered, and it includes support for Markdown extensions including GitHub-Flavored Markdown.

You can see it in action on the github-to-sqlite demo.

I also upgraded datasette-render-timestamps with the same explicit column configuration pattern.

datasette-publish-fly

Fly is a relatively new hosting provider which lets you host applications bundled as Docker containers in load-balanced data centers geographically close to your users.

It has a couple of characteristics that make it a really good fit for Datasette.

Firstly, the pricing model: Fly will currently host a tiny (128MB of RAM) container for $2.67/month - and they give you $10/month of free service credit, enough for 3 containers.

It turns out Datasette runs just fine in 128MB of RAM, so that's three always-on Datasette containers! (Unlike Heroku and Cloud Run, Fly keeps your containers running rather than scaling them to zero).

Secondly, it works by shipping it a Dockerfile. This means building datasette publish support for it is really easy.

I added the publish_subcommand plugin hook to Datasette all the way back in 0.25 in September 2018, but I've never actually built anything with it. That's now changed: datasette-publish-fly uses the hook to add a datasette publish fly command for publishing databases directly to your Fly account.

hacker-news-to-sqlite

It turns out I created my Hacker News account in 2007, and I've posted 2,167 comments and submitted 131 stories since then. Since my personal Dogsheep project is about pulling my data from multiple sources into a single place it made sense to build a tool for importing from Hacker News.

hacker-news-to-sqlite uses the official Hacker News API to import every comment and story posted by a specific user. It can also use one or more item IDs to suck the entire discussion tree around those items.

The README includes detailed documentation on how to best browse your data using Datasette once you have imported it.

Other projects

sqlite-utils gained some improvements to the way it suggests types for existing columns.
twitter-to-sqlite now offers --sql and --attach for more of its subcommands.
datasette-show-errors is a new plugin which exposes 500 errors as tracebacks, like Django does with DEBUG=True. It's built on top of Starlette's ServerErrorMiddleware.
I upgraded inaturalist-to-sqlite to work with sqlite-utils 2.x.

Tags: github, projects, sqlite, markdown, jupyter, datasette, dogsheep, weeknotes, fly

Weeknotes: Python 3.7 on Glitch, datasette-render-markdown

2019-11-11T23:26:34+00:00

Streaks is really working well for me. I’m at 12 days of commits to Datasette, 16 posting a daily Niche Museum, 19 of actually reviewing my email inbox and 14 of guitar practice. I rewarded myself for that last one by purchasing an actual classical (as opposed to acoustic) guitar.

Datasette

One downside: since my aim is to land a commit to Datasette master every day, I’m incentivised to land small changes. I have a bunch of much larger Datasette projects in the works - I think my goal for the next week should be to land one of those. Contenders include:

TableView.data() refactor - a blocker on a bunch of other projects
Datasette Edit - finish the new connection work so I can have plugins that write changes to databases
Datasette Library - watch a directory and automatically serve new database files that show up in that directory
Finish and ship my work on facet-by-many-to-many
Implement basic join support for table views (so you can join without writing a custom SQL query)
Probably the most impactful: Datasette needs a website! Up until now I’ve directed people to GitHub or to the documentation but the project has grow to the point that it warrants its own home.

I’m going to redefine my daily goal to include pushing in-progress work to Datasette branches in an attempt to escape that false incentive.

New datasette-csvs using Python 3.7 on Glitch

The main reason I’ve been strict about keeping Datasette compatible with Python 3.5 is that it was the only version supported by Glitch, and Glitch has become my favourite tool for getting people up and running with Datasette quickly.

There’s been a long running Glitch support thread requesting an upgrade, and last week it finally bore fruit. Projects on Glitch now get python3 pointing to Python 3.7.5 instead!

This actually broke my datasette-csvs project at first, because for some reason under Python 3.7 the Pandas dependency used by csvs-to-sqlite started taking up too much space from the 200MB Glitch instance quota. I ended up working around this by switching over to using my sqlite-utils CLI tool instead, which has much lighter dependencies.

I’ve shared the new code for my Glitch project in the datasette-csvs repo on GitHub.

The one thing missing from sqlite-utils insert my.db mytable myfile.csv --csv right now is the ability to run it against multiple files at once - something csvs-to-sqlite handles really well. I ended up finally learning how to use while in bash and wrote the following install.sh shell script:

$ pip3 install -U -r requirements.txt --user && \
  mkdir -p .data && \
  rm .data/data.db || true && \
  for f in *.csv
    do
        sqlite-utils insert .data/data.db ${f%.*} $f --csv
    done

${f%.*} is the bash incantation for stripping off the file extension - so the above evaluates to this for each of the CSV files it finds in the root directory:

$ sqlite-utils insert .data/data.db trees trees.csv --csv

github-to-sqlite releases

I released github-to-sqlite 0.6 with a new sub-command:

$ github-to-sqlite releases github.db simonw/datasette

It grabs all of the releases for a repository using the GitHub releases API.

I’m using this for my personal Dogsheep instance, but I’m also planning to use this for the forthcoming Datasette website - I want to pull together all of the releases of all of the Datasette Ecosystem of projects in one place.

I decided to exercise my new bash while skills and write a script to run by cron once an hour which fetches all of my repos (from both my simonw account and my dogsheep GitHub organization) and then fetches their releases.

Since I don’t want to fetch releases for all 257 of my personal GitHub repos - just the repos which relate to Datasette - I started applying a new datasette-io topic (for datasette.io, my planned website domain) to the repos that I want to pull releases from.

Then I came up with this shell script monstrosity:

#!/bin/bash
# Fetch repos for simonw and dogsheep
github-to-sqlite repos github.db simonw dogsheep -a auth.json

# Fetch releases for the repos tagged 'datasette-io'
sqlite-utils github.db "
select full_name from repos where rowid in (
    select repos.rowid from repos, json_each(repos.topics) j
    where j.value = 'datasette-io'
)" --csv --no-headers | while read repo;
    do github-to-sqlite releases \
            github.db $(echo $repo | tr -d '\r') \
            -a auth.json;
        sleep 2;
    done;

Here’s an example of the database this produces, running on Cloud Run: https://github-to-sqlite-releases-j7hipcg4aq-uc.a.run.app

I’m using the ability of sqlite-utils to run a SQL query and return the results as CSV, but without the header row. Then I pipe the results through a while loop and use them to call the github-to-sqlite releases command against each repo.

I ran into a weird bug which turned out to be caused by the CSV output using \r\n which was fed into github-to-sqlite releases as simonw/datasette\r - I fixed that using $(echo $repo | tr -d '\r').

datasette-render-markdown

Now that I have a releases database table with all of the releases of my various packages I want to be able to browse them in one place. I fired up Datasette and realized that the most interesting information is in the body column, which contains markdown.

So I built a plugin for the render_cell plugin hook which safely renders markdown data as HTML. Here’s the full implementation of the plugin:

import bleach
import markdown
from datasette import hookimpl
import jinja2

ALLOWED_TAGS = [
    "a", "abbr", "acronym", "b", "blockquote", "code", "em",
    "i", "li", "ol", "strong", "ul", "pre", "p", "h1","h2",
    "h3", "h4", "h5", "h6",
]

@hookimpl()
def render_cell(value, column):
    if not isinstance(value, str):
        return None
    # Only convert to markdown if table ends in _markdown
    if not column.endswith("_markdown"):
        return None
    # Render it!
    html = bleach.linkify(
        bleach.clean(
            markdown.markdown(value, output_format="html5"),
            tags=ALLOWED_TAGS,
        )
    )
    return jinja2.Markup(html)

This first release of the plugin just looks for column names that end in _markdown and renders those. So the following SQL query does what I need:

select
  json_object("label", repos.full_name, "href", repos.html_url) as repo,
  json_object(
    "href",
    releases.html_url,
    "label",
    releases.name
  ) as release,
  substr(releases.published_at, 0, 11) as date,
  releases.body as body_markdown,
  releases.published_at
from
  releases
  join repos on repos.id = releases.repo
order by
  releases.published_at desc

In aliases releases.body to body_markdown to trigger the markdown rendering, and uses json_object(...) to cause datasette-json-html to render some links.

You can see the results here.

More museums

I added another 7 museums to www.niche-museums.com.

Dingles Fairground Heritage Centre
Ilfracombe Museum
Barometer World
La Galcante
Musée des Arts et Métiers
International Women’s Air & Space Museum
West Kern Oil Museum

Tags: glitch, museums, markdown, datasette, weeknotes, sqlite-utils

Creating Simple Interactive Forms Using Python + Markdown Using ScriptedForms + Jupyter

2018-04-19T16:05:57+00:00

Creating Simple Interactive Forms Using Python + Markdown Using ScriptedForms + Jupyter

ScriptedForms is a fascinating Jupyter hack that lets you construct dynamic documents defined using markdown that provide form fields and evaluate Python code instantly as you interact with them.

Via @psychemedia

Tags: jupyter, markdown, python

Deckset for Mac

2018-04-10T21:34:37+00:00

Deckset for Mac

$29 desktop Mac application that creates presentations using a cleverly designed markdown dialect. You edit the underlying markdown in your standard text editor and the Deskset app shows a preview of the presentation and lets you hit “play” to run it or export it as a PDF.

Via garyfleming/apis-for-decades

Tags: markdown, presentations

Dillinger

2017-10-08T18:38:40+00:00

Dillinger

I really like this online Markdown editor. It has source syntax highlighting, live previews of the generated HTML and it constantly syncs to localStorage so you won’t lose your work if you accidentally shut your browser window. The code is also available open source on GitHub.

Tags: markdown

Should I store markdown instead of HTML into database fields?

2013-09-08T15:57:00+00:00

My answer to Should I store markdown instead of HTML into database fields? on Quora

You should store the exact format that was entered by the user.

- This lets you offer an "edit" feature without round-tripping between two formats.
- This makes debugging much easier
- Related: if you need to investigate a security bug, having the original input is essential.

If you're worried about performance, you can cache the transformed HTML somewhere - or even denormalize it to an extra table column. Just make sure you always have the original input available.

Tags: cms, databases, html, mysql, quora, markdown

A myriad of markup systems

2004-04-13T04:58:54+00:00

It's hard to avoid the legions of custom markup systems out there these days. Every Wiki has it's own syntactical quirks, while packages like Markdown, Textile, BBCode (in dozens of variants), reStructuredText offer easy ways of hooking markup conversion in to existing applications. When it comes to being totally over-implemented and infuratingly inconsistent, markup systems are rapidly catching up with template packages. Never one to miss out on an opportunity to reinvent the wheel, I've worked on several of each ;)

My most recent markup handling attempt has just been published as part of my SitePoint article on Bookmarklets (cliché). It's a structured markup language in a bookmarklet: activate the bookmarklet to convert the text in any textarea on a page to XHTML. The syntax is ridiculously simple, and serves my limited needs just fine:


= This is a header

Here is a paragraph.

* This is a list of items
* Another item in the list

Converts to:


<h4>This is a header</h4>

<p>Here is a paragraph.</p>

<ul>
 <li>This is a list of items</li>
 <li>Another item in the list</li>
</ul>

The algorithm is simple, and easily portable to any language you care to mention:

Normalise newlines to \n, for cross-platform consistency.
Split the text up on double newlines, to create a list of blocks.
For each block:
1. If it starts with an equals sign, wrap it in header tags.
2. If it starts with an asterisk, split it in to lines, make each a list item (stripping off the asterisk at the start of the line if required) and glue them all together inside a <ul>.
3. Otherwise, wrap it in a  tag provided it doesn't have one already.
Glue everything back together again with a couple of newlines, to make the underlying XHTML look pretty.

The bookmarklet comes in two flavours: Expand HTML Shorthand (the full version) and Expand HTML Shorthand IE, which loses header support in order to fit within IE's crippling 508 character limit. A more capable bookmarklet could be built using the import-script-stub method described in my article, but the implementation of such a thing is left as an exercise for the reader (I've always wanted to say that).

Incidentally, there's a very common bug in markup systems that allow inline styles that proves extremely difficult to fix: that of improperly nested tags. Say you have a system where *text* is bold and _text_ is italic; what happens when the user enters _italic*italic-bold_bold*? Most systems (and that includes Markdown, Textile and my home-rolled Python solution) use naive regular expressions for inline markup processing and will output vadly formed XHTML: italicitalic-boldbold. To truly solve this problem requires a context-sensitive parser, which involves an unpleasantly large amount of effort to solve what looks like a simple bug.

Tags: bookmarklets, restructuredtext, markdown

Simon Willison's Weblog: markdown

[red-knot] type inference/checking test framework

My Jina Reader tool

otterwiki

simonw/docs cookiecutter template

Markdown and Math Live Renderer

Share Claude conversations by converting their JSON to Markdown

Mermaid Gantt diagrams are great for displaying distributed traces in Markdown

New blog feature: Support for markdown in quotations

Jina AI Reader

GitHub Copilot Chat: From Prompt Injection to Data Exfiltration

Blogmarks that use markdown

Migrating out of PostHaven

babelmark3

Pikchr

Weeknotes: airtable-export, generating screenshots in GitHub Actions, Dogsheep!

airtable-export

Social media cards for my TILs

Rendering Markdown

Dogsheep Beta

TIL this week

Releases this week

Render Markdown tool

Weeknotes: California Protected Areas in Datasette

California Protected Areas in Datasette

datasette-graphql 1.0

Improvements to my TILs

Datasette Cloud

TIL this week

Releases this week

Using a self-rewriting README powered by GitHub Actions to track TILs

TILs

Automating the README index page with GitHub Actions

Implementing search with Datasette

A powerful combination

Weeknotes: Datasette 0.39 and many other projects

Datasette 0.39

github-to-sqlite

datasette-render-markdown

datasette-publish-fly

hacker-news-to-sqlite

Other projects

Weeknotes: Python 3.7 on Glitch, datasette-render-markdown

Datasette

New datasette-csvs using Python 3.7 on Glitch

github-to-sqlite releases

datasette-render-markdown

More museums

Creating Simple Interactive Forms Using Python + Markdown Using ScriptedForms + Jupyter

Deckset for Mac

Dillinger

Should I store markdown instead of HTML into database fields?

A myriad of markup systems