348 items tagged “datasette”
2023
Vector Search. Amjith Ramanujam provides a very thorough tutorial on implementing vector similarity search using SentenceTransformers (all-MiniLM-L6-v2) embeddings run using sqlite-utils, then served via datasette-sqlite-vss and deployed using Fly. # 2nd June 2023, 5:02 am
Exploration de données avec Datasette. One of the great delights of open source development is seeing people run workshops on your project, even more so when they’re in a language other than English! Romain Clement presented this French workshop for the Python Grenoble meetup on 25th May 2023, using GitHub Codespaces as the environment. It’s pretty comprehensive, including a 300,000+ row example table which illustrates Datasette plugins such as datasette-cluster-map and datasette-leaflet-geojson. # 27th May 2023, 12:36 am
MMS Language Coverage in Datasette Lite. I converted the HTML table of 4,021 languages supported by Meta’s new Massively Multilingual Speech models to newline-delimited JSON and loaded it into Datasette Lite. Faceting by Language Family is particularly interesting—the top five families represented are Niger-Congo with 1,019, Austronesian with 609, Sino-Tibetan with 288, Indo-European with 278 and Afro-Asiatic with 222. # 22nd May 2023, 8:01 pm
Data analysis with SQLite and Python for PyCon 2023
I’m at PyCon 2023 in Salt Lake City this week.
[... 334 words]What’s in the RedPajama-Data-1T LLM training set
RedPajama is “a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens”. It’s a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute.
[... 1077 words]sqlite-history: tracking changes to SQLite tables using triggers (also weeknotes)
In between blogging about ChatGPT rhetoric, micro-benchmarking with ChatGPT Code Interpreter and Why prompt injection is an even bigger problem now I managed to ship the beginnings of a new project: sqlite-history.
[... 1680 words]GitHub Accelerator: our first cohort. I’m participating in the first cohort of GitHub’s new open source accelerator program, with Datasette (and related projects). It’s a 10 week program with 20 projects working together “with an end goal of building durable streams of funding for their work”. # 13th April 2023, 5:28 pm
Weeknotes: A new llm CLI tool, plus automating my weeknotes and newsletter
I started publishing weeknotes in 2019 partly as a way to hold myself accountable but mainly as a way to encourage myself to write more.
[... 830 words]Semi-automating a Substack newsletter with an Observable notebook
I recently started sending out a weekly-ish email newsletter consisting of content from my blog. I’ve mostly automated that, using an Observable Notebook to generate the HTML. Here’s how that system works.
[... 2520 words]I built a ChatGPT plugin to answer questions about data hosted in Datasette
Yesterday OpenAI announced support for ChatGPT plugins. It’s now possible to teach ChatGPT how to make calls out to external APIs and use the responses to help generate further answers in the current conversation.
[... 1801 words]Weeknotes: AI won’t slow down, a new newsletter and a huge Datasette refactor
I’m a few weeks behind on my weeknotes, but it’s not through lack of attention to my blog. AI just keeps getting weirder and more interesting.
[... 1255 words]Datasette: Gather feedback on new ?_extra= design. I just landed the single biggest backwards-incompatible change to Datasette ever, in preparation for the 1.0 release. It’s a change to the default JSON format from the Datasette API—the new format is much slimmer, and can be expanded using a new ?_extra= query string parameter. I’m desperately keen on getting feedback on this change! This issues has more details and a call for feedback. # 22nd March 2023, 11:14 pm
Using Datasette in GitHub Codespaces. A new Datasette tutorial showing how it can be run inside GitHub Codespaces—GitHub’s browser-based development environments—in order to explore and analyze data. I’ve been using Codespaces to run tutorials recently and it’s absolutely fantastic, because it puts every tutorial attendee on a level playing field with respect to their development environments. # 24th February 2023, 12:40 am
Analytics: Hacker News v.s. a tweet from Elon Musk
My post Bing: “I will not harm you unless you harm me first” really took off.
[... 817 words]Introducing sqlite-vss: A SQLite Extension for Vector Search (via) This latest SQLite extension from Alex Garcia is possibly his best yet: it adds FAISS-powered vector similarity search directly to SQLite, enabling fast KNN similarity lookups against a virtual table that feels a lot like SQLite’s own built-in full text search feature. This write-up includes interactive demos using Datasette called from an Observable notebook, running similarity searches against an index of 200,000 news headlines and summaries in less than 50ms. # 10th February 2023, 10:53 pm
Weeknotes: A bunch of things I learned this week, plus datasette-explain
The Datasette table view refactor, JSON redesign and ?_extra=
continues this week, mainly in this ongoing pull request and this tracking issue.
Making SQLite extensions pip install-able (via) Alex Garcia figured out how to bundle a compiled SQLite extension in a Python wheel (building different wheels for different platforms) and publish them to PyPI. This is a huge leap forward in terms of the usability of SQLite extensions, which have previously been pretty difficult to actually install and run. Alex also created Datasette plugins that depend on his packages, so you can now “datasette install datasette-sqlite-regex” (or datasette-sqlite-ulid, datasette-sqlite-fastrand, datasette-sqlite-jsonschema) to gain access to his custom SQLite extensions in your Datasette instance. It even works with “datasette publish --install” to deploy to Vercel, Fly.io and Cloud Run. # 6th February 2023, 7:44 pm
datasette-scraper, Big Local News and other weeknotes
In addition to exploring the new MusicCaps training and evaluation data I’ve been working on the big Datasette JSON refactor, and getting excited about a Datasette project that I didn’t work on at all.
[... 1744 words]datasette-scraper walkthrough on YouTube (via) datasette-scraper is Colin Dellow’s new plugin that turns Datasette into a powerful web scraping tool, with a web UI based on plugin-driven customizations to the Datasette interface. It’s really impressive, and this ten minute demo shows quite how much it is capable of: it can crawl sitemaps and fetch pages, caching them (using zstandard with optional custom dictionaries for extra compression) to speed up subsequent crawls... and you can add your own plugins to extract structured data from crawled pages and save it to a separate SQLite table! # 29th January 2023, 5:23 am
Examples of sites built using Datasette (via) I gave the examples page on the Datasette website a significant upgrade today: it now includes screenshots (taken using shot-scraper) of six projects chosen to illustrate the variety of problems Datasette can be used to tackle. # 29th January 2023, 3:40 am
We’ve built many tools for publishing to the web—but I want to make the claim that we have underdeveloped the tools and platforms for publishing collections, indexes and small databases. It’s too hard to build these kinds of experiences, too hard to maintain them and a lack of collaborative tools.
— Tom Critchlow # 28th January 2023, 4:43 pm
Exploring MusicCaps, the evaluation data released to accompany Google’s MusicLM text-to-music model
Google Research just released MusicLM: Generating Music From Text. It’s a new generative AI model that takes a descriptive prompt and produces a “high-fidelity” music track. Here’s the paper (and a more readable version using arXiv Vanity).
[... 1323 words]datasette-granian (via) Granian is a new Python web server—similar to Gunicorn—written in Rust. I built a small plugin that adds a “datasette granian” command starting a Granian server that serves Datasette’s ASGI application, using the same pattern as my existing datasette-gunicorn plugin. # 20th January 2023, 2:12 am
Datasette is my data hammer (via) Jeremia Kimelman—a data journalist at CalMatters in Sacramento—enthuses about how he uses Datasette as his default hammer for all kinds of data projects—in particular how much he appreciates Datasette’s focus on URLs. So nice to see this! # 17th January 2023, 5:23 pm
Weeknotes: AI hacking and a SpatiaLite tutorial
Short weeknotes this time because the key things I worked on have already been covered here:
[... 477 words]How to implement Q&A against your documentation with GPT3, embeddings and Datasette
If you’ve spent any time with GPT-3 or ChatGPT, you’ve likely thought about how useful it would be if you could point them at a specific, current collection of text or documentation and have it use that as part of its input for answering questions.
[... 3490 words]Datasette 0.64, with a warning about SpatiaLite
I release Datasette 0.64 this morning. This release is mainly a response to the realization that it’s not safe to run Datasette with the SpatiaLite extension loaded if that Datasette instance is configured to enable arbitrary SQL queries from untrusted users.
[... 675 words]2022
Weeknotes: Datasette 0.63.3, datasette-ripgrep
We’re back in the UK to see family over Christmas (our first trip back since 2019). Here are a few notes from the past couple of weeks.
[... 801 words]Datasette 1.0a2: Upserts and finely grained permissions
I’ve released the third alpha of Datasette 1.0. The 1.0a2 release introduces upsert support to the new JSON API and makes some major improvements to the Datasette permissions system.
[... 2844 words]Over-engineering Secret Santa with Python cryptography and Datasette
We’re doing a family Secret Santa this year, and we needed a way to randomly assign people to each other without anyone knowing who was assigned to who.
[... 2045 words]