Simon Willison’s Weblog

Subscribe

Blogmarks tagged git

Filters: Type: blogmark × git × Sorted by date

simonw/git-scraper-template. I built this new GitHub template repository in preparation for a workshop I'm giving at NICAR (the data journalism conference) next week on Cutting-edge web scraping techniques.

One of the topics I'll be covering is Git scraping - creating a GitHub repository that uses scheduled GitHub Actions workflows to grab copies of websites and data feeds and store their changes over time using Git.

This template repository is designed to be the fastest possible way to get started with a new Git scraper: simple create a new repository from the template and paste the URL you want to scrape into the description field and the repository will be initialized with a custom script that scrapes and stores that URL.

It's modeled after my earlier shot-scraper-template tool which I described in detail in Instantly create a GitHub repository to take screenshots of a web page.

The new git-scraper-template repo took some help from Claude to figure out. It uses a custom script to download the provided URL and derive a filename to use based on the URL and the content type, detected using file --mime-type -b "$file_path" against the downloaded file.

It also detects if the downloaded content is JSON and, if it is, pretty-prints it using jq - I find this is a quick way to generate much more useful diffs when the content changes.

# 26th February 2025, 5:34 am / data-journalism, git, github, projects, scraping, github-actions, git-scraping, nicar

otterwiki (via) It's been a while since I've seen a new-ish Wiki implementation, and this one by Ralph Thesen is really nice. It's written in Python (Flask + SQLAlchemy + mistune for Markdown + GitPython) and keeps all of the actual wiki content as Markdown files in a local Git repository.

The installation instructions are a little in-depth as they assume a production installation with Docker or systemd - I figured out this recipe for trying it locally using uv:

git clone https://github.com/redimp/otterwiki.git
cd otterwiki

mkdir -p app-data/repository
git init app-data/repository

echo "REPOSITORY='${PWD}/app-data/repository'" >> settings.cfg
echo "SQLALCHEMY_DATABASE_URI='sqlite:///${PWD}/app-data/db.sqlite'" >> settings.cfg
echo "SECRET_KEY='$(echo $RANDOM | md5sum | head -c 16)'" >> settings.cfg

export OTTERWIKI_SETTINGS=$PWD/settings.cfg
uv run --with gunicorn gunicorn --bind 127.0.0.1:8080 otterwiki.server:app

# 9th October 2024, 3:22 pm / flask, git, python, sqlalchemy, sqlite, markdown, wikis, uv

Why GitHub Actually Won (via) GitHub co-founder Scott Chacon shares some thoughts on how GitHub won the open source code hosting market. Shortened to two words: timing, and taste.

There are some interesting numbers in here. I hadn't realized that when GitHub launched in 2008 the term "open source" had only been coined ten years earlier, in 1998. This paper by Dirk Riehle estimates there were 18,000 open source projects in 2008 - Scott points out that today there are over 280 million public repositories on GitHub alone.

Scott's conclusion:

We were there when a new paradigm was being born and we approached the problem of helping people embrace that new paradigm with a developer experience centric approach that nobody else had the capacity for or interest in.

# 9th September 2024, 5:16 pm / git, github, open-source

AI-powered Git Commit Function (via) Andrej Karpathy built a shell alias, gcm, which passes your staged Git changes to an LLM via my LLM tool, generates a short commit message and then asks you if you want to "(a)ccept, (e)dit, (r)egenerate, or (c)ancel?".

Here's the incantation he's using to generate that commit message:

git diff --cached | llm "
Below is a diff of all staged changes, coming from the command:
\`\`\`
git diff --cached
\`\`\`
Please generate a concise, one-line commit message for these changes."

This pipes the data into LLM (using the default model, currently gpt-4o-mini unless you set it to something else) and then appends the prompt telling it what to do with that input.

# 26th August 2024, 1:06 am / git, ai, andrej-karpathy, prompt-engineering, generative-ai, llms, ai-assisted-programming, llm

EpicEnv (via) Dan Goodman's tool for managing shared secrets via a Git repository. This uses a really neat trick: you can run epicenv invite githubuser and the tool will retrieve that user's public key from github.com/{username}.keys (here's mine) and use that to encrypt the secrets such that the user can decrypt them with their private key.

# 3rd August 2024, 12:31 am / encryption, git

1991-WWW-NeXT-Implementation on GitHub. I fell down a bit of a rabbit hole today trying to answer that question about when World Wide Web Day was first celebrated. I found my way to www.w3.org/History/1991-WWW-NeXT/Implementation/ - an Apache directory listing of the source code for Tim Berners-Lee's original WorldWideWeb application for NeXT!

The code wasn't particularly easy to browse: clicking a .m file would trigger a download rather than showing the code in the browser, and there were no niceties like syntax highlighting.

So I decided to mirror that code to a new repository on GitHub. I grabbed the code using wget -r and was delighted to find that the last modified dates (from the early 1990s) were preserved ... which made me want to preserve them in the GitHub repo too.

I used Claude to write a Python script to back-date those commits, and wrote up what I learned in this new TIL: Back-dating Git commits based on file modification dates.

End result: I now have a repo with Tim's original code, plus commit dates that reflect when that code was last modified.

Three commits credited to Tim Berners-Lee, in 1995, 1994 and 1993

# 1st August 2024, 9:15 pm / git, github, history, tim-berners-lee, w3c

AWS CodeCommit quietly deprecated (via) CodeCommit is AWS's Git hosting service. In a reply from an AWS employee to this forum thread:

Beginning on 06 June 2024, AWS CodeCommit ceased onboarding new customers. Going forward, only customers who have an existing repository in AWS CodeCommit will be able to create additional repositories.

[...] If you would like to use AWS CodeCommit in a new AWS account that is part of your AWS Organization, please let us know so that we can evaluate the request for allowlisting the new account. If you would like to use an alternative to AWS CodeCommit given this news, we recommend using GitLab, GitHub, or another third party source provider of your choice.

What's weird about this is that, as far as I can tell, this is the first official public acknowledgement from AWS that CodeCommit is no longer accepting customers. The CodeCommit landing page continues to promote the product, though it does link to the How to migrate your AWS CodeCommit repository to another Git provider blog post from July 25th, which gives no direct indication that CodeCommit is being quietly sunset.

I wonder how long they'll continue to support their existing customers?

Amazon QLDB too

It looks like AWS may be having a bit of a clear-out. Amazon QLDB - Quantum Ledger Database (a blockchain-adjacent immutable ledger, launched in 2019) - quietly put out a deprecation announcement in their release history on July 18th (again, no official announcement elsewhere):

End of support notice: Existing customers will be able to use Amazon QLDB until end of support on 07/31/2025. For more details, see Migrate an Amazon QLDB Ledger to Amazon Aurora PostgreSQL.

This one is more surprising, because migrating to a different Git host is massively less work than entirely re-writing a system to use a fundamentally different database.

It turns out there's an infrequently updated community GitHub repo called SummitRoute/aws_breaking_changes which tracks these kinds of changes. Other services listed there include CodeStar, Cloud9, CloudSearch, OpsWorks, Workdocs and Snowmobile, and they cleverly (ab)use the GitHub releases mechanism to provide an Atom feed.

# 30th July 2024, 5:51 am / aws, git, blockchain

uv pip install --exclude-newer example (via) A neat new feature of the uv pip install command is the --exclude-newer option, which can be used to avoid installing any package versions released after the specified date.

Here's a clever example of that in use from the typing_extensions packages CI tests that run against some downstream packages:

uv pip install --system -r test-requirements.txt --exclude-newer $(git show -s --date=format:'%Y-%m-%dT%H:%M:%SZ' --format=%cd HEAD)

They use git show to get the date of the most recent commit (%cd means commit date) formatted as an ISO timestamp, then pass that to --exclude-newer.

# 10th May 2024, 4:35 pm / git, pip, python, uv, astral

Use an llm to automagically generate meaningful git commit messages. Neat, thoroughly documented recipe by Harper Reed using my LLM CLI tool as part of a scheme for if you’re feeling too lazy to write a commit message—it uses a prepare-commit-msg Git hook which runs any time you commit without a message and pipes your changes to a model along with a custom system prompt.

# 11th April 2024, 4:06 am / cli, git, ai, generative-ai, llm

How I use git worktrees (via) TIL about worktrees, a Git feature that lets you have multiple repository branches checked out to separate directories at the same time.

The default UI for them is a little unergonomic (classic Git) but Bill Mill here shares a neat utility script for managing them in a more convenient way.

One particularly neat trick: Bill’s “worktree” Bash script checks for a node_modules folder and, if one exists, duplicates it to the new directory using copy-on-write, saving you from having to run yet another lengthy “npm install”.

# 6th March 2024, 3:21 pm / git

Figure out who’s leaving the company: dump, diff, repeat (via) Rachel Kroll describes a neat hack for companies with an internal LDAP server or similar machine-readable employee directory: run a cron somewhere internal that grabs the latest version and diffs it against the previous to figure out who has joined or left the company.

I suggest using Git for this—a form of Git scraping—as then you get a detailed commit log of changes over time effectively for free.

I really enjoyed Rachel’s closing thought: “Incidentally, if someone gets mad about you running this sort of thing, you probably don’t want to work there anyway. On the other hand, if you’re able to build such tools without IT or similar getting ”threatened“ by it, then you might be somewhere that actually enjoys creating interesting and useful stuff. Treasure such places. They don’t tend to last.”

# 9th February 2024, 5:44 am / git, git-scraping

Inside .git. This single diagram filled in all sorts of gaps in my mental model of how git actually works under the hood.

# 25th January 2024, 2:59 pm / git, julia-evans

See the History of a Method with git log -L (via) Neat Git trick from Caleb Hearth that I hadn’t seen before, and it works for Python out of the box:

git log -L :path_with_format:__init__.py

That command displays a log (with diffs) of just the portion of commits that changed the path_with_format function in the __init__.py file.

# 5th November 2023, 8:16 pm / git, python

Tracking SQLite Database Changes in Git (via) A neat trick from Garrit Franke that I hadn’t seen before: you can teach “git diff” how to display human readable versions of the differences between binary files with a specific extension using the following:

git config diff.sqlite3.binary true
git config diff.sqlite3.textconv “echo .dump | sqlite3”

That way you can store binary files in your repo but still get back SQL diffs to compare them.

I still worry about the efficiency of storing binary files in Git, since I expect multiple versions of a text text file to compress together better.

# 1st November 2023, 6:53 pm / git, sqlite

A tiny CI system (via) Christian Ştefănescu shares a recipe for building a tiny self-hosted CI system using Git and Redis. A post-receive hook runs when a commit is pushed to the repo and uses redis-cli to push jobs to a list. Then a separate bash script runs a loop with a blocking “redis-cli blpop jobs” operation which waits for new jobs and then executes the CI job as a shell script.

# 26th April 2022, 3:39 pm / bash, continuous-integration, git, redis

Commits are snapshots, not diffs (via) Useful, clearly explained revision of some Git fundamentals.

# 17th December 2020, 10:01 pm / git, github

nyt-2020-election-scraper. Brilliant application of git scraping by Alex Gaynor and a growing team of contributors. Takes a JSON snapshot of the NYT’s latest election poll figures every five minutes, then runs a Python script to iterate through the history and build an HTML page showing the trends, including what percentage of the remaining votes each candidate needs to win each state. This is the perfect case study in why it can be useful to take a “snapshot if the world right now” data source and turn it into a git revision history over time.

# 6th November 2020, 2:24 pm / alex-gaynor, data-journalism, elections, git, new-york-times, git-scraping

Repository driven development (via) I’m already a big fan of keeping documentation and code in the same repo so you can update them both from within the same code review, but this takes it even further: in repository driven development every aspect of the code and configuration needed to define, document, test and ship a service live in the service repository—all the way down to the configurations for reporting dashboards. This sounds like heaven.

# 24th July 2019, 8:41 am / git

isomorphic-git (via) A pure-JavaScript implementation of the git protocol and underlying tools which works both server-side (Node.js) AND in the client, using an emulation of the fs API. Given the right CORS headers it can clone a GitHub repository over HTTPS right into your browser. Impressive.

# 16th May 2018, 8:54 pm / git, javascript, cors

Telling stories through your commits. Joel Chippendale’s excellent guide to writing a useful commit history. I spend a lot of time on my commit messages, because when I’m trying to understand code later on they are the only form of documentation that is guaranteed to remain up-to-date against the code at that exact point of time. These tips are clear, concise, teadabale and include some great examples.

# 13th January 2018, 7:44 pm / git, sourcecontrol

Exploding Git Repositories. Kate Murphy describes how git is vulnerable to a similar attack to the XML “billion laughs” recursive entity expansion attack—you can create a tiny git repository that acts as a “git bomb”, expanding 12 root objects to over a billion files using recursive blob references.

# 12th October 2017, 7:43 pm / git, security

GitHub: Announcing SVN Support. The best kind of April Fool’s joke: one that works. It’s read-only, but that’s good enough to support referencing GitHub repositories from SVN externals.

# 1st April 2010, 11:33 am / aprilfools, git, github, subversion

A successful Git branching model (via) This looks eminently sensible. The master branch is used for production-ready code, and is only updated by merging from either release branches or emergency hotfix branches. A develop branch is used for integration (from feature branches), and is branched to create release branches when a release is nearly ready. It’s all comprehensively documented and comes with some well-designed diagrams.

# 20th January 2010, 7:30 pm / branching, development, git, process

Introducing the YUI 3 Gallery. Write a plugin for YUI3, BSD license it and sign a CLA and Yahoo! will push your module out to their CDN and make it loadable using the YUI().use() statement. They’re coordinating the submissions using GitHub.

# 4th November 2009, 11:14 pm / bsd, cla, git, github, javascript, open-source, yahoo, yui, yui3

How We Made GitHub Fast. Detailed overview of the new GitHub architecture. It’s a lot more complicated than I would have expected—lots of moving parts are involved in ensuring they can scale horizontally when they need to. Interesting components include nginx, Unicorn, Rails, DRBD, HAProxy, Redis, Erlang, memcached, SSH, git and a bunch of interesting new open source projects produced by the GitHub team such as BERT/Ernie and ProxyMachine.

# 21st October 2009, 9:14 pm / drbd, erlang, ernie, git, github, haproxy, memcached, nginx, proxymachine, rails, redis, replication, ruby, scaling, ssh, unicorn

homebrew. Exciting alternative to MacPorts for compiling software on OS X—homebrew avoids sudo and defines packages as simple Ruby scripts, shared and distributed using Git.

# 21st September 2009, 6:51 pm / git, homebrew, macports, osx, ruby

Fabric, Django, Git, Apache, mod_wsgi, virtualenv and pip deployment. I’m slowly working my way through this stack at the moment—next stop, fabric.

# 28th July 2009, 11:56 am / apache, deployment, django, fabric, gareth-rushgrove, git, modwsgi, pip, python, virtualenv, wsgi

djangopeople.net on GitHub. I’ve released the source code for Django People, the geographical community site developed last year by myself and Natalie Downe (it hasn’t otherwise been touched since April last year, so it needs porting to Django 1.1). If you want a new feature on the site, implement it and I’ll see about merging it in.

# 4th May 2009, 6:12 pm / django, django-people, git, github, natalie-downe, open-source, projects, python

Dulwich. A pure Python implementation of the Git file format and protocols. Reinforces my impression that a key to Git’s success is stable, well designed and documented on-disk formats.

# 16th February 2009, 10:27 pm / dulwich, git, python

Tailor. “Tailor is a tool to migrate or replicate changesets between ArX, Bazaar, Bazaar-NG, CVS, Codeville, Darcs, Git, Mercurial, Monotone, Subversion and Tla repositories.”—written in Python.

# 24th June 2008, 9:59 am / bazaar, codeville, cvs, darcs, dvcs, git, mercurial, monotone, python, subversion, tailor, tla, version-control