Simon Willison’s Weblog

Subscribe

53 items tagged “testing”

2024

inline-snapshot. I’m a big fan of snapshot testing, where expected values are captured the first time a test suite runs and then asserted against in future runs. It’s a very productive way to build a robust test suite.

inline-snapshot by Frank Hoffmann is a particularly neat implementation of the pattern. It defines a snapshot() function which you can use in your tests:

assert 1548 * 18489 == snapshot()

When you run that test using “pytest --inline-snapshot=create” the snapshot() function will be replaced in your code (using AST manipulation) with itself wrapping the repr() of the expected result:

assert 1548 * 18489 == snapshot(28620972)

If you modify the code and need to update the tests you can run “pytest --inline-snapshot=fix” to regenerate the recorded snapshot values. # 16th April 2024, 4:04 pm

Your AI Product Needs Evals (via) Hamel Husain: “I’ve seen many successful and unsuccessful approaches to building LLM products. I’ve found that unsuccessful products almost always share a common root cause: a failure to create robust evaluation systems.”

I’ve been frustrated about this for a while: I know I need to move beyond “vibe checks” for the systems I have started to build on top of LLMs, but I was lacking a thorough guide about how to build automated (and manual) evals in a productive way.

Hamel has provided exactly the tutorial I was needing for this, with a really thorough example case-study.

Using GPT-4 to create test cases is an interesting approach: “Write 50 different instructions that a real estate agent can give to his assistant to create contacts on his CRM. The contact details can include name, phone, email, partner name, birthday, tags, company, address and job.”

Also important: “... unlike traditional unit tests, you don’t necessarily need a 100% pass rate. Your pass rate is a product decision.”

Hamel’s guide then covers the importance of traces for evaluating real-world performance of your deployed application, plus the pros and cons of leaning on automated evaluation using LLMs themselves.

Plus some wisdom from a footnote: “A reasonable heuristic is to keep reading logs until you feel like you aren’t learning anything new.” # 31st March 2024, 9:53 pm

time-machine example test for a segfault in Python (via) Here’s a really neat testing trick by Adam Johnson. Someone reported a segfault bug in his time-machine library. How you you write a unit test that exercises a segfault without crashing the entire test suite?

Adam’s solution is a test that does this:

subprocess.run([sys.executable, “-c”, code_that_crashes_python], check=True)

sys.executable is the path to the current Python executable—ensuring the code will run in the same virtual environment as the test suite itself. The -c option can be used to have it run a (multi-line) string of Python code, and check=True causes the subprocess.run() function to raise an error if the subprocess fails to execute cleanly and returns an error code.

I’m absolutely going to be borrowing this pattern next time I need to add tests to cover a crashing bug in one of my projects. # 23rd March 2024, 7:44 pm

Testcontainers (via) Not sure how I missed this: Testcontainers is a family of testing libraries (for Python, Go, JavaScript, Ruby, Rust and a bunch more) that make it trivial to spin up a service such as PostgreSQL or Redis in a container for the duration of your tests and then spin it back down again.

The Python example code is delightful:

redis = DockerContainer(“redis:5.0.3-alpine”).with_exposed_ports(6379)
redis.start()
wait_for_logs(redis, “Ready to accept connections”)

I much prefer integration-style tests over unit tests, and I like to make sure any of my projects that depend on PostgreSQL or similar can run their tests against a real running instance. I’ve invested heavily in spinning up Varnish or Elasticsearch ephemeral instances in the past—Testcontainers look like they could save me a lot of time.

The open source project started in 2015, span off a company called AtomicJar in 2021 and was acquired by Docker in December 2023. # 28th February 2024, 2:41 am

Before we even started writing the database, we first wrote a fully-deterministic event-based network simulation that our database could plug into. This system let us simulate an entire cluster of interacting database processes, all within a single-threaded, single-process application, and all driven by the same random number generator. We could run this virtual cluster, inject network faults, kill machines, simulate whatever crazy behavior we wanted, and see how it reacted. Best of all, if one particular simulation run found a bug in our application logic, we could run it over and over again with the same random seed, and the exact same series of events would happen in the exact same order. That meant that even for the weirdest and rarest bugs, we got infinity “tries” at figuring it out, and could add logging, or do whatever else we needed to do to track it down.

[...] At FoundationDB, once we hit the point of having ~zero bugs and confidence that any new ones would be found immediately, we entered into this blessed condition and we flew.

[...] We had built this sophisticated testing system to make our database more solid, but to our shock that wasn’t the biggest effect it had. The biggest effect was that it gave our tiny engineering team the productivity of a team 50x its size.

Will Wilson, on FoundationDB # 13th February 2024, 5:20 pm

2023

promptfoo: How to benchmark Llama2 Uncensored vs. GPT-3.5 on your own inputs. promptfoo is a CLI and library for “evaluating LLM output quality”. This tutorial in their documentation about using it to compare Llama 2 to gpt-3.5-turbo is a good illustration of how it works: it uses YAML files to configure the prompts, and more YAML to define assertions such as “not-icontains: AI language model”. # 10th September 2023, 4:19 pm

pytest-icdiff (via) This is neat: “pip install pytest-icdiff” provides an instant usability upgrade to the output of failed tests in pytest, especially if the assertions involve comparing larger strings or nested JSON objects. # 3rd June 2023, 4:59 pm

pyfakefs usage (via) New to me pytest fixture library that provides a really easy way to mock Python’s filesystem functions—open(), os.path.listdir() and so on—so a test can run against a fake set of files. This looks incredibly useful. # 1st February 2023, 10:37 pm

How to simulate a broken database connection for testing in Django (via) Neil Kakkar explores the options using unittest.patch() and then settles on a neater pattern using “with connection.execute_wrapper(QueryTimeoutWrapper()):” to simulate the exact exception he needs to test against. # 15th January 2023, 8:31 pm

2022

Coping strategies for the serial project hoarder

I gave a talk at DjangoCon US 2022 in San Diego last month about productivity on personal projects, titled “Massively increase your productivity on personal projects with comprehensive documentation and automated tests”.

[... 3863 words]

mitsuhiko/insta (via) I asked for recommendations on Twitter for testing libraries in other languages that would give me the same level of delight that I get from pytest. Two people pointed me to insta by Armin Ronacher, a Rust testing framework for “snapshot testing” which automatically records reference values to your repository, so future tests can spot if they change. # 31st October 2022, 1:06 am

viewport-preview (via) I built a tiny tool which lets you preview a URL in a bunch of different common browser viewport widths, using iframes. # 26th July 2022, 12 am

I discovered a while ago that all those errors and bugs that only appear when you demo something to an audience also magically appear when you record yourself demoing it to nobody. Maybe narrating a feature to a pretend audience takes the blinders off enough that you notice little mistakes you wouldn’t have otherwise.

karaterobot # 24th July 2022, 8:59 pm

2021

I’m pretty convinced that the biggest single contributor to improved software in my lifetime wasn’t object-orientation or higher-level languages or functional programming or strong typing or MVC or anything else: It was the rise of testing culture.

Tim Bray # 1st June 2021, 2:35 pm

When you have to mock a collaborator, avoid using the Mock object directly. Either use mock.create_autospec() or mock.patch(autospec=True) if at all possible. Autospeccing from the real collaborator means that if the collaborator’s interface changes, your tests will fail. Manually speccing or not speccing at all means that changes in the collaborator’s interface will not break your tests that use the collaborator: you could have 100% test coverage and your library would fall over when used!

Thea Flowers # 17th March 2021, 4:44 pm

Blazing fast CI with pytest-split and GitHub Actions (via) pytest-split is a neat looking variant on the pattern of splitting up a test suite to run different parts of it in parallel on different machines. It involves maintaining a periodically updated JSON file in the repo recording the average runtime of different tests, to enable them to be more fairly divided among test runners. Includes a recipe for running as a matrix in GitHub Actions. # 22nd February 2021, 7:06 pm

Litestream runs continuously on a test server with generated load and streams backups to S3. It uses physical replication so it’ll actually restore the data from S3 periodically and compare the checksum byte-for-byte with the current database.

Ben Johnson # 11th February 2021, 8:50 pm

2020

How to cheat at unit tests with pytest and Black

I’ve been making a lot of progress on Datasette Cloud this week. As an application that provides private hosted Datasette instances (initially targeted at data journalists and newsrooms) the majority of the code I’ve written deals with permissions: allowing people to form teams, invite team members, promote and demote team administrators and suchlike.

[... 933 words]

2019

parameterized. I love the @parametrize decorator in pytest, which lets you run the same test multiple times against multiple parameters. The only catch is that the decorator in pytest doesn’t work for old-style unittest TestCase tests, which means you can’t easily add it to test suites that were built using the older model. I just found out about parameterized which works with unittest tests whether or not you are running them using the pytest test runner. # 19th February 2019, 9:05 pm

2018

Documentation unit tests

Or: Test-driven documentation.

[... 1521 words]

Hynek Schlawack: Testing & Packaging (via) “How to ensure that your tests run code that you think they are running, and how to measure your coverage over multiple tox runs (in parallel!)”—Hynek makes a convincing argument for putting your packaged Python code in a src/ directory for ease of testing and coverage. # 22nd May 2018, 10:12 pm

2017

I’ve heard managers and teams mandating 100% code coverage for applications. That’s a really bad idea. The problem is that you get diminishing returns on our tests as the coverage increases much beyond 70% (I made that number up… no science there). Why is that? Well, when you strive for 100% all the time, you find yourself spending time testing things that really don’t need to be tested. Things that really have no logic in them at all (so any bugs could be caught by ESLint and Flow). Maintaining tests like this actually really slow you and your team down.

Kent C. Dodds # 27th October 2017, 6:20 am

Cypress (via) Promising looking new open source testing framework for full-blown web integration testing—a modern alternative to Selenium. I spent five minutes playing with the demo and was really impressed by it—especially their “time travel” feature which lets you hover over a passed test and see the state of the browser when each of those assertions was executed. # 11th October 2017, 4:14 pm

2011

One interesting quirk of Pinboard is a complete absence of unit tests. I used to be a die-hard believer in testing, but in Pinboard tried a different approach, as an experiment. Instead of writng tests I try to be extremely careful in coding, and keep the code size small so I continue to understand it. I’ve found my defect rate to be pretty comparable to earlier projects that included extensive test suites and fixtures, but I am much more productive on Pinboard.

Maciej Ceglowski # 11th February 2011, 2:57 am

2010

Unit Testing Achievements. A plugin for Python’s nose test runner that adds achievements—“Night Shift: Make a failing suite pass between 12am and 5am.” # 28th February 2010, 3:56 pm

twitter-text-conformance (via) This is a neat idea: Twitter have released open source libraries for parsing standard tweet syntax in Ruby and Java, but they’ve also released a set of YAML unit tests aimed at anyone who wants to implement the same parsing logic in other languages. # 6th February 2010, 3:39 pm

rlisagor’s freshen. A Python clone of Ruby’s innovative Cucumber testing framework. Tests are defined as a set of plain-text scenarios, which are then executed by being matched against test functions decorated with regular expressions. Has anyone used this or Cucumber? I’m intrigued but unconvinced—are the plain text scenarios really a useful way of defining tests? # 5th January 2010, 7:30 pm

2009

There is no WebKit on Mobile. PPK ran 27 tests against 19 different WebKit-on-mobile implementations and found enormous disparities between the levels of support in currently available mobile phones. # 7th October 2009, 12:23 pm

shunit2 (via) xUnit style testing for shell scripts. # 27th September 2009, 7:34 pm

Fabric factory. Promising looking continuous integration server written in Django, which uses Fabric scripts to define actions. # 21st September 2009, 6:35 pm