Discovering Berkeley DB
I’m working on a project at the moment which involves exporting a whole bunch of data out of an existing system. The system is written in Perl and uses Berkeley DB files for most of its storage.
I’d never done anything with Berkeley DB before, but luckily Python has a module which seems to do all of the hard work for me:
>>> db = bsddb.btopen('xpand.db')
>>> db.keys()[0:10]
[':archives:index.html', ':art:test.html', ...
>>> db[':art:test.html']
'template;front.tp\x01\x01'
>>>
The Berkeley DB libraries are maintained by Sleepycat Software. Unfortunately, their site is completely saturated with marketing jargon. Our customers rely on Berkeley DB for fast, scalable, reliable and cost-effective data management for their mission-critical applications
. Great—now what does it do exactly?
Some digging around turned up the real information: the Berkeley DB Tutorial and Reference Guide, which contains pretty much everything you could possible want to know about the technology. It turns out that at a basic level Berkeley DB is just a very high performance, reliable way of persisting dictionary style data structures—anything where a piece of data can be stored and looked up using a unique key. The key and the value can each be up to 4 gigabytes in length and can consist of anything that can be crammed in to a string of bytes, so what you do with it is completely up to you. The only operations available are “store this value under this key”, “check if this key exists” and “retrieve the value for this key” so conceptually it’s pretty simple—the complicated stuff all happens under the hood.
It seems like a great alternative to a full on relational database for simple applications, although I’m slightly confused by the license which allows free use for open source products but requires a license for commercial applications. Does that mean that if I use the bsddb Python module in a commercial app I need to get a license from Sleepycat?
More recent articles
- Weeknotes: Parquet in Datasette Lite, various talks, more LLM hacking - 4th June 2023
- It's infuriatingly hard to understand how closed models train on their input - 4th June 2023
- ChatGPT should include inline tips - 30th May 2023
- Lawyer cites fake cases invented by ChatGPT, judge is not amused - 27th May 2023
- llm, ttok and strip-tags - CLI tools for working with ChatGPT and other LLMs - 18th May 2023
- Delimiters won't save you from prompt injection - 11th May 2023
- Weeknotes: sqlite-utils 3.31, download-esm, Python in a sandbox - 10th May 2023
- Leaked Google document: "We Have No Moat, And Neither Does OpenAI" - 4th May 2023
- Midjourney 5.1 - 4th May 2023
- Prompt injection explained, with video, slides, and a transcript - 2nd May 2023