Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Discovering Berkeley DB

I’m working on a project at the moment which involves exporting a whole bunch of data out of an existing system. The system is written in Perl and uses Berkeley DB files for most of its storage.

I’d never done anything with Berkeley DB before, but luckily Python has a module which seems to do all of the hard work for me:

>>> db = bsddb.btopen('xpand.db')
>>> db.keys()[0:10]
[':archives:index.html', ':art:test.html', ... 
>>> db[':art:test.html']
'template;front.tp\x01\x01'
>>> 

The Berkeley DB libraries are maintained by Sleepycat Software. Unfortunately, their site is completely saturated with marketing jargon. Our customers rely on Berkeley DB for fast, scalable, reliable and cost-effective data management for their mission-critical applications. Great—now what does it do exactly?

Some digging around turned up the real information: the Berkeley DB Tutorial and Reference Guide, which contains pretty much everything you could possible want to know about the technology. It turns out that at a basic level Berkeley DB is just a very high performance, reliable way of persisting dictionary style data structures—anything where a piece of data can be stored and looked up using a unique key. The key and the value can each be up to 4 gigabytes in length and can consist of anything that can be crammed in to a string of bytes, so what you do with it is completely up to you. The only operations available are “store this value under this key”, “check if this key exists” and “retrieve the value for this key” so conceptually it’s pretty simple—the complicated stuff all happens under the hood.

It seems like a great alternative to a full on relational database for simple applications, although I’m slightly confused by the license which allows free use for open source products but requires a license for commercial applications. Does that mean that if I use the bsddb Python module in a commercial app I need to get a license from Sleepycat?

This is Discovering Berkeley DB by Simon Willison, posted on 26th November 2003.

View blog reactions

Next: Pyrex

Previous: Feed you

12 comments

  1. Be aware of one trap with Berkeley DB: it has a notorious history of changing the on-disk storage format between small version upgrades. To the point that data written with one version has not always been readable with a new point-release upgrade. They may have stabilised lately, in view of the public feedback, but this has been a problem for a few years.

    That being said, this package is a great one for many data storage needs (and I cannot help you with the license question, having never found myself in that position and so having not thought about it).

    Malcolm Tredinnick - 26th November 2003 03:13 - #

  2. be sure to also check out the anydbm module (also in the standard python library), which transparently provides the same interface for any of the DBM style databases.

    anders - 26th November 2003 03:49 - #

  3. Just a note to check out the Berkley DB XML development weblog. It contains a lot of FAQ and such like that you may find useful, if you want to go for an XML based approach.

    Ben Meadowcroft - 26th November 2003 07:36 - #

  4. Just a note that you're right, dbm style database, berkeley db3 in particular, are very useful for many cases where you don't need a full on rdbm, but just need persistent dictionaries. FAST used them for storing most of the search engine index results...

    Sterling Hughes - 26th November 2003 09:46 - #

  5. You're right about the license. It's straight GPL, therefore viral. *But* if you pay them money, they'll let you use it under a normal commercial license so you don't have to release your own code under the GPL.

    Mark - 26th November 2003 11:44 - #

  6. In the past, the ruling has been that the only thing that needs to be open-source is the Python BerkeleyDB extension. A commercial application built on top of the Python extension doesn't need to be open sourced.

    A.M. Kuchling - 26th November 2003 12:20 - #

  7. My experience would confirm Andrew's statement. I contacted SleepyCat some time ago re: their license and in their response (wish I could find that e-mail!) they indicated that they would not require a commercial license for commercial software that was "completely written in an open-source language such as Perl or Python" (a rough paraphrase from memory).

    Of course, one could e-mail them independently and get confirmation straight from the cat's mouth. ;-)

    Graham Fawcett - 28th November 2003 04:54 - #

  8. I asked for there license terms a few days ago. Here's what they told me: As you may know, Berkeley DB is an Open Source product. You may download it and use it at no charge, provided that you meet at least one of these two requirements: + You do not redistribute it off of a single physical site. If you use Berkeley DB on a Web server that is accessed from many locations, but you do not redistribute the actual code that uses Berkeley DB, you do not need to pay us. + If you redistribute your application, you make your source code freely available. If, instead, you want to redistribute in binary form the application that uses Berkeley DB, you need to purchase a license from Sleepycat Software. The pricing for Berkeley DB depends on the services that you need. We offer five different Berkeley DB products. All five are available for download in source code form on Sleepycat's web site. The five different products are: + Berkeley DB Data Store: Intended for single-user, or multi-user read-only, applications that don't need transactions or disaster recovery. + Berkeley DB Concurrent Data Store: This product allows multiple users to use the database at the same time, with any mix of readers and writers. It does not support transactions or disaster recovery. + Berkeley DB Transactional Data Store: This is our enterprise-class embedded database manager. It supports arbitrarily many concurrent users, with any mix of readers and writers. It also supports transactions and recovery from application, system, or hardware crashes. + Berkeley DB High Availability: HA provides single-master replication with fail-over, and is built on top of the Berkeley DB TDS product. Using HA, you can run multiple instances of your application on different servers on a network. All updates must go to the master server, which distributes changes to all replica servers. Each of the replicas can support read queries on the database. If the master fails for any reason, one of the replicas is able to take its place, and the application can continue to run. + Berkeley DB XML: Built on top of TDS or HA, DB XML is a high-performance, extremely reliable embedded database engine that stores and manages XML data. If you need more technical details about any of the five products, you can look at our on-line documentation, where you will find our reference manual containing a wealth of useful information. Go to http://www.sleepycat.com/ and click on the "Documentation" link, then the link for the reference manual. We offer a number of pricing models for each of the products. Our most popular option requires an up-front, one-time buyout payment, and thereafter permits unlimited royalty-free distribution of Berkeley DB. If the buyout pricing is not appropriate for your expected distribution, or is prohibitive as a one-time fee, we can talk about other possible arrangements. Our buyout price list, including information about our annual maintenance packages, is on line at http://www.sleepycat.com/pricing.shtml Our standard licensing agreement for binary redistribution is at http://www.sleepycat.com/license/license.pdf During your evaluation of Berkeley DB we offer free technical support. During that time, we'll be glad to help you debug and fix problems, including performance issues, should you have any. If you'd like to sign up for free evaluation support, please let me know and I'd be happy to set that up for you. Thank you for your interest in Berkeley DB. I look forward to learning about the application you're building, and how we can help you with your data-management needs. Please let me know how I can be of further assistance.

    jens - 4th December 2003 17:34 - #

  9. I'm glad you found SleepyCat's Berkeley DB Tutorial and Reference Guide both useful and legible ...

    "The DB_ENV->open method ... provides a structure for creating a consistent environment for processes using one or more of the features of Berkeley DB."

    and that's just the first paragraph on the first method!

    In my estimation, this 'manual' tome stands at the pinnacle of abstract expressionism in geek literature, opaque as it is obtuse, and brimming with impenetrable jargon. It's easy to see how the tech crew and the marketing crew could be best friends ;)

    That said, and thank you for the opportunity to say it (I feel better now) there is a decent (Java-centric) getting-started tutorial at http://today.java.net/pub/a/today/2004/08/24/sleep y.html, but all I really want to know, in plain English even a fool like me can understand, is how do I implement multiprocess lock protection on an ordered Btree that uses cursors across a range of keys ...

    mrG - 7th October 2005 20:03 - #

  10. My only complaint about BerkeleyDB is that it's a wee bit flaky when not used with transactions. Databases can become corrupted, processes can deadlock, etc. I've found problems even when using CDS mode. Only problem with wrapping everything in transactions is the performance hit. So, here's what we came up with as a compromise (we use the excellent BerkeleyDB.pm module from our Perl code): lock the entire database on write with a semaphore. The overhead is negligible in terms of speed, but it's done a remarkable job of keeping our indexes very clean!

    zyxtberk - 8th February 2006 04:24 - #

  11. The thing is, how much do they charge if I want to bundle my application with the database. I don't like when software companies behave like a used-car shop. Tell me the price!

    dbusertobe - 21st May 2006 03:21 - #

  12. Perl/Python Licensing - I found this under Sleepycat licensing.

    Do I have to pay for a Berkeley DB license to use it in my Perl or Python scripts?

    No, you may use the Berkeley DB open source license at no cost. The Berkeley DB open source license requires that software that uses Berkeley DB be freely redistributable. In the case of Perl or Python, that software is Perl or Python, and not your scripts. Any scripts you write are your property, including scripts that make use of Berkeley DB. None of the Perl, Python or Berkeley DB licenses place any restrictions on what you may do with them.

    John Resler - 8th June 2006 11:39 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2003/11/26/discoveringBerkeleyDB

A django site