15th September 2003
I adore the Python interactive interpreter. I use it for development (it’s amazing how many bugs you can skip by testing your code line by line in the interactive environment), I use it for calculations, but recently I’ve also found myself using it just as a general tool for answering questions.
Here’s a classic example. This blog entry describes a campaign to reimbuse the 12 year old girl recently fined $2000 by the RIAA for file sharing. The full amount has been raised, and a list of doners is available along with how much each donated. Being the inquisitive type I am, I wanted to know how much money was raised in total. First, I copied and pasted the list in to a Python string in IDLE:
>>> s = """$20 - Emmett Plant, USA $20 - Peter Mills, UK $20 - "Billy Blackbeard," USA ... $10 - Will Morton""" >>>
All of the monetary values consist of 2 digits, so next I compiled and tested a regular expression to search for them:
>>> import re >>> num = re.compile(r'\d\d') >>> num.findall(s) ['20', '20', '20', '20', '20', '20', '20', ...
Now I can run the
sum() function to add them all up:
>>> sum(num.findall(s)) Traceback (most recent call last): File "<pyshell#4>", line 1, in -toplevel- sum(num.findall(s)) TypeError: unsupported operand type(s) for +: 'int' and 'str'
sum operates on integers but the list is full of strings. We can use
map to apply the
int function to every item in the list first:
>>> sum(map(int, num.findall(s))) 2005
And there’s the answer. I think this quite neatly demonstrates the power and flexibility of the interactive prompt—for one thing, it shows that errors really don’t matter as you can simply try again the next time round. It also shows that most of the time you don’t even need to assign additional variables—Python is fast enough that you can just build up more and more complicated expressions. When you’re just trying to find a one off answer to a problem code readability doesn’t really come in to the equation.
A more interesting problem that came up today was working out the percentage of Netscape 4 visits to the Python.org site in the last month, as part of a mailing list discussion on whether or not the site should embrace a pure CSS layout. The raw data is a huge, ugly file listing 12,000 odd user agent strings along with the number of hits from each. My first step was to copy out the data part of the file and save it as a text file. I also compiled a new regular expression to find all lines that start with a number, which could then be used to ensure the data loaded was in the right format.
>>> num = re.compile(r'^(\d+)') >>> lines = open('python-browser-stats.txt').readlines() >>> lines = [line for line in lines if num.match(line)]
Finding the lines that contained a user agent string for Netscape 4 took a bit of effort, mainly because of the utterly insane way user agent strings have evolved over the years. I eventually settled on the rule that anything with Mozilla/4.x in it without the word ’compatible’ was probably a Netscape 4 variant. I excluded anything with ’Gecko’ in it as well, but with hindsight this was unnecessary as Gecko browsers all start with Mozilla/5.x.
>>> netscape = [line for line in lines if 'Mozilla/4' in line and 'compatible' not in line and 'Gecko' not in line]
Are you getting the impression that I love list comprehensions yet?
When working in the interactive prompt it’s a good idea to periodically check that the data you are dealing with looks how you expect it to look. I’ve stripped down the explanation of what I did quite a bit—in fact there was a lot more checking of variables and lists to make sure nothing had gone wrong. At this point, here’s what an item in my netscape array looked like:
>>> netscape '3536 0.05% Mozilla/4.01 [en](Win95;I)\n'
OK, I now had two arrays, one featuring all of the lines in the input set and another featuring just those lines that referred to a Netscape 4 browser. The final trick is to add up the total numbers for each of those sets. Remember, the total is the sum of all of the numbers at the start of each line. First, I built up new arrays of just those numbers (as integers) using the regular expression defined previously:
>>> nscounts = [int(num.match(line).groups()) for line in netscape] >>> allcounts = [int(num.match(line).groups()) for line in lines]
We now have two arrays of numbers. The total for each array can be found with the sum function, but we want the over all percentage of Netscape 4 user agents:
>>> print float(sum(nscounts)) / sum(allcounts) * 100 1.17457446601
The float call is in there because Python disregards the remainder in straight integer division; by casting one of the arguments to a float floating point division is used instead. As you can see, only approximately 1.17% of visits to Python.org in August were made using Netscape 4*. The case for CSS seems assured.
This has turned in to a longer entry than I had intended, but I hope it demonstrates the power and versatility of Python’s interactive mode.
* Please note that this figure is not entirely accurate, as it may also include web spiders that pretend to be Netscape 4, Opera users and a few other false positives as well. As an estimate though it’s probably pretty good.
More recent articles
- Datasette Enrichments: a new plugin framework for augmenting your data - 1st December 2023
- llamafile is the new best way to run a LLM on your own computer - 29th November 2023
- Prompt injection explained, November 2023 edition - 27th November 2023
- I'm on the Newsroom Robots podcast, with thoughts on the OpenAI board - 25th November 2023
- Weeknotes: DevDay, GitHub Universe, OpenAI chaos - 22nd November 2023
- Deciphering clues in a news article to understand how it was reported - 22nd November 2023
- Exploring GPTs: ChatGPT in a trench coat? - 15th November 2023
- Financial sustainability for open source projects at GitHub Universe - 10th November 2023
- ospeak: a CLI tool for speaking text in the terminal via OpenAI - 7th November 2023
- DALL-E 3, GPT4All, PMTiles, sqlite-migrate, datasette-edit-schema - 30th October 2023