Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Interactive Python

I adore the Python interactive interpreter. I use it for development (it’s amazing how many bugs you can skip by testing your code line by line in the interactive environment), I use it for calculations, but recently I’ve also found myself using it just as a general tool for answering questions.

Here’s a classic example. This blog entry describes a campaign to reimbuse the 12 year old girl recently fined $2000 by the RIAA for file sharing. The full amount has been raised, and a list of doners is available along with how much each donated. Being the inquisitive type I am, I wanted to know how much money was raised in total. First, I copied and pasted the list in to a Python string in IDLE:

>>> s = """$20 - Emmett Plant, USA
$20 - Peter Mills, UK
$20 - "Billy Blackbeard," USA
...
$10 - Will Morton"""
>>>

All of the monetary values consist of 2 digits, so next I compiled and tested a regular expression to search for them:

>>> import re
>>> num = re.compile(r'\d\d')
>>> num.findall(s)
['20', '20', '20', '20', '20', '20', '20', ...

Now I can run the sum() function to add them all up:

>>> sum(num.findall(s))

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in -toplevel-
    sum(num.findall(s))
TypeError: unsupported operand type(s) for +: 'int' and 'str'

Oops! sum operates on integers but the list is full of strings. We can use map to apply the int function to every item in the list first:

>>> sum(map(int, num.findall(s)))
2005

And there’s the answer. I think this quite neatly demonstrates the power and flexibility of the interactive prompt—for one thing, it shows that errors really don’t matter as you can simply try again the next time round. It also shows that most of the time you don’t even need to assign additional variables—Python is fast enough that you can just build up more and more complicated expressions. When you’re just trying to find a one off answer to a problem code readability doesn’t really come in to the equation.

A more interesting problem that came up today was working out the percentage of Netscape 4 visits to the Python.org site in the last month, as part of a mailing list discussion on whether or not the site should embrace a pure CSS layout. The raw data is a huge, ugly file listing 12,000 odd user agent strings along with the number of hits from each. My first step was to copy out the data part of the file and save it as a text file. I also compiled a new regular expression to find all lines that start with a number, which could then be used to ensure the data loaded was in the right format.

>>> num = re.compile(r'^(\d+)')
>>> lines = open('python-browser-stats.txt').readlines() 
>>> lines = [line for line in lines if num.match(line)] 

Finding the lines that contained a user agent string for Netscape 4 took a bit of effort, mainly because of the utterly insane way user agent strings have evolved over the years. I eventually settled on the rule that anything with Mozilla/4.x in it without the word ’compatible’ was probably a Netscape 4 variant. I excluded anything with ’Gecko’ in it as well, but with hindsight this was unnecessary as Gecko browsers all start with Mozilla/5.x.

>>> netscape = [line for line in lines if
    'Mozilla/4' in line and
    'compatible' not in line and
    'Gecko' not in line]

Are you getting the impression that I love list comprehensions yet?

When working in the interactive prompt it’s a good idea to periodically check that the data you are dealing with looks how you expect it to look. I’ve stripped down the explanation of what I did quite a bit—in fact there was a lot more checking of variables and lists to make sure nothing had gone wrong. At this point, here’s what an item in my netscape array looked like:

>>> netscape[0]
'3536       0.05%  Mozilla/4.01 [en](Win95;I)\n'

OK, I now had two arrays, one featuring all of the lines in the input set and another featuring just those lines that referred to a Netscape 4 browser. The final trick is to add up the total numbers for each of those sets. Remember, the total is the sum of all of the numbers at the start of each line. First, I built up new arrays of just those numbers (as integers) using the regular expression defined previously:

>>> nscounts = [int(num.match(line).groups()[0]) for line in netscape] 
>>> allcounts = [int(num.match(line).groups()[0]) for line in lines]

We now have two arrays of numbers. The total for each array can be found with the sum function, but we want the over all percentage of Netscape 4 user agents:

>>> print float(sum(nscounts)) / sum(allcounts) * 100
1.17457446601

The float call is in there because Python disregards the remainder in straight integer division; by casting one of the arguments to a float floating point division is used instead. As you can see, only approximately 1.17% of visits to Python.org in August were made using Netscape 4*. The case for CSS seems assured.

This has turned in to a longer entry than I had intended, but I hope it demonstrates the power and versatility of Python’s interactive mode.

* Please note that this figure is not entirely accurate, as it may also include web spiders that pretend to be Netscape 4, Opera users and a few other false positives as well. As an estimate though it’s probably pretty good.

This is Interactive Python by Simon Willison, posted on 15th September 2003.

View blog reactions

Next: Google conspiracy theories

Previous: Don't delete.me

13 comments

  1. If you like the interactive python command line, you'll love this, have a look at IPython at scipy Also you'd love

    WingIDE even more. For a couple of reasons above all else.

    • Interactive Debug Probe - run a program and then use the interactive python command shell with all the scope of where you were when you halted the program.
    • Exception Rollback - Let the program hit an exception of wing rollsback everything to the step before and then you can use the interactive python command shell to check everything

    it looks a bit clunky but theres stuff being done about that at the moment. Best of all it's a free license for open source projects (check website for details).

    Tim Parkin - 15th September 2003 22:06 - #

  2. The standard Python interpreter has the interactive debug probe - just run your script with python -i myscript.py. I use IDLE though so I don't often get to use that. I'll give Wing IDE a shot - thanks for the recommendation.

    On Unix, a fantastically useful addition to Python is this tab completion module. It requires readline though so I haven't been able to get it to work on Windows (although it would probably work with cygwin).

    Simon Willison - 15th September 2003 22:53 - #

  3. It is nice to have a better-than-basic interactive prompt to play around with, and I agree, list comprehensions are very nice... (One line Pig-Latin converter anyone?) :)

    pig = lambda s: ' '.join([((w[0] in 'aeiouAEIOU') and w[0] or '') + w[1:] + ((not w[0] in 'aeiouAEIOU') and w[0] or 'w') + 'ay' for w in s.split()])

    (and while I wont post multiple comments, nice link on the Python-pascals-triangle entry!)

    sfb - 15th September 2003 23:08 - #

  4. That's the list comprehension from hell. My eyes are watering.

    Simon Willison - 15th September 2003 23:26 - #

  5. Wow, sfb. So much for python being readable, huh? You mind if I quote tha somewhere?

    Dast - 16th September 2003 02:33 - #

  6. That's the list comprehension from hell. My eyes are watering Oh it's not that bad! It could be a nice, readable function, but that would be far less interesting...

    You mind if I quote tha somewhere? no, not even if it's in a 'worst Python ever' contest.

    sfb - 16th September 2003 11:20 - #

  7. What? No mention of PyCrust? I'm crushed! ;-)

    Patrick K. O'Brien - 21st September 2003 17:34 - #

  8. Are you getting the impression that I love list comprehensions yet?

    I rather got the feeling that you much prefer regular expressions to just about anything else. ;-) But I also do some of this kind of text processing sometimes, although I tend to inspect the input more interactively by splitting it up:

    >>> s.split("\n")
    ['$20 - Emmett Plant, USA', '$20 - Peter Mills, UK', '$20 - "Billy Blackbeard," USA', '...', '$10 - Will Morton']

    That checked to see if the lines look OK. Then, we remember the result and check the line format:

    >>> l=_
    >>> l[0].split()
    ['$20', '-', 'Emmett', 'Plant,', 'USA']

    Get the first column of each line:

    >>> [line.split()[0] for line in l]
    ['$20', '$20', '$20', '...', '$10']

    And so on... Python is great for this kind of thing, and if you're going to repeat this activity later, you can do some copying and pasting to get your script.

    Paul Boddie - 22nd September 2003 17:43 - #

  9. Actually I have a confession to make - my first attempt at adding up those monetary values used the split method without a regular expression in sight. It was only afterwards when I was playing around with the data a bit more tht I realised several lines of splitting code could be replaced with a single findall() on a regular expression.

    The other thing that's been influencing my use of regular expressions recently is David Mertz' book "Python Text Processing", which has excellent coverage of them and has inspired me to learn more about them.

    Simon Willison - 23rd September 2003 09:10 - #

  10. python for gui developmement and python for web site development.

    michael a. beaver - 23rd September 2003 16:07 - #

  11. Could you u plz send all the notes about python on my e-mail {enkwinika@webmail.co.za} thanh you eugine

    Eugine - 20th February 2005 15:06 - #

  12. how can i convert l=['5','2','6'] from string to integer l=[5,2,6] emaeyak

    emaeyak - 19th April 2006 04:17 - #

  13. how can i convert l=['5','2','6'] from string to integer l=[5,2,6] in python emaeyak

    emaeyak - 19th April 2006 04:19 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2003/09/15/interactivePython

A django site