Command line blacklisting
Just over a year ago, I started blacklisting domain names from links featured in comment spam. My idea then was that these blacklists could become a shared resource: people would publish their own blacklist and subscribe to those of people they trust, thus making it much harder for spammers to operate. While the sheer volume of spam domains meant that the technique was much less useful than I originally anticipated, I’ve continued to maintain my blacklist ever since as a preventative measure against repeat spammers.
I have a confession to make: all of my blog administration (with the exception of adding entries and blogmarks) is performed using phpMyAdmin. The trouble with writing your own software is that it’s very easy to skimp on the backend tools, since you’re the only person who will ever see them. Incidentally, this is the main reason I plan to switch to WordPress just as soon as I find the inspiration to write the necessary import scripts. Comments are deleted in phpMyAdmin, and domains are blacklisted by manually editing the blacklist.txt file via FTP.
This has been really bugging me, especially since I have so little other use for FTP that my only installed client is an unregistered version of Transmit (closes after ten minutes, won’t save passwords along with account details). I’ve been muddling along with that for longer than I care to admit, but today I decided to take 10 minutes out to solve the problem once and for all. I could have put together a web interface for adding new domains but I wasn’t really in the mood, so I decided to put time spent reading The Art of Unix Programming to good use and knock out a simple command line application.
The result (minus my login details) can be found here. Sample usage: ./blacklist.py www.domain.org www.domain2.com. It follows the Unix ideal of being the simplest-thing-that-could-possibly-work, and ended up taking longer to write than I expected thanks mainly to the craziness of Python’s ftplib. I’ve seen complaints about this before, and it thoroughly deserves its bad reputation.
Here’s one example:
retrlines is the method used to retrieve ascii text from the server. Bizzarely, it doesn’t actually return the text receieved; instead, it expects you to provide it with a callback function that will be fed each line in turn, minus the newline. Sounds like a job for StringIO, but
StringIO objects don’y have a writeline method (required to add the newline back on). I ended up writing my own extension of the
StringIO2 class and adding a writeline method just to preserve the newlines returned from the server!
Strange APIs aside, I’m pretty pleased with the final result. It follows a bunch of Unix design patterns (and skips others such as those related to configuration, but I’m not overly bothered about those) including the following:
- A usage note is displayed if no arguments are provided.
- Multiple domains can be blacklisted at once, by providing them as multiple command line arguments.
- Domains that are already in the blacklist are skipped, and a message is written to standard error.
- If the script suceeds, it doesn’t say anything at all.
It also uses the common Python idiom of wrapping the principle logic in a function and then calling that from a block that runs only if the file is executed directly (the
__name__ == '__main__' idiom) so that other Python code can import the module and reuse its functionality if required.
There’s plenty of room for improvement: being able to pipe a list of domains in via standard input would be nice, and hard coding the (unencrypted) username and password is sloppy (as is expecting the blacklist.txt file to live in the FTP home directory). Even better, with SSH access the whole thing could be replaced with an infinitely more secure one-liner:
echo www.domain-to-ban.org | ssh username@server "cat - >> blacklist.txt". I’m happy though: an irritating task has become much less irritating and I have some example code to fall back on next time I need to get mucky with
More recent articles
- Understanding GPT tokenizers - 8th June 2023
- Weeknotes: Parquet in Datasette Lite, various talks, more LLM hacking - 4th June 2023
- It's infuriatingly hard to understand how closed models train on their input - 4th June 2023
- ChatGPT should include inline tips - 30th May 2023
- Lawyer cites fake cases invented by ChatGPT, judge is not amused - 27th May 2023
- llm, ttok and strip-tags - CLI tools for working with ChatGPT and other LLMs - 18th May 2023
- Delimiters won't save you from prompt injection - 11th May 2023
- Weeknotes: sqlite-utils 3.31, download-esm, Python in a sandbox - 10th May 2023
- Leaked Google document: "We Have No Moat, And Neither Does OpenAI" - 4th May 2023
- Midjourney 5.1 - 4th May 2023