Posts tagged python, xml
Filters: python × xml × Sorted by date
My First Open Source AI Generated Library (via) Armin Ronacher had Claude and Claude Code do almost all of the work in building, testing, packaging and publishing a new Python library based on his design:
- It wrote ~1100 lines of code for the parser
- It wrote ~1000 lines of tests
- It configured the entire Python package, CI, PyPI publishing
- Generated a README, drafted a changelog, designed a logo, made it theme-aware
- Did multiple refactorings to make me happier
The project? sloppy-xml-py, a lax XML parser (and violation of everything the XML Working Group hold sacred) which ironically is necessary because LLMs themselves frequently output "XML" that includes validation errors.
Claude's SVG logo design is actually pretty decent, turns out it can draw more than just bad pelicans!

I think experiments like this are a really valuable way to explore the capabilities of these models. Armin's conclusion:
This was an experiment to see how far I could get with minimal manual effort, and to unstick myself from an annoying blocker. The result is good enough for my immediate use case and I also felt good enough to publish it to PyPI in case someone else has the same problem.
Treat it as a curious side project which says more about what's possible today than what's necessarily advisable.
I'd like to present a slightly different conclusion here. The most interesting thing about this project is that the code is good.
My criteria for good code these days is the following:
- Solves a defined problem, well enough that I'm not tempted to solve it in a different way
- Uses minimal dependencies
- Clear and easy to understand
- Well tested, with tests prove that the code does what it's meant to do
- Comprehensive documentation
- Packaged and published in a way that makes it convenient for me to use
- Designed to be easy to maintain and make changes in the future
sloppy-xml-py
fits all of those criteria. It's useful, well defined, the code is readable with just about the right level of comments, everything is tested, the documentation explains everything I need to know, and it's been shipped to PyPI.
I'd be proud to have written this myself.
This example is not an argument for replacing programmers with LLMs. The code is good because Armin is an expert programmer who stayed in full control throughout the process. As I wrote the other day, a skilled individual with both deep domain understanding and deep understanding of the capabilities of the agent.
Anthropic’s Prompt Engineering Interactive Tutorial (via) Anthropic continue their trend of offering the best documentation of any of the leading LLM vendors. This tutorial is delivered as a set of Jupyter notebooks - I used it as an excuse to try uvx like this:
git clone https://github.com/anthropics/courses
uvx --from jupyter-core jupyter notebook courses
This installed a working Jupyter system, started the server and launched my browser within a few seconds.
The first few chapters are pretty basic, demonstrating simple prompts run through the Anthropic API. I used %pip install anthropic
instead of !pip install anthropic
to make sure the package was installed in the correct virtual environment, then filed an issue and a PR.
One new-to-me trick: in the first chapter the tutorial suggests running this:
API_KEY = "your_api_key_here" %store API_KEY
This stashes your Anthropic API key in the IPython store. In subsequent notebooks you can restore the API_KEY
variable like this:
%store -r API_KEY
I poked around and on macOS those variables are stored in files of the same name in ~/.ipython/profile_default/db/autorestore
.
Chapter 4: Separating Data and Instructions included some interesting notes on Claude's support for content wrapped in XML-tag-style delimiters:
Note: While Claude can recognize and work with a wide range of separators and delimeters, we recommend that you use specifically XML tags as separators for Claude, as Claude was trained specifically to recognize XML tags as a prompt organizing mechanism. Outside of function calling, there are no special sauce XML tags that Claude has been trained on that you should use to maximally boost your performance. We have purposefully made Claude very malleable and customizable this way.
Plus this note on the importance of avoiding typos, with a nod back to the problem of sandbagging where models match their intelligence and tone to that of their prompts:
This is an important lesson about prompting: small details matter! It's always worth it to scrub your prompts for typos and grammatical errors. Claude is sensitive to patterns (in its early years, before finetuning, it was a raw text-prediction tool), and it's more likely to make mistakes when you make mistakes, smarter when you sound smart, sillier when you sound silly, and so on.
Chapter 5: Formatting Output and Speaking for Claude includes notes on one of Claude's most interesting features: prefill, where you can tell it how to start its response:
client.messages.create( model="claude-3-haiku-20240307", max_tokens=100, messages=[ {"role": "user", "content": "JSON facts about cats"}, {"role": "assistant", "content": "{"} ] )
Things start to get really interesting in Chapter 6: Precognition (Thinking Step by Step), which suggests using XML tags to help the model consider different arguments prior to generating a final answer:
Is this review sentiment positive or negative? First, write the best arguments for each side in <positive-argument> and <negative-argument> XML tags, then answer.
The tags make it easy to strip out the "thinking out loud" portions of the response.
It also warns about Claude's sensitivity to ordering. If you give Claude two options (e.g. for sentiment analysis):
In most situations (but not all, confusingly enough), Claude is more likely to choose the second of two options, possibly because in its training data from the web, second options were more likely to be correct.
This effect can be reduced using the thinking out loud / brainstorming prompting techniques.
A related tip is proposed in Chapter 8: Avoiding Hallucinations:
How do we fix this? Well, a great way to reduce hallucinations on long documents is to make Claude gather evidence first.
In this case, we tell Claude to first extract relevant quotes, then base its answer on those quotes. Telling Claude to do so here makes it correctly notice that the quote does not answer the question.
I really like the example prompt they provide here, for answering complex questions against a long document:
<question>What was Matterport's subscriber base on the precise date of May 31, 2020?</question>
Please read the below document. Then, in <scratchpad> tags, pull the most relevant quote from the document and consider whether it answers the user's question or whether it lacks sufficient detail. Then write a brief numerical answer in <answer> tags.
Using memory-profiler to debug excessive memory usage in healthkit-to-sqlite. This morning I figured out how to use the memory-profiler module (and mprof command line tool) to debug memory usage of Python processes. I added the details, including screenshots, to this GitHub issue. It helped me knock down RAM usage for my healthkit-to-sqlite from 2.5GB to just 80MB by making smarter usage of the ElementTree pull parser.
minixsv (via) As far as I can tell, this is the only library that can validate XML using pure Python (no C extension required). I’d be extremely happy if someone would write a pure Python library (or one that only depends on ElementTree, which is included in the standard library) for validating XML against a Relax NG Compact syntax schema. Even DTD validation would be better than nothing!
xmlwitch. An XML building library for Python that doesn’t suck (I love ElementTree for parsing XML, but I’ve never really liked it for generation). Makes smart use of the with statement.
A few notes on the Guardian Open Platform
This morning we launched the Guardian Open Platform at a well attended event in our new offices in Kings Place. This is one of the main projects I’ve been helping out with since joining the Guardian last year, and it’s fantastic to finally have it out in the open.
[... 839 words]How to install lxml python module on mac os 10.5 (leopard). Instructions that work! Finally, I can find out what all the fuss is about.
pyquery. “A jQuery-like library for Python”—implemented on top of lxml, providing jQuery style methods for manipulating an HTML or XML document.
My Universal Feed Parser was conceived as a weapon against what I considered the gravest error of XML: draconian error handling. Recently, someone asked me to implement a switch that makes it not fall back on lax parsing in the case of an XML wellformedness error. I said no, not because it would be difficult to implement, but because that defeats its entire reason for being.
PDFMiner. Useful looking PDF parsing library in Python—can produce an XML representation of the text and style information in a PDF document.
Protocol Buffers: Google’s Data Interchange Format. Open sourced today. Highly efficient binary protocol for storing and transmitting structured data between C++, Java and Python. Uses a .proto file describing the data structure which is compiled to classes in those languages for serializing and deserializing. 3-10 times smaller and 20-100 times faster than XML.
Atom Models. Building Python classes that act as utility wrappers around data stored in an lxml DOM object.
Apache Solr 1.1. Solr is the search Web Service built on top of Lucene. The latest release introduces JSON, Python and Ruby response formats in addition to XML.
PHP 5 Release Candidate 1
I haven’t blogged much about PHP in a while because I’ve been up to my nose in mod_python and loving every minute of it. This news is just too important to miss: PHP 5 Release Candidate 1 has been released, bringing the first production-ready release tantilisingly close. While I doubt PHP 5 will tempt me back it’s definitely an exciting upgrade—my biggest complaint with PHP 4 is the brain-dead object model which defaults to copying whole objects rather than passing references, and this is one of the many things addressed by PHP 5. The new libxml2 powered XML features sound really powerful, and SQLite as an on-board database should be ideal for knocking out small stand-alone applications without needing to set up a mySQL database for them.
[... 173 words]XML.com: Lightweight XML Search Servers [Jan. 21, 2004] (via) More fun with Python and libxml2
Ned Batchelder: handyxml. Yet another XML object wrapper for Python, this time with full DOM method support included
Using XPath to mine XHTML
This morning, I finally decided to install libxml2 and see what all the fuss was about, in particular with respect to XPath. What followed is best described as an enlightening experience.
[... 576 words]xmlhack news wire
xmlhack’s “Editor’s Newswire” is interesting. It is a small column (explained here) located on the right hand side of the site that displays the latest XML news snippets “in real time”. The interesting part is how the section is updated—an IRC bot (The Daily Chump Bot, written in Python) monitors a channel for specific commands from authorised users, and produces an XML file of new snippets. Site updates through IRC (or instant messenging services such as MSN or Jabber) is a concept which we could see a lot more of, especially in this age of web services.