Simon Willison’s Weblog


9 items tagged “nlp”


Matthew Honnibal from spaCy on why LLMs have not solved NLP. A common trope these days is that the entire field of NLP has been effectively solved by Large Language Models. Here’s a lengthy comment from Matthew Honnibal, creator of the highly regarded spaCy Python NLP library, explaining in detail why that argument doesn’t hold up. # 9th September 2023, 9:30 pm

Closed AI Models Make Bad Baselines (via) The NLP academic research community are facing a tough challenge: the state-of-the-art in large language models, GPT-4, is entirely closed which means papers that compare it to other models lack replicability and credibility. “We make the case that as far as research and scientific publications are concerned, the “closed” models (as defined below) cannot be meaningfully studied, and they should not become a “universal baseline”, the way BERT was for some time widely considered to be.”

Anna Rogers proposes a new rule for this kind of research: “That which is not open and reasonably reproducible cannot be considered a requisite baseline.” # 3rd April 2023, 7:57 pm

As an NLP researcher I’m kind of worried about this field after 10-20 years. Feels like these oversized LLMs are going to eat up this field and I’m sitting in my chair thinking, “What’s the point of my research when GPT-4 can do it better?”

Jeonghwan Kim # 16th March 2023, 5:39 am


Statistical NLP on OpenStreetMap. libpostal is ferociously clever: it’s a library for parsing and understanding worldwide addresses, built on top of a machine learning model trained on millions of addresses from OpenStreetMap. Al Barrentine describes how it works in this fascinating and detailed essay. # 8th January 2018, 7:33 pm


spaCy. “Industrial-strength Natural Language Processing in Python”. Exciting alternative to nltk—spaCy is mostly written in Cython, makes bold performance claims and ships with a range of pre-built statistical models covering multiple different languages. The API design is clean and intuitive and spaCy even includes an SVG visualizer that works with Jupyter. # 8th November 2017, 4:43 pm

Oxford Deep NLP 2017 course (via) Slides, course description and links to lecture videos for the 2017 Deep Natural Language Processing course at the University of Oxford presented by a team from Google DeepMind. # 31st October 2017, 8:39 pm


Which investors would consider a natural language processing startup in London?

I don’t know the answer, but I know how you can find it: track down as many London-based AI/machine learning/NLP startups as you can and look at who their investors are.

[... 48 words]


topia.termextract. Impressive Python term extraction library (similar to the various term extraction web APIs but you can run it on your own hardware), incorporating a Parts-Of-Speech tagging algorithm. # 10th August 2009, 9:26 pm

JS-Placemaker—geolocate texts in JavaScript. Chris Heilmann exposed Placemaker to JavaScript (JSONP) using a YQL execute table. Try his examples—I’m impressed that “My name is Jack London, I live in Ontario” returns just Ontario, demonstrating that Placemaker’s NLP is pretty well tuned. # 23rd May 2009, 12:36 am