Simon Willison’s Weblog

Subscribe

Sunday, 27th June 2021

Group thousands of similar spreadsheet text cells in seconds (via) Luke Whyte explains how to efficiently group similar text columns in a table (Walmart and Wal-mart for example) using a clever combination of TF/IDF, sparse matrices and cosine similarity. Includes the clearest explanation of cosine similarity for text I’ve seen—and Luke wrote a Python library, textpack, that implements the described pattern.

# 4:24 pm / data-science, python

2021 » June

MTWTFSS
 123456
78910111213
14151617181920
21222324252627
282930