Exploring Hacker News by mapping and analyzing 40 million posts and comments for fun (via) A real tour de force of data engineering. Wilson Lin fetched 40 million posts and comments from the Hacker News API (using Node.js with a custom multi-process worker pool) and then ran them all through the BGE-M3
embedding model using RunPod, which let him fire up ~150 GPU instances to get the whole run done in a few hours, using a custom RocksDB and Rust queue he built to save on Amazon SQS costs.
Then he crawled 4 million linked pages, embedded that content using the faster and cheaper jina-embeddings-v2-small-en
model, ran UMAP dimensionality reduction to render a 2D map and did a whole lot of follow-on work to identify topic areas and make the map look good.
That's not even half the project - Wilson built several interactive features on top of the resulting data, and experimented with custom rendering techniques on top of canvas to get everything to render quickly.
There's so much in here, and both the code and data (multiple GBs of arrow files) are available if you want to dig in and try some of this out for yourself.
In the Hacker News comments Wilson shares that the total cost of the project was a couple of hundred dollars.
One tiny detail I particularly enjoyed - unrelated to the embeddings - was this trick for testing which edge location is closest to a user using JavaScript:
const edge = await Promise.race(
EDGES.map(async (edge) => {
// Run a few times to avoid potential cold start biases.
for (let i = 0; i < 3; i++) {
await fetch(`https://${edge}.edge-hndr.wilsonl.in/healthz`);
}
return edge;
}),
);
Recent articles
- Notes from Bing Chat—Our First Encounter With Manipulative AI - 19th November 2024
- Project: Civic Band - scraping and searching PDF meeting minutes from hundreds of municipalities - 16th November 2024
- Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac - 12th November 2024