Simon Willison’s Weblog

Subscribe

21 items tagged “hacker-news”

2024

New improved commit messages for scrape-hacker-news-by-domain. My simonw/scrape-hacker-news-by-domain repo has a very specific purpose. Once an hour it scrapes the Hacker News /from?site=simonwillison.net page (and the equivalent for datasette.io) using my shot-scraper tool and stashes the parsed links, scores and comment counts in JSON files in that repo.

It does this mainly so I can subscribe to GitHub's Atom feed of the commit log - visit simonw/scrape-hacker-news-by-domain/commits/main and add .atom to the URL to get that.

NetNewsWire will inform me within about an hour if any of my content has made it to Hacker News, and the repo will track the score and comment count for me over time. I wrote more about how this works in Scraping web pages from the command line with shot-scraper back in March 2022.

Prior to the latest improvement, the commit messages themselves were pretty uninformative. The message had the date, and to actually see which Hacker News post it was referring to, I had to click through to the commit and look at the diff.

I built my csv-diff tool a while back to help address this problem: it can produce a slightly more human-readable version of a diff between two CSV or JSON files, ideally suited for including in a commit message attached to a git scraping repo like this one.

I got that working, but there was still room for improvement. I recently learned that any Hacker News thread has an undocumented URL at /latest?id=x which displays the most recently added comments at the top.

I wanted that in my commit messages, so I could quickly click a link to see the most recent comments on a thread.

So... I added one more feature to csv-diff: a new --extra option lets you specify a Python format string to be used to add extra fields to the displayed difference.

My GitHub Actions workflow now runs this command:

csv-diff simonwillison-net.json simonwillison-net-new.json \
  --key id --format json \
  --extra latest 'https://news.ycombinator.com/latest?id={id}' \
  >> /tmp/commit.txt

This generates the diff between the two versions, using the id property in the JSON to tie records together. It adds a latest field linking to that URL.

The commits now look like this:

Fri Sep 6 05:22:32 UTC 2024. 1 row changed. id: 41459472 points: "25" => "27" numComments: "7" => "8" extras: latest: https://news.ycombinator.com/latest?id=41459472

# 6th September 2024, 5:40 am / shot-scraper, github-actions, projects, hacker-news, git-scraping, json

We had to exclude [dead] and eventually even just [flagged] posts from the public API because many third-party clients and sites were displaying them as if they were regular posts. […]

IMO this issue is existential for HN. We've spent years and so much energy trying to find a balance between openness and human decency, a task which oscillates between barely-possible and simply-doomed, so the idea that anybody anywhere sees anything labeled "Hacker News" that pours all the toxic waste back into the ecosystem is physically painful to me.

dang

# 12th August 2024, 10:04 pm / hacker-news, moderation

Hacker News homepage with links to comments ordered by most recent first (via) Conversations on Hacker News are displayed as a tree, which can make it difficult to spot new comments added since the last time you viewed the thread.

There's a workaround for this using the Hacker News Algolia Search interface: search for story:STORYID, select "comments" and the result will be a list of comments sorted by most recent first.

I got fed up of doing this manually so I built a quick tool in an Observable Notebook that documents the hack, provides a UI for pasting in a Hacker News URL to get back that search interface link and also shows the most recent items on the homepage with links to their most recently added comments.

See also my How to read Hacker News threads with most recent comments first TIL from last year.

# 15th July 2024, 5:48 pm / projects, observable, hacker-news

Exploring Hacker News by mapping and analyzing 40 million posts and comments for fun (via) A real tour de force of data engineering. Wilson Lin fetched 40 million posts and comments from the Hacker News API (using Node.js with a custom multi-process worker pool) and then ran them all through the BGE-M3 embedding model using RunPod, which let him fire up ~150 GPU instances to get the whole run done in a few hours, using a custom RocksDB and Rust queue he built to save on Amazon SQS costs.

Then he crawled 4 million linked pages, embedded that content using the faster and cheaper jina-embeddings-v2-small-en model, ran UMAP dimensionality reduction to render a 2D map and did a whole lot of follow-on work to identify topic areas and make the map look good.

That's not even half the project - Wilson built several interactive features on top of the resulting data, and experimented with custom rendering techniques on top of canvas to get everything to render quickly.

There's so much in here, and both the code and data (multiple GBs of arrow files) are available if you want to dig in and try some of this out for yourself.

In the Hacker News comments Wilson shares that the total cost of the project was a couple of hundred dollars.

One tiny detail I particularly enjoyed - unrelated to the embeddings - was this trick for testing which edge location is closest to a user using JavaScript:

const edge = await Promise.race(
  EDGES.map(async (edge) => {
    // Run a few times to avoid potential cold start biases.
    for (let i = 0; i < 3; i++) {
      await fetch(`https://${edge}.edge-hndr.wilsonl.in/healthz`);
    }
    return edge;
  }),
);

# 10th May 2024, 4:42 pm / hacker-news, embeddings

Everything Google’s Python team were responsible for. In a questionable strategic move, Google laid off the majority of their internal Python team a few days ago. Someone on Hacker News asked what the team had been responsible for, and team member zem relied with this fascinating comment providing detailed insight into how the team worked and indirectly how Python is used within Google.

# 27th April 2024, 6:52 pm / hacker-news, google, python

Permissions have three moving parts, who wants to do it, what do they want to do, and on what object. Any good permission system has to be able to efficiently answer any permutation of those variables. Given this person and this object, what can they do? Given this object and this action, who can do it? Given this person and this action, which objects can they act upon?

wkirby on Hacker News

# 16th April 2024, 7:49 pm / hacker-news, permissions

Spam, and its cousins like content marketing, could kill HN if it became orders of magnitude greater—but from my perspective, it isn't the hardest problem on HN. [...]

By far the harder problem, from my perspective, is low-quality comments, and I don't mean by bad actors—the community is pretty good about flagging and reporting those; I mean lame and/or mean comments by otherwise good users who don't intend to and don't realize they're doing that.

dang

# 19th February 2024, 3:57 pm / hacker-news, moderation, socialsoftware

2023

Analytics: Hacker News v.s. a tweet from Elon Musk

My post Bing: “I will not harm you unless you harm me first” really took off.

[... 817 words]

2022

Scraping web pages from the command line with shot-scraper

Visit Scraping web pages from the command line with shot-scraper

I’ve added a powerful new capability to my shot-scraper command line browser automation tool: you can now use it to load a web page in a headless browser, execute JavaScript to extract information and return that information back to the terminal as JSON.

[... 1,277 words]

2021

Launch HN Instructions (via) The instructions for YC companies that are posting their launch announcement on Hacker News are really interesting to read. “As founders, you’re used to talking to users, customers, and investors. HN readers are not any of those—what they are is peers, and using any of those styles with peers feels clueless and entitled. [...] To interest HN, write in a factual, personal, and modest way about what problem you solve, why it matters, how you solve it, and how you got there.”

# 19th July 2021, 1:05 am / marketing, y-combinator, hacker-news

2020

A List of Hacker News’s Undocumented Features and Behaviors (via) If you’re interested in community software design this is a neat insight into the many undocumented features of Hacker News, collated by Max Woolf.

# 6th June 2020, 5:36 pm / hacker-news, community, max-woolf

SQL is a better API language than GraphQL – Convince me otherwise (via) A dumb tweet I posted this morning blew up today and ended up on the Hacker News homepage.

# 16th April 2020, 10:44 pm / webapis, sql, hacker-news, graphql

hacker-news-to-sqlite (via) The latest in my Dogsheep series of tools: hacker-news-to-sqlite uses the Hacker News API to fetch your comments and submissions from Hacker News and save them to a SQLite database.

# 21st March 2020, 4:27 am / dogsheep, projects, hacker-news, sqlite

2018

Ask HN: What are the best MOOCs you’ve taken? Most useful Hacker News thread I’ve seen in a while: a torrent of great recommendations for online courses to learn everything from machine learning to astrophysics to songwriting.

# 3rd April 2018, 5:17 pm / hacker-news, education

2013

What is the Hacker News technology stack?

It’s written in Arc, a Lisp variant created by Paul Graham. I believe it uses the file system for storage rather than a dedicated database.

[... 70 words]

Does Paul Graham steal business models from teams not accepted into Y Combinator and feed them to accepted teams as pivot ideas?

No. If he did, word would quickly get around and strong teams would stop applying to YC.

[... 45 words]

2010

I think that “bad technology” can kill a startup, but slightly different variations of good technology don’t have much effect. Choose what you know/like best. And Ruby and Python are both in this latter category.

enko on Hacker News

# 2nd October 2010, 11:19 am / hacker-news, python, ruby, recovered

When all of human endeavor falls under the rubric of the “hack” the word ceases to mean anything. Hack your commute, take public transit! Hack your next dinner party with parlour games. Delightfully clever key hack keeps all your keys on the same ring. Hack Mexican food with a “burrito” sized tortilla! Hack your brain with REM sleep. Hack the sun with a straw hat. Hack hygiene with silver oxide “deodorant”. Hack girls with compliments. Hack your windowsill with a pot of wheatgrass, and hack the sky with the goddamn moon.

qwzybug on Hacker News

# 10th August 2010, 11:54 am / funny, hacker-news, hacks, recovered

2009

Any sufficiently advanced damage control is indistinguishable from ethics.

Eliezer

# 6th December 2009, 9:31 am / etherpad, ethics, google, hacker-news

Hacker News thread on Negative Cashback. Is it common practice for online stores with affiliate referral schemes to artificially inflate their prices if they’re going to have to pay out a referral bonus?

# 23rd November 2009, 9:44 pm / hacker-news, affiliates, referrals, cashback

What I’ve Learned from Hacker News. I’m always fascinated by online community war stories.

# 25th February 2009, 11:16 pm / paul-graham, community, hacker-news