Simon Willison’s Weblog

Subscribe

Quotations in Feb, 2024

Filters: Type: quotation × Year: 2024 × Month: Feb × Sorted by date


For the last few years, Meta has had a team of attorneys dedicated to policing unauthorized forms of scraping and data collection on Meta platforms. The decision not to further pursue these claims seems as close to waving the white flag as you can get against these kinds of companies. But why? [...]

In short, I think Meta cares more about access to large volumes of data and AI than it does about outsiders scraping their public data now. My hunch is that they know that any success in anti-scraping cases can be thrown back at them in their own attempts to build AI training databases and LLMs. And they care more about the latter than the former.

Kieran McCarthy # 28th February 2024, 3:15 pm

When I first published the micrograd repo, it got some traction on GitHub but then somewhat stagnated and it didn’t seem that people cared much. [...] When I made the video that built it and walked through it, it suddenly almost 100X’d the overall interest and engagement with that exact same piece of code.

[...] you might be leaving somewhere 10-100X of the potential of that exact same piece of work on the table just because you haven’t made it sufficiently accessible.

Andrej Karpathy # 21st February 2024, 9:26 pm

In 2006, reddit was sold to Conde Nast. It was soon obvious to many that the sale had been premature, the site was unmanaged and under-resourced under the old-media giant who simply didn’t understand it and could never realize its full potential, so the founders and their allies in Y-Combinator (where reddit had been born) hatched an audacious plan to re-extract reddit from the clutches of the 100-year-old media conglomerate. [...]

Yishan Wong # 20th February 2024, 4:23 pm

Spam, and its cousins like content marketing, could kill HN if it became orders of magnitude greater—but from my perspective, it isn’t the hardest problem on HN. [...]

By far the harder problem, from my perspective, is low-quality comments, and I don’t mean by bad actors—the community is pretty good about flagging and reporting those; I mean lame and/or mean comments by otherwise good users who don’t intend to and don’t realize they’re doing that.

dang # 19th February 2024, 3:57 pm

Before we even started writing the database, we first wrote a fully-deterministic event-based network simulation that our database could plug into. This system let us simulate an entire cluster of interacting database processes, all within a single-threaded, single-process application, and all driven by the same random number generator. We could run this virtual cluster, inject network faults, kill machines, simulate whatever crazy behavior we wanted, and see how it reacted. Best of all, if one particular simulation run found a bug in our application logic, we could run it over and over again with the same random seed, and the exact same series of events would happen in the exact same order. That meant that even for the weirdest and rarest bugs, we got infinity “tries” at figuring it out, and could add logging, or do whatever else we needed to do to track it down.

[...] At FoundationDB, once we hit the point of having ~zero bugs and confidence that any new ones would be found immediately, we entered into this blessed condition and we flew.

[...] We had built this sophisticated testing system to make our database more solid, but to our shock that wasn’t the biggest effect it had. The biggest effect was that it gave our tiny engineering team the productivity of a team 50x its size.

Will Wilson, on FoundationDB # 13th February 2024, 5:20 pm

“We believe that open source should be sustainable and open source maintainers should get paid!”

Maintainer: *introduces commercial features*
“Not like that”

Maintainer: *works for a large tech co*
“Not like that”

Maintainer: *takes investment*
“Not like that”

Jacob Kaplan-Moss # 12th February 2024, 5:18 am

One consideration is that such a deep ML system could well be developed outside of Google-- at Microsoft, Baidu, Yandex, Amazon, Apple, or even a startup. My impression is that the Translate team experienced this. Deep ML reset the translation game; past advantages were sort of wiped out. Fortunately, Google’s huge investment in deep ML largely paid off, and we excelled in this new game. Nevertheless, our new ML-based translator was still beaten on benchmarks by a small startup. The risk that Google could similarly be beaten in relevance by another company is highlighted by a startling conclusion from BERT: huge amounts of user feedback can be largely replaced by unsupervised learning from raw text. That could have heavy implications for Google.

Eric Lehman, internal Google email in 2018 # 11th February 2024, 10:59 pm

Reality is that LLMs are not AGI -- they’re a big curve fit to a very large dataset. They work via memorization and interpolation. But that interpolative curve can be tremendously useful, if you want to automate a known task that’s a match for its training data distribution.

Memorization works, as long as you don’t need to adapt to novelty. You don’t *need* intelligence to achieve usefulness across a set of known, fixed scenarios.

François Chollet # 10th February 2024, 6:39 am

If your only way of making a painting is to actually dab paint laboriously onto a canvas, then the result might be bad or good, but at least it’s the result of a whole lot of micro-decisions you made as an artist. You were exercising editorial judgment with every paint stroke. That is absent in the output of these programs.

Neal Stephenson # 7th February 2024, 5:04 pm

Sometimes, performance just doesn’t matter. If I make some codepath in Ruff 10x faster, but no one ever hits it, I’m sure it could get some likes on Twitter, but the impact on users would be meaningless.

And yet, it’s good to care about performance everywhere, even when it doesn’t matter. Caring about performance is cultural and contagious. Small wins add up. Small losses add up even more.

Charlie Marsh # 4th February 2024, 7:41 pm

Rye lets you get from no Python on a computer to a fully functioning Python project in under a minute with linting, formatting and everything in place.

[...] Because it was demonstrably designed to avoid interference with any pre-existing Python configurations, Rye allows for a smooth and gradual integration and the emotional barrier of picking it up even for people who use other tools was shown to be low.

Armin Ronacher # 4th February 2024, 3:12 pm

LLMs may offer immense value to society. But that does not warrant the violation of copyright law or its underpinning principles. We do not believe it is fair for tech firms to use rightsholder data for commercial purposes without permission or compensation, and to gain vast financial rewards in the process. There is compelling evidence that the UK benefits economically, politically and societally from upholding a globally respected copyright regime.

UK House of Lords report on Generative AI # 2nd February 2024, 3:54 am

For many people in many organizations, their measurable output is words—words in emails, in reports, in presentations. We use words as proxy for many things: the number of words is an indicator of effort, the quality of the words is an indicator of intelligence, the degree to which the words are error-free is an indicator of care.

[...] But now every employee with Copilot can produce work that checks all the boxes of a formal report without necessarily representing underlying effort.

Ethan Mollick # 2nd February 2024, 3:34 am