Simon Willison’s Weblog

Subscribe
Atom feed for wikipedia

41 items tagged “wikipedia”

2024

Wikidata is a Giant Crosswalk File. Drew Breunig shows how to take the 140GB Wikidata JSON export, use sed 's/,$//' to convert it to newline-delimited JSON, then use DuckDB to run queries and extract external identifiers, including a query that pulls out 500MB of latitude and longitude points.

# 5th October 2024, 3:45 pm / wikipedia, drew-breunig, duckdb, json

Wikipedia Manual of Style: Linking (via) I started a conversation on Mastodon about the grammar of linking: how to decide where in a phrase an inline link should be placed.

Lots of great (and varied) replies there. The most comprehensive style guide I've seen so far is this one from Wikipedia, via Tom Morris.

# 22nd June 2024, 2:15 pm / wikipedia, links, writing

qrank (via) Interesting and very niche project by Colin Dellow.

Wikidata has pages for huge numbers of concepts, people, places and things.

One of the many pieces of data they publish is QRank—“ranking Wikidata entities by aggregating page views on Wikipedia, Wikispecies, Wikibooks, Wikiquote, and other Wikimedia projects”. Every item gets a score and these scores can be used to answer questions like “which island nations get the most interest across Wikipedia”—potentially useful for things like deciding which labels to display on a highly compressed map of the world.

QRank is published as a gzipped CSV file.

Colin’s hikeratlas/qrank GitHub repository runs weekly, fetches the latest qrank.csv.gz file and loads it into a SQLite database using SQLite’s “.import” mechanism. Then it publishes the resulting SQLite database as an asset attached to the “latest” GitHub release on that repo—currently a 307MB file.

The database itself has just a single table mapping the Wikidata ID (a primary key integer) to the latest QRank—another integer. You’d need your own set of data with Wikidata IDs to join against this to do anything useful.

I’d never thought of using GitHub Releases for this kind of thing. I think it’s a really interesting pattern.

# 21st April 2024, 10:28 pm / wikipedia, github-actions, sqlite, colin-dellow

Become a Wikipedian in 30 minutes (via) A characteristically informative and thoughtful guide to getting started with Wikipedia editing by Molly White—video accompanied by a full transcript.

I found the explanation of Reliable Sources particularly helpful, including why Wikipedia prefers secondary to primary sources.

“The way we determine reliability is typically based on the reputation for editorial oversight, and for factchecking and corrections. For example, if you have a reference book that is published by a reputable publisher that has an editorial board and that has edited the book for accuracy, if you know of a newspaper that has, again, an editorial team that is reviewing articles and issuing corrections if there are any errors, those are probably reliable sources.”

# 8th March 2024, 9:47 am / wikipedia, molly-white

Wikimedia Commons Category:Bach Dancing & Dynamite Society. After creating a new Wikipedia page for the Bach Dancing & Dynamite Society in Half Moon Bay I ran a search across Wikipedia for other mentions of the venue... and found 41 artist pages that mentioned it in a photo caption.

On further exploration it turns out that Brian McMillen, the official photographer for the venue, has been uploading photographs to Wikimedia Commons since 2007 and adding them to different artist pages. Brian has been a jazz photographer based out of Half Moon Bay for 47 years and has an amazing portfolio of images. It’s thrilling to see him share them on Wikipedia in this way.

# 6th March 2024, 5:24 am / wikipedia

Wikipedia: Bach Dancing & Dynamite Society (via) I created my first Wikipedia page! The Bach Dancing & Dynamite Society is a really neat live music venue in Half Moon Bay which has been showcasing world-class jazz talent for over 50 years. I attended a concert there for the first time on Sunday and was surprised to see it didn’t have a page yet.

Creating a Wikipedia page is an interesting process. New pages on English Wikipedia created by infrequent editors stay in “draft” mode until they’ve been approved by a member of “WikiProject Articles for creation”—the standards are really high, especially around sources of citations. I spent quite a while tracking down good citation references for the key facts I used in my first draft for the page.

# 5th March 2024, 4:21 pm / wikipedia, music, half-moon-bay

WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia. This paper describes a really interesting LLM system that runs Retrieval Augmented Generation against Wikipedia to help answer questions, but includes a second step where facts in the answer are fact-checked against Wikipedia again before returning an answer to the user. They claim “97.3% factual accuracy of its claims in simulated conversation” on a GPT-4 backed version, and also see good results when backed by LLaMA 7B.

The implementation is mainly through prompt engineering, and detailed examples of the prompts they used are included at the end of the paper.

# 9th January 2024, 9:30 pm / prompt-engineering, generative-ai, wikipedia, ai, llms, rag

2023

Wikimedia Commons: Photographs by Gage Skidmore (via) Gage Skidmore is a Wikipedia legend: this category holds 93,458 photographs taken by Gage and released under a Creative Commons license, including a vast number of celebrities taken at events like San Diego Comic-Con. CC licensed photos of celebrities are generally pretty hard to come by so if you see a photo of any celebrity on Wikipedia there’s a good chance it’s credited to Gage.

# 10th October 2023, 4:17 am / wikipedia, creativecommons, photography

Wikipedia search-by-vibes through millions of pages offline (via) Really cool demo by Lee Butterman, who built embeddings of 2 million Wikipedia pages and figured out how to serve them directly to the browser, where they are used to implement “vibes based” similarity search returning results in 250ms. Lots of interesting details about how he pulled this off, using Arrow as the file format and ONNX to run the model in the browser.

# 4th September 2023, 9:13 pm / embedding, search, wikipedia, webassembly

2018

Why it took a long time to build that tiny link preview on Wikipedia (via) Wikipedia now shows a little preview card on internal links with an image and summary paragraph of the linked page. As a Wikpedia user I absolutely love this feature—and as an engineer and product designer, it’s fascinating to hear the challenges they overcame to ship it. Of particular interest: actually generating a useful summary of a page, while stripping out the cruft that often accumulates at the beginning of their text. It’s also an impressive scaling challenge: the API they use for this feature is now handling more than 500,000 requests per minute.

# 23rd April 2018, 9:07 pm / wikipedia, scaling

2012

Why doesn’t Wikipedia try something other than donations to make money?

Wikipedia is run by a non-profit, and the content is created by volunteers for free. Those volunteers created that content under the understanding that it would be for the benefit of the species. Alternative methods of making money would break that assumed contract with their volunteers, and would likely damage their ability to encourage free contributions in the future.

[... 76 words]

2010

What are the best APIs for creating location-based Wikipedia mashups?

GeoNames has a fantastic API for finding Wikipedia articles near a specific latitude/longitude pair:

[... 32 words]

2009

Authority, historically, gets bestowed on the gatekeepers of information, such as Britannica, universities, newspapers, etc. Everything that can be digitized will be digitized, and will then be available over the internet, which is disruptive, not only to business models, but to authority.

Joe Gregorio

# 19th November 2009, 6:53 pm / joe-gregorio, wikipedia, authority, newspapers, internet

Best of OpenStreetMap (via) I keep on telling people OpenStreetMap is this year’s Wikipedia—at its best, it beats commercially available maps. This “best of” site highlights the areas where OSM really shines (the yellow stars)—the German mapping community in particular have produced some outstanding cartography.

# 13th August 2009, 12:30 pm / openstreetmap, wikipedia, mapping, maps, cartography

Wikipedia over DNS. Added to my ~/bin/ directory as dns-wikipedia.sh: host -t txt $1.wp.dg.cx

# 2nd January 2009, 11:29 am / wikipedia, dns

2008

License Hacking. Wikipedia is making the switch to a CC license, by asking the Free Software Foundation to include that as an option in the latest version of the Free Documentation License which Wikipedia currently uses and which includes an auto-upgrade clause. Devious.

# 10th November 2008, 10:46 pm / licenses, open-source, wikipedia, freesoftwarefoundation, fsf, creativecommons, fdl

It’s a purple world. Stuart Langridge made a purplish map of the US election results, using JSON data from Google and an SVG map of the US from Wikipedia.

# 6th November 2008, 8:26 pm / stuart-langridge, uselection, svg, wikipedia

Data Scraping Wikipedia with Google Spreadsheets. I hadn’t played with =importHTML in Google spreadsheets, which lets you suck in data from an HTML table or list somewhere on the web. This tutorial takes it further, bringing Wikipedia, Yahoo! Pipes and KML in to the mix.

# 16th October 2008, 2:37 pm / mashups, importhtml, google-docs, googlespreadsheet, wikipedia, yahoopipes, kml, scraping

Google’s Wikipedia and Panoramio layers are now available in the API. I really like their use of reverse domain style identifiers for the layer IDs: map.addOverlay(new GLayer(“org.wikipedia”));

# 2nd October 2008, 11:59 am / google-maps, wikipedia, javascript, panoramio, glayer

GiantBomb.com. Launched today, powered by Django—a combination of (mostly ex-Gamespot) quality editorial content and a massive structured wiki of every computer game ever released. This is going to be a lot of fun—all of the crazy detailed content that Wikipedia tends to reject.

# 22nd July 2008, 7:09 am / django, giantbomb, games, wikipedia, wiki

Comet (programming) on Wikipedia on 4th June 2008 (via) The last useful version (which I had pointed many people to) before it was gutted down to just a couple of paragraphs by infuriating deletionists.

# 16th June 2008, 9:34 am / wikipedia, comet, deletionist

The fatal flaw of deletionism is the mindset of deciding what someone else should find interesting

Jeff Atwood

# 16th June 2008, 8:23 am / jeff-atwood, deletionism, wikipedia

Wikipedia:Canvassing (via) Apparently it’s considered bad form to tell people about debates occurring on Wikipedia (such as votes for deletion). Looks like a policy designed to discourage the participation of subject experts in favour of the participation of Wikipedia process gnomes.

# 16th June 2008, 8:23 am / wikipedia, canvassing

There are two [Wikipedias]: One is the public-facing reliable-enough-on-average encyclopedia that people read every day, which makes for nice fluff pieces in the media about "these new Web thingamajigs that the kids are building, aren't they neat?". The other is the insular behind-the-scenes bureaucracy, which reads like an improvised performance of the collected writings of Clay Shirky.

James Bennett

# 16th June 2008, 8:16 am / james-bennett, wikipedia, clay-shirky, snark

Google Maps now shows photos and Wikipedia articles. Click the “More...” button. My first thought was “how do they get so many photo markers on the map?”—Firebug shows that they’re generating tiles on the server containing multiple photo markers, then when you click on one an Ajax call checks which photo is in that particular spot.

# 14th May 2008, 7:10 pm / google-maps, javascript, ajax, wikipedia

MediaWiki API. Wikipedia’s best kept secret?

# 26th April 2008, 6:47 pm / mediawiki, wikipedia, api

wikinear.com, OAuth and Fire Eagle

I’m pleased to announce wikinear.com. It’s a simple site that does just one thing: show you a list of the five Wikipedia pages that are geographically closest to your current location. It’s designed (or not-designed) to be used mainly from mobile phones.

[... 1,190 words]

Everyone applauds when Google goes after Microsoft's Office monopoly [...] but when they start to go after web non-profits like Wikipedia, you see where the ineluctible logic leads. As Google's growth slows, as inevitably it will, it will need to consume more and more of the web ecosystem, trading against its former suppliers, rather than distributing attention to them.

Tim O'Reilly

# 1st January 2008, 11:29 am / tim-oreilly, google, microsoft, wikipedia, competition

2007

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo. (via) See also: Wikipedia’s “List of linguistic example sentences”.

# 28th October 2007, 6:12 pm / buffalo, linguistics, wikipedia