Simon Willison’s Weblog

Subscribe

Under the hood of Canada Spends with Brendan Samek

9th December 2025

I talked to Brendan Samek about Canada Spends, a project from Build Canada that makes Canadian government financial data accessible and explorable using a combination of Datasette, a neat custom frontend, Ruby ingestion scripts, sqlite-utils and pieces of LLM-powered PDF extraction.

Here’s the video on YouTube.

Sections within that video:

  • 02:57 Data sources and the PDF problem
  • 05:51 Crowdsourcing financial data across Canada
  • 07:27 Datasette demo: Search and facets
  • 12:33 Behind the scenes: Ingestion code
  • 17:24 Data quality horror stories
  • 20:46 Using Gemini to extract PDF data
  • 25:24 Why SQLite is perfect for data distribution

Build Canada and Canada Spends

Build Canada is a volunteer-driven non-profit that launched in February 2025—here’s some background information on the organization, which has a strong pro-entrepreneurship and pro-technology angle.

Canada Spends is their project to make Canadian government financial data more accessible and explorable. It includes a tax sources and sinks visualizer and a searchable database of government contracts, plus a collection of tools covering financial data from different levels of government.

Datasette for data exploration

The project maintains a Datasette instance at api.canadasbilding.com containing the data they have gathered and processed from multiple data sources—currently more than 2 million rows plus a combined search index across a denormalized copy of that data.

  Datasette UI for a canada-spends database.  aggregated-contracts-under-10k:  year, contract_goods_number_of, contracts_goods_original_value, contracts_goods_amendment_value, contract_service_number_of, contracts_service_original_value, contracts_service_amendment_value, contract_construction_number_of, contracts_construction_original_value, contracts_construction_amendment_value, acquisition_card_transactions_number_of, acquisition_card_transactions_total_value, owner_org, owner_org_title  487 rows cihr_grants  external_id, title, project_lead_name, co_researchers, institution, province, country, competition_year, award_amount, program, program_type, theme, research_subject, keywords, abstract, duration, source_url  53,420 rows contracts-over-10k:   reference_number, procurement_id, vendor_name, vendor_postal_code, buyer_name, contract_date, economic_object_code, description_en, description_fr, contract_period_start, delivery_date, contract_value, original_value, amendment_value, comments_en, comments_fr, additional_comments_en, additional_comments_fr, agreement_type_code, trade_agreement, land_claims, commodity_type, commodity_code, country_of_vendor, solicitation_procedure, limited_tendering_reason, trade_agreement_exceptions, indigenous_business, indigenous_business_excluding_psib, intellectual_property, potential_commercial_exploitation, former_public_servant, contracting_entity, standing_offer_number, instrument_type, ministers_office, number_of_bids, article_6_exceptions, award_criteria, socioeconomic_indicator, reporting_period, owner_org, owner_org_title  1,172,575 rows global_affairs_grants:   id, projectNumber, dateModified, title, description, status, start, end, countries, executingAgencyPartner, DACSectors, maximumContribution, ContributingOrganization, expectedResults, resultsAchieved, aidType, collaborationType, financeType, flowType, reportingOrganisation, programName, selectionMechanism, policyMarkers, regions, alternameImPositions, budgets, Locations, otherIdentifiers, participatingOrgs, programDataStructure, relatedActivities, transactions  2,378 rows nserc_grants:   title, award_summary, application_id, competition_year, fiscal_year, project_lead_name, institution, department, province, award_amount, installment, program, selection_committee, research_subject, area_of_application, co-researchers, partners, external_id, source_url  701,310 rows sshrc_grants:   id, title, program, fiscal_year, competition_year, applicant, organization, amount, discipline, area_of_research, co_applicant, keywords, source_url  213,085 rows transfers:   FSCL_YR, MINC, MINE, MINF, DepartmentNumber-Numéro-de-Ministère, DEPT_EN_DESC, DEPT_FR_DESC, RCPNT_CLS_EN_DESC, RCPNT_CLS_FR_DESC, RCPNT_NML_EN_DESC, RCPNT_NML_FR_DESC, CTY_EN_NM, CTY_FR_NM, PROVTER_EN, PROVTER_FR, CNTRY_EN_NM, CNTRY_FR_NM, TOT_CY_XPND_AMT, AGRG_PYMT_AMT  357,797 rows  Download SQLite DB: canada-spends.db 2.4 GB Powered by Datasette · Queries took 24.733ms

Processing PDFs

The highest quality government financial data comes from the audited financial statements that every Canadian government department is required to publish. As is so often the case with government data, these are usually published as PDFs.

Brendan has been using Gemini to help extract data from those PDFs. Since this is accounting data the numbers can be summed and cross-checked to help validate the LLM didn’t make any obvious mistakes.

Further reading

This is Under the hood of Canada Spends with Brendan Samek by Simon Willison, posted on 9th December 2025.

Part of series Project

  1. Project: VERDAD - tracking misinformation in radio broadcasts using Gemini 1.5 - Nov. 7, 2024, 6:41 p.m.
  2. Project: Civic Band - scraping and searching PDF meeting minutes from hundreds of municipalities - Nov. 16, 2024, 10:14 p.m.
  3. Six short video demos of LLM and Datasette projects - Jan. 22, 2025, 2:09 a.m.
  4. Under the hood of Canada Spends with Brendan Samek - Dec. 9, 2025, 11:52 p.m.

Previous: Highlights from my appearance on the Data Renegades podcast with CL Kao and Dori Wilson

Monthly briefing

Sponsor me for $10/month and get a curated email digest of the month's most important LLM developments.

Pay me to send you less!

Sponsor & subscribe