Example dashboard

Various statistics from my blog.

Owned by simonw, visibility: Public

Entries

3060

SQL query
select 'Entries' as label, count(*) as big_number from blog_entry

Blogmarks

7012

SQL query
select 'Blogmarks' as label, count(*) as big_number from blog_blogmark

Quotations

846

SQL query
select 'Quotations' as label, count(*) as big_number from blog_quotation

Chart of number of entries per month over time

SQL query
select '<h2>Chart of number of entries per month over time</h2>' as html
SQL query
select to_char(date_trunc('month', created), 'YYYY-MM') as bar_label,
count(*) as bar_quantity from blog_entry group by bar_label order by count(*) desc

Ten most recent blogmarks (of 7012 total)

SQL query
select '## Ten most recent blogmarks (of ' || count(*) || ' total)' as markdown from blog_blogmark
SQL query
select link_title, link_url, commentary, created from blog_blogmark order by created desc limit 10

10 rows

link_title link_url commentary created
Merge pull request #1757 from simonw/heic-heif https://github.com/gchq/CyberChef/commit/674c8c7c87eff167f03ee42c998c7fff18da4fa3 I got a PR into GCHQ's CyberChef this morning! I added support for detecting heic/heif files to the Forensics -> Detect File Type tool. The change was landed by the delightfully mysterious a3957273. 2024-03-28 05:37:31+00:00
Wrap text at specified width https://observablehq.com/@simonw/wrap-text-at-specified-width New Observable notebook. I built this with the help of Claude 3 Opus - it's a text wrapping tool which lets you set the width and also lets you optionally add a four space indent. The four space indent is handy for posting on forums such as Hacker News that treat a four space indent as a code block. 2024-03-28 03:36:01+00:00
llm-gemini 0.1a1 https://github.com/simonw/llm-gemini/releases/tag/0.1a1 I upgraded my llm-gemini plugin to add support for the new Google Gemini Pro 1.5 model, which is beginning to roll out in early access. The 1.5 model supports 1,048,576 input tokens and generates up to 8,192 output tokens - a big step up from Gemini 1.0 Pro which handled 30,720 and 2,048 respectively. The big missing feature from my LLM tool at the moment is image input - a fantastic way to take advantage of that huge context window. I have a branch for this which I really need to get into a useful state. 2024-03-28 03:32:15+00:00
“The king is dead”—Claude 3 surpasses GPT-4 on Chatbot Arena for the first time https://arstechnica.com/information-technology/2024/03/the-king-is-dead-claude-3-surpasses-gpt-4-on-chatbot-arena-for-the-first-time/ I'm quoted in this piece by Benj Edwards for Ars Technica: "For the first time, the best available models—Opus for advanced tasks, Haiku for cost and efficiency—are from a vendor that isn't OpenAI. That's reassuring—we all benefit from a diversity of top vendors in this space. But GPT-4 is over a year old at this point, and it took that year for anyone else to catch up." 2024-03-27 16:58:20+00:00
Annotated DBRX system prompt https://huggingface.co/spaces/databricks/dbrx-instruct/blob/73f0fe25ed8eeb14ee2279b2ecff15dbd863d63d/app.py#L109-L134 DBRX is an exciting new openly licensed LLM released today by Databricks. They haven't (yet) disclosed what was in the training data for it. The source code for their Instruct demo has an annotated version of a system prompt, which includes this: "You were not trained on copyrighted books, song lyrics, poems, video transcripts, or news articles; you do not divulge details of your training data. You do not provide song lyrics, poems, or news articles and instead refer the user to find them online or in a store." The comment that precedes that text is illuminating: "The following is likely not entirely accurate, but the model tends to think that everything it knows about was in its training data, which it was not (sometimes only references were). So this produces more accurate accurate answers when the model is asked to introspect" 2024-03-27 15:33:17+00:00
gchq.github.io/CyberChef https://gchq.github.io/CyberChef/ CyberChef is "the Cyber Swiss Army Knife - a web app for encryption, encoding, compression and data analysis" - entirely client-side JavaScript with dozens of useful tools for working with different formats and encodings. It's maintained and released by GCHQ - the UK government's signals intelligence security agency. I didn't know GCHQ had a presence on GitHub, and I find the URL to this tool absolutely delightful. They first released it back in 2016 and it has over 3,700 commits. The top maintainers also have suitably anonymous usernames - great work, n1474335, j433866, d98762625 and n1073645. 2024-03-26 17:08:34+00:00
GGML GGUF File Format Vulnerabilities https://www.databricks.com/blog/ggml-gguf-file-format-vulnerabilities The GGML and GGUF formats are used by llama.cpp to package and distribute model weights. Neil Archibald: "The GGML library performs insufficient validation on the input file and, therefore, contains a selection of potentially exploitable memory corruption vulnerabilities during parsing." These vulnerabilities were shared with the library authors on 23rd January and patches landed on the 29th. If you have a llama.cpp or llama-cpp-python installation that's more than a month old you should upgrade ASAP. 2024-03-26 06:47:17+00:00
Cohere int8 & binary Embeddings - Scale Your Vector Database to Large Datasets https://txt.cohere.com/int8-binary-embeddings/ Jo Kristian Bergum told me "The accuracy retention [of binary embedding vectors] is sensitive to whether the model has been using this binarization as part of the loss function." Cohere provide an API for embeddings, and last week added support for returning binary vectors specifically tuned in this way. 250M embeddings (Cohere provide a downloadable dataset of 250M embedded documents from Wikipedia) at float32 (4 bytes) is 954GB. Cohere claim that reducing to 1 bit per dimension knocks that down to 30 GB (954/32) while keeping "90-98% of the original search quality". 2024-03-26 06:19:30+00:00
My binary vector search is better than your FP32 vectors https://blog.pgvecto.rs/my-binary-vector-search-is-better-than-your-fp32-vectors I'm still trying to get my head around this, but here's what I understand so far. Embedding vectors as calculated by models such as OpenAI text-embedding-3-small are arrays of floating point values, which look something like this: [0.0051681744, 0.017187592, -0.018685209, -0.01855924, -0.04725188...] - 1356 elements long Different embedding models have different lengths, but they tend to be hundreds up to low thousands of numbers. If each float is 32 bits that's 4 bytes per float, which can add up to a lot of memory if you have millions of embedding vectors to compare. If you look at those numbers you'll note that they are all pretty small positive or negative numbers, close to 0. Binary vector search is a trick where you take that sequence of floating point numbers and turn it into a binary vector - just a list of 1s and 0s, where you store a 1 if the corresponding float was greater than 0 and a 0 otherwise. For the above example, this would start [1, 1, 0, 0, 0...] Incredibly, it looks like the cosine distance between these 0 and 1 vectors captures much of the semantic relevant meaning present in the distance between the much more accurate vectors. This means you can use 1/32nd of the space and still get useful results! Ce Gao here suggests a further optimization: use the binary vectors for a fast brute-force lookup of the top 200 matches, then run a more expensive re-ranking against those filtered values using the full floating point vectors. 2024-03-26 04:56:25+00:00
Semgrep: AutoFixes using LLMs https://choly.ca/post/semgrep-autofix-llm/ semgrep is a really neat tool for semantic grep against source code - you can give it a pattern like "log.$A(...)" to match all forms of log.warning(...) / log.error(...) etc. Ilia Choly built semgrepx - xargs for semgrep - and here shows how it can be used along with my llm CLI tool to execute code replacements against matches by passing them through an LLM such as Claude 3 Opus. 2024-03-26 00:51:37+00:00
Copy and export data

Duration: 4.70ms