Source URL: https://simonwillison.net/2024/Nov/25/ask-questions-of-sqlite/#atom-everything
Source: Simon Willison’s Weblog
Title: Ask questions of SQLite databases and CSV/JSON files in your terminal
Feedly Summary: I built a new plugin for my sqlite-utils CLI tool that lets you ask human-language questions directly of SQLite databases and CSV/JSON files on your computer.
It’s called sqlite-utils-ask. Here’s how you install it:
sqlite-utils install sqlite-utils-ask
It picks up API keys from an OPENAI_API_KEY environment variable, or you can install LLM and use llm keys set openai to store a key in a configuration file.
Then you can use it like this:
curl -O https://datasette.io/content.db
sqlite-utils ask content.db “how many sqlite-utils pypi downloads in 2024?"
This command will extract the SQL schema for the provided database file, send that through an LLM along with your question, get back a SQL query and attempt to run it to derive a result.
If all goes well it spits out an answer something like this:
SELECT SUM(downloads)
FROM stats
WHERE package = ‘sqlite-utils’ AND date >= ‘2024-01-01’ AND date < '2025-01-01';
[
{
"SUM(downloads)": 4300221
}
]
If the SQL query fails to execute (due to a syntax error of some kind) it passes that error back to the model for corrections and retries up to three times before giving up.
Add -v/--verbose to see the exact prompt it's using:
System prompt:
You will be given a SQLite schema followed by a question. Generate a single SQL
query to answer that question. Return that query in a ```sql ... ```
fenced code block.
Example: How many repos are there?
Answer:
```sql
select count(*) from repos
```
Prompt:
...
CREATE TABLE [stats] (
[package] TEXT,
[date] TEXT,
[downloads] INTEGER,
PRIMARY KEY ([package], [date])
);
...
how many sqlite-utils pypi downloads in 2024?
I've truncated the above to just the relevant table - it actually includes the full schema of every table in that dtabase.
By default, the tool sends just that database schema and your question to the LLM. If you add the -e/--examples option it will also include five common values for each of the text columns in that schema with an average length less than 32 characters. This can sometimes help get a better result, for example sending values "CA" and "FL" and "TX" for a state column can tip the model of that it should use state abbreviations rather than full names in its queries.
Asking questions of CSV and JSON data
The core sqlite-utils CLI usually works against SQLite files directly, but three years ago I added the ability to run SQL queries against CSV and JSON files directly with the sqlite-utils memory command. This works by loading that data into an in-memory SQLite database before executing a SQL query.
I decided to reuse that mechanism to enable LLM prompts against CSV and JSON data directly as well.
The sqlite-utils ask-files command looks like this:
sqlite-utils ask-files transactions.csv "total sales by year"
This command accepts one or more files, and you can provide a mix of CSV, TSV and JSON. Each provided file will be imported into a different table, allowing the model to construct join queries where necessary.
Implementation notes
The core of the plugin is implemented as around 250 lines of Python, using the sqlite-utils register_commands() plugin hook to add the ask and ask-files commands.
It adds LLM as a dependency, and takes advantage of LLM's Python API to abstract over the details of talking to the models. This means sqlite-utils-ask can use any of the models supported by LLM or its plugins - if you want to run your prompt through Claude 3.5 Sonnet you can do this:
sqlite-utils install llm-claude-3
sqlite-utils ask content.db "count rows in news table" -m claude-3.5-sonnet
The plugin defaults to gpt-4o-mini initially to take advantage of that model's automatic prompt caching: if you run multiple questions against the same schema you'll end up sending the same lengthy prompt prefix multiple times, and OpenAI's prompt caching should automatically kick in and provide a 50% discount on those input tokens.
Then I ran the actual numbers and found that gpt-4o-mini is cheap enough that even without caching a 4,000 token prompt (that's a pretty large SQL schema) should cost less than a tenth of a cent. So those caching savings aren't worth anything at all!
Tags: plugins, projects, sqlite, ai, sqlite-utils, generative-ai, llms, ai-assisted-programming, llm
AI Summary and Description: Yes
Summary: The text describes a new plugin for the sqlite-utils CLI tool, enabling users to query SQLite databases and CSV/JSON files using natural language via integration with large language models (LLMs). The novelty lies in its capability to automatically generate SQL queries based on user questions, showcasing an innovative application of LLMs in simplifying database interactions.
Detailed Description: The sqlite-utils-ask plugin introduces significant advancements in how users interact with database files:
* **Functionality**:
- Allows users to pose natural language questions concerning data stored in SQLite databases and CSV/JSON files.
- Automatically generates SQL queries to fetch responses, using LLMs to interpret natural language questions.
* **Installation**:
- Easily installable via the sqlite-utils command: `sqlite-utils install sqlite-utils-ask`.
- Can interface with OpenAI's API using an environment variable for API keys.
* **Core Mechanism**:
- The plugin analyzes the SQL schema of the provided database file and applies it to form a SQL query.
- If a generated query results in an error, the plugin can attempt corrections by rephrasing the request with retries.
* **Command Usage**:
- Example command: `sqlite-utils ask content.db "how many sqlite-utils pypi downloads in 2024?"`.
- Outputs results in a structured format, showing the SQL executed and the resulting data.
* **CSV and JSON Support**:
- Extends functionality to include CSV and JSON data types, enabling LLM prompts and query executions on this data as well.
* **Implementation**:
- Developed with around 250 lines of Python code, utilizing the sqlite-utils framework and LLM APIs.
- Supports multiple LLM models, offering flexibility for users depending on their needs.
* **Cost Considerations**:
- Initially intended to maximize efficiency with OpenAI’s prompt caching, though real-world application showed that even without caching, costs remained minimal for larger queries.
* **Example Commands**:
- `sqlite-utils ask-files transactions.csv "total sales by year"` to query mixed data files.
- Commands also allow specifying different model integrations to optimize performance.
This tool represents an innovative shift in database querying, simplifying access to information stored in databases, reducing the technical barrier for non-technical users, and integrating seamlessly with modern AI capabilities. This development is particularly relevant for professionals involved in data management, AI integration, and database systems, emphasizing improved access and usability in the realm of information management.