Simon Willison’s Weblog: Hybrid full-text search and vector search with SQLite

Source URL: https://simonwillison.net/2024/Oct/4/hybrid-full-text-search-and-vector-search-with-sqlite/
Source: Simon Willison’s Weblog
Title: Hybrid full-text search and vector search with SQLite

Feedly Summary: Hybrid full-text search and vector search with SQLite
As part of Alex’s work on his sqlite-vec SQLite extension – adding fast vector lookups to SQLite – he’s been investigating hybrid search, where search results from both vector similarity and traditional full-text search are combined together.
The most promising approach looks to be Reciprocal Rank Fusion, which combines the top ranked items from both approaches. Here’s Alex’s SQL query:
— the sqlite-vec KNN vector search results
with vec_matches as (
select
article_id,
row_number() over (order by distance) as rank_number,
distance
from vec_articles
where
headline_embedding match lembed(:query)
and k = :k
),
— the FTS5 search results
fts_matches as (
select
rowid,
row_number() over (order by rank) as rank_number,
rank as score
from fts_articles
where headline match :query
limit :k
),
— combine FTS5 + vector search results with RRF
final as (
select
articles.id,
articles.headline,
vec_matches.rank_number as vec_rank,
fts_matches.rank_number as fts_rank,
— RRF algorithm
(
coalesce(1.0 / (:rrf_k + fts_matches.rank_number), 0.0) * :weight_fts +
coalesce(1.0 / (:rrf_k + vec_matches.rank_number), 0.0) * :weight_vec
) as combined_rank,
vec_matches.distance as vec_distance,
fts_matches.score as fts_score
from fts_matches
full outer join vec_matches on vec_matches.article_id = fts_matches.rowid
join articles on articles.rowid = coalesce(fts_matches.rowid, vec_matches.article_id)
order by combined_rank desc
)
select * from final;
I’ve been puzzled in the past over how to best do that because the distance scores from vector similarity and the relevance scores from FTS are meaningless in comparison to each other. RRF doesn’t even attempt to compare them – it uses them purely for row_number() ranking within each set and combines the results based on that.
Tags: embeddings, sql, vector-search, sqlite, search, alex-garcia, full-text-search, rag

AI Summary and Description: Yes

Summary: The text presents a novel approach to enhance search capabilities within SQLite databases by integrating hybrid search techniques that combine vector similarity search with traditional full-text search. The method, utilizing Reciprocal Rank Fusion (RRF), allows for a more efficient retrieval of relevant information, which may be beneficial for those working in AI and data retrieval fields.

Detailed Description: The content discusses an innovative effort by Alex to enhance SQLite search functionalities through his sqlite-vec SQLite extension. This extension adds fast vector lookups to SQLite, facilitating hybrid search capabilities that merge the results from both vector similarity search and traditional full-text search.

Key points include:

– **Hybrid Search Approach**: The technique integrates vector similarity and full-text search results, optimizing the search process within databases.
– **Reciprocal Rank Fusion (RRF)**: This algorithm is highlighted as the most promising method for combining search results. RRF allows for the seamless combination of ranking systems from both search methods without direct comparison of their scores.
– **Implementation**: A detailed SQL query illustrates how to implement this hybrid search. It includes:
– Two common table expressions (CTEs): `vec_matches` for vector search results and `fts_matches` for full-text search results.
– A final selection combining these results using the RRF algorithm, where search results are ranked based on a combined score that takes into account both vector distances and full-text relevance.

Further insights that may pertain to professional practices in cloud and information security include:

– **Improved Data Retrieval**: By enhancing search efficiency, organizations can improve the retrieval of sensitive or critical data from large datasets, which is important for compliance and information security.
– **Potential for Integration in Security Systems**: The methodologies discussed could be adapted for use in security systems requiring rapid, reliable access to logs and threat intelligence.
– **Use of Embeddings**: The use of embeddings in search might reflect an emerging trend where natural language processing and AI techniques are utilized in information retrieval, aligning with practices in AI security and infrastructure security.

Overall, the integration of vector search and traditional full-text search through innovative algorithms like RRF can significantly improve data handling capabilities in various applications, making it particularly relevant for professionals in AI, cloud computing, and information security sectors.