Cloud Blog: Introducing ScaNN in BigQuery vector search for large query batches

Source URL: https://cloud.google.com/blog/products/data-analytics/introducing-scann-in-bigquery-vector-search-for-large-query-batches/
Source: Cloud Blog
Title: Introducing ScaNN in BigQuery vector search for large query batches

Feedly Summary: We continue to add more capabilities to BigQuery to make it the AI-ready data platform for the Gemini era. Earlier this year, we introduced vector search, which enables vector similarity search on BigQuery data. Since then, we also added several functionalities such as stored columns and pre-filtering. Already, the scale, performance, and ease-of-use offered through BigQuery’s vector search and AI capabilities is empowering customers to build pipelines and applications, ranging from semantic search to LLM-based retrieval-augmented generation (RAG).
Today, we are announcing the preview of the TreeAH vector index, which brings core pieces from Google’s research and innovation in approximate nearest neighbor algorithms to BigQuery. This new index type uses the same underlying technology that powers some of Google’s most popular services and delivers significant latency and cost reductions in certain situations compared to the first index we implemented in BigQuery, the inverted file index (IVF). How significant? Read on to learn about architectural differences between the two, performance results, and when and how to use TreeAH rather than IVF. 
IVF vs. TreeAH indexes: a comparison
Using a vector index allows BigQuery to optimize the lookups and distance computations required to identify closely matching embeddings. Both IVF and TreeAH indexes allow BigQuery to perform approximate nearest neighbor (ANN) search instead of exact nearest neighbor search, which trades some accuracy for lower query latency and cost.
BigQuery’s first vector index, IVF, uses a scalable k-means clustering algorithm to partition the vector data into clusters. When you use the VECTOR_SEARCH function to search the vector data, it finds the clusters that are closest to the query vector and only ranks the vector data from those clusters; this reduces the number of distance calculations by a large factor.
The new TreeAH index is based on Google’s ScaNN algorithm, which is used in a multitude of Google services for similarity search. The main difference with the IVF index is the use of asymmetric hashing (the “AH" in the TreeAH), which uses product quantization to compress embeddings. Coupled with a CPU-optimized distance computation algorithm, vector search using TreeAH can be orders of magnitude faster and more cost-efficient than IVF. Index generation can also be 10x faster and cheaper and have a smaller memory footprint, as only the compressed embeddings are stored.
TreeAH performance
Our engineering team conducted benchmarks across various table configurations and query batch sizes to compare TreeAH with IVF. Here are the results:

Fig 1. Latency and cost for vector search queries

Fig 2. Vector index training latency and cost

Key results:

For small query batches, the IVF index performs as well as TreeAH, and sometimes better, due to the increased overhead introduced by the TreeAH index.

For large query batches, TreeAH significantly outperforms IVF due to its optimized distance-calculation algorithm.

TreeAH index training is also significantly faster and cheaper compared to IVF in most cases.

(*): The query using the IVF index benefited from block pruning optimization.
When to use TreeAH and current limitations
As shown in the benchmarks, the TreeAH index already greatly outperforms the IVF index when the query batch size is large. Index training is also faster and cheaper.
We’re still actively developing the TreeAH index and new features and performance improvements will be added over time. Currently, the following limitations apply:

The base table can have a maximum of 200 million rows.

Stored columns and pre-filtering are not supported for the TreeAH index.

Getting started with TreeAH
The TreeAH index type is in public preview and can be used today:

code_block
`\r\nON `<my_table>`(<embedding_column>)\r\nOPTIONS(distance_type=’COSINE’, index_type=’TREE_AH’)"), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e79fe196fd0>)])]>

For more details, please refer to the documentation.
We encourage you to try the TreeAH index and experience the performance and cost benefits it can bring to your vector search workloads. Please reach out to us (bq-vector-search@google.com) if you have any questions or need assistance in choosing the best index type for your workload.

AI Summary and Description: Yes

**Summary:** The text discusses the introduction of the TreeAH vector index to Google’s BigQuery, aimed at enhancing AI capabilities in data processing, specifically in vector similarity searches. This new index improves performance and cost-effectiveness compared to the previous inverted file index (IVF) and is relevant for businesses leveraging AI and cloud technologies in data applications.

**Detailed Description:**
The announcement details advancements in Google’s BigQuery, an AI-ready data platform that now includes the TreeAH vector index. Key insights from this text include the following points:

– **Introduction of TreeAH Vector Index:**
– A product of Google’s research in approximate nearest neighbor (ANN) algorithms.
– Designed to optimize vector similarity searches within BigQuery.

– **Comparison with Existing Index (IVF):**
– Both TreeAH and IVF allow for approximate nearest neighbor searches, which prioritize speed and cost-efficiency over perfect accuracy.
– IVF uses a scalable k-means clustering algorithm, while TreeAH is based on Google’s ScaNN algorithm, utilizing asymmetric hashing for faster performance.

– **Performance Enhancements:**
– Benchmarks indicate that TreeAH shows significant improvements over IVF for large query batches, particularly in terms of latency and cost.
– Index generation for TreeAH is noted to be 10x faster and cheaper, with a reduced memory footprint.

– **Limitations:**
– Current constraints include a maximum of 200 million rows per base table and no support for stored columns and pre-filtering in TreeAH.

– **Practical Implications:**
– This innovation enables a better foundation for applications such as semantic search and retrieval-augmented generation (RAG), beneficial for businesses utilizing LLMs (Large Language Models).
– Encourages organizations to consider adopting this index type for improved performance in handling vector search workloads.

The text ultimately signifies an important step for professionals in AI and cloud services, presenting opportunities for enhanced data processing that aligns with modern AI application needs. This progress highlights the ongoing evolution of cloud computing capabilities and the importance of keeping pace with emerging technologies in data management.