Source URL: https://cloud.google.com/blog/products/data-analytics/bigquery-vector-search-is-now-ga/
Source: Cloud Blog
Title: BigQuery vector search now GA, setting the stage for a new class of AI-powered analytics
Feedly Summary: Artificial intelligence (AI) has given organizations brand new ways to represent data so that it can be analyzed and understood at scale. AI models are trained to interpret the semantic meaning of words and other data by encoding them as vector embeddings. These embeddings represent the relative position in the vector space of the data that is encoded. Semantically similar embeddings are proximate in vector space, whereas semantically dissimilar embeddings are… not proximate. And together, codifying the semantic understanding of data, applying that understanding across all types of data, and giving organizations simple tools to analyze it all, unblocks brand new approaches for data analytics.
That is why we are excited today to announce the general availability (GA) of BigQuery vector search, enabling vector similarity search on BigQuery data. This functionality, also commonly referred to as approximate nearest-neighbor search, is the key to empowering numerous new data and AI use cases such as semantic search, similarity detection, and retrieval-augmented generation (RAG) with a large language model (LLM).
Originally announced in February, BigQuery vector search integrates generation, management, and search of embeddings within the data platform, to provide a serverless and integrated vector-analytics solution for use cases such as anomaly detection, multi-modal search, product recommendations, drug discovery, and more.
In addition, the inverted file index (IVF) index for BigQuery vector search is now also generally available. This index uses a k-means algorithm to cluster the vector data and combines it with an inverted row locator in a two-piece index in order to efficiently search similar embedding representations of your data. IVF contains several new enhancements since announcing in preview:
Improved scalability: You can now index 10 billion embeddings, enabling applications with massive scale.
Managed index with guaranteed correctness: When the underlying data changes, vector indexes are automatically updated using the existing k-means model. Vector search always returns correct results based on the latest mutations of the data, even before the system has finished re-indexing the modified data.
Stored columns: You can now store frequently used columns in the index to avoid expensive joins when retrieving additional data in the search result. This optimization yields the most noticeable performance improvements in scenarios with high result-set cardinality, for example, when your query data contains a large batch of embeddings, or when you need a high top_k. For example, for a table with 1 billion 96-dimensional embeddings, returning the 1,000 most similar candidates for an embedding is ~4x faster with ~200x less slots using vector indexes with stored columns than without stored columns.
Pre-filters: Combined with stored columns, vector search results can be pre-filtered by rewriting the base table statement into a query with filters. Compared with post-filtering, where the WHERE clauses are added after the VECTOR_SEARCH() function, pre-filtering improves query performance, enhances search quality, and minimizes the risk of missing results.
We’ve seen customers like Palo Alto Networks use BigQuery vector search to find similar common queries to accelerate time to insight.
“We’re leveraging BigQuery’s vector search in our copilot to suggest relevant query examples, significantly enhancing the customer experience. We really like how easy it was to set up and enable vector search in BigQuery and are impressed with the performance and speed of the similarity searches.” – Sameer Merchant, Senior Distinguished Engineer, Palo Alto Networks
Additionally, prototyping with BigQuery vector search and pushing to production is simple, even for the massive-scale workloads like drug discovery that Vilya has been working on. Further, on-demand pricing and budget assessment tools have made scaling to capacity-based billing models seamless.
“We were really impressed with BigQuery’s ability to scale to the size of biological data needed to search for novel macrocycles to precisely target disease biology. BigQuery’s vector search ease-of-use and scalability enables us to rapidly search through billions of such macrocycles.” – Patrick Salveson, Co-Founder & CTO, Vilya
Building with an example
New to BigQuery vector search? Here’s a real-world example to get you started.
Imagine you want to post a question to an internal Q&A forum, but before you do, you would like to find out if there are any existing questions that are semantically similar to yours. To demonstrate, we assume that we have generated embeddings for the questions and stored them in the
code_block
<ListValue: [StructValue([(‘code’, “CREATE OR REPLACE VECTOR INDEX `<index_name>`\r\nON `<my_posts_questions>`(embedding)\r\nSTORING (title, content, tags)\r\nOPTIONS(distance_type=’COSINE’, index_type=’IVF’)"), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6b56987160>)])]>
Although VECTOR_SEARCH() works even if you don’t have a vector index, creating an index often results in better query performance. Once ready, you can use VECTOR_SEARCH() combined with ML.GENERATE_EMBEDDING to search similar questions to “Android app using RSS crashing” for example. To better refine the results, you can use a pre-filter on the “tags” column to restrict the search space.
code_block
<ListValue: [StructValue([(‘code’, "SELECT query.query, base.title, base.content, distance\r\nFROM VECTOR_SEARCH(\r\n ( SELECT * FROM <my_posts_questions> WHERE SEARCH(tags, ‘android’) ), \r\n ’embedding’,\r\n (\r\n SELECT ml_generate_embedding_result AS embedding, content AS query\r\n FROM ML.GENERATE_EMBEDDING(\r\n MODEL <LLM_embedding_model>,\r\n (SELECT ‘Android app using RSS crashing’ AS content)\r\n )\r\n ))"), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6b56987640>)])]>
We also recently announced a new index type based on Google-developed ScaNN in preview that can further improve search performance even more. As BigQuery vector search evolves, it becomes a key component in a multi-modal Retrieval Augmentation Generation (RAG) solution that uses state-of-the-art Gemini models on top of a full-fledged BigQuery knowledge base comprised of structured, unstructured, and multimodal data.
Get started today
The combination of vector embeddings and machine learning stands to revolutionize what you can do with the data stored in your BigQuery enterprise data warehouses — starting with searching on those embeddings quickly and cost-effectively. To get started with BigQuery vector search, take a look at the following resources:
Getting started with vector search
Learn more about vector indexes, prefilters, and stored columns
Product Recommendations Example
Generate and search multi-modal embeddings documentation
Semantic search for record de-deduplication code example
Log & anomaly detection using BQ vector search code example
AI Summary and Description: Yes
Summary: The text introduces the general availability of BigQuery vector search, a feature enabling advanced vector similarity search within Google Cloud’s BigQuery platform. This innovation represents a significant advancement in data analytics, particularly in enhancing the performance of machine learning applications like semantic search, anomaly detection, and retrieval-augmented generation (RAG).
Detailed Description:
The article discusses the transformative impact of BigQuery vector search on data analysis and AI applications by leveraging vector embeddings. Here are the major points covered:
– **Introduction to AI and Vector Embeddings**:
– AI enables organizations to analyze data at scale by representing it as vector embeddings, which reflect the semantic meaning and relative positioning of the data.
– **General Availability of BigQuery Vector Search**:
– This feature allows approximate nearest-neighbor search on BigQuery data, facilitating various data and AI applications.
– **Integration of AI Capabilities**:
– It supports semantic search, similarity detection, and RAG with large language models (LLMs), promoting innovative approaches to analyzing data.
– **Key Functionalities**:
– **Inverted File Index (IVF)**:
– Utilizes a k-means algorithm for clustering, allowing efficient searches of similar embedding representations.
– Scalability to index up to 10 billion embeddings.
– Guarantees accuracy in search results even amidst data changes through automatic indexing updates.
– **Performance Optimizations**:
– Ability to store frequently used columns in the index prevents expensive join operations.
– Pre-filtering capabilities yield improved query performance and search quality.
– **Real-World Applications**:
– Organizations like Palo Alto Networks are leveraging these features for enhanced customer experiences and insights.
– Vilya has successfully applied this in drug discovery, showcasing the system’s scalability and ease of use.
– **Getting Started with BigQuery Vector Search**:
– The text elaborates on practical usage through examples, including creating a vector index and performing similarity searches.
Overall, the introduction of BigQuery vector search represents a noteworthy advancement in data analytics, enabling organizations to efficiently and accurately analyze vast amounts of unstructured data while enhancing machine learning capabilities across multiple use cases. Security and compliance professionals in AI and cloud environments can leverage these advancements for better data governance and analytical insights.