Vector Database Management Techniques and Systems SIGMOD-Companion ’24, June 9–15, 2024, Santiago, AA, Chile
It may also be possible to improve query results by learning a suit-
able score directly over the vector space. This is the goal of metric
learning, and several techniques have been proposed [21, 60, 91].
(2) Score Selection. Score selection aims to select the most appro-
priate score for a particular application. While many scores are
known, automatic score selection remains challenging. We mention
one attempt to dynamically adjust the score based on the query
[
82
]. Score selection is also related to query semantics, as certain
query entities such as text strings may still be ambiguous and need
to be resolved before a suitable score can be selected [
75
]. Finally,
the curse of dimensionality limits the usefulness of certain distance-
based scores, requiring other scores to compensate [22, 30, 61].
Query Types and Basic Operators. Data manipulation queries
aim to alter the vector collection. As each vector is derived from an
embedding model, it is possible to integrate the model within the
VDBMS. A VDBMS also must handle vector search queries. Basic
search queries include
𝑘
-nearest neighbor (
𝑘
-NN) and approximate
nearest neighbor (ANN) queries, and query variants include predi-
cated, batched, and multi-vector queries. To answer these queries,
a VDBMS compares the similarity between the query vector and a
number of candidate vectors using similarity projection. The quality
of a result set is measured using precision and recall.
(1) Data Manipulation. The embedding model can live inside or out-
side the VDBMS. Under direct manipulation, users directly modify
the feature vectors, and the user is responsible for the embedding
model [7, 90]. Under indirect manipulation, the collection appears
as a collection of entities, not vectors, and users manipulate the
entities [
5
,
10
]. The VDBMS is responsible for the embedding model.
(2) Basic Search Queries. In a
(𝑐, 𝑘)
-search query, the goal is to retrieve
𝑘
vectors that are most similar to the query vector, and where no
retrieved vector has a similarity score that is a factor of
𝑐
worse
than the best non-zero score. In a range query, a similarity threshold
is given, and the goal is to retrieve all vectors with similarity scores
within the threshold. Some particular cases of
(𝑐, 𝑘)
-search queries
have been studied, notably
𝑐 =
0
, 𝑘 >
1 corresponding to the
𝑘
-NN
query and
𝑐 >
0
, 𝑘 >
1 corresponding to the ANN query. These
queries have been individually studied for many decades, leading
to a variety of techniques and theoretical results [24, 48, 70].
(3) Query Variants. Most VDBMSs support predicated or “hybrid”
queries, and some also support batched and multi-vector queries
for applications such as e-commerce, facial recognition, and text
retrieval [
11
,
79
,
84
]. In a hybrid query, vectors in the collection
are associated to structured attributes regarding the represented
entity, and each vector in the search result set must also satisfy
boolean predicates over the corresponding attributes [
84
]. For
batched queries, a number of search queries are given at once,
and the VDBMS must answer all the queries in the batch. Several
techniques have been proposed to exploit commonalities between
the queries in order to speed up processing the batch [
50
,
79
]. Fi-
nally in a multi-vector query, multiple feature vectors are used to
represent either the query, each entity, or both, and these can be
supported via aggregate scores [
79
]. Multi-vector queries support
several additional sub-variants.
(4) Basic Operators. Similarity projection can be used to answer
vector search queries by projecting each vector in the collection
onto its similarity score.
Query Interfaces. Some VDBMSs aim to support only a small num-
ber of query types and simple APIs are sucient. Other VDBMSs
aim to support a wide range of query types and may rely on SQL
extensions.
2.2 Indexing
An index can be used to speed up search queries, but vectors cannot
be indexed like structured attributes as they lack a natural sort or-
der and categories that are used in typical attribute indexes such as
B-tree. As a result, vector indexes rely on techniques including ran-
domization, learned partitioning, and navigable partitioning in order
to partition the collection so that it can be more easily explored.
To address the large size of vectors, disk-resident indexes have also
been proposed, in addition to techniques based on a compression
technique called quantization. A single index may combine several
of these techniques, but the performance of an index also depends
on its structure that can be table-based, tree-based, or graph-based.
In this section, we describe several vector indexes that are used in
existing VDBMSs, starting from table-based indexes.
Table-Based Indexes. A table-based index partitions the vector
collection into buckets, and each bucket can be retrieved by looking
up a key like the rows in a hash table. In general, table-based indexes
are easy to maintain but search performance may be worse than
other indexes at same recall if buckets are very large. Large buckets
can improve recall but are harder to scan while small buckets may
suer from low recall but are easier to scan. Existing techniques,
including lo cality sensitive hashing (LSH) and learning to hash (L2H),
tend to rely on randomization and learned partitioning to improve
performance at high recall. Furthermore, quantization is a compres-
sion technique that relies mostly on learning compression codes in
order to reduce storage costs.
(1) Locality Sensitive Hashing (LSH). Locality sensitive hashing relies
on random hash functions to bucket vectors. The basic idea is to
hash each vector into each of
𝐿
tables, with each hash function
a concatenation of
𝐾
number of hash functions that belong to a
“hash family”. To answer a query, the query vector is hashed to
each of the tables, and collisions are kept as candidates. The hash
family, along with the tunable parameters
𝐿
and
𝐾
, can be designed
to give error guarantees for ANN. Many hash families are known
with varying performance and accuracy characteristics, including
random hyperplanes in E
2
LSH [
35
], binary projections in
IndexLSH
[1], and overlapping spheres in FALCONN [23, 25].
(2) Learning to Hash (L2H). Learning to hash aims to use machine
learning techniques to directly learn a hash function that can bucket
similar vectors together. One technique is
𝑘
-means clustering in
SPANN, where each vector is bucketed along with other members
of the same cluster [
32
]. The SPANN index also introduces several
techniques for disk-resident collections such as overlapping buckets
to reduce I/O retrievals. Other techniques include spectral hashing
[
85
] and techniques based on neural networks [
71
]. While these
techniques may produce high quality partitionings, they are data
dependent and cannot easily handle out-of-distribution updates. A
survey of techniques can be found in [81].
相关文档
评论