hnswlib -向量ANN检索库

文章目录

- 关于 hnswlib
- 安装
- 使用示例
- - 1、创建索引，插入元素，搜索和选择序列
  - 2、序列化/反序列化后更新

关于 hnswlib

Header-only C++/python library for fast approximate nearest neighbors

github : https://github.com/nmslib/hnswlib

hnswlib 是一个用于高维向量的快速最近邻搜索（ANN）的C++库，其设计基于 Hierarchical Navigable Small World（HNSW）算法。
HNSW 算法是一种基于图的近似最近邻搜索方法，旨在高效地在高维向量空间中找到与给定查询向量最相似的向量。

以下是 hnswlib 的一些主要特点和功能：

高维向量支持：hnswlib 专注于处理高维向量，适用于各种需要处理大量高维数据的应用场景，如图像搜索、文本检索、推荐系统等。
HNSW 算法：HNSW 算法是 hnswlib 的核心，它利用图结构和局部连接性来构建一个高效的索引结构，使得在高维空间中进行最近邻搜索变得高效可行。
多线程支持：hnswlib 支持多线程并行处理，可以利用多核处理器提高搜索性能。
内存友好：hnswlib 设计优化了内存使用，能够高效地管理和存储大规模的高维向量数据。
灵活的参数配置：用户可以根据自己的需求灵活配置索引结构和搜索参数，以满足不同应用场景的需求。
易于集成：hnswlib 提供了简洁易用的C++接口，方便用户集成到自己的应用中使用。

hnswlib 的主要优势在于其高效的最近邻搜索性能和对高维向量的友好支持，使其成为处理大规模高维数据的理想选择。
它已经被广泛应用于各种领域，包括机器学习、数据挖掘、自然语言处理等。
同时，hnswlib 也提供了 Python 接口，方便 Python 用户使用。

「向量召回」相似检索算法——HNSW
https://mp.weixin.qq.com/s/dfdNj9CZ3Kj2UwDr9PQcVg

安装

pip install hnswlib

从源码安装

apt-get install -y python-setuptools python-pip
git clone https://github.com/nmslib/hnswlib.git
cd hnswlib
pip install .

使用示例

C++ 示例：https://github.com/nmslib/hnswlib/blob/master/examples/cpp/EXAMPLES.md

1、创建索引，插入元素，搜索和选择序列

import hnswlib
import numpy as np
import pickledim = 128
num_elements = 10000# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))
ids = np.arange(num_elements)# Declaring index
p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip# Initializing index - the maximum number of elements should be known beforehand
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)# Element insertion (can be called several times):
p.add_items(data, ids)# Controlling the recall by setting ef:
p.set_ef(50) # ef should always be > k# Query dataset, k - number of the closest elements (returns 2 numpy arrays)
labels, distances = p.knn_query(data, k = 1)# Index objects support pickling
# WARNING: serialization via pickle.dumps(p) or p.__getstate__() is NOT thread-safe with p.add_items method!
# Note: ef parameter is included in serialization; random number generator is initialized with random_seed on Index load
p_copy = pickle.loads(pickle.dumps(p)) # creates a copy of index p using pickle round-trip### Index parameters are exposed as class properties:
print(f"Parameters passed to constructor:  space={p_copy.space}, dim={p_copy.dim}") 
print(f"Index construction: M={p_copy.M}, ef_construction={p_copy.ef_construction}")
print(f"Index size is {p_copy.element_count} and index capacity is {p_copy.max_elements}")
print(f"Search speed/quality trade-off parameter: ef={p_copy.ef}")

2、序列化/反序列化后更新

import hnswlib
import numpy as npdim = 16
num_elements = 10000# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))# We split the data in two batches:
data1 = data[:num_elements // 2]
data2 = data[num_elements // 2:]# Declaring index
p = hnswlib.Index(space='l2', dim=dim)  # possible options are l2, cosine or ip# Initializing index
# max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
# during insertion of an element.
# The capacity can be increased by saving/loading the index, see below.
#
# ef_construction - controls index search speed/build speed tradeoff
#
# M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
# Higher M leads to higher accuracy/run_time at fixed ef/efConstructionp.init_index(max_elements=num_elements//2, ef_construction=100, M=16)# Controlling the recall by setting ef:
# higher ef leads to better accuracy, but slower search
p.set_ef(10)# Set number of threads used during batch search/construction
# By default using all available cores
p.set_num_threads(4)print("Adding first batch of %d elements" % (len(data1)))
p.add_items(data1)# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data1, k=1)
print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n")# Serializing and deleting the index:
index_path='first_half.bin'
print("Saving index to '%s'" % index_path)
p.save_index("first_half.bin")
del p# Re-initializing, loading the index
p = hnswlib.Index(space='l2', dim=dim)  # the space can be changed - keeps the data, alters the distance function.print("\nLoading index from 'first_half.bin'\n")# Increase the total capacity (max_elements), so that it will handle the new data
p.load_index("first_half.bin", max_elements = num_elements)print("Adding the second batch of %d elements" % (len(data2)))
p.add_items(data2)# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data, k=1)
print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n")