文章目录
- 关于 hnswlib
- 安装
- 使用示例
- 1、创建索引,插入元素,搜索和选择序列
- 2、序列化/反序列化后更新
关于 hnswlib
Header-only C++/python library for fast approximate nearest neighbors
- github : https://github.com/nmslib/hnswlib
hnswlib 是一个用于高维向量的快速最近邻搜索(ANN)的C++库,其设计基于 Hierarchical Navigable Small World(HNSW)算法。
HNSW 算法是一种基于图的近似最近邻搜索方法,旨在高效地在高维向量空间中找到与给定查询向量最相似的向量。
以下是 hnswlib 的一些主要特点和功能:
- 高维向量支持:hnswlib 专注于处理高维向量,适用于各种需要处理大量高维数据的应用场景,如图像搜索、文本检索、推荐系统等。
- HNSW 算法:HNSW 算法是 hnswlib 的核心,它利用图结构和局部连接性来构建一个高效的索引结构,使得在高维空间中进行最近邻搜索变得高效可行。
- 多线程支持:hnswlib 支持多线程并行处理,可以利用多核处理器提高搜索性能。
- 内存友好:hnswlib 设计优化了内存使用,能够高效地管理和存储大规模的高维向量数据。
- 灵活的参数配置:用户可以根据自己的需求灵活配置索引结构和搜索参数,以满足不同应用场景的需求。
- 易于集成:hnswlib 提供了简洁易用的C++接口,方便用户集成到自己的应用中使用。
hnswlib 的主要优势在于其高效的最近邻搜索性能和对高维向量的友好支持,使其成为处理大规模高维数据的理想选择。
它已经被广泛应用于各种领域,包括机器学习、数据挖掘、自然语言处理等。
同时,hnswlib 也提供了 Python 接口,方便 Python 用户使用。
相关文章
- 「向量召回」相似检索算法——HNSW
https://mp.weixin.qq.com/s/dfdNj9CZ3Kj2UwDr9PQcVg
安装
pip install hnswlib
从源码安装
apt-get install -y python-setuptools python-pip
git clone https://github.com/nmslib/hnswlib.git
cd hnswlib
pip install .
使用示例
C++ 示例:https://github.com/nmslib/hnswlib/blob/master/examples/cpp/EXAMPLES.md
1、创建索引,插入元素,搜索和选择序列
import hnswlib
import numpy as np
import pickledim = 128
num_elements = 10000# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))
ids = np.arange(num_elements)# Declaring index
p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip# Initializing index - the maximum number of elements should be known beforehand
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)# Element insertion (can be called several times):
p.add_items(data, ids)# Controlling the recall by setting ef:
p.set_ef(50) # ef should always be > k# Query dataset, k - number of the closest elements (returns 2 numpy arrays)
labels, distances = p.knn_query(data, k = 1)# Index objects support pickling
# WARNING: serialization via pickle.dumps(p) or p.__getstate__() is NOT thread-safe with p.add_items method!
# Note: ef parameter is included in serialization; random number generator is initialized with random_seed on Index load
p_copy = pickle.loads(pickle.dumps(p)) # creates a copy of index p using pickle round-trip### Index parameters are exposed as class properties:
print(f"Parameters passed to constructor: space={p_copy.space}, dim={p_copy.dim}")
print(f"Index construction: M={p_copy.M}, ef_construction={p_copy.ef_construction}")
print(f"Index size is {p_copy.element_count} and index capacity is {p_copy.max_elements}")
print(f"Search speed/quality trade-off parameter: ef={p_copy.ef}")
2、序列化/反序列化后更新
import hnswlib
import numpy as npdim = 16
num_elements = 10000# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))# We split the data in two batches:
data1 = data[:num_elements // 2]
data2 = data[num_elements // 2:]# Declaring index
p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip# Initializing index
# max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
# during insertion of an element.
# The capacity can be increased by saving/loading the index, see below.
#
# ef_construction - controls index search speed/build speed tradeoff
#
# M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
# Higher M leads to higher accuracy/run_time at fixed ef/efConstructionp.init_index(max_elements=num_elements//2, ef_construction=100, M=16)# Controlling the recall by setting ef:
# higher ef leads to better accuracy, but slower search
p.set_ef(10)# Set number of threads used during batch search/construction
# By default using all available cores
p.set_num_threads(4)print("Adding first batch of %d elements" % (len(data1)))
p.add_items(data1)# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data1, k=1)
print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n")# Serializing and deleting the index:
index_path='first_half.bin'
print("Saving index to '%s'" % index_path)
p.save_index("first_half.bin")
del p# Re-initializing, loading the index
p = hnswlib.Index(space='l2', dim=dim) # the space can be changed - keeps the data, alters the distance function.print("\nLoading index from 'first_half.bin'\n")# Increase the total capacity (max_elements), so that it will handle the new data
p.load_index("first_half.bin", max_elements = num_elements)print("Adding the second batch of %d elements" % (len(data2)))
p.add_items(data2)# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data, k=1)
print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n")
伊织 2024-03-06(三)