第三百零一节 Lucene教程 - Lucene索引文件

Lucene教程 - Lucene索引文件

索引是识别文档并为搜索准备文档的过程。

下表列出了索引过程中常用的类。

描述
IndexWriter在索引过程中创建/更新索引。
Directory表示索引的存储位置。
Analyzer分析文档并从文本中获取标记/单词。
Document带有字段的虚拟文档。分析仪可以处理文档。
Field索引过程的最低单位。它表示键值对,其中键用于标识索引值。

例子

以下代码显示了如何使用Lucene索引文本文件。

/** Licensed to the Apache Software Foundation (ASF) under one or more* contributor license agreements.  See the NOTICE file distributed with* this work for additional information regarding copyright ownership.* The ASF licenses this file to You under the Apache License, Version 2.0* (the "License"); you may not use this file except in compliance with* the License.  You may obtain a copy of the License at**     http://www.apache.org/licenses/LICENSE-2.0** Unless required by applicable law or agreed to in writing, software* distributed under the License is distributed on an "AS IS" BASIS,* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.* See the License for the specific language governing permissions and* limitations under the License.*/import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;
import java.util.Date;/** Index all text files under a directory.* <p>* This is a command-line application demonstrating simple Lucene indexing.* Run it with no command-line arguments for usage information.*/
public class Main {private Main() {}/** Index all text files under a directory. */public static void main(String[] args) {String usage = "java IndexFiles"+ " [-index INDEX_PATH] [-docs DOCS_PATH] [-update]\n\n"+ "This indexes the documents in DOCS_PATH, creating a Lucene index"+ "in INDEX_PATH that can be searched with SearchFiles";String indexPath = "index";String docsPath = null;boolean create = true;for(int i=0;i<args.length;i++) {if ("-index".equals(args[i])) {indexPath = args[i+1];i++;} else if ("-docs".equals(args[i])) {docsPath = args[i+1];i++;} else if ("-update".equals(args[i])) {create = false;}}if (docsPath == null) {System.err.println("Usage: " + usage);System.exit(1);}final File docDir = new File(docsPath);if (!docDir.exists() || !docDir.canRead()) {System.out.println("Document directory "" +docDir.getAbsolutePath()+ "" does not exist or is not readable, please check the path");System.exit(1);}Date start = new Date();try {System.out.println("Indexing to directory "" + indexPath + ""...");Directory dir = FSDirectory.open(new File(indexPath));// :Post-Release-Update-Version.LUCENE_XY:Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_10_0);IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_10_0, analyzer);if (create) {// Create a new index in the directory, removing any// previously indexed documents:iwc.setOpenMode(OpenMode.CREATE);} else {// Add new documents to an existing index:iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);}// Optional: for better indexing performance, if you// are indexing many documents, increase the RAM// buffer.  But if you do this, increase the max heap// size to the JVM (eg add -Xmx512m or -Xmx1g)://// iwc.setRAMBufferSizeMB(256.0);IndexWriter writer = new IndexWriter(dir, iwc);indexDocs(writer, docDir);// NOTE: if you want to maximize search performance,// you can optionally call forceMerge here.  This can be// a terribly costly operation, so generally it"s only// worth it when your index is relatively static (ie// you"re done adding documents to it)://// writer.forceMerge(1);writer.close();Date end = new Date();System.out.println(end.getTime() - start.getTime() + " total milliseconds");} catch (IOException e) {System.out.println(" caught a " + e.getClass() +"\n with message: " + e.getMessage());}}/*** Indexes the given file using the given writer, or if a directory is given,* recurses over files and directories found under the given directory.* * NOTE: This method indexes one document per input file.  This is slow.  For good* throughput, put multiple documents into your input file(s).  An example of this is* in the benchmark module, which can create "line doc" files, one document per line,* using the* <a href="../../../../../contrib-benchmark/org/apache/lucene/benchmark/byTask/tasks/WriteLineDocTask.html"* >WriteLineDocTask</a>.*  * @param writer Writer to the index where the given file/dir info will be stored* @param file The file to index, or the directory to recurse into to find files to index* @throws IOException If there is a low-level I/O error*/static void indexDocs(IndexWriter writer, File file)throws IOException {// do not try to index files that cannot be readif (file.canRead()) {if (file.isDirectory()) {String[] files = file.list();// an IO error could occurif (files != null) {for (int i = 0; i < files.length; i++) {indexDocs(writer, new File(file, files[i]));}}} else {FileInputStream fis;try {fis = new FileInputStream(file);} catch (FileNotFoundException fnfe) {// at least on windows, some temporary files raise this exception with an "access denied" message// checking if the file can be read doesn"t helpreturn;}try {// make a new, empty documentDocument doc = new Document();// Add the path of the file as a field named "path".  Use a// field that is indexed (i.e. searchable), but don"t tokenize // the field into separate words and don"t index term frequency// or positional information:Field pathField = new StringField("path", file.getPath(), Field.Store.YES);doc.add(pathField);// Add the last modified date of the file a field named "modified".// Use a LongField that is indexed (i.e. efficiently filterable with// NumericRangeFilter).  This indexes to milli-second resolution, which// is often too fine.  You could instead create a number based on// year/month/day/hour/minutes/seconds, down the resolution you require.// For example the long value 2011021714 would mean// February 17, 2011, 2-3 PM.doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));// Add the contents of the file to a field named "contents".  Specify a Reader,// so that the text of the file is tokenized and indexed, but not stored.// Note that FileReader expects the file to be in UTF-8 encoding.// If that"s not the case searching for special characters will fail.doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, StandardCharsets.UTF_8))));if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {// New index, so we just add the document (no old document can be there):System.out.println("adding " + file);writer.addDocument(doc);} else {// Existing index (an old copy of this document may have been indexed) so // we use updateDocument instead to replace the old one matching the exact // path, if present:System.out.println("updating " + file);writer.updateDocument(new Term("path", file.getPath()), doc);}} finally {fis.close();}}}}
}

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/883987.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

人工智能算法之粒子群优化算法

人工智能算法之粒子群优化算法 粒子群优化算法&#xff08;PSO&#xff09;是一种群体智能优化算法&#xff0c;由Kennedy和Eberhart在1995年提出&#xff0c;灵感来源于鸟群、鱼群等生物群体行为。PSO通过群体中个体的交互及对周围环境的感知&#xff0c;快速找到最优解。PSO算…

【深入浅出】深入浅出transformer(附面试题)

本文的目的是为了帮助大家面试transformer&#xff0c;会结合我的面试经历以及看法去讲解transformer&#xff0c;并非完整的技术细致讲解&#xff0c;介意请移步。 结构 提到transformer网络模型&#xff0c;大家脑海中是否有这张图呢&#xff1f; 这是网络结构中经典的编解…

net 获取本地ip地址,net mvc + net core 两种

net mvc public static string GetIP(HttpRequestBase request){// 尝试获取 X-Forwarded-For 头string result request.Headers["X-Forwarded-For"]?.Split(,).FirstOrDefault()?.Trim();if (string.IsNullOrEmpty(result)){// 获取用户的 IP 地址result reques…

Handler、Looper、message进阶知识

Android Handler、Looper、Message的进阶知识 在Android开发中&#xff0c;Handler、Looper和Message机制是多线程通信的核心。为了深入理解并优化它们的使用&#xff0c;尤其是在高并发和UI性能优化中&#xff0c;可以利用一些高级特性。 1. Handler的高阶知识 Handler在基本…

开源一款基于 JAVA 的仓库管理系统,支持三方物流和厂内物流,包含 PDA 和 WEB 端的源码

大家好&#xff0c;我是一颗甜苞谷&#xff0c;今天分享一款基于 JAVA 的仓库管理系统,支持三方物流和厂内物流,包含 PDA 和 WEB 端的源码。 前言 在当前的物流仓储行业&#xff0c;企业面临着信息化升级的迫切需求&#xff0c;但往往受限于高昂的软件采购和维护成本。现有的…

Sigrity Power SI Resonance analysis模式如何进行谐振分析操作指导

Sigrity Power SI Resonance analysis模式如何进行谐振分析操作指导 Sigrity Power SI可以方便快捷的进行谐振分析,谐振分析的目的是为了分析电源地平面组成的腔体的谐振频率以及谐振幅度,让频率在谐振频率附近的信号避开谐振腔,以及添加相应的电容来降低谐振峰值. 仍然以这…

vue特性

Vue.js是一套构建用户界面的渐进式框架&#xff0c;其特性主要包括以下几点&#xff1a; MVVM模式 Vue.js采用了MVVM&#xff08;Model-View-ViewModel&#xff09;的设计模式。在这种模式下&#xff0c;Model代表数据模型&#xff0c;View代表用户界面&#xff0c;ViewModel…

【AI开源项目】FastGPT- 快速部署FastGPT以及使用知识库的两种方式!

文章目录 一、FastGPT大模型介绍1. 开发团队2. 发展史3. 基本概念 二、FastGPT与其他大模型的对比三、使用 Docker Compose 快速部署 FastGPT1、安装 Docker 和 Docker Compose&#xff08;1&#xff09;. 安装 Docker&#xff08;2&#xff09;. 安装 Docker Compose&#xff…

Kubernetes实战——DevOps集成SpringBoot项目

目录 一、安装Gitlab 1、安装并配置Gitlab 1.1 、下载安装包 1.2、安装 1.3、修改配置文件 1.4、更新配置并重启 2、配置 2.1、修改密码 2.2、禁用注册功能 2.3、取消头像 2.4、修改中文配置 2.5、配置 webhook 3、卸载 二、安装镜像私服Harbor 1、下载安装包 2、…

多项目管理复杂性对企业的影响

在现代企业中&#xff0c;多项目管理已成为提升竞争力的关键策略。然而&#xff0c;资源分配冲突、沟通协调难题、优先级排序复杂等因素使得多项目管理充满挑战。资源分配冲突尤其突出&#xff0c;因为在多个项目同时进行时&#xff0c;有限的资源需要在不同项目间进行合理分配…

利用EasyExcel实现简易Excel导出

目标 通过注解形式完成对一个方法返回值的通用导出功能 工程搭建 pom <?xml version"1.0" encoding"UTF-8"?> <project xmlns"http://maven.apache.org/POM/4.0.0" xmlns:xsi"http://www.w3.org/2001/XMLSchema-instance&qu…

Mac OS 搭建MySQL开发环境

Mac OS 搭建MySQL开发环境 文章目录 Mac OS 搭建MySQL开发环境一、安装Mysql&#xff1a;二、配置环境变量三、安装Navicat 本地环境&#xff1a; Mac OS Sequoia15.0.1&#xff08;M3 Max) 目标状态&#xff1a; 下载安装Mysql&#xff0c;配置相关环境。 一、安装Mysql&…

关于springboot跨域与拦截器的问题

今天写代码的时候遇到的一个问题&#xff0c;在添加自己设置的token拦截器之后&#xff0c;报错&#xff1a; “ERROR Network Error AxiosError: Network Error at XMLHttpRequest.handleError (webpack-internal:///./node_modules/axios/lib/adapters/xhr.js:112:14) at Axi…

Java 面向对象编程(OOP)(4/30)

目录 Java 面向对象编程&#xff08;OOP&#xff09; 1. 类与对象 1.1 类的定义 1.2 对象的创建与使用 2. 封装 2.1 访问修饰符 2.2 使用 Getter 和 Setter 方法 3. 继承 3.1 继承的基本用法 3.2 方法重写 4. 多态 4.1 编译时多态&#xff08;方法重载&#xff09;…

NVR设备ONVIF接入平台EasyCVR视频分析设备平台视频质量诊断技术与能力

视频诊断技术是一种智能化的视频故障分析与预警系统&#xff0c;NVR设备ONVIF接入平台EasyCVR通过对前端设备传回的码流进行解码以及图像质量评估&#xff0c;对视频图像中存在的质量问题进行智能分析、判断和预警。这项技术在安防监控领域尤为重要&#xff0c;因为它能够确保监…

Python Pycharm下载

pycharm-professional-2023.3.3 python-3.9.0-amd64.exe 链接&#xff1a;https://pan.baidu.com/s/1YYf835hlleeDksPMmX9y2g?pwd9x16 提取码&#xff1a;9x16 更多资料获取学习书籍下面搜一搜这里不迷路&#xff0c;回复关键字获取&#xff1a;python

比较24个结构的迭代次数

(A,B)---6*30*2---(0,1)(1,0) 让A是结构1&#xff0c;让B全是0。收敛误差为7e-4&#xff0c;收敛199次取迭代次数平均值&#xff0c;得到28080.98 做一个同样的网络(A,B)---6*30*2---(0,1)(1,0)&#xff0c;让A是结构1-24&#xff0c;B全是0&#xff0c;用结构1的收敛权重做初…

深度了解flink(七) JobManager(1) 组件启动流程分析

前言 JobManager是Flink的核心进程&#xff0c;主要负责Flink集群的启动和初始化&#xff0c;包含多个重要的组件(JboMaster&#xff0c;Dispatcher&#xff0c;WebEndpoint等)&#xff0c;本篇文章会基于源码分析JobManagr的启动流程&#xff0c;对其各个组件进行介绍&#x…

.NET内网实战:通过白名单文件反序列化漏洞绕过UAC

01阅读须知 此文所节选自小报童《.NET 内网实战攻防》专栏&#xff0c;主要内容有.NET在各个内网渗透阶段与Windows系统交互的方式和技巧&#xff0c;对内网和后渗透感兴趣的朋友们可以订阅该电子报刊&#xff0c;解锁更多的报刊内容。 02基本介绍 03原理分析 在渗透测试和红…

ELK之路第三步——日志收集筛选logstash和filebeat

logstash和filebeat&#xff08;偷懒版&#xff09; 前言logstash1.下载2.修改配置文件3.测试启动4.文件启动 filebeat1.下载2.配置3.启动 前言 上一篇&#xff0c;我们说到了可视化界面Kibana的安装&#xff0c;这一篇&#xff0c;会简单介绍logstash和filebeat的安装和配置。…