[译]How to make searching faster

Here are some things to try to speed up the seaching speed of your Lucene application. Please see ImproveIndexingSpeed for how to speed up indexing.

以下是一些尝试提高lucene程序检索速度的方法. 如果需要提高索引速度,请看提高索引速度.

 

  • Be sure you really need to speed things up. Many of the ideas here are simple to try, but others will necessarily add some complexity to your application. So be sure your searching speed is indeed too slow and the slowness is indeed within Lucene.

  • 必须确定你真的需要提升检索速度.以下的方法多数是简单易用的,但有些却可能会增加你程序的复杂度.所以你必须确定你的检索速度真的过慢,而且真的是由Lucene引起的.

     

  • Make sure you are using the latest version of Lucene.

  • 确定你正在使用的是最新版的Lucene.

     

  • Use a local filesystem. Remote filesystems are typically quite a bit slower for searching. If the index must be remote, try to mount the remote filesystem as a "readonly" mount. In some cases this could improve performance.

  • 使用本地文件. 远程文件的检索通常会比本地检索慢一点.如果索引必须放在远程服务器上,可以把远程文件设置为"只读".有些情况下这样做会提高效率.

     

  • Get faster hardware, especially a faster IO system. Flash-based Solid State Drives works very well for Lucene searches. As seek-times for SSD's are about 100 times faster than traditional platter-based harddrives, the usual penalty for seeking is virtually eliminated. This means that SSD-equipped machines need less RAM for file caching and that searchers require less warm-up time before they respond quickly.

  • 升级到更快的硬件,特别是用于IO的硬件. 内置闪存芯片的固态硬盘会更利于Lucene的检索.固态硬盘的寻址速度是传统磁碟硬盘的100倍,平常的硬盘寻址损失会明显减少.这意味着配置了固态硬盘的机器对用来缓存文件的内存(RAM)依赖减少,而且检索用户也无需再等待索引文件从硬盘读入内存的这段缓存时间消耗.

     

  • Tune the OS

    One tunable that stands out on Linux is swappiness (http://kerneltrap.org/node/3000), which controls how aggressively the OS will swap out RAM used by processes in favor of the IO Cache. Most Linux distros default this to a highish number (meaning, aggressive) but this can easily cause horrible search latency, especially if you are searching a large index with a low query rate. Experiment by turning swappiness down or off entirely (by setting it to 0). Windows also has a checkbox, under My Computer -> Properties -> Advanced -> Performance Settings -> Advanced -> Memory Usage, that lets you favor Programs or System Cache, that's likely doing something similar.

  • 调整操作系统
    一个在Linux下的可调整部分是交换系统(http://kerneltrap.org/node/3000),它会控制操作系统对腾出内存来处理IO缓存的积极性.大多数Linux会默认设置一个最大值(就是比较积极的缓存索引),但这样很容易会引起严重的检索延时,特别是当你检索一个不经常使用的大索引文件时.(Clotho注:例如有一个1G的索引文件,但一个月才检索一次,如果交换值设置太高,每次检索都整个1G的文件被载入内存,之后又不再使用,就很浪费时间和内存空间).可以尝试将交换值调低或者关闭交换系统(设置为0).Windows也有这个选项,在"我的电脑->右键菜单的"属性"->高级->性能->设置->高级->内存使用"里,可以设置程序或者系统缓存,作用应该是和Linux的交换系统类似.

     

  • Open the IndexReader with readOnly=true. This makes a big difference when multiple threads are sharing the same reader, as it removes certain sources of thread contention.

  • 用只读模式(readOnly=true)来调用IndexReader. 当多线程共享同一个reader时这样会有很大不同,肯定会减少一部分线程同步的资源占用.

     

  • On non-Windows platform, using NIOFSDirectory instead of FSDirectory.

    This also removes sources of contention when accessing the underlying files. Unfortunately, due to a longstanding bug on Windows in Sun's JRE (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734 -- feel particularly free to go vote for it), NIOFSDirectory gets poor performance on Windows.

  • 在非Windows的操作系统中,用NIOFSDirectory类代替FSDirectory类.
    这样也可以减少访问底层文件时的资源争抢.很不幸地,作为一个SUN的JRE在Windows下存在已久的bug(http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734 -- 就当是特殊的免费投票),NIOFSDirectory类在Windows下的性能很低.

     

  • Add RAM to your hardware and/or increase the heap size for the JVM. For a large index, searching can use a lot of RAM. If you don't have enough RAM or your JVM is not running with a large enough HEAP size then the JVM can hit swapping and thrashing at which point everything will run slowly.

  • 增加硬件内存和/或增加JVM的堆大小. 当检索一个大索引文件时,会占用很多内存空间.如果你的内存不够大,或者你的JVM没有足够的堆空间,JVM会触发虚拟内存页交换和重新生成,令所有工作变得缓慢.

     

  • Use one instance of IndexSearcher.

    Share a single IndexSearcher across queries and across threads in your application.

  • 使用单个IndexSearcher实例.
    程序中多条线程共享一个单独IndexSearcher来进行检索.

     

  • When measuring performance, disregard the first query.

    The first query to a searcher pays the price of initializing caches (especially when sorting by fields) and thus will skew your results (assuming you re-use the searcher for many queries). On the other hand, if you re-run the same query again and again, results won't be realistic either, because the operating system will use its cache to speed up IO operations. On Linux (kernel 2.6.16 and later) you can clean the disk cache using sync ; echo 3 > /proc/sys/vm/drop_caches. See http://linux-mm.org/Drop_Caches for details.

  • 如果测试性能,可以忽略第一次检索的测试结果.
    第一次检索需要花时间来初始化缓存(特别是当检索结果按字段排序),因此会影响你的测试结果(假设你会重复使用该检索器进行查询).另一方面,如果你重复同一个查询多次,测试结果也不一定符合实际,因为操作系统本身会使用缓存来加速IO操作.在Linux下(内核版本2.6.16或更高)你可以通过这条命令来清空硬盘缓存"sync ; echo 3 > /proc/sys/vm/drop_caches".详情可以参考http://linux-mm.org/Drop_Caches.

     

  • Re-open the IndexSearcher only when necessary.

    You must re-open the IndexSearcher in order to make newly committed changes visible to searching. However, re-opening the searcher has a certain overhead (noticeable mostly with large indexes and with sorting turned on) and should thus be minimized. Consider using a so called warming technique which allows the searcher to warm up its caches before the first query hits.

  • 除非不得已,否则尽量不要重新打开IndexSearcher.
    你需要是打开IndexSearcher来提交新的修改给检索.然而,重新打开检索器会产生相当高的资源消耗(由于大索引文件和排序的启用),因此请尽量减少重新打开的次数.可以考虑使用warming技术,它允许检索器在第一次检索之前预先载入缓存.

     

  • Run optimize on your index before searching. An optimized index has only 1 segment to search which can be much faster than the many segments that will normally be created, especially for a large index. If your application does not often update the index then it pays to build the index, optimize it, and use the optimized one for searching. If instead you are frequently updating the index and then refreshing searchers, then optimizing will likely be too costly and you should decrease mergeFactor instead.

  • 在检索之前调用IndexWriter的optimize方法来优化你的索引文件. 一个优化后的索引只有一个索引段文件,检索起来会比多个索引段文件快很多,正常未经优化的索引通常有多个索引段文件,特别是大索引更加多索引段.如果你的程序不经常更新索引的话,最好优化生成单个索引来检索.相反,如果你要经常更新索引和刷新检索器的话,调用优化反而会增加开销,这时你需要降低mergeFactor参数的值.

     

  • Decrease mergeFactor. Smaller mergeFactors mean fewer segments and searching will be faster. However, this will slow down indexing speed, so you should test values to strike an appropriate balance for your application.

  • 降低mergeFactor参数的值. 更小的mergeFactors值会生成着更少的索引段文件,检索起来会更快.然而这样会降低索引的速度,所以你要为你的程序定出一个令检索和索引速度平衡的mergeFactors值.

     

  • Limit usage of stored fields and term vectors. Retrieving these from the index is quite costly. Typically you should only retrieve these for the current "page" the user will see, not for all documents in the full result set. For each document retrieved, Lucene must seek to a different location in various files. Try sorting the documents you need to retrieve by docID order first.

  • 限制储存字段和词项向量的使用. 从索引中获取这些数据的开销比较大.通常的解决方法是你只获取用户可见的当前"页"的结果数,而非结果集合中的全部文档.因为每获取一个结果文档,Lucene就必须查找多个文件的不同位置.可以尝试对需要获取的文档按docID排序.

     

  • Use FieldSelector to carefully pick which fields are loaded, and how they are loaded, when you retrieve a document.

  • 当你获取一个文档时,用FieldSelector类仔细的操控哪些字段要读取,怎么读取.

     

  • Don't iterate over more hits than needed.

    Iterating over all hits is slow for two reasons. Firstly, the search() method that returns a Hits object re-executes the search internally when you need more than 100 hits. Solution: use the search method that takes a HitCollector instead. Secondly, the hits will probably be spread over the disk so accessing them all requires much I/O activity. This cannot easily be avoided unless the index is small enough to be loaded into RAM. If you don't need the complete documents but only one (small) field you could also use the FieldCache class to cache that one field and have fast access to it.

  • 不要遍历比用户需求数更多的结果.
    遍历全部结果会很慢的原因有两个.首先,当你需要多于100个结果以上时search()方法会在内部重新检索并返回一个新的结果对象.解决方法:用HitCollector代替原来的检索结果类.其次,那些结果数据可能会遍布在磁盘各处,所以访问它们需要多次I/O操作.这点不能轻易忽视,除非索引文件小到能够整个放入内存.如果你不需要用到整个文档而只要其中一个(小的)字段,你也可以使用FieldCache类来缓存那个字段来加快访问它的速度.

     

  • When using fuzzy queries use a minimum prefix length.

    Fuzzy queries perform CPU-intensive string comparisons - avoid comparing all unique terms with the user input by only examining terms starting with the first "N" characters. This prefix length is a property on both QueryParser and FuzzyQuery - default is zero so ALL terms are compared.

  • 使用模糊查询时最好用一个最小的预先指定的长度值.
    模糊查询会执行精密的CPU字符串比较 - 尽量避免比较用户输入的全部的唯一词项,而只比较词项的前N个字符.预先指定的长度值是一个属性,QueryParser和FuzzyQuery都有这个属性 - 默认是0,所以全部词项都会进行对比.

     

  • Consider using filters. It can be much more efficient to restrict results to a part of the index using a cached bit set filter rather than using a query clause. This is especially true for restrictions that match a great number of documents of a large index. Filters are typically used to restrict the results to a category but could in many cases be used to replace any query clause. One difference between using a Query and a Filter is that the Query has an impact on the score while a Filter does not.

  • 考虑使用filters. 它使用缓存后的位集合过滤器代替查询语句可以更加有效率地限制结果数量.这样做对于大索引中的大批量文档匹配的情况特别有效.过滤器通常被用来限制结果的类别,但很多情况下也可以用来代替查询语句.使用查询和过滤的区别是,查询的结果会带有权重值而过滤没有.

     

  • Find the bottleneck.

    Complex query analysis or heavy post-processing of results are examples of hidden bottlenecks for searches. Profiling with at tool such as VisualVM helps locating the problem.

  • 找出瓶颈.
    复杂的查询分析或大结果量的处理潜藏着很多检索瓶颈.可以使用VisualVM等工具来检测和定位出瓶颈所在.

 

原文地址: http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

一位牛人的翻译版本: http://hi.baidu.com/expertsearch/blog/item/2195a237bfe83d360a55a9fd.html

转载于:https://www.cnblogs.com/live41/archive/2009/12/31/1636900.html

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/496102.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

8个最高效的Python爬虫框架,你用过几个?

From:https://segmentfault.com/a/1190000015131017 1.Scrapy Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。。用这个框架可以轻松爬下来如亚马逊商品信…

android oreo 源码,android – Oreo:如何在源代码中找到所有受限制的系统调用?

哪些Syscalls在Android 8.0 Oreo中受限制?编辑:Syscall过滤背景过滤本身是Linux内核提供的标准功能,称为seccomp.所有AOSP都使用此功能来过滤上面链接的应用黑名单中列出的系统调用.脚本处理将黑名单列入特定于平台的自动生成过滤器,然后将其提供给secco…

Effective Java~57. 将局部变量的作用域最小化

优先选择 for 循环而不是 while 循环 例如,下面是遍历集合的首选方式 // Preferred idiom for iterating over a collection or array for (Element e : c) { ... // Do Something with e } 如果需要在循环中调用 remove 方法,首选传统的 for 循环代…

300+Jquery, CSS, MooTools 和 JS的导航菜单资源

如果你是网站前端开发人员,那么对你来说,也许做一个漂亮导航菜单会很得心应手。本文要为大家总结各种导航菜单的资源,以便让大家的网站前端开发的工作更方便更快速,只要选择现成的例子就可以应用于自己的网站了。本文收集的这些资…

轻量级分布式任务调度平台 XXL-JOB

From:https://www.cnblogs.com/xuxueli/p/5021979.html github 地址 及 中文文档地址:https://github.com/xuxueli/xxl-job 《分布式任务调度平台XXL-JOB》 一、简介 1.1 概述 XXL-JOB是一个轻量级分布式任务调度平台,其核心设计目标是开发…

畅玩4c刷android 9.0,华为畅玩4C电信版 CyanogenMod 13.0_Android_6.0.1 【HRT_chiwahfj】

本帖最后由 chiwah渔夫 于 2016-9-9 22:31 编辑【基本信息】ROM名称:华为畅玩4C电信版 CyanogenMod 13.0_Android_6.0.1ROM大小:617M适配版本:CyanogenMod 13.0_android_6.0.1测试机型:华为畅玩4C电信版作者简介:HRT团…

Effective Java~58. for-each 循环优先于传统的for 循环

传统的 for循环来遍历一个集合: // Not the best way to iterate over a collection! for (Iterator<Element> i c.iterator(); i.hasNext(); ) { Element e i.next();... // Do something with e } 迭代数组的传统 for 循环的实例 // Not the best way to iterate …

近期课余目标

http://acm.pku.edu.cn/JudgeOnline/ 按这个顺序做&#xff0c; 这些都是入门的水题 一.基本算法: (1)枚举. (poj1753,poj2965) (2)贪心(poj1328,poj2109,poj2586) (3)递归和分治法. (4)递推. (5)构造法.(poj3295) (6)模拟法.(poj1068,poj2632,poj1573,poj2993,poj2996) 二.图算…

html语言鼠标悬停特效,CSS3鼠标悬停文字幻影动画特效

这是一款CSS3鼠标悬停文字幻影动画特效。该特效利用before和after伪元素来制作当鼠标悬停在超链接文本上的时候的幻影效果。使用方法在页面中引入bootstrap.css、jquery和photoviewer.js文件。HTML结构在页面中添加一个元素&#xff0c;并设置它的data-hover属性和它的显示文字…

Python 爬虫 实例项目 大全

Github Python 爬虫&#xff1a;https://github.com/search?qpython爬虫 32个Python爬虫项目让你一次吃到撑&#xff1a;https://www.77169.com/html/170460.html 今天为大家整理了32个Python爬虫项目。 整理的原因是&#xff0c;爬虫入门简单快速&#xff0c;也非常适合新入…

Effective Java~23. 类层次优于标签类

标签类&#xff0c;包含一个标签属性&#xff08;tag field&#xff09;&#xff0c;表示实例的风格 // Tagged class - vastly inferior to a class hierarchy! class Figure {enum Shape { RECTANGLE, CIRCLE };// Tag field - the shape of this figurefinal Shape shape;/…

领导之所以是领导

最近系统很不安生&#xff0c;故障频出&#xff0c;12月30日下午15&#xff1a;00系统所有工单都无法在网元开通&#xff0c;用户缴费不能开机&#xff0c;只好手工执行&#xff0c;直到23&#xff1a;30才恢复。1月1日某个业务平台执行工单异常缓慢&#xff0c;导致系统1万多工…

可以叫板Google的一个搜索引擎 —— DuckDuckGo

From&#xff1a;https://blog.csdn.net/inter_peng/article/details/53223455 作为习惯了使用Google进行资料查询的我来说&#xff0c;如果没有Google&#xff0c;真的感觉很难受。纵使找了一些可以翻墙的软件&#xff0c;但无奈还是经常不稳定&#xff0c;总是时断时续的。Bi…

小米鸿蒙1001小米鸿蒙,小米高管早就放下狠话!愿意使用鸿蒙2.0系统:那其他厂商呢?...

【9月14日讯】相信大家都知道&#xff0c;自从华为鸿蒙OS系统2.0版本正式发布以后&#xff0c;由于华为消费者业务CEO余承东正式确认&#xff1a;“华为手机在12月开始适配鸿蒙OS系统&#xff0c;明年所有华为手机全面启用鸿蒙OS系统。” 这也意味着国产智能手机厂商也将彻底的…

Effective Java~26. 不要使用 raw type

在编译完成之后尽快发现错误是值得的&#xff0c;理想情况是在编译时 在泛型被添加到 Java 之前&#xff0c;这是一个典型的集合声明 // Raw collection type - dont do this! // My stamp collection. Contains only Stamp instances. private final Collection stamps ...…

WCF中的管道——管道类型

管道是所有消息进出WCF应用程序的渠道。它的职责是以统一的方式编制和提供消息。管道中定义了传输、协议和消息拦截。管道以层级结构的形式汇总&#xff0c;就创建了一个管道栈。管道栈以分层的方式进行通信并处理消息。例如&#xff0c;一个管道栈可以使用一个TCP协议管道和一…

android德州扑克计算器,学界 | 一台笔记本打败超算:CMU冷扑大师团队提出全新德扑AI Modicum...

原标题&#xff1a;学界 | 一台笔记本打败超算&#xff1a;CMU冷扑大师团队提出全新德扑AI Modicum选自arXiv参与&#xff1a;路、晓坤CMU 冷扑大师团队在读博士 Noam Brown、Tuomas Sandholm 教授和研究助理 Brandon Amos 近日提交了一个新研究&#xff1a;德州扑克人工智能 M…

Git教程~忽略特殊文件

# Windows: Thumbs.db ehthumbs.db Desktop.ini# Python: *.py[cod] *.so *.egg *.egg-info dist build# My configurations: db.ini deploy_key_rsa# 排除目录 /target/ /logs/# 排除所有.开头的隐藏文件: .* # 排除所有.class文件: *.class# 不排除.gitignore和App.class: !.…

解决“计划任务不存在的问题”方法

唉&#xff0c;最近弄了一点东西&#xff0c;修改了一下计划任务 结果&#xff0c;在此打开计划任务后看到了这个提示 所选择的任务“{0}”不再存在。若要查看当前任务&#xff0c;请单击“刷新”。 我这个系统是Win2008R2 &#xff0c;因为乱改了Windows自己的任务&#xff0c…

神器 | 百度云资源搜索

From&#xff1a;https://blog.csdn.net/qq_21492635/article/details/81109247 直接上神器 该网页没有做自适应&#xff0c;也没有专门的手机站点&#xff0c;因此建议电脑使用。也可下载桌面客户端&#xff08;仅支持windows&#xff09;&#xff0c;稳定不卡&#xff0c;速度…