solrconfig.xml配置文件中包含了很多solr自身配置相关的参数,solrconfig.xml配置文件示例可以从solr的解压目录下找到,如图:
用文本编辑软件打开solrconfig.xml配置,你将会看到以下配置内容:
- <?xml version="1.0" encoding="UTF-8" ?>
- <!--
- Licensed to the Apache Software Foundation (ASF) under one or more
- contributor license agreements. See the NOTICE file distributed with
- this work for additional information regarding copyright ownership.
- The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License. You may obtain a copy of the License at
- http://www.apache.org/licenses/LICENSE-2.0
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
- -->
- <!--
- For more details about configurations options that may appear in
- this file, see http://wiki.apache.org/solr/SolrConfigXml.
- -->
- <config>
- <!-- In all configuration below, a prefix of "solr." for class names
- is an alias that causes solr to search appropriate packages,
- including org.apache.solr.(search|update|request|core|analysis)
- You may also specify a fully qualified Java classname if you
- have your own custom plugins.
- -->
- <!-- Controls what version of Lucene various components of Solr
- adhere to. Generally, you want to use the latest version to
- get all bug fixes and improvements. It is highly recommended
- that you fully re-index after changing this setting as it can
- affect both how text is indexed and queried.
- -->
- <luceneMatchVersion>5.1.0</luceneMatchVersion>
- <!-- Data Directory
- Used to specify an alternate directory to hold all index data
- other than the default ./data under the Solr home. If
- replication is in use, this should match the replication
- configuration.
- -->
- <!--
- <dataDir>${solr.data.dir:}</dataDir>
- -->
- <dataDir>C:\solr_home\core1\data</dataDir>
- <!-- The DirectoryFactory to use for indexes.
- solr.StandardDirectoryFactory is filesystem
- based and tries to pick the best implementation for the current
- JVM and platform. solr.NRTCachingDirectoryFactory, the default,
- wraps solr.StandardDirectoryFactory and caches small files in memory
- for better NRT performance.
- One can force a particular implementation via solr.MMapDirectoryFactory,
- solr.NIOFSDirectoryFactory, or solr.SimpleFSDirectoryFactory.
- solr.RAMDirectoryFactory is memory based, not
- persistent, and doesn't work with replication.
- -->
- <directoryFactory name="DirectoryFactory"
- class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}">
- </directoryFactory>
- <!-- The CodecFactory for defining the format of the inverted index.
- The default implementation is SchemaCodecFactory, which is the official Lucene
- index format, but hooks into the schema to provide per-field customization of
- the postings lists and per-document values in the fieldType element
- (postingsFormat/docValuesFormat). Note that most of the alternative implementations
- are experimental, so if you choose to customize the index format, it's a good
- idea to convert back to the official format e.g. via IndexWriter.addIndexes(IndexReader)
- before upgrading to a newer version to avoid unnecessary reindexing.
- -->
- <codecFactory class="solr.SchemaCodecFactory"/>
- <schemaFactory class="ClassicIndexSchemaFactory"/>
- <!-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Index Config - These settings control low-level behavior of indexing
- Most example settings here show the default value, but are commented
- out, to more easily see where customizations have been made.
- Note: This replaces <indexDefaults> and <mainIndex> from older versions
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -->
- <indexConfig>
- <!-- LockFactory
- This option specifies which Lucene LockFactory implementation
- to use.
- single = SingleInstanceLockFactory - suggested for a
- read-only index or when there is no possibility of
- another process trying to modify the index.
- native = NativeFSLockFactory - uses OS native file locking.
- Do not use when multiple solr webapps in the same
- JVM are attempting to share a single index.
- simple = SimpleFSLockFactory - uses a plain file for locking
- Defaults: 'native' is default for Solr3.6 and later, otherwise
- 'simple' is the default
- More details on the nuances of each LockFactory...
- http://wiki.apache.org/lucene-java/AvailableLockFactories
- -->
- <lockType>${solr.lock.type:native}</lockType>
- <!-- Lucene Infostream
- To aid in advanced debugging, Lucene provides an "InfoStream"
- of detailed information when indexing.
- Setting the value to true will instruct the underlying Lucene
- IndexWriter to write its info stream to solr's log. By default,
- this is enabled here, and controlled through log4j.properties.
- -->
- <infoStream>true</infoStream>
- </indexConfig>
- <!-- JMX
- This example enables JMX if and only if an existing MBeanServer
- is found, use this if you want to configure JMX through JVM
- parameters. Remove this to disable exposing Solr configuration
- and statistics to JMX.
- For more details see http://wiki.apache.org/solr/SolrJmx
- -->
- <jmx />
- <!-- If you want to connect to a particular server, specify the
- agentId
- -->
- <!-- <jmx agentId="myAgent" /> -->
- <!-- If you want to start a new MBeanServer, specify the serviceUrl -->
- <!-- <jmx serviceUrl="service:jmx:rmi:///jndi/rmi://localhost:9999/solr"/>
- -->
- <!-- The default high-performance update handler -->
- <updateHandler class="solr.DirectUpdateHandler2">
- <!-- Enables a transaction log, used for real-time get, durability, and
- and solr cloud replica recovery. The log can grow as big as
- uncommitted changes to the index, so use of a hard autoCommit
- is recommended (see below).
- "dir" - the target directory for transaction logs, defaults to the
- solr data directory. -->
- <updateLog>
- <str name="dir">${solr.ulog.dir:}</str>
- </updateLog>
- <!-- AutoCommit
- Perform a hard commit automatically under certain conditions.
- Instead of enabling autoCommit, consider using "commitWithin"
- when adding documents.
- http://wiki.apache.org/solr/UpdateXmlMessages
- maxDocs - Maximum number of documents to add since the last
- commit before automatically triggering a new commit.
- maxTime - Maximum amount of time in ms that is allowed to pass
- since a document was added before automatically
- triggering a new commit.
- openSearcher - if false, the commit causes recent index changes
- to be flushed to stable storage, but does not cause a new
- searcher to be opened to make those changes visible.
- If the updateLog is enabled, then it's highly recommended to
- have some sort of hard autoCommit to limit the log size.
- -->
- <autoCommit>
- <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
- <openSearcher>false</openSearcher>
- </autoCommit>
- <!-- softAutoCommit is like autoCommit except it causes a
- 'soft' commit which only ensures that changes are visible
- but does not ensure that data is synced to disk. This is
- faster and more near-realtime friendly than a hard commit.
- -->
- <autoSoftCommit>
- <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
- </autoSoftCommit>
- </updateHandler>
- <!-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Query section - these settings control query time things like caches
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -->
- <query>
- <!-- Max Boolean Clauses
- Maximum number of clauses in each BooleanQuery, an exception
- is thrown if exceeded.
- ** WARNING **
- This option actually modifies a global Lucene property that
- will affect all SolrCores. If multiple solrconfig.xml files
- disagree on this property, the value at any given moment will
- be based on the last SolrCore to be initialized.
- -->
- <maxBooleanClauses>1024</maxBooleanClauses>
- <!-- Solr Internal Query Caches
- There are two implementations of cache available for Solr,
- LRUCache, based on a synchronized LinkedHashMap, and
- FastLRUCache, based on a ConcurrentHashMap.
- FastLRUCache has faster gets and slower puts in single
- threaded operation and thus is generally faster than LRUCache
- when the hit ratio of the cache is high (> 75%), and may be
- faster under other scenarios on multi-cpu systems.
- -->
- <!-- Filter Cache
- Cache used by SolrIndexSearcher for filters (DocSets),
- unordered sets of *all* documents that match a query. When a
- new searcher is opened, its caches may be prepopulated or
- "autowarmed" using data from caches in the old searcher.
- autowarmCount is the number of items to prepopulate. For
- LRUCache, the autowarmed items will be the most recently
- accessed items.
- Parameters:
- class - the SolrCache implementation LRUCache or
- (LRUCache or FastLRUCache)
- size - the maximum number of entries in the cache
- initialSize - the initial capacity (number of entries) of
- the cache. (see java.util.HashMap)
- autowarmCount - the number of entries to prepopulate from
- and old cache.
- -->
- <filterCache class="solr.FastLRUCache"
- size="512"
- initialSize="512"
- autowarmCount="0"/>
- <!-- Query Result Cache
- Caches results of searches - ordered lists of document ids
- (DocList) based on a query, a sort, and the range of documents requested.
- -->
- <queryResultCache class="solr.LRUCache"
- size="512"
- initialSize="512"
- autowarmCount="0"/>
- <!-- Document Cache
- Caches Lucene Document objects (the stored fields for each
- document). Since Lucene internal document ids are transient,
- this cache will not be autowarmed.
- -->
- <documentCache class="solr.LRUCache"
- size="512"
- initialSize="512"
- autowarmCount="0"/>
- <!-- custom cache currently used by block join -->
- <cache name="perSegFilter"
- class="solr.search.LRUCache"
- size="10"
- initialSize="0"
- autowarmCount="10"
- regenerator="solr.NoOpRegenerator" />
- <!-- Lazy Field Loading
- If true, stored fields that are not requested will be loaded
- lazily. This can result in a significant speed improvement
- if the usual case is to not load all stored fields,
- especially if the skipped fields are large compressed text
- fields.
- -->
- <enableLazyFieldLoading>true</enableLazyFieldLoading>
- <!-- Result Window Size
- An optimization for use with the queryResultCache. When a search
- is requested, a superset of the requested number of document ids
- are collected. For example, if a search for a particular query
- requests matching documents 10 through 19, and queryWindowSize is 50,
- then documents 0 through 49 will be collected and cached. Any further
- requests in that range can be satisfied via the cache.
- -->
- <queryResultWindowSize>20</queryResultWindowSize>
- <!-- Maximum number of documents to cache for any entry in the
- queryResultCache.
- -->
- <queryResultMaxDocsCached>200</queryResultMaxDocsCached>
- <!-- Use Cold Searcher
- If a search request comes in and there is no current
- registered searcher, then immediately register the still
- warming searcher and use it. If "false" then all requests
- will block until the first searcher is done warming.
- -->
- <useColdSearcher>false</useColdSearcher>
- <!-- Max Warming Searchers
- Maximum number of searchers that may be warming in the
- background concurrently. An error is returned if this limit
- is exceeded.
- Recommend values of 1-2 for read-only slaves, higher for
- masters w/o cache warming.
- -->
- <maxWarmingSearchers>2</maxWarmingSearchers>
- </query>
- <!-- Request Dispatcher
- This section contains instructions for how the SolrDispatchFilter
- should behave when processing requests for this SolrCore.
- handleSelect is a legacy option that affects the behavior of requests
- such as /select?qt=XXX
- handleSelect="true" will cause the SolrDispatchFilter to process
- the request and dispatch the query to a handler specified by the
- "qt" param, assuming "/select" isn't already registered.
- handleSelect="false" will cause the SolrDispatchFilter to
- ignore "/select" requests, resulting in a 404 unless a handler
- is explicitly registered with the name "/select"
- handleSelect="true" is not recommended for new users, but is the default
- for backwards compatibility
- -->
- <requestDispatcher handleSelect="false" >
- <!-- Request Parsing
- These settings indicate how Solr Requests may be parsed, and
- what restrictions may be placed on the ContentStreams from
- those requests
- enableRemoteStreaming - enables use of the stream.file
- and stream.url parameters for specifying remote streams.
- multipartUploadLimitInKB - specifies the max size (in KiB) of
- Multipart File Uploads that Solr will allow in a Request.
- formdataUploadLimitInKB - specifies the max size (in KiB) of
- form data (application/x-www-form-urlencoded) sent via
- POST. You can use POST to pass request parameters not
- fitting into the URL.
- addHttpRequestToContext - if set to true, it will instruct
- the requestParsers to include the original HttpServletRequest
- object in the context map of the SolrQueryRequest under the
- key "httpRequest". It will not be used by any of the existing
- Solr components, but may be useful when developing custom
- plugins.
- *** WARNING ***
- The settings below authorize Solr to fetch remote files, You
- should make sure your system has some authentication before
- using enableRemoteStreaming="true"
- -->
- <requestParsers enableRemoteStreaming="true"
- multipartUploadLimitInKB="2048000"
- formdataUploadLimitInKB="2048"
- addHttpRequestToContext="false"/>
- <!-- HTTP Caching
- Set HTTP caching related parameters (for proxy caches and clients).
- The options below instruct Solr not to output any HTTP Caching
- related headers
- -->
- <httpCaching never304="true" />
- </requestDispatcher>
- <!-- Request Handlers
- http://wiki.apache.org/solr/SolrRequestHandler
- Incoming queries will be dispatched to a specific handler by name
- based on the path specified in the request.
- Legacy behavior: If the request path uses "/select" but no Request
- Handler has that name, and if handleSelect="true" has been specified in
- the requestDispatcher, then the Request Handler is dispatched based on
- the qt parameter. Handlers without a leading '/' are accessed this way
- like so: http://host/app/[core/]select?qt=name If no qt is
- given, then the requestHandler that declares default="true" will be
- used or the one named "standard".
- If a Request Handler is declared with startup="lazy", then it will
- not be initialized until the first request that uses it.
- -->
- <!-- SearchHandler
- http://wiki.apache.org/solr/SearchHandler
- For processing Search Queries, the primary Request Handler
- provided with Solr is "SearchHandler" It delegates to a sequent
- of SearchComponents (see below) and supports distributed
- queries across multiple shards
- -->
- <!--
- <requestHandler name="/dataimport" class="solr.DataImportHandler">
- <lst name="defaults">
- <str name="config">solr-data-config.xml</str>
- </lst>
- </requestHandler>
- -->
- <requestHandler name="/dataimport" class="solr.DataImportHandler">
- <lst name="defaults">
- <str name="config">data-config.xml</str>
- </lst>
- </requestHandler>
- <requestHandler name="/select" class="solr.SearchHandler">
- <!-- default values for query parameters can be specified, these
- will be overridden by parameters in the request
- -->
- <lst name="defaults">
- <str name="echoParams">explicit</str>
- <int name="rows">10</int>
- </lst>
- </requestHandler>
- <!-- A request handler that returns indented JSON by default -->
- <requestHandler name="/query" class="solr.SearchHandler">
- <lst name="defaults">
- <str name="echoParams">explicit</str>
- <str name="wt">json</str>
- <str name="indent">true</str>
- <str name="df">text</str>
- </lst>
- </requestHandler>
- <!--
- The export request handler is used to export full sorted result sets.
- Do not change these defaults.
- -->
- <requestHandler name="/export" class="solr.SearchHandler">
- <lst name="invariants">
- <str name="rq">{!xport}</str>
- <str name="wt">xsort</str>
- <str name="distrib">false</str>
- </lst>
- <arr name="components">
- <str>query</str>
- </arr>
- </requestHandler>
- <initParams path="/update/**,/query,/select,/tvrh,/elevate,/spell">
- <lst name="defaults">
- <str name="df">text</str>
- </lst>
- </initParams>
- <!-- Field Analysis Request Handler
- RequestHandler that provides much the same functionality as
- analysis.jsp. Provides the ability to specify multiple field
- types and field names in the same request and outputs
- index-time and query-time analysis for each of them.
- Request parameters are:
- analysis.fieldname - field name whose analyzers are to be used
- analysis.fieldtype - field type whose analyzers are to be used
- analysis.fieldvalue - text for index-time analysis
- q (or analysis.q) - text for query time analysis
- analysis.showmatch (true|false) - When set to true and when
- query analysis is performed, the produced tokens of the
- field value analysis will be marked as "matched" for every
- token that is produces by the query analysis
- -->
- <requestHandler name="/analysis/field"
- startup="lazy"
- class="solr.FieldAnalysisRequestHandler" />
- <!-- Document Analysis Handler
- http://wiki.apache.org/solr/AnalysisRequestHandler
- An analysis handler that provides a breakdown of the analysis
- process of provided documents. This handler expects a (single)
- content stream with the following format:
- <docs>
- <doc>
- <field name="id">1</field>
- <field name="name">The Name</field>
- <field name="text">The Text Value</field>
- </doc>
- <doc>...</doc>
- <doc>...</doc>
- ...
- </docs>
- Note: Each document must contain a field which serves as the
- unique key. This key is used in the returned response to associate
- an analysis breakdown to the analyzed document.
- Like the FieldAnalysisRequestHandler, this handler also supports
- query analysis by sending either an "analysis.query" or "q"
- request parameter that holds the query text to be analyzed. It
- also supports the "analysis.showmatch" parameter which when set to
- true, all field tokens that match the query tokens will be marked
- as a "match".
- -->
- <requestHandler name="/analysis/document"
- class="solr.DocumentAnalysisRequestHandler"
- startup="lazy" />
- <!-- Echo the request contents back to the client -->
- <requestHandler name="/debug/dump" class="solr.DumpRequestHandler" >
- <lst name="defaults">
- <str name="echoParams">explicit</str>
- <str name="echoHandler">true</str>
- </lst>
- </requestHandler>
- <!-- Search Components
- Search components are registered to SolrCore and used by
- instances of SearchHandler (which can access them by name)
- By default, the following components are available:
- <searchComponent name="query" class="solr.QueryComponent" />
- <searchComponent name="facet" class="solr.FacetComponent" />
- <searchComponent name="mlt" class="solr.MoreLikeThisComponent" />
- <searchComponent name="highlight" class="solr.HighlightComponent" />
- <searchComponent name="stats" class="solr.StatsComponent" />
- <searchComponent name="debug" class="solr.DebugComponent" />
- -->
- <!-- Terms Component
- http://wiki.apache.org/solr/TermsComponent
- A component to return terms and document frequency of those
- terms
- -->
- <searchComponent name="terms" class="solr.TermsComponent"/>
- <!-- A request handler for demonstrating the terms component -->
- <requestHandler name="/terms" class="solr.SearchHandler" startup="lazy">
- <lst name="defaults">
- <bool name="terms">true</bool>
- <bool name="distrib">false</bool>
- </lst>
- <arr name="components">
- <str>terms</str>
- </arr>
- </requestHandler>
- <!-- Legacy config for the admin interface -->
- <admin>
- <defaultQuery>*:*</defaultQuery>
- </admin>
- </config>
下面我将对其中关键地方加以解释说明:
lib
<lib> 标签指令可以用来告诉Solr如何去加载solr plugins(Solr插件)依赖的jar包,在solrconfig.xml配置文件的注释中有配置示例,例如:
<lib dir="./lib" regex=”lucene-\w+\.jar”/>
这里的dir表示一个jar包目录路径,该目录路径是相对于你当前core根目录的;regex表示一个正则表达式,用来过滤文件名的,符合正则表达式的jar文件将会被加载
dataDir parameter
用来指定一个solr的索引数据目录,solr创建的索引会存放在data\index目录下,默认dataDir是相对于当前core目录(如果solr_home下存在core的话),如果solr_home下不存在core的话,那dataDir默认就是相对于solr_home啦,不过一般dataDir都在core.properties下配置。
<dataDir>/var/data/solr</dataDir>
codecFactory
用来设置Lucene倒排索引的编码工厂类,默认实现是官方提供的SchemaCodecFactory类。
indexConfig Section
在solrconfig.xml的<indexConfig>标签中间有很多关于此配置项的说明:
<!-- maxFieldLength was removed in 4.0. To get similar behavior, include a
LimitTokenCountFilterFactory in your fieldType definition. E.g.
<filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="10000"/>
提供我们maxFieldLength配置项已经从4.0版本开始就已经被移除了,可以使用配置一个filter达到相似的效果,maxTokenCount即在对某个域分词的时候,最多只提取前10000个Token,后续的域值将被抛弃。maxFieldLength若表示1000,则意味着只会对域值的0~1000范围内的字符串进行分词索引。
<writeLockTimeout>1000</writeLockTimeout>
writeLockTimeout表示IndexWriter实例在获取写锁的时候最大等待超时时间,超过指定的超时时间仍未获取到写锁,则IndexWriter写索引操作将会抛出异常
<maxIndexingThreads>8</maxIndexingThreads>
表示创建索引的最大线程数,默认是开辟8个线程来创建索引
<useCompoundFile>false</useCompoundFile>
是否开启复合文件模式,启用了复合文件模式即意味着创建的索引文件数量会减少,这样占用的文件描述符也会减少,但这会带来性能的损耗,在Lucene中,它默认是开启,而在Solr中,自从3.6版本开始,默认就是禁用的
<ramBufferSizeMB>100</ramBufferSizeMB>
表示创建索引时内存缓存大小,单位是MB,默认最大是100M,
<maxBufferedDocs>1000</maxBufferedDocs>
表示在document写入到硬盘之前,缓存的document最大个数,超过这个最大值会触发索引的flush操作。
- <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
- <int name="maxMergeAtOnce">10</int>
- <int name="segmentsPerTier">10</int>
- </mergePolicy>
用来配置Lucene索引段合并策略的,里面有两个参数:
maxMergeAtOne: 一次最多合并段个数
segmentPerTier: 每个层级的段个数,同时也是内存buffer递减的等比数列的公比,看源码:
- // Compute max allowed segs in the index
- long levelSize = minSegmentBytes;
- long bytesLeft = totIndexBytes;
- double allowedSegCount = 0;
- while(true) {
- final double segCountLevel = bytesLeft / (double) levelSize;
- if (segCountLevel < segsPerTier) {
- allowedSegCount += Math.ceil(segCountLevel);
- break;
- }
- allowedSegCount += segsPerTier;
- bytesLeft -= segsPerTier * levelSize;
- levelSize *= maxMergeAtOnce;
- }
- int allowedSegCountInt = (int) allowedSegCount;
<mergeFactor>10</mergeFactor>
要理解mergeFactor因子的含义,还是先看看lucene in action中给出的解释:
- IndexWriter’s mergeFactor lets you control how many Documents to store in memory
- before writing them to the disk, as well as how often to merge multiple index
- segments together. (Index segments are covered in appendix B.) With the default
- value of 10, Lucene stores 10 Documents in memory before writing them to a single
- segment on the disk. The mergeFactor value of 10 also means that once the
- number of segments on the disk has reached the power of 10, Lucene merges
- these segments into a single segment.
- For instance, if you set mergeFactor to 10, a new segment is created on the disk
- for every 10 Documents added to the index. When the tenth segment of size 10 is
- added, all 10 are merged into a single segment of size 100. When 10 such segments
- of size 100 have been added, they’re merged into a single segment containing
- 1,000 Documents, and so on. Therefore, at any time, there are no more than 9
- segments in the index, and the size of each merged segment is the power of 10.
- There is a small exception to this rule that has to do with maxMergeDocs,
- another IndexWriter instance variable: While merging segments, Lucene ensuresthat no segment with more than maxMergeDocs Documents is created. For instance,
- suppose you set maxMergeDocs to 1,000. When you add the ten-thousandth Document,
- instead of merging multiple segments into a single segment of size 10,000,
- Lucene creates the tenth segment of size 1,000 and keeps adding new segments
- of size 1,000 for every 1,000 Documents added.
IndexWriter的mergeFactory允许你来控制索引在写入磁盘之前内存中能缓存的document数量,以及合并
多个段文件的频率。默认这个值为10. 当往内存中存储了10个document,此时Lucene还没有把单个段文件
写入磁盘,mergeFactor值等于10也意味着当硬盘上的段文件数量达到10,lucene将会把这10个段文件合
并到一个段文件中。例如:如果你把mergeFactor设置为10,当你往索引中添加了10个document,一个段
文件将会在硬盘上被创建,当第10个段文件被添加时,这10个段文件就会被合并到1个段文件,此时这个
段文件中有100个document,当10个这样的包含了100个document的段文件被添加时,他们又会被合并到一
个新的段文件中,而此时这个段文件包含 1000个document,以此类推。所以,在任何时候,在索引中不
存在超过9个段文件。每个被合并的段文件包含的document个数都是10,但这样有点小问题,我们还必须
设置一个maxMergeDocs变量,当合并段文件的时候,lucene必须确保没有哪个段文件超过maxMergeDocs
变量规定的最大document数量。设置maxMergeDocs的目的是为了防止单个段文件中包含的document数量
过大,假定你把maxMergeDocs设置为1000,当你创建第10个包含1000个document段文件的时候,这时并
不会触发段文件合并(如果没有设置maxMergeDocs为100的话,按理来说,这10个包含了1000个document
的段文件将会被合并到一个包含了10000个document的段文件当中,但maxMergeDocs限制了单个段文件中
最多包含1000个document,所以此时并不会触发段合并操作)。影响段合并还有一些其他参数,比如:
mergeFactor:当大小几乎相当的段的数量达到此值的时候,开始合并。
minMergeSize:所有大小小于此值的段,都被认为是大小几乎相当,一同参与合并。
maxMergeSize:当一个段的大小大于此值的时候,就不再参与合并。
maxMergeDocs:当一个段包含的文档数大于此值的时候,就不再参与合并。
段合并分两个步骤:
1.首先筛选出哪些段需要合并,这一步由MergePolicy合并策略类来决定
2.然后就是真正的段合并过程了,这一步是交给MergeScheduler来完成的,MergeScheduler类主要做两件事:
A.对存储域,项向量,标准化因子即norms等信息进行合并
B.对倒排索引信息进行合并
尼玛扯远了,接着继续我们的solrconfig.xml中影响索引创建的一些参数配置;
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
mergeScheduler刚才提到过了,这是用来配置段合并操作的处理类。默认实现类是Lucene中自带的ConcurrentMergeScheduler。
<lockType>${solr.lock.type:native}</lockType>
这个是用来指定Lucene中LockFactory实现的,可配置项如下:
- single = SingleInstanceLockFactory - suggested for a
- read-only index or when there is no possibility of
- another process trying to modify the index.
- native = NativeFSLockFactory - uses OS native file locking.
- Do not use when multiple solr webapps in the same
- JVM are attempting to share a single index.
- simple = SimpleFSLockFactory - uses a plain file for locking
- Defaults: 'native' is default for Solr3.6 and later, otherwise
- 'simple' is the default
single:表示只读锁,没有另外一个处理线程会去修改索引数据
native:即Lucene中的NativeFSLockFactory实现,使用的是基于操作系统的本地文件锁
simple:即Lucene中的SimpleFSLockFactory实现,通过在硬盘上创建write.lock锁文件实现
Defaults:从solr3.6版本开始,这个默认值是native,否则,默认值就是simple,意思就是说,你如果配置为Defaults,到底使用哪种锁实现,取决于你当前使用的Solr版本。
<unlockOnStartup>false</unlockOnStartup>
如果这个设置为true,那么在solr启动后,IndexWriter和commit提交操作拥有的锁将会被释放,这会打破Lucene的锁机制,请谨慎使用。如果你的lockType设置为single,那么这个配置true or false都不会产生任何影响。
<deletionPolicy class="solr.SolrDeletionPolicy">
用来配置索引删除策略的,默认使用的是Solr的SolrDeletionPolicy实现。如果你需要自定义删除策略,那么你需要实现Lucene的org.apache.lucene.index.IndexDeletionPolicy接口。
<jmx />
这个配置是用来在Solr中启用JMX,有关这方面的详细信息,请移步到Solr官方Wiki,访问地址如下:
http://wiki.apache.org/solr/SolrJmx
<updateHandler class="solr.DirectUpdateHandler2">
指定索引更新操作处理类,DirectUpdateHandler2是一个高性能的索引更新处理类,它支持软提交
<updateLog>
<str name="dir">${solr.ulog.dir:}</str>
</updateLog>
<updateLog>用来指定上面的updateHandler的处理事务日志存放路径的,默认值是solr的data目录即solr的dataDir配置的目录。
<query>标签是有关索引查询相关的配置项
<maxBooleanClauses>1024</maxBooleanClauses>
表示BooleanQuery最大能链接多少个子Query,当不同的core下的solrconfig.xml中此配置项的参数值配置的不一样时,以最后一个初始化的core的配置为准。
<filterCache class="solr.FastLRUCache"
size="512"
initialSize="512"
autowarmCount="0"/>
用来配置filter过滤器的缓存相关的参数
<queryResultCache class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="0"/>
用来配置对Query返回的查询结果集即TopDocs的缓存
<documentCache class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="0"/>
用来配置对Document中存储域的缓存,因为每次从硬盘上加载存储域的值都是很昂贵的操作,这里说的存储域指的是那些Store.YES的Field,所以你懂的。
<fieldValueCache class="solr.FastLRUCache"
size="512"
autowarmCount="128"
showItems="32" />
这个配置是用来缓存Document id的,用来快速访问你的Document id的。这个配置项默认就是开启的,无需显式配置。
<cache name="myUserCache"
class="solr.LRUCache"
size="4096"
initialSize="1024"
autowarmCount="1024"
regenerator="com.mycompany.MyRegenerator"
/>
这个配置是用来配置你的自定义缓存的,你自己的Regenerator需要实现Solr的CacheRegenerator接口。
<enableLazyFieldLoading>true</enableLazyFieldLoading>
表示启用存储域的延迟加载,前提是你的存储域在Query的时候没有显式指定需要return这个域。
<useFilterForSortedQuery>true</useFilterForSortedQuery>
表示当你的Query没有使用score进行排序时,是否使用filter来替代Query.
- <listener event="newSearcher" class="solr.QuerySenderListener">
- <arr name="queries">
- <!--
- <lst><str name="q">solr</str><str name="sort">price asc</str></lst>
- <lst><str name="q">rocks</str><str name="sort">weight asc</str></lst>
- -->
- </arr>
- </listener>
QuerySenderListener用来监听查询发送过程,即你可以在Query请求发送之前追加一些请求参数,如上面给的示例中,可以追加qery关键字以及sort排序规则。
<requestDispatcher handleSelect="false" >
设置为false即表示Solr 服务器端不接收/select请求,即如果你请求http://localhost:8080/solr/coreName/select?qt=xxxx时,将会返回一个404,
这个select请求是为了兼容先前的旧版本,已经不推荐使用。
<httpCaching never304="true" />
表示solr服务器段永远不返回304,那http响应状态码304表示什么呢?表示服务器端告诉客户端,你请求的资源尚未被修改过,我返回给你的是上次缓存的内容。Never304即告诉服务器,不管我访问的资源有没有更新过,都给我重新返回不走Http缓存。这属于Http协议相关知识,不清楚的请去Google HTTP协议详细了解去。
- <requestHandler name="/query" class="solr.SearchHandler">
- <lst name="defaults">
- <str name="echoParams">explicit</str>
- <str name="wt">json</str>
- <str name="indent">true</str>
- </lst>
- </requestHandler>
这个requestHandler配置的是请求URL /query跟请求处理类SearcherHandler之间的一个映射关系,即你访问http://localhost:8080/solr/coreName/query?q=xxx时,会交给SearcherHandler类来处理这个http请求,你可以配置一些参数来干预SearcherHandler处理细节,比如echoParams表示是否打印HTTP请求参数,wt即writer type,即返回的数据的MIME类型,如json,xml等等,indent表示返回的json或者XML数据是否需要缩进,否则返回的数据没有缩进也没有换行,不利于阅读。
其他的一些requestHandler说明就略过了,其实都大同小异,就是一个请求URL跟请求处理类的一个映射,就好比SpringMVC中请求URL和Controller类的一个映射。
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
用来配置查询组件比如SpellCheckComponent拼写检查,有关拼写检查的详细配置说明留到以后说到SpellCheck时再说吧。
<searchComponent name="terms" class="solr.TermsComponent"/>
用来返回所有的Term以及每个document中Term的出现频率
<searchComponent class="solr.HighlightComponent" name="highlight">
用来配置关键字高亮的,Solr高亮配置的详细说明这里暂时先略过,这篇我们只是先暂时大致了解下每个配置项的含义即可,具体如何使用留到后续再深入研究。
有关searchComponent查询组件的其他配置我就不一一说明了,太多了。你们自己看里面的英文注释吧,如果你实在看不懂再来问我。
- <queryResponseWriter name="json" class="solr.JSONResponseWriter">
- <!-- For the purposes of the tutorial, JSON responses are written as
- plain text so that they are easy to read in *any* browser.
- If you expect a MIME type of "application/json" just remove this override.
- -->
- <str name="content-type">text/plain; charset=UTF-8</str>
- </queryResponseWriter>
这个是用来配置Solr响应数据转换类,JSONResponseWriter就是把HTTP响应数据转成JSON格式,content-type即response响应头信息中的content-type,即告诉客户端返回的数据的MIME类型为text/plain,且charset字符集编码为UTF-8.
内置的响应数据转换器还有velocity,xslt等,如果你想自定义一个基于FreeMarker的转换器,那你需要实现Solr的QueryResponseWriter接口,模仿其他实现类,你懂的,然后在solrconfig.xml中添加类似的<queryResponseWriter配置即可
最后需要说明下的是solrconfig.xml中有大量类似<arr> <list> <str> <int>这样的自定义标签,下面做个统一的说明:
这张图摘自于Solr in Action这本书,由于是英文的,所以我稍微解释下:
arr:即array的缩写,表示一个数组,name即表示这个数组参数的变量名
lst即list的缩写,但注意它里面存放的是key-value键值对
bool表示一个boolean类型的变量,name表示boolean变量名,
同理还有int,long,float,str等等
Str即string的缩写,唯一要注意的是arr下的str子元素是没有name属性的,而list下的str元素是有name属性的
最后总结下:
solrconfig.xml中的配置项主要分以下几大块:
1.依赖的lucene版本配置,这决定了你创建的Lucene索引结构,因为Lucene各版本之间的索引结构并不是完全兼容的,这个需要引起你的注意。
2.索引创建相关的配置,如索引目录,IndexWriterConfig类中的相关配置(它决定了你的索引创建性能)
3.solrconfig.xml中依赖的外部jar包加载路径配置
4.JMX相关配置
5.缓存相关配置,缓存包括过滤器缓存,查询结果集缓存,Document缓存,以及自定义缓存等等
6.updateHandler配置即索引更新操作相关配置
7.RequestHandler相关配置,即接收客户端HTTP请求的处理类配置
8.查询组件配置如HightLight,SpellChecker等等
9.ResponseWriter配置即响应数据转换器相关配置,决定了响应数据是以什么样格式返回给客户端的。
10.自定义ValueSourceParser配置,用来干预Document的权重、评分,排序
solrconfig.xml就解释到这儿了,理解这些配置项是为后续Solr学习扫清障碍。有些我没说到的或者我有意略过的,就留给你们自己去阅读和理解了,毕竟内容太多,1000多行的配置,一行不拉的解释完太耗时,有些都是类似的配置,我想你们应该能看懂。
如果你还有什么问题请加我Q-Q:7-3-6-0-3-1-3-0-5,
或者加裙
一起交流学习!
转载:http://iamyida.iteye.com/blog/2211728