12.1. Data source configuration options
Data source type.
Mandatory, no default value.
Known types are mysql
, pgsql
, mssql
,
xmlpipe2
, tsvpipe, and odbc
.
All other per-source options depend on source type selected by this option.
Names of the options used for SQL sources (ie. MySQL, PostgreSQL, MS SQL) start with "sql_";
names of the ones used for xmlpipe2 or tsvpipe start with "xmlpipe_" and "tsvpipe_" correspondingly.
All source types are conditional; they might or might
not be supported depending on your build settings, installed client libraries, etc.
mssql
type is currently only available on Windows.
odbc
type is available both on Windows natively and on
Linux through UnixODBC library.
Example:
type = mysql
SQL server host to connect to.
Mandatory, no default value.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
In the simplest case when Sphinx resides on the same host with your MySQL
or PostgreSQL installation, you would simply specify "localhost". Note that
MySQL client library chooses whether to connect over TCP/IP or over UNIX
socket based on the host name. Specifically "localhost" will force it
to use UNIX socket (this is the default and generally recommended mode)
and "127.0.0.1" will force TCP/IP usage. Refer to
MySQL manual
for more details.
Example:
sql_host = localhost
SQL server IP port to connect to.
Optional, default is 3306 for mysql
source type and 5432 for pgsql
type.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Note that it depends on sql_host setting whether this value will actually be used.
Example:
sql_port = 3306
SQL user to use when connecting to sql_host.
Mandatory, no default value.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Example:
sql_user = test
SQL user password to use when connecting to sql_host.
Mandatory, no default value.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Example:
sql_pass = mysecretpassword
SQL database (in MySQL terms) to use after the connection and perform further queries within.
Mandatory, no default value.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Example:
sql_db = test
UNIX socket name to connect to for local SQL servers.
Optional, default value is empty (use client library default settings).
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
On Linux, it would typically be /var/lib/mysql/mysql.sock
.
On FreeBSD, it would typically be /tmp/mysql.sock
.
Note that it depends on sql_host setting whether this value will actually be used.
Example:
sql_sock = /tmp/mysql.sock
12.1.8. mysql_connect_flags
MySQL client connection flags.
Optional, default value is 0 (do not set any flags).
Applies to mysql
source type only.
This option must contain an integer value with the sum of the flags.
The value will be passed to mysql_real_connect() verbatim.
The flags are enumerated in mysql_com.h include file.
Flags that are especially interesting in regard to indexing, with their respective values, are as follows:
CLIENT_COMPRESS = 32; can use compression protocol
CLIENT_SSL = 2048; switch to SSL after handshake
CLIENT_SECURE_CONNECTION = 32768; new 4.1 authentication
For instance, you can specify 2080 (2048+32) to use both compression and SSL,
or 32768 to use new authentication only. Initially, this option was introduced
to be able to use compression when the indexer
and mysqld
are on different hosts. Compression on 1 Gbps
links is most likely to hurt indexing time though it reduces network traffic,
both in theory and in practice. However, enabling compression on 100 Mbps links
may improve indexing time significantly (upto 20-30% of the total indexing time
improvement was reported). Your mileage may vary.
Example:
mysql_connect_flags = 32 # enable compression
12.1.9. mysql_ssl_cert, mysql_ssl_key, mysql_ssl_ca
SSL certificate settings to use for connecting to MySQL server.
Optional, default values are empty strings (do not use SSL).
Applies to mysql
source type only.
These directives let you set up secure SSL connection between
indexer
and MySQL. The details on creating
the certificates and setting up MySQL server can be found in
MySQL documentation.
Example:
mysql_ssl_cert = /etc/ssl/client-cert.pem
mysql_ssl_key = /etc/ssl/client-key.pem
mysql_ssl_ca = /etc/ssl/cacert.pem
ODBC DSN to connect to.
Mandatory, no default value.
Applies to odbc
source type only.
ODBC DSN (Data Source Name) specifies the credentials (host, user, password, etc)
to use when connecting to ODBC data source. The format depends on specific ODBC
driver used.
Example:
odbc_dsn = Driver={Oracle ODBC Driver};Dbq=myDBName;Uid=myUsername;Pwd=myPassword
Pre-fetch query, or pre-query.
Multi-value, optional, default is empty list of queries.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Multi-value means that you can specify several pre-queries.
They are executed before the main fetch query,
and they will be executed exactly in order of appearance in the configuration file.
Pre-query results are ignored.
Pre-queries are useful in a lot of ways. They are used to setup encoding,
mark records that are going to be indexed, update internal counters,
set various per-connection SQL server options and variables, and so on.
Perhaps the most frequent pre-query usage is to specify the encoding
that the server will use for the rows it returns. Note that Sphinx accepts
only UTF-8 texts.
Two MySQL specific examples of setting the encoding are:
sql_query_pre = SET CHARACTER_SET_RESULTS=utf8
sql_query_pre = SET NAMES utf8
Also specific to MySQL sources, it is useful to disable query cache
(for indexer connection only) in pre-query, because indexing queries
are not going to be re-run frequently anyway, and there's no sense
in caching their results. That could be achieved with:
sql_query_pre = SET SESSION query_cache_type=OFF
Example:
sql_query_pre = SET NAMES utf8
sql_query_pre = SET SESSION query_cache_type=OFF
Main document fetch query.
Mandatory, no default value.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
There can be only one main query.
This is the query which is used to retrieve documents from SQL server.
You can specify up to 32 full-text fields (formally, upto SPH_MAX_FIELDS from sphinx.h), and an arbitrary amount of attributes.
All of the columns that are neither document ID (the first one) nor attributes will be full-text indexed.
Document ID MUST be the very first field,
and it MUST BE UNIQUE UNSIGNED POSITIVE (NON-ZERO, NON-NEGATIVE) INTEGER NUMBER.
It can be either 32-bit or 64-bit, depending on how you built Sphinx;
by default it builds with 32-bit IDs support but --enable-id64
option
to configure
allows to build with 64-bit document and word IDs support.
Example:
sql_query = \
SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, \
title, content \
FROM documents
12.1.13. sql_joined_field
Joined/payload field fetch query.
Multi-value, optional, default is empty list of queries.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
sql_joined_field
lets you use two different features:
joined fields, and payloads (payload fields). It's syntax is as follows:
sql_joined_field = FIELD-NAME 'from' ( 'query' | 'payload-query' \
| 'ranged-query' ); QUERY [ ; RANGE-QUERY ]
where
FIELD-NAME is a joined/payload field name;
QUERY is an SQL query that must fetch values to index.
RANGE-QUERY is an optional SQL query that fetches a range
of values to index. (Added in version 2.0.1-beta.)
Joined fields let you avoid JOIN and/or GROUP_CONCAT statements in the main
document fetch query (sql_query). This can be useful when SQL-side JOIN is slow,
or needs to be offloaded on Sphinx side, or simply to emulate MySQL-specific
GROUP_CONCAT functionality in case your database server does not support it.
The query must return exactly 2 columns: document ID, and text to append
to a joined field. Document IDs can be duplicate, but they must be
in ascending order. All the text rows fetched for a given ID will be
concatenated together, and the concatenation result will be indexed
as the entire contents of a joined field. Rows will be concatenated
in the order returned from the query, and separating whitespace
will be inserted between them. For instance, if joined field query
returns the following rows:
( 1, 'red' )
( 1, 'right' )
( 1, 'hand' )
( 2, 'mysql' )
( 2, 'sphinx' )
then the indexing results would be equivalent to that of adding
a new text field with a value of 'red right hand' to document 1 and
'mysql sphinx' to document 2.
Joined fields are only indexed differently. There are no other differences
between joined fields and regular text fields.
Starting with 2.0.1-beta, ranged queries can be used when
a single query is not efficient enough or does not work because of
the database driver limitations. It works similar to the ranged
queries in the main indexing loop, see Section 3.8, “Ranged queries”.
The range will be queried for and fetched upfront once,
then multiple queries with different $start
and $end
substitutions will be run to fetch
the actual data.
Payloads let you create a special field in which, instead of
keyword positions, so-called user payloads are stored. Payloads are
custom integer values attached to every keyword. They can then be used
in search time to affect the ranking.
The payload query must return exactly 3 columns: document ID; keyword;
and integer payload value. Document IDs can be duplicate, but they must be
in ascending order. Payloads must be unsigned integers within 24-bit range,
ie. from 0 to 16777215. For reference, payloads are currently internally
stored as in-field keyword positions, but that is not guaranteed
and might change in the future.
Currently, the only method to account for payloads is to use
SPH_RANK_PROXIMITY_BM25 ranker. On indexes with payload fields,
it will automatically switch to a variant that matches keywords
in those fields, computes a sum of matched payloads multiplied
by field weights, and adds that sum to the final rank.
Example:
sql_joined_field = \
tagstext from query; \
SELECT docid, CONCAT('tag',tagid) FROM tags ORDER BY docid ASC
sql_joined_field = bigint tag from ranged-query; \
SELECT id, tag FROM tags WHERE id>=$start AND id<=$end ORDER BY id ASC; \
SELECT MIN(id), MAX(id) FROM tags
Range query setup.
Optional, default is empty.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Setting this option enables ranged document fetch queries (see Section 3.8, “Ranged queries”).
Ranged queries are useful to avoid notorious MyISAM table locks when indexing
lots of data. (They also help with other less notorious issues, such as reduced
performance caused by big result sets, or additional resources consumed by InnoDB
to serialize big read transactions.)
The query specified in this option must fetch min and max document IDs that will be
used as range boundaries. It must return exactly two integer fields, min ID first
and max ID second; the field names are ignored.
When ranged queries are enabled, sql_query
will be required to contain $start
and $end
macros
(because it obviously would be a mistake to index the whole table many times over).
Note that the intervals specified by $start
..$end
will not overlap, so you should not remove document IDs that are
exactly equal to $start
or $end
from your query.
The example in Section 3.8, “Ranged queries”) illustrates that; note how it
uses greater-or-equal and less-or-equal comparisons.
Example:
sql_query_range = SELECT MIN(id),MAX(id) FROM documents
Range query step.
Optional, default is 1024.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Only used when ranged queries are enabled.
The full document IDs interval fetched by sql_query_range
will be walked in this big steps. For example, if min and max IDs fetched
are 12 and 3456 respectively, and the step is 1000, indexer will call
sql_query several times with the
following substitutions:
$start=12, $end=1011
$start=1012, $end=2011
$start=2012, $end=3011
$start=3012, $end=3456
Example:
sql_range_step = 1000
12.1.16. sql_query_killlist
Kill-list query.
Optional, default is empty (no query).
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Introduced in version 0.9.9-rc1.
This query is expected to return a number of 1-column rows, each containing
just the document ID. The returned document IDs are stored within an index.
Kill-list for a given index suppresses results from other
indexes, depending on index order in the query. The intended use is to help
implement deletions and updates on existing indexes without rebuilding
(actually even touching them), and especially to fight phantom results
problem.
Let us dissect an example. Assume we have two indexes, 'main' and 'delta'.
Assume that documents 2, 3, and 5 were deleted since last reindex of 'main',
and documents 7 and 11 were updated (ie. their text contents were changed).
Assume that a keyword 'test' occurred in all these mentioned documents
when we were indexing 'main'; still occurs in document 7 as we index 'delta';
but does not occur in document 11 any more. We now reindex delta and then
search through both these indexes in proper (least to most recent) order:
$res = $cl->Query ( "test", "main delta" );
First, we need to properly handle deletions. The result set should not
contain documents 2, 3, or 5. Second, we also need to avoid phantom results.
Unless we do something about it, document 11 will
appear in search results! It will be found in 'main' (but not 'delta').
And it will make it to the final result set unless something stops it.
Kill-list, or K-list for short, is that something. Kill-list attached
to 'delta' will suppress the specified rows from all the preceding
indexes, in this case just 'main'. So to get the expected results,
we should put all the updated and deleted
document IDs into it.
Note that in the distributed index setup, K-lists are local
to every node in the cluster. They are not get transmitted
over the network when sending queries. (Because that might be too much
of an impact when the K-list is huge.) You will need to setup a
separate per-server K-lists in that case.
Example:
sql_query_killlist = \
SELECT id FROM documents WHERE updated_ts>=@last_reindex UNION \
SELECT id FROM documents_deleted WHERE deleted_ts>=@last_reindex
Unsigned integer attribute declaration.
Multi-value (there might be multiple attributes declared), optional.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
The column value should fit into 32-bit unsigned integer range.
Values outside this range will be accepted but wrapped around.
For instance, -1 will be wrapped around to 2^32-1 or 4,294,967,295.
You can specify bit count for integer attributes by appending
':BITCOUNT' to attribute name (see example below). Attributes with
less than default 32-bit size, or bitfields, perform slower.
But they require less RAM when using extern storage:
such bitfields are packed together in 32-bit chunks in .spa
attribute data file. Bit size settings are ignored if using
inline storage.
Example:
sql_attr_uint = group_id
sql_attr_uint = forum_id:9 # 9 bits for forum_id
Boolean attribute declaration.
Multi-value (there might be multiple attributes declared), optional.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Equivalent to sql_attr_uint declaration with a bit count of 1.
Example:
sql_attr_bool = is_deleted # will be packed to 1 bit
64-bit signed integer attribute declaration.
Multi-value (there might be multiple attributes declared), optional.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Note that unlike sql_attr_uint,
these values are signed.
Introduced in version 0.9.9-rc1.
Example:
sql_attr_bigint = my_bigint_id
12.1.20. sql_attr_timestamp
UNIX timestamp attribute declaration.
Multi-value (there might be multiple attributes declared), optional.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Timestamps can store date and time in the range of Jan 01, 1970
to Jan 19, 2038 with a precision of one second.
The expected column value should be a timestamp in UNIX format, ie. 32-bit unsigned
integer number of seconds elapsed since midnight, January 01, 1970, GMT.
Timestamps are internally stored and handled as integers everywhere.
But in addition to working with timestamps as integers, it's also legal
to use them along with different date-based functions, such as time segments
sorting mode, or day/week/month/year extraction for GROUP BY.
Note that DATE or DATETIME column types in MySQL can not be directly
used as timestamp attributes in Sphinx; you need to explicitly convert such
columns using UNIX_TIMESTAMP function (if data is in range).
Note timestamps can not represent dates before January 01, 1970,
and UNIX_TIMESTAMP() in MySQL will not return anything expected.
If you only needs to work with dates, not times, consider TO_DAYS()
function in MySQL instead.
Example:
# sql_query = ... UNIX_TIMESTAMP(added_datetime) AS added_ts ...
sql_attr_timestamp = added_ts
Floating point attribute declaration.
Multi-value (there might be multiple attributes declared), optional.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
The values will be stored in single precision, 32-bit IEEE 754 format.
Represented range is approximately from 1e-38 to 1e+38. The amount
of decimal digits that can be stored precisely is approximately 7.
One important usage of the float attributes is storing latitude
and longitude values (in radians), for further usage in query-time
geosphere distance calculations.
Example:
sql_attr_float = lat_radians
sql_attr_float = long_radians
Multi-valued attribute (MVA) declaration.
Multi-value (ie. there may be more than one such attribute declared), optional.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Plain attributes only allow to attach 1 value per each document.
However, there are cases (such as tags or categories) when it is
desired to attach multiple values of the same attribute and be able
to apply filtering or grouping to value lists.
The declaration format is as follows (backslashes are for clarity only;
everything can be declared in a single line as well):
sql_attr_multi = ATTR-TYPE ATTR-NAME 'from' SOURCE-TYPE \
[;QUERY] \
[;RANGE-QUERY]
where
ATTR-TYPE is 'uint', 'bigint' or 'timestamp'
SOURCE-TYPE is 'field', 'query', or 'ranged-query'
QUERY is SQL query used to fetch all ( docid, attrvalue ) pairs
RANGE-QUERY is SQL query used to fetch min and max ID values, similar to 'sql_query_range'
Example:
sql_attr_multi = uint tag from query; SELECT id, tag FROM tags
sql_attr_multi = bigint tag from ranged-query; \
SELECT id, tag FROM tags WHERE id>=$start AND id<=$end; \
SELECT MIN(id), MAX(id) FROM tags
String attribute declaration.
Multi-value (ie. there may be more than one such attribute declared), optional.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Introduced in version 1.10-beta.
String attributes can store arbitrary strings attached to every document.
There's a fixed size limit of 4 MB per value. Also, searchd
will currently cache all the values in RAM, which is an additional implicit limit.
Starting from 2.0.1-beta string attributes can be used for sorting and
grouping(ORDER BY, GROUP BY, WITHIN GROUP ORDER BY). Note that attributes
declared using sql_attr_string
will not be full-text
indexed; you can use sql_field_string
directive for that.
Example:
sql_attr_string = title # will be stored but will not be indexed
JSON attribute declaration.
Multi-value (ie. there may be more than one such attribute declared), optional.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Introduced in version 2.1.1-beta.
When indexing JSON attributes, Sphinx expects a text field
with JSON formatted data. As of 2.2.1-beta JSON attributes supports arbitrary
JSON data with no limitation in nested levels or types.
{
"id": 1,
"gid": 2,
"title": "some title",
"tags": [
"tag1",
"tag2",
"tag3"
{
"one": "two",
"three": [4, 5]
}
]
}
These attributes allow Sphinx to work with documents without a fixed set of
attribute columns. When you filter on a key of a JSON attribute, documents
that don't include the key will simply be ignored.
You can read more on JSON attributes in
http://sphinxsearch.com/blog/2013/08/08/full-json-support-in-trunk/.
Example:
sql_attr_json = properties
12.1.25. sql_column_buffers
Per-column buffer sizes.
Optional, default is empty (deduce the sizes automatically).
Applies to odbc
, mssql
source types only.
Introduced in version 2.0.1-beta.
ODBC and MS SQL drivers sometimes can not return the maximum
actual column size to be expected. For instance, NVARCHAR(MAX) columns
always report their length as 2147483647 bytes to
indexer
even though the actually used length
is likely considerably less. However, the receiving buffers still
need to be allocated upfront, and their sizes have to be determined.
When the driver does not report the column length at all, Sphinx
allocates default 1 KB buffers for each non-char column, and 1 MB
buffers for each char column. Driver-reported column length
also gets clamped by an upper limit of 8 MB, so in case the
driver reports (almost) a 2 GB column length, it will be clamped
and a 8 MB buffer will be allocated instead for that column.
These hard-coded limits can be overridden using the
sql_column_buffers
directive, either in order
to save memory on actually shorter columns, or overcome
the 8 MB limit on actually longer columns. The directive values
must be a comma-separated lists of selected column names and sizes:
sql_column_buffers = <colname>=<size>[K|M] [, ...]
Example:
sql_query = SELECT id, mytitle, mycontent FROM documents
sql_column_buffers = mytitle=64K, mycontent=10M
12.1.26. sql_field_string
Combined string attribute and full-text field declaration.
Multi-value (ie. there may be more than one such attribute declared), optional.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Introduced in version 1.10-beta.
sql_attr_string only stores the column
value but does not full-text index it. In some cases it might be desired to both full-text
index the column and store it as attribute. sql_field_string
lets you do
exactly that. Both the field and the attribute will be named the same.
Example:
sql_field_string = title # will be both indexed and stored
File based field declaration.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Introduced in version 1.10-beta.
This directive makes indexer
interpret field contents
as a file name, and load and index the referred file. Files larger than
max_file_field_buffer
in size are skipped. Any errors during the file loading (IO errors, missed
limits, etc) will be reported as indexing warnings and will not early
terminate the indexing. No content will be indexed for such files.
Example:
sql_file_field = my_file_path # load and index files referred to by my_file_path
Post-fetch query.
Optional, default value is empty.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
This query is executed immediately after sql_query
completes successfully. When post-fetch query produces errors,
they are reported as warnings, but indexing is not terminated.
It's result set is ignored. Note that indexing is not yet completed
at the point when this query gets executed, and further indexing still may fail.
Therefore, any permanent updates should not be done from here.
For instance, updates on helper table that permanently change
the last successfully indexed ID should not be run from post-fetch
query; they should be run from post-index query instead.
Example:
sql_query_post = DROP TABLE my_tmp_table
12.1.29. sql_query_post_index
Post-index query.
Optional, default value is empty.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
This query is executed when indexing is fully and successfully completed.
If this query produces errors, they are reported as warnings,
but indexing is not terminated. It's result set is ignored.
$maxid
macro can be used in its text; it will be
expanded to maximum document ID which was actually fetched
from the database during indexing. If no documents were indexed,
$maxid will be expanded to 0.
Example:
sql_query_post_index = REPLACE INTO counters ( id, val ) \
VALUES ( 'max_indexed_id', $maxid )
12.1.30. sql_ranged_throttle
Ranged query throttling period, in milliseconds.
Optional, default is 0 (no throttling).
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Throttling can be useful when indexer imposes too much load on the
database server. It causes the indexer to sleep for given amount of
milliseconds once per each ranged query step. This sleep is unconditional,
and is performed before the fetch query.
Example:
sql_ranged_throttle = 1000 # sleep for 1 sec before each query step
Shell command that invokes xmlpipe2 stream producer.
Mandatory.
Applies to xmlpipe2
source types only.
Specifies a command that will be executed and which output
will be parsed for documents. Refer to Section 3.9, “xmlpipe2 data source” for specific format description.
Example:
xmlpipe_command = cat /home/sphinx/test.xml
xmlpipe field declaration.
Multi-value, optional.
Applies to xmlpipe2
source type only. Refer to Section 3.9, “xmlpipe2 data source”.
Example:
xmlpipe_field = subject
xmlpipe_field = content
12.1.33. xmlpipe_field_string
xmlpipe field and string attribute declaration.
Multi-value, optional.
Applies to xmlpipe2
source type only. Refer to Section 3.9, “xmlpipe2 data source”.
Introduced in version 1.10-beta.
Makes the specified XML element indexed as both a full-text field and a string attribute.
Equivalent to <sphinx:field name="field" attr="string"/> declaration within the XML file.
Example:
xmlpipe_field_string = subject
12.1.34. xmlpipe_attr_uint
xmlpipe integer attribute declaration.
Multi-value, optional.
Applies to xmlpipe2
source type only.
Syntax fully matches that of sql_attr_uint.
Example:
xmlpipe_attr_uint = author_id
12.1.35. xmlpipe_attr_bigint
xmlpipe signed 64-bit integer attribute declaration.
Multi-value, optional.
Applies to xmlpipe2
source type only.
Syntax fully matches that of sql_attr_bigint.
Example:
xmlpipe_attr_bigint = my_bigint_id
12.1.36. xmlpipe_attr_bool
xmlpipe boolean attribute declaration.
Multi-value, optional.
Applies to xmlpipe2
source type only.
Syntax fully matches that of sql_attr_bool.
Example:
xmlpipe_attr_bool = is_deleted # will be packed to 1 bit
12.1.37. xmlpipe_attr_timestamp
xmlpipe UNIX timestamp attribute declaration.
Multi-value, optional.
Applies to xmlpipe2
source type only.
Syntax fully matches that of sql_attr_timestamp.
Example:
xmlpipe_attr_timestamp = published
12.1.38. xmlpipe_attr_float
xmlpipe floating point attribute declaration.
Multi-value, optional.
Applies to xmlpipe2
source type only.
Syntax fully matches that of sql_attr_float.
Example:
xmlpipe_attr_float = lat_radians
xmlpipe_attr_float = long_radians
12.1.39. xmlpipe_attr_multi
xmlpipe MVA attribute declaration.
Multi-value, optional.
Applies to xmlpipe2
source type only.
This setting declares an MVA attribute tag in xmlpipe2 stream.
The contents of the specified tag will be parsed and a list of integers
that will constitute the MVA will be extracted, similar to how
sql_attr_multi parses
SQL column contents when 'field' MVA source type is specified.
Example:
xmlpipe_attr_multi = taglist
12.1.40. xmlpipe_attr_multi_64
xmlpipe MVA attribute declaration. Declares the BIGINT (signed 64-bit integer) MVA attribute.
Multi-value, optional.
Applies to xmlpipe2
source type only.
This setting declares an MVA attribute tag in xmlpipe2 stream.
The contents of the specified tag will be parsed and a list of integers
that will constitute the MVA will be extracted, similar to how
sql_attr_multi parses
SQL column contents when 'field' MVA source type is specified.
Example:
xmlpipe_attr_multi_64 = taglist
12.1.41. xmlpipe_attr_string
xmlpipe string declaration.
Multi-value, optional.
Applies to xmlpipe2
source type only.
Introduced in version 1.10-beta.
This setting declares a string attribute tag in xmlpipe2 stream.
The contents of the specified tag will be parsed and stored as a string value.
Example:
xmlpipe_attr_string = subject
12.1.42. xmlpipe_attr_json
JSON attribute declaration.
Multi-value (ie. there may be more than one such attribute declared), optional.
Introduced in version 2.1.1-beta.
This directive is used to declare that the contents of a given
XML tag are to be treated as a JSON document and stored into a Sphinx
index for later use. Refer to Section 12.1.24, “sql_attr_json”
for more details on the JSON attributes.
Example:
xmlpipe_attr_json = properties
12.1.43. xmlpipe_fixup_utf8
Perform Sphinx-side UTF-8 validation and filtering to prevent XML parser from choking on non-UTF-8 documents.
Optional, default is 0.
Applies to xmlpipe2
source type only.
Under certain occasions it might be hard or even impossible to guarantee
that the incoming XMLpipe2 document bodies are in perfectly valid and
conforming UTF-8 encoding. For instance, documents with national
single-byte encodings could sneak into the stream. libexpat XML parser
is fragile, meaning that it will stop processing in such cases.
UTF8 fixup feature lets you avoid that. When fixup is enabled,
Sphinx will preprocess the incoming stream before passing it to the
XML parser and replace invalid UTF-8 sequences with spaces.
Example:
xmlpipe_fixup_utf8 = 1
MS SQL Windows authentication flag.
Boolean, optional, default value is 0 (false).
Applies to mssql
source type only.
Introduced in version 0.9.9-rc1.
Whether to use currently logged in Windows account credentials for
authentication when connecting to MS SQL Server. Note that when running
searchd
as a service, account user can differ
from the account you used to install the service.
Example:
mssql_winauth = 1
Columns to unpack using zlib (aka deflate, aka gunzip).
Multi-value, optional, default value is empty list of columns.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Introduced in version 0.9.9-rc1.
Columns specified using this directive will be unpacked by indexer
using standard zlib algorithm (called deflate and also implemented by gunzip
).
When indexing on a different box than the database, this lets you offload the database, and save on network traffic.
The feature is only available if zlib and zlib-devel were both available during build time.
Example:
unpack_zlib = col1
unpack_zlib = col2
12.1.46. unpack_mysqlcompress
Columns to unpack using MySQL UNCOMPRESS() algorithm.
Multi-value, optional, default value is empty list of columns.
Applies to SQL source types (mysql
, pgsql
, mssql
) only.
Introduced in version 0.9.9-rc1.
Columns specified using this directive will be unpacked by indexer
using modified zlib algorithm used by MySQL COMPRESS() and UNCOMPRESS() functions.
When indexing on a different box than the database, this lets you offload the database, and save on network traffic.
The feature is only available if zlib and zlib-devel were both available during build time.
Example:
unpack_mysqlcompress = body_compressed
unpack_mysqlcompress = description_compressed
12.1.47. unpack_mysqlcompress_maxsize
Buffer size for UNCOMPRESS()ed data.
Optional, default value is 16M.
Introduced in version 0.9.9-rc1.
When using unpack_mysqlcompress,
due to implementation intricacies it is not possible to deduce the required buffer size
from the compressed data. So the buffer must be preallocated in advance, and unpacked
data can not go over the buffer size. This option lets you control the buffer size,
both to limit indexer
memory use, and to enable unpacking
of really long data fields if necessary.
Example:
unpack_mysqlcompress_maxsize = 1M
12.2. Index configuration options
Index type.
Known values are 'plain', 'distributed', 'rt' and 'template'.
Optional, default is 'plain' (plain local index).
Sphinx supports several different types of indexes.
Versions 0.9.x supported two index types: plain local indexes
that are stored and processed on the local machine; and distributed indexes,
that involve not only local searching but querying remote searchd
instances over the network as well (see Section 5.8, “Distributed searching”).
Version 1.10-beta also adds support
for so-called real-time indexes (or RT indexes for short) that
are also stored and processed locally, but additionally allow
for on-the-fly updates of the full-text index (see Chapter 4, Real-time indexes).
Note that attributes can be updated on-the-fly using
either plain local indexes or RT ones.
In 2.2.1-beta template indexes was introduced. They are actually a
pseudo-indexes because they do not store any data. That means they do not create
any files on your hard drive. But you can use them for keywords and snippets
generation, which may be useful in some cases.
Index type setting lets you choose the needed type.
By default, plain local index type will be assumed.
Example:
type = distributed
Adds document source to local index.
Multi-value, mandatory.
Specifies document source to get documents from when the current
index is indexed. There must be at least one source. There may be multiple
sources, without any restrictions on the source types: ie. you can pull
part of the data from MySQL server, part from PostgreSQL, part from
the filesystem using xmlpipe2 wrapper.
However, there are some restrictions on the source data. First,
document IDs must be globally unique across all sources. If that
condition is not met, you might get unexpected search results.
Second, source schemas must be the same in order to be stored
within the same index.
No source ID is stored automatically. Therefore, in order to be able
to tell what source the matched document came from, you will need to
store some additional information yourself. Two typical approaches
include:
mangling document ID and encoding source ID in it:
source src1
{
sql_query = SELECT id*10+1, ... FROM table1
...
}
source src2
{
sql_query = SELECT id*10+2, ... FROM table2
...
}
storing source ID simply as an attribute:
source src1
{
sql_query = SELECT id, 1 AS source_id FROM table1
sql_attr_uint = source_id
...
}
source src2
{
sql_query = SELECT id, 2 AS source_id FROM table2
sql_attr_uint = source_id
...
}
Example:
source = srcpart1
source = srcpart2
source = srcpart3
Index files path and file name (without extension).
Mandatory.
Path specifies both directory and file name, but without extension.
indexer
will append different extensions
to this path when generating final names for both permanent and
temporary index files. Permanent data files have several different
extensions starting with '.sp'; temporary files' extensions
start with '.tmp'. It's safe to remove .tmp*
files is if indexer fails to remove them automatically.
For reference, different index files store the following data:
.spa
stores document attributes (used in extern docinfo storage mode only);
.spd
stores matching document ID lists for each word ID;
.sph
stores index header information;
.spi
stores word lists (word IDs and pointers to .spd
file);
.spk
stores kill-lists;
.spm
stores MVA data;
.spp
stores hit (aka posting, aka word occurrence) lists for each word ID;
.sps
stores string attribute data.
.spe
stores skip-lists to speed up doc-list filtering
Example:
path = /var/data/test1
Document attribute values (docinfo) storage mode.
Optional, default is 'extern'.
Known values are 'none', 'extern' and 'inline'.
Docinfo storage mode defines how exactly docinfo will be
physically stored on disk and RAM. "none" means that there will be
no docinfo at all (ie. no attributes). Normally you need not to set
"none" explicitly because Sphinx will automatically select "none"
when there are no attributes configured. "inline" means that the
docinfo will be stored in the .spd
file,
along with the document ID lists. "extern" means that the docinfo
will be stored separately (externally) from document ID lists,
in a special .spa
file.
Basically, externally stored docinfo must be kept in RAM when querying.
for performance reasons. So in some cases "inline" might be the only option.
However, such cases are infrequent, and docinfo defaults to "extern".
Refer to Section 3.3, “Attributes” for in-depth discussion
and RAM usage estimates.
Example:
docinfo = inline
Memory locking for cached data.
Optional, default is 0 (do not call mlock()).
For search performance, searchd
preloads
a copy of .spa
and .spi
files in RAM, and keeps that copy in RAM at all times. But if there
are no searches on the index for some time, there are no accesses
to that cached copy, and OS might decide to swap it out to disk.
First queries to such "cooled down" index will cause swap-in
and their latency will suffer.
Setting mlock option to 1 makes Sphinx lock physical RAM used
for that cached data using mlock(2) system call, and that prevents
swapping (see man 2 mlock for details). mlock(2) is a privileged call,
so it will require searchd
to be either run
from root account, or be granted enough privileges otherwise.
If mlock() fails, a warning is emitted, but index continues
working.
Example:
mlock = 1
A list of morphology preprocessors (stemmers or lemmatizers) to apply.
Optional, default is empty (do not apply any preprocessor).
Morphology preprocessors can be applied to the words being
indexed to replace different forms of the same word with the base,
normalized form. For instance, English stemmer will normalize
both "dogs" and "dog" to "dog", making search results for
both searches the same.
There are 3 different morphology preprocessors that Sphinx implements:
lemmatizers, stemmers, and phonetic algorithms.
Lemmatizer reduces a keyword form to a so-called lemma,
a proper normal form, or in other words, a valid natural language
root word. For example, "running" could be reduced to "run",
the infinitive verb form, and "octopi" would be reduced to "octopus",
the singular noun form. Note that sometimes a word form can have
multiple corresponding root words. For instance, by looking at
"dove" it is not possible to tell whether this is a past tense
of "dive" the verb as in "He dove into a pool.", or "dove" the noun
as in "White dove flew over the cuckoo's nest." In this case
lemmatizer can generate all the possible root forms.
Stemmer reduces a keyword form to a so-called stem
by removing and/or replacing certain well-known suffixes.
The resulting stem is however notguaranteed to be
a valid word on itself. For instance, with a Porter English
stemmers "running" would still reduce to "run", which is fine,
but "business" would reduce to "busi", which is not a word,
and "octopi" would not reduce at all. Stemmers are essentially
(much) simpler but still pretty good replacements of full-blown
lemmatizers.
Phonetic algorithms replace the words with specially
crafted phonetic codes that are equal even when the words original
are different, but phonetically close.
The morphology processors that come with our own built-in Sphinx
implementations are:
English, Russian, and German lemmatizers;
English, Russian, Arabic, and Czech stemmers;
SoundEx and MetaPhone phonetic algorithms.
You can also link with libstemmer library for even more
stemmers (see details below). With libstemmer, Sphinx also supports
morphological processing for more than 15 other languages. Binary
packages should come prebuilt with libstemmer support, too.
Lemmatizer support was added in version 2.1.1-beta, starting with
a Russian lemmatizer. English and German lemmatizers were then added
in version 2.2.1-beta.
Lemmatizers require a dictionary that needs to be
additionally downloaded from the Sphinx website. That dictionary
needs to be installed in a directory specified by
lemmatizer_base
directive. Also, there is a
lemmatizer_cache
directive that lets you speed up lemmatizing (and therefore
indexing) by spending more RAM for, basically, an uncompressed
cache of a dictionary.
Chinese segmentation using Rosette Linguistics Platform was added in 2.2.1-beta.
It is a much more precise but slower way (compared to n-grams) to segment Chinese documents.
charset_table
must contain all Chinese characters except
Chinese punctuation marks because incoming documents are first processed by sphinx tokenizer and then the result
is processed by RLP. Sphinx performs per-token language detection on the incoming documents. If token language is
identified as Chinese, it will only be processed the RLP, even if multiple morphology processors are specified.
Otherwise, it will be processed by all the morphology processors specified in the "morphology" option. Rosette
Linguistics Platform must be installed and configured and sphinx must be built with a --with-rlp switch. See also
rlp_root
,
rlp_environment
and
rlp_context
options.
A batched version of RLP segmentation is also available (rlp_chinese_batched
). It provides the
same functionality as the basic rlp_chinese
segmentation, but enables batching documents before
processing them by the RLP. Processing several documents at once can result in a substantial indexing speedup if
the documents are small (for example, less than 1k). See also
rlp_max_batch_size
and
rlp_max_batch_docs
options.
Additional stemmers provided by Snowball
project libstemmer library
can be enabled at compile time using --with-libstemmer
configure
option.
Built-in English and Russian stemmers should be faster than their
libstemmer counterparts, but can produce slightly different results,
because they are based on an older version.
Soundex implementation matches that of MySQL. Metaphone implementation
is based on Double Metaphone algorithm and indexes the primary code.
Built-in values that are available for use in morphology
option are as follows:
none - do not perform any morphology processing;
lemmatize_ru - apply Russian lemmatizer and pick a single root form (added in 2.1.1-beta);
lemmatize_en - apply English lemmatizer and pick a single root form (added in 2.2.1-beta);
lemmatize_de - apply German lemmatizer and pick a single root form (added in 2.2.1-beta);
lemmatize_ru_all - apply Russian lemmatizer and index all possible root forms (added in 2.1.1-beta);
lemmatize_en_all - apply English lemmatizer and index all possible root forms (added in 2.2.1-beta);
lemmatize_de_all - apply German lemmatizer and index all possible root forms (added in 2.2.1-beta);
stem_en - apply Porter's English stemmer;
stem_ru - apply Porter's Russian stemmer;
stem_enru - apply Porter's English and Russian stemmers;
stem_cz - apply Czech stemmer;
stem_ar - apply Arabic stemmer (added in 2.1.1-beta);
soundex - replace keywords with their SOUNDEX code;
metaphone - replace keywords with their METAPHONE code.
rlp_chinese - apply Chinese text segmentation using Rosette Linguistics Platform
rlp_chinese_batched - apply Chinese text segmentation using Rosette Linguistics Platform with document batching
Additional values provided by libstemmer are in 'libstemmer_XXX' format,
where XXX is libstemmer algorithm codename (refer to
libstemmer_c/libstemmer/modules.txt
for a complete list).
Several stemmers can be specified (comma-separated). They will be applied
to incoming words in the order they are listed, and the processing will stop
once one of the stemmers actually modifies the word.
Also when wordforms feature is enabled
the word will be looked up in word forms dictionary first, and if there is
a matching entry in the dictionary, stemmers will not be applied at all.
Or in other words, wordforms can be
used to implement stemming exceptions.
Example:
morphology = stem_en, libstemmer_sv
The keywords dictionary type.
Known values are 'crc' and 'keywords'.
'crc' is DEPRECATED. Use 'keywords' instead.
Optional, default is 'keywords'.
Introduced in version 2.0.1-beta.
CRC dictionary mode (dict=crc) is the default dictionary type
in Sphinx, and the only one available until version 2.0.1-beta.
Keywords dictionary mode (dict=keywords) was added in 2.0.1-beta,
primarily to (greatly) reduce indexing impact and enable substring
searches on huge collections. They also eliminate the chance of
CRC32 collisions. In 2.0.1-beta, that mode was only supported
for disk indexes. Starting with 2.0.2-beta, RT indexes are
also supported.
CRC dictionaries never store the original keyword text in the index.
Instead, keywords are replaced with their control sum value (either CRC32 or
FNV64, depending whether Sphinx was built with --enable-id64
)
both when searching and indexing, and that value is used internally
in the index.
That approach has two drawbacks. First, in CRC32 case there is
a chance of control sum collision between several pairs of different
keywords, growing quadratically with the number of unique keywords
in the index. (FNV64 case is unaffected in practice, as a chance
of a single FNV64 collision in a dictionary of 1 billion entries
is approximately 1:16, or 6.25 percent. And most dictionaries
will be much more compact that a billion keywords, as a typical
spoken human language has in the region of 1 to 10 million word
forms.) Second, and more importantly, substring searches are not
directly possible with control sums. Sphinx alleviated that by
pre-indexing all the possible substrings as separate keywords
(see Section 12.2.18, “min_prefix_len”, Section 12.2.19, “min_infix_len”
directives). That actually has an added benefit of matching
substrings in the quickest way possible. But at the same time
pre-indexing all substrings grows the index size a lot (factors
of 3-10x and even more would not be unusual) and impacts the
indexing time respectively, rendering substring searches
on big indexes rather impractical.
Keywords dictionary, introduced in 2.0.1-beta, fixes both these
drawbacks. It stores the keywords in the index and performs
search-time wildcard expansion. For example, a search for a
'test*' prefix could internally expand to 'test|tests|testing'
query based on the dictionary contents. That expansion is fully
transparent to the application, except that the separate
per-keyword statistics for all the actually matched keywords
would now also be reported.
Version 2.1.1-beta introduced extended wildcards support, now special
symbols like '?' and '%' are supported along with substring (infix) search (e.g. "t?st*", "run%", "*abc*").
Note, however, these wildcards work only with dict=keywords, and not elsewhere.
Indexing with keywords dictionary should be 1.1x to 1.3x slower
compared to regular, non-substring indexing - but times faster
compared to substring indexing (either prefix or infix). Index size
should only be slightly bigger that than of the regular non-substring
index, with a 1..10% percent total difference.
Regular keyword searching time must be very close or identical across
all three discussed index kinds (CRC non-substring, CRC substring,
keywords). Substring searching time can vary greatly depending
on how many actual keywords match the given substring (in other
words, into how many keywords does the search term expand).
The maximum number of keywords matched is restricted by the
expansion_limit
directive.
Essentially, keywords and CRC dictionaries represent the two
different trade-off substring searching decisions. You can choose
to either sacrifice indexing time and index size in favor of
top-speed worst-case searches (CRC dictionary), or only slightly
impact indexing time but sacrifice worst-case searching time when
the prefix expands into very many keywords (keywords dictionary).
Example:
dict = keywords
Whether to detect and index sentence and paragraph boundaries.
Optional, default is 0 (do not detect and index).
Introduced in version 2.0.1-beta.
This directive enables sentence and paragraph boundary indexing.
It's required for the SENTENCE and PARAGRAPH operators to work.
Sentence boundary detection is based on plain text analysis, so you
only need to set index_sp = 1
to enable it. Paragraph
detection is however based on HTML markup, and happens in the
HTML stripper.
So to index paragraph locations you also need to enable the stripper
by specifying html_strip = 1
. Both types of boundaries
are detected based on a few built-in rules enumerated just below.
Sentence boundary detection rules are as follows.
Question and exclamation signs (? and !) are always a sentence boundary.
Trailing dot (.) is a sentence boundary, except:
When followed by a letter. That's considered a part of an abbreviation (as in "S.T.A.L.K.E.R" or "Goldman Sachs S.p.A.").
When followed by a comma. That's considered an abbreviation followed by a comma (as in "Telecom Italia S.p.A., founded in 1994").
When followed by a space and a small letter. That's considered an abbreviation within a sentence (as in "News Corp. announced in February").
When preceded by a space and a capital letter, and followed by a space. That's considered a middle initial (as in "John D. Doe").
Paragraph boundaries are inserted at every block-level HTML tag.
Namely, those are (as taken from HTML 4 standard) ADDRESS, BLOCKQUOTE,
CAPTION, CENTER, DD, DIV, DL, DT, H1, H2, H3, H4, H5, LI, MENU, OL, P,
PRE, TABLE, TBODY, TD, TFOOT, TH, THEAD, TR, and UL.
Both sentences and paragraphs increment the keyword position counter by 1.
Example:
index_sp = 1
A list of in-field HTML/XML zones to index.
Optional, default is empty (do not index zones).
Introduced in version 2.0.1-beta.
Zones can be formally defined as follows. Everything between
an opening and a matching closing tag is called a span, and
the aggregate of all spans corresponding sharing the same
tag name is called a zone. For instance, everything between
the occurrences of <H1> and </H1> in the document
field belongs to H1 zone.
Zone indexing, enabled by index_zones
directive,
is an optional extension of the HTML stripper. So it will also
require that the stripper
is enabled (with html_strip = 1
). The value of the
index_zones
should be a comma-separated list of
those tag names and wildcards (ending with a star) that should
be indexed as zones.
Zones can nest and overlap arbitrarily. The only requirement
is that every opening tag has a matching tag. You can also have
an arbitrary number of both zones (as in unique zone names,
such as H1) and spans (all the occurrences of those H1 tags)
in a document.
Once indexed, zones can then be used for matching with
the ZONE operator, see Section 5.3, “Extended query syntax”.
Example:
index_zones = h*, th, title
Earlier versions than 2.1.1-beta only provided this feature for plain
index files; currently, RT index files also provide it.
12.2.10. min_stemming_len
Minimum word length at which to enable stemming.
Optional, default is 1 (stem everything).
Introduced in version 0.9.9-rc1.
Stemmers are not perfect, and might sometimes produce undesired results.
For instance, running "gps" keyword through Porter stemmer for English
results in "gp", which is not really the intent. min_stemming_len
feature lets you suppress stemming based on the source word length,
ie. to avoid stemming too short words. Keywords that are shorter than
the given threshold will not be stemmed. Note that keywords that are
exactly as long as specified will be stemmed. So in order to avoid
stemming 3-character keywords, you should specify 4 for the value.
For more finely grained control, refer to wordforms feature.
Example:
min_stemming_len = 4
Stopword files list (space separated).
Optional, default is empty.
Stopwords are the words that will not be indexed. Typically you'd
put most frequent words in the stopwords list because they do not add
much value to search results but consume a lot of resources to process.
You can specify several file names, separated by spaces. All the files
will be loaded. Stopwords file format is simple plain text. The encoding
must be UTF-8.
File data will be tokenized with respect to charset_table
settings, so you can use the same separators as in the indexed data.
The stemmers will normally be
applied when parsing stopwords file. That might however lead to undesired
results. Starting with 2.1.1-beta, you can turn that off with
stopwords_unstemmed.
Starting with version 2.1.1-beta small enough files are stored in the index
header, see Section 12.2.13, “embedded_limit” for details.
While stopwords are not indexed, they still do affect the keyword positions.
For instance, assume that "the" is a stopword, that document 1 contains the line
"in office", and that document 2 contains "in the office". Searching for "in office"
as for exact phrase will only return the first document, as expected, even though
"the" in the second one is stopped. That behavior can be tweaked through the
stopword_step directive.
Stopwords files can either be created manually, or semi-automatically.
indexer
provides a mode that creates a frequency dictionary
of the index, sorted by the keyword frequency, see --buildstops
and --buildfreqs
switch in Section 7.1, “indexer
command reference”.
Top keywords from that dictionary can usually be used as stopwords.
Example:
stopwords = /usr/local/sphinx/data/stopwords.txt
stopwords = stopwords-ru.txt stopwords-en.txt
Word forms dictionary.
Optional, default is empty.
Word forms are applied after tokenizing the incoming text
by charset_table rules.
They essentially let you replace one word with another. Normally,
that would be used to bring different word forms to a single
normal form (eg. to normalize all the variants such as "walks",
"walked", "walking" to the normal form "walk"). It can also be used
to implement stemming exceptions, because stemming is not applied
to words found in the forms list.
Starting with version 2.1.1-beta small enough files are stored in the index
header, see Section 12.2.13, “embedded_limit” for details.
Dictionaries are used to normalize incoming words both during indexing
and searching. Therefore, to pick up changes in wordforms file
it's required to rotate index.
Word forms support in Sphinx is designed to support big dictionaries well.
They moderately affect indexing speed: for instance, a dictionary with 1 million
entries slows down indexing about 1.5 times. Searching speed is not affected at all.
Additional RAM impact is roughly equal to the dictionary file size,
and dictionaries are shared across indexes: ie. if the very same 50 MB wordforms
file is specified for 10 different indexes, additional searchd
RAM usage will be about 50 MB.
Dictionary file should be in a simple plain text format. Each line
should contain source and destination word forms, in UTF-8 encoding,
separated by "greater" sign. Rules from the
charset_table will be
applied when the file is loaded. So basically it's as case sensitive
as your other full-text indexed data, ie. typically case insensitive.
Here's the file contents sample:
walks > walk
walked > walk
walking > walk
There is a bundled spelldump
utility that
helps you create a dictionary file in the format Sphinx can read
from source .dict
and .aff
dictionary files in ispell
or MySpell
format (as bundled with OpenOffice).
Starting with version 0.9.9-rc1, you can map several source words
to a single destination word. Because the work happens on tokens,
not the source text, differences in whitespace and markup are ignored.
Starting with version 2.1.1-beta, you can use "=>" instead of ">". Comments
(starting with "#" are also allowed. Finally, if a line starts with a tilde ("~")
the wordform will be applied after morphology, instead of before.
core 2 duo > c2d
e6600 > c2d
core 2duo => c2d # Some people write '2duo' together...
Stating with version 2.2.4, you can specify multiple destination tokens:
s02e02 > season 2 episode 2
s3 e3 > season 3 episode 3
Example:
wordforms = /usr/local/sphinx/data/wordforms.txt
wordforms = /usr/local/sphinx/data/alternateforms.txt
wordforms = /usr/local/sphinx/private/dict*.txt
Starting with version 2.1.1-beta you can specify several files and not
only just one. Masks can be used as a pattern, and all matching files will
be processed in simple ascending order. (If multi-byte codepages are used,
and file names can include foreign characters, the resulting order may not
be exactly alphabetic.) If a same wordform definition is found in several
files, the latter one is used, and it overrides previous definitions.
Embedded exceptions, wordforms, or stopwords file size limit.
Optional, default is 16K.
Added in version 2.1.1-beta.
Before 2.1.1-beta, the contents of exceptions, wordforms, or stopwords
files were always kept in the files. Only the file names were stored into
the index. Starting with 2.1.1-beta, indexer can either save the file name,
or embed the file contents directly into the index. Files sized under
embedded_limit
get stored into the index. For bigger files,
only the file names are stored. This also simplifies moving index files
to a different machine; you may get by just copying a single file.
With smaller files, such embedding reduces the number of the external
files on which the index depends, and helps maintenance. But at the same
time it makes no sense to embed a 100 MB wordforms dictionary into a tiny
delta index. So there needs to be a size threshold, and embedded_limit
is that threshold.
Example:
embedded_limit = 32K
Tokenizing exceptions file.
Optional, default is empty.
Exceptions allow to map one or more tokens (including tokens with
characters that would normally be excluded) to a single keyword.
They are similar to wordforms
in that they also perform mapping, but have a number of important
differences.
Starting with version 2.1.1-beta small enough files are stored in the index
header, see Section 12.2.13, “embedded_limit” for details.
Short summary of the differences is as follows:
exceptions are case sensitive, wordforms are not;
exceptions can use special characters that are not in charset_table, wordforms fully obey charset_table;
exceptions can underperform on huge dictionaries, wordforms handle millions of entries well.
The expected file format is also plain text, with one line per exception,
and the line format is as follows:
map-from-tokens => map-to-token
Example file:
at & t => at&t
AT&T => AT&T
Standarten Fuehrer => standartenfuhrer
Standarten Fuhrer => standartenfuhrer
MS Windows => ms windows
Microsoft Windows => ms windows
C++ => cplusplus
c++ => cplusplus
C plus plus => cplusplus
All tokens here are case sensitive: they will not be processed by
charset_table rules. Thus, with
the example exceptions file above, "at&t" text will be tokenized as two
keywords "at" and "t", because of lowercase letters. On the other hand,
"AT&T" will match exactly and produce single "AT&T" keyword.
Note that this map-to keyword is a) always interpreted
as a single word, and b) is both case and space
sensitive! In our sample, "ms windows" query will not
match the document with "MS Windows" text. The query will be interpreted
as a query for two keywords, "ms" and "windows". And what "MS Windows"
gets mapped to is a single keyword "ms windows",
with a space in the middle. On the other hand, "standartenfuhrer"
will retrieve documents with "Standarten Fuhrer" or "Standarten Fuehrer"
contents (capitalized exactly like this), or any capitalization variant
of the keyword itself, eg. "staNdarTenfUhreR". (It won't catch
"standarten fuhrer", however: this text does not match any of the
listed exceptions because of case sensitivity, and gets indexed
as two separate keywords.)
Whitespace in the map-from tokens list matters, but its amount does not.
Any amount of the whitespace in the map-form list will match any other amount
of whitespace in the indexed document or query. For instance, "AT & T"
map-from token will match "AT & T" text,
whatever the amount of space in both map-from part and the indexed text.
Such text will therefore be indexed as a special "AT&T" keyword,
thanks to the very first entry from the sample.
Exceptions also allow to capture special characters (that are exceptions
from general charset_table rules;
hence the name). Assume that you generally do not want to treat '+'
as a valid character, but still want to be able search for some exceptions
from this rule such as 'C++'. The sample above will do just that, totally
independent of what characters are in the table and what are not.
Exceptions are applied to raw incoming document and query data
during indexing and searching respectively. Therefore, to pick up
changes in the file it's required to reindex and restart
searchd
.
Example:
exceptions = /usr/local/sphinx/data/exceptions.txt
Minimum indexed word length.
Optional, default is 1 (index everything).
Only those words that are not shorter than this minimum will be indexed.
For instance, if min_word_len is 4, then 'the' won't be indexed, but 'they' will be.
Example:
min_word_len = 4
Accepted characters table, with case folding rules.
Optional, default value are latin and cyrillic characters.
charset_table is the main workhorse of Sphinx tokenizing process,
ie. the process of extracting keywords from document text or query text.
It controls what characters are accepted as valid and what are not,
and how the accepted characters should be transformed (eg. should
the case be removed or not).
You can think of charset_table as of a big table that has a mapping
for each and every of 100K+ characters in Unicode. By default,
every character maps to 0, which means that it does not occur
within keywords and should be treated as a separator. Once
mentioned in the table, character is mapped to some other
character (most frequently, either to itself or to a lowercase
letter), and is treated as a valid keyword part.
The expected value format is a commas-separated list of mappings.
Two simplest mappings simply declare a character as valid, and map
a single character to another single character, respectively.
But specifying the whole table in such form would result
in bloated and barely manageable specifications. So there are
several syntax shortcuts that let you map ranges of characters
at once. The complete list is as follows:
- A->a
Single char mapping, declares source char 'A' as allowed
to occur within keywords and maps it to destination char 'a'
(but does not declare 'a' as allowed).
- A..Z->a..z
Range mapping, declares all chars in source range
as allowed and maps them to the destination range. Does not
declare destination range as allowed. Also checks ranges' lengths
(the lengths must be equal).
- a
Stray char mapping, declares a character as allowed
and maps it to itself. Equivalent to a->a single char mapping.
- a..z
Stray range mapping, declares all characters in range
as allowed and maps them to themselves. Equivalent to
a..z->a..z range mapping.
- A..Z/2
Checkerboard range map. Maps every pair of chars
to the second char. More formally, declares odd characters
in range as allowed and maps them to the even ones; also
declares even characters as allowed and maps them to themselves.
For instance, A..Z/2 is equivalent to A->B, B->B, C->D, D->D,
..., Y->Z, Z->Z. This mapping shortcut is helpful for
a number of Unicode blocks where uppercase and lowercase
letters go in such interleaved order instead of contiguous
chunks.
Control characters with codes from 0 to 31 are always treated as separators.
Characters with codes 32 to 127, ie. 7-bit ASCII characters, can be used
in the mappings as is. To avoid configuration file encoding issues,
8-bit ASCII characters and Unicode characters must be specified in U+xxx form,
where 'xxx' is hexadecimal codepoint number. This form can also be used
for 7-bit ASCII characters to encode special ones: eg. use U+20 to
encode space, U+2E to encode dot, U+2C to encode comma.
Starting with 2.2.3-beta, aliases "english" and "russian" are allowed at
control character mapping.
Example:
# default are English and Russian letters
charset_table = 0..9, A..Z->a..z, _, a..z, \
U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451
# english charset defined with alias
charset_table = 0..9, english, _
Ignored characters list.
Optional, default is empty.
Useful in the cases when some characters, such as soft hyphenation mark (U+00AD),
should be not just treated as separators but rather fully ignored.
For example, if '-' is simply not in the charset_table,
"abc-def" text will be indexed as "abc" and "def" keywords.
On the contrary, if '-' is added to ignore_chars list, the same
text will be indexed as a single "abcdef" keyword.
The syntax is the same as for charset_table,
but it's only allowed to declare characters, and not allowed to map them. Also,
the ignored characters must not be present in charset_table.
Example:
ignore_chars = U+AD
Minimum word prefix length to index.
Optional, default is 0 (do not index prefixes).
Prefix indexing allows to implement wildcard searching by 'wordstart*' wildcards.
When mininum prefix length is set to a positive number, indexer will index
all the possible keyword prefixes (ie. word beginnings) in addition to the keywords
themselves. Too short prefixes (below the minimum allowed length) will not
be indexed.
For instance, indexing a keyword "example" with min_prefix_len=3
will result in indexing "exa", "exam", "examp", "exampl" prefixes along
with the word itself. Searches against such index for "exam" will match
documents that contain "example" word, even if they do not contain "exam"
on itself. However, indexing prefixes will make the index grow significantly
(because of many more indexed keywords), and will degrade both indexing
and searching times.
There's no automatic way to rank perfect word matches higher
in a prefix index, but there's a number of tricks to achieve that.
First, you can setup two indexes, one with prefix indexing and one
without it, search through both, and use SetIndexWeights()
call to combine weights. Second, you can rewriteyour extended-mode queries:
$cl->Query ( "( keyword | keyword* ) other keywords" );
Example:
min_prefix_len = 3
Minimum infix prefix length to index.
Optional, default is 0 (do not index infixes).
Infix indexing allows to implement wildcard searching by 'start*', '*end', and '*middle*' wildcards.
When minimum infix length is set to a positive number, indexer will index all the possible keyword infixes
(ie. substrings) in addition to the keywords themselves. Too short infixes
(below the minimum allowed length) will not be indexed. For instance,
indexing a keyword "test" with min_infix_len=2 will result in indexing
"te", "es", "st", "tes", "est" infixes along with the word itself.
Searches against such index for "es" will match documents that contain
"test" word, even if they do not contain "es" on itself. However,
indexing infixes will make the index grow significantly (because of
many more indexed keywords), and will degrade both indexing and
searching times.
There's no automatic way to rank perfect word matches higher
in an infix index, but the same tricks as with prefix indexes
can be applied.
Example:
min_infix_len = 3
12.2.20. max_substring_len
Maximum substring (either prefix or infix) length to index.
Optional, default is 0 (do not limit indexed substrings).
Applies to dict=crc only.
By default, substring (either prefix or infix) indexing in the
dict=crc mode will index all
the possible substrings as separate keywords. That might result
in an overly large index. So the max_substring_len
directive lets you limit the impact of substring indexing
by skipping too-long substrings (which, chances are, will never
get searched for anyway).
For example, a test index of 10,000 blog posts takes this
much disk space depending on the settings:
- 6.4 MB baseline (no substrings)
- 24.3 MB (3.8x) with min_prefix_len = 3
- 22.2 MB (3.5x) with min_prefix_len = 3, max_substring_len = 8
- 19.3 MB (3.0x) with min_prefix_len = 3, max_substring_len = 6
- 94.3 MB (14.7x) with min_infix_len = 3
- 84.6 MB (13.2x) with min_infix_len = 3, max_substring_len = 8
- 70.7 MB (11.0x) with min_infix_len = 3, max_substring_len = 6
So in this test limiting the max substring length saved us
10-15% on the index size.
There is no performance impact associated with substring length
when using dict=keywords mode, so this directive is not applicable
and intentionally forbidden in that case. If required, you can still
limit the length of a substring that you search for in the application
code.
Example:
max_substring_len = 12
The list of full-text fields to limit prefix indexing to.
Optional, default is empty (index all fields in prefix mode).
Because prefix indexing impacts both indexing and searching performance,
it might be desired to limit it to specific full-text fields only:
for instance, to provide prefix searching through URLs, but not through
page contents. prefix_fields specifies what fields will be prefix-indexed;
all other fields will be indexed in normal mode. The value format is a
comma-separated list of field names.
Example:
prefix_fields = url, domain
The list of full-text fields to limit infix indexing to.
Optional, default is empty (index all fields in infix mode).
Similar to prefix_fields,
but lets you limit infix-indexing to given fields.
Example:
infix_fields = url, domain
N-gram lengths for N-gram indexing.
Optional, default is 0 (disable n-gram indexing).
Known values are 0 and 1 (other lengths to be implemented).
N-grams provide basic CJK (Chinese, Japanese, Korean) support for
unsegmented texts. The issue with CJK searching is that there could be no
clear separators between the words. Ideally, the texts would be filtered
through a special program called segmenter that would insert separators
in proper locations. However, segmenters are slow and error prone,
and it's common to index contiguous groups of N characters, or n-grams,
instead.
When this feature is enabled, streams of CJK characters are indexed
as N-grams. For example, if incoming text is "ABCDEF" (where A to F represent
some CJK characters) and length is 1, in will be indexed as if
it was "A B C D E F". (With length equal to 2, it would produce "AB BC CD DE EF";
but only 1 is supported at the moment.) Only those characters that are
listed in ngram_chars table
will be split this way; other ones will not be affected.
Note that if search query is segmented, ie. there are separators between
individual words, then wrapping the words in quotes and using extended mode
will result in proper matches being found even if the text was not
segmented. For instance, assume that the original query is BC DEF.
After wrapping in quotes on the application side, it should look
like "BC" "DEF" (with quotes). This query
will be passed to Sphinx and internally split into 1-grams too,
resulting in "B C" "D E F" query, still with
quotes that are the phrase matching operator. And it will match
the text even though there were no separators in the text.
Even if the search query is not segmented, Sphinx should still produce
good results, thanks to phrase based ranking: it will pull closer phrase
matches (which in case of N-gram CJK words can mean closer multi-character
word matches) to the top.
Example:
ngram_len = 1
N-gram characters list.
Optional, default is empty.
To be used in conjunction with in ngram_len,
this list defines characters, sequences of which are subject to N-gram extraction.
Words comprised of other characters will not be affected by N-gram indexing
feature. The value format is identical to charset_table.
Example:
ngram_chars = U+3000..U+2FA1F
Phrase boundary characters list.
Optional, default is empty.
This list controls what characters will be treated as phrase boundaries,
in order to adjust word positions and enable phrase-level search
emulation through proximity search. The syntax is similar
to charset_table.
Mappings are not allowed and the boundary characters must not
overlap with anything else.
On phrase boundary, additional word position increment (specified by
phrase_boundary_step)
will be added to current word position. This enables phrase-level
searching through proximity queries: words in different phrases
will be guaranteed to be more than phrase_boundary_step distance
away from each other; so proximity search within that distance
will be equivalent to phrase-level search.
Phrase boundary condition will be raised if and only if such character
is followed by a separator; this is to avoid abbreviations such as
S.T.A.L.K.E.R or URLs being treated as several phrases.
Example:
phrase_boundary = ., ?, !, U+2026 # horizontal ellipsis
12.2.26. phrase_boundary_step
Phrase boundary word position increment.
Optional, default is 0.
On phrase boundary, current word position will be additionally incremented
by this number. See phrase_boundary for details.
Example:
phrase_boundary_step = 100
Whether to strip HTML markup from incoming full-text data.
Optional, default is 0.
Known values are 0 (disable stripping) and 1 (enable stripping).
Both HTML tags and entities and considered markup and get processed.
HTML tags are removed, their contents (i.e., everything between
<P> and </P>) are left intact by default. You can choose
to keep and index attributes of the tags (e.g., HREF attribute in
an A tag, or ALT in an IMG one). Several well-known inline tags are
completely removed, all other tags are treated as block level and
replaced with whitespace. For example, 'te<B>st</B>'
text will be indexed as a single keyword 'test', however,
'te<P>st</P>' will be indexed as two keywords
'te' and 'st'. Known inline tags are as follows: A, B, I, S, U, BASEFONT,
BIG, EM, FONT, IMG, LABEL, SMALL, SPAN, STRIKE, STRONG, SUB, SUP, TT.
HTML entities get decoded and replaced with corresponding UTF-8
characters. Stripper supports both numeric forms (such as ï)
and text forms (such as ó or ). All entities
as specified by HTML4 standard are supported.
Stripping should work with
properly formed HTML and XHTML, but, just as most browsers, may produce
unexpected results on malformed input (such as HTML with stray <'s
or unclosed >'s).
Only the tags themselves, and also HTML comments, are stripped.
To strip the contents of the tags too (eg. to strip embedded scripts),
see html_remove_elements option.
There are no restrictions on tag names; ie. everything
that looks like a valid tag start, or end, or a comment
will be stripped.
Example:
html_strip = 1
12.2.28. html_index_attrs
A list of markup attributes to index when stripping HTML.
Optional, default is empty (do not index markup attributes).
Specifies HTML markup attributes whose contents should be retained and indexed
even though other HTML markup is stripped. The format is per-tag enumeration of
indexable attributes, as shown in the example below.
Example:
html_index_attrs = img=alt,title; a=title;
12.2.29. html_remove_elements
A list of HTML elements for which to strip contents along with the elements themselves.
Optional, default is empty string (do not strip contents of any elements).
This feature allows to strip element contents, ie. everything that
is between the opening and the closing tags. It is useful to remove
embedded scripts, CSS, etc. Short tag form for empty elements
(ie. <br />) is properly supported; ie. the text that
follows such tag will not be removed.
The value is a comma-separated list of element (tag) names whose
contents should be removed. Tag names are case insensitive.
Example:
html_remove_elements = style, script
Local index declaration in the distributed index.
Multi-value, optional, default is empty.
This setting is used to declare local indexes that will be searched when
given distributed index is searched. Many local indexes can be declared per
each distributed index. Any local index can also be mentioned several times
in different distributed indexes.
Note that by default all local indexes will be searched sequentially,
utilizing only 1 CPU or core. To parallelize processing of the local parts
in the distributed index, you should use dist_threads
directive,
see Section 12.4.24, “dist_threads”.
Before dist_threads
, there also was a legacy solution
to configure searchd
to query itself instead of using
local indexes (refer to Section 12.2.31, “agent” for the details). However,
that creates redundant CPU and network load, and dist_threads
is now strongly suggested instead.
Example:
local = chunk1
local = chunk2
Remote agent declaration in the distributed index.
Multi-value, optional, default is empty.
agent
directive declares remote agents that are searched
every time when the enclosing distributed index is searched. The agents
are, essentially, pointers to networked indexes. Prior to version 2.1.1-beta,
the value format was:
agent = address:index-list
Starting with 2.1.1-beta, the value can additionally specify multiple
alternatives (agent mirrors) for either the address only, or the address
and index list:
agent = address1 [ | address2 [...] ]:index-list
agent = address1:index-list [ | address2:index-list [...] ]
In both cases the address specification must be one of the following:
address = hostname:port # eg. server2:9312
address = /absolute/unix/socket/path # eg. /var/run/sphinx2.sock
Where
hostname
is the remote host name,
port
is the remote TCP port number,
index-list
is a comma-separated list of index names,
and square braces [] designate an optional clause.
In other words, you can point every single agent to one or more remote
indexes, residing on one or more networked servers. There are absolutely
no restrictions on the pointers. To point out a couple important things,
the host can be localhost, and the remote index can be a distributed
index in turn, all that is legal. That enables a bunch of very different
usage modes:
sharding over multiple agent servers, and creating
an arbitrary cluster topology;
sharding over multiple agent servers, mirrored
for HA/LB (High Availability and Load Balancing) purposes
(starting with 2.1.1-beta);
sharding within localhost, to utilize multiple cores
(historical and not recommended in versions 1.x and above, use multiple
local indexes and dist_threads directive instead);
All agents are searched in parallel. An index list is passed verbatim
to the remote agent. How exactly that list is searched within the agent
(ie. sequentially or in parallel too) depends solely on the agent
configuration (ie. dist_threads directive). Master has no remote
control over that.
Example:
# config on box2
# sharding an index over 3 servers
agent = box2:9312:chunk2
agent = box3:9312:chunk3
# config on box2
# sharding an index over 3 servers
agent = box1:9312:chunk2
agent = box3:9312:chunk3
# config on box3
# sharding an index over 3 servers
agent = box1:9312:chunk2
agent = box2:9312:chunk3
Agent mirrors
New syntax added in 2.1.1-beta lets you define so-called agent mirrors
that can be used interchangeably when processing a search query. Master server
keeps track of mirror status (alive or dead) and response times, and does
automatic failover and load balancing based on that. For example, this line:
agent = box1:9312|box2:9312|box3:9312:chunk2
Declares that box1:9312, box2:9312, and box3:9312 all have an index
called chunk2, and can be used as interchangeable mirrors. If any single
of those servers go down, the queries will be distributed between
the other two. When it gets back up, master will detect that and begin
routing queries to all three boxes again.
Another way to define the mirrors is to explicitly specify the index list
for every mirror:
agent = box1:9312:box1chunk2|box2:9312:box2chunk2
This works essentially the same as the previous example, but different
index names will be used when querying different severs: box1chunk2 when querying
box1:9312, and box2chunk when querying box2:9312.
By default, all queries are routed to the best of the mirrors. The best one
is picked based on the recent statistics, as controlled by the
ha_period_karma config directive.
Master stores a number of metrics (total query count, error count, response
time, etc) recently observed for every agent. It groups those by time spans,
and karma is that time span length. The best agent mirror is then determined
dynamically based on the last 2 such time spans. Specific algorithm that
will be used to pick a mirror can be configured
ha_strategy directive.
The karma period is in seconds and defaults to 60 seconds. Master stores
upto 15 karma spans with per-agent statistics for instrumentation purposes
(see SHOW AGENT STATUS
statement). However, only the last 2 spans out of those are ever used for
HA/LB logic.
When there are no queries, master sends a regular ping command every
ha_ping_interval milliseconds
in order to have some statistics and at least check, whether the remote
host is still alive. ha_ping_interval defaults to 1000 msec. Setting it to 0
disables pings and statistics will only be accumulated based on actual queries.
Example:
# sharding index over 4 servers total
# in just 2 chunks but with 2 failover mirrors for each chunk
# box1, box2 carry chunk1 as local
# box3, box4 carry chunk2 as local
# config on box1, box2
agent = box3:9312|box4:9312:chunk2
# config on box3, box4
agent = box1:9312|box2:9312:chunk1
12.2.32. agent_persistent
Persistently connected remote agent declaration.
Multi-value, optional, default is empty.
Introduced in version 2.1.1-beta.
agent_persistent
directive syntax matches that of
the agent directive. The only difference
is that the master will not open a new connection to the agent for
every query and then close it. Rather, it will keep a connection open and
attempt to reuse for the subsequent queries. The maximal number of such persistent connections per one agent host
is limited by persistent_connections_limit option of searchd section.
Note, that you have to set the last one in something greater than 0 if you want to use persistent agent connections.
Otherwise - when persistent_connections_limit is not defined, it assumes
the zero num of persistent connections, and 'agent_persistent' acts exactly as simple 'agent'.
Persistent master-agent connections reduce TCP port pressure, and
save on connection handshakes. As of time of this writing, they are supported only
in workers=threads mode. In other modes, simple non-persistent connections
(i.e., one connection per operation) will be used, and a warning will show
up in the console.
Example:
agent_persistent = remotebox:9312:index2
Remote blackhole agent declaration in the distributed index.
Multi-value, optional, default is empty.
Introduced in version 0.9.9-rc1.
agent_blackhole
lets you fire-and-forget queries
to remote agents. That is useful for debugging (or just testing)
production clusters: you can setup a separate debugging/testing searchd
instance, and forward the requests to this instance from your production
master (aggregator) instance without interfering with production work.
Master searchd will attempt to connect and query blackhole agent
normally, but it will neither wait nor process any responses.
Also, all network errors on blackhole agents will be ignored.
The value format is completely identical to regular
agent directive.
Example:
agent_blackhole = testbox:9312:testindex1,testindex2
12.2.34. agent_connect_timeout
Remote agent connection timeout, in milliseconds.
Optional, default is 1000 (ie. 1 second).
When connecting to remote agents, searchd
will wait at most this much time for connect() call to complete
successfully. If the timeout is reached but connect() does not complete,
and retries are enabled,
retry will be initiated.
Example:
agent_connect_timeout = 300
12.2.35. agent_query_timeout
Remote agent query timeout, in milliseconds.
Optional, default is 3000 (ie. 3 seconds).
Added in version 2.1.1-beta.
After connection, searchd
will wait at most this
much time for remote queries to complete. This timeout is fully separate
from connection timeout; so the maximum possible delay caused by
a remote agent equals to the sum of agent_connection_timeout
and
agent_query_timeout
. Queries will not be retried
if this timeout is reached; a warning will be produced instead.
Example:
agent_query_timeout = 10000 # our query can be long, allow up to 10 sec
Whether to pre-open all index files, or open them per each query.
Optional, default is 0 (do not preopen).
This option tells searchd
that it should pre-open
all index files on startup (or rotation) and keep them open while it runs.
Currently, the default mode is not to pre-open the files (this may
change in the future). Preopened indexes take a few (currently 2) file
descriptors per index. However, they save on per-query open()
calls;
and also they are invulnerable to subtle race conditions that may happen during
index rotation under high load. On the other hand, when serving many indexes
(100s to 1000s), it still might be desired to open the on per-query basis
in order to save file descriptors.
This directive does not affect indexer
in any way,
it only affects searchd
.
Example:
preopen = 1
Whether to enable in-place index inversion.
Optional, default is 0 (use separate temporary files).
Introduced in version 0.9.9-rc1.
inplace_enable
greatly reduces indexing disk footprint,
at a cost of slightly slower indexing (it uses around 2x less disk,
but yields around 90-95% the original performance).
Indexing involves two major phases. The first phase collects,
processes, and partially sorts documents by keyword, and writes
the intermediate result to temporary files (.tmp*). The second
phase fully sorts the documents, and creates the final index
files. Thus, rebuilding a production index on the fly involves
around 3x peak disk footprint: 1st copy for the intermediate
temporary files, 2nd copy for newly constructed copy, and 3rd copy
for the old index that will be serving production queries in the meantime.
(Intermediate data is comparable in size to the final index.)
That might be too much disk footprint for big data collections,
and inplace_enable
allows to reduce it.
When enabled, it reuses the temporary files, outputs the
final data back to them, and renames them on completion.
However, this might require additional temporary data chunk
relocation, which is where the performance impact comes from.
This directive does not affect searchd
in any way,
it only affects indexer
.
Example:
inplace_enable = 1
In-place inversion fine-tuning option.
Controls preallocated hitlist gap size.
Optional, default is 0.
Introduced in version 0.9.9-rc1.
This directive does not affect searchd
in any way,
it only affects indexer
.
Example:
inplace_hit_gap = 1M
12.2.39. inplace_docinfo_gap
In-place inversion fine-tuning option.
Controls preallocated docinfo gap size.
Optional, default is 0.
Introduced in version 0.9.9-rc1.
This directive does not affect searchd
in any way,
it only affects indexer
.
Example:
inplace_docinfo_gap = 1M
12.2.40. inplace_reloc_factor
In-place inversion fine-tuning option.
Controls relocation buffer size within indexing memory arena.
Optional, default is 0.1.
Introduced in version 0.9.9-rc1.
This directive does not affect searchd
in any way,
it only affects indexer
.
Example:
inplace_reloc_factor = 0.1
12.2.41. inplace_write_factor
In-place inversion fine-tuning option.
Controls in-place write buffer size within indexing memory arena.
Optional, default is 0.1.
Introduced in version 0.9.9-rc1.
This directive does not affect searchd
in any way,
it only affects indexer
.
Example:
inplace_write_factor = 0.1
12.2.42. index_exact_words
Whether to index the original keywords along with the stemmed/remapped versions.
Optional, default is 0 (do not index).
Introduced in version 0.9.9-rc1.
When enabled, index_exact_words
forces indexer
to put the raw keywords in the index along with the stemmed versions. That, in turn,
enables exact form operator in the query language to work.
This impacts the index size and the indexing time. However, searching performance
is not impacted at all.
Example:
index_exact_words = 1
Position increment on overshort (less that min_word_len) keywords.
Optional, allowed values are 0 and 1, default is 1.
Introduced in version 0.9.9-rc1.
This directive does not affect searchd
in any way,
it only affects indexer
.
Example:
overshort_step = 1
Position increment on stopwords.
Optional, allowed values are 0 and 1, default is 1.
Introduced in version 0.9.9-rc1.
This directive does not affect searchd
in any way,
it only affects indexer
.
Example:
stopword_step = 1
Hitless words list.
Optional, allowed values are 'all', or a list file name.
Introduced in version 1.10-beta.
By default, Sphinx full-text index stores not only a list of matching
documents for every given keyword, but also a list of its in-document positions
(aka hitlist). Hitlists enables phrase, proximity, strict order and other
advanced types of searching, as well as phrase proximity ranking. However,
hitlists for specific frequent keywords (that can not be stopped for
some reason despite being frequent) can get huge and thus slow to process
while querying. Also, in some cases we might only care about boolean
keyword matching, and never need position-based searching operators
(such as phrase matching) nor phrase ranking.
hitless_words
lets you create indexes that either
do not have positional information (hitlists) at all, or skip it for
specific keywords.
Hitless index will generally use less space than the respective
regular index (about 1.5x can be expected). Both indexing and searching
should be faster, at a cost of missing positional query and ranking support.
When searching, positional queries (eg. phrase queries) will be automatically
converted to respective non-positional (document-level) or combined queries.
For instance, if keywords "hello" and "world" are hitless, "hello world"
phrase query will be converted to (hello & world) bag-of-words query,
matching all documents that mention either of the keywords but not necessarily
the exact phrase. And if, in addition, keywords "simon" and "says" are not
hitless, "simon says hello world" will be converted to ("simon says" &
hello & world) query, matching all documents that contain "hello" and
"world" anywhere in the document, and also "simon says" as an exact phrase.
Example:
hitless_words = all
Expand keywords with exact forms and/or stars when possible.
Optional, default is 0 (do not expand keywords).
Introduced in version 1.10-beta.
Queries against indexes with expand_keywords
feature
enabled are internally expanded as follows. If the index was built with
prefix or infix indexing enabled, every keyword gets internally replaced
with a disjunction of keyword itself and a respective prefix or infix
(keyword with stars). If the index was built with both stemming and
index_exact_words enabled,
exact form is also added. Here's an example that shows how internal
expansion works when all of the above (infixes, stemming, and exact
words) are combined:
running -> ( running | *running* | =running )
Expanded queries take naturally longer to complete, but can possibly
improve the search quality, as the documents with exact form matches
should be ranked generally higher than documents with stemmed or infix matches.
Note that the existing query syntax does not allow to emulate this
kind of expansion, because internal expansion works on keyword level and
expands keywords within phrase or quorum operators too (which is not
possible through the query syntax).
This directive does not affect indexer
in any way,
it only affects searchd
.
Example:
expand_keywords = 1
Blended characters list.
Optional, default is empty.
Introduced in version 1.10-beta.
Blended characters are indexed both as separators and valid characters.
For instance, assume that & is configured as blended and AT&T
occurs in an indexed document. Three different keywords will get indexed,
namely "at&t", treating blended characters as valid, plus "at" and "t",
treating them as separators.
Positions for tokens obtained by replacing blended characters with whitespace
are assigned as usual, so regular keywords will be indexed just as if there was
no blend_chars
specified at all. An additional token that
mixes blended and non-blended characters will be put at the starting position.
For instance, if the field contents are "AT&T company" occurs in the very
beginning of the text field, "at" will be given position 1, "t" position 2,
"company" position 3, and "AT&T" will also be given position 1 ("blending"
with the opening regular keyword). Thus, querying for either AT&T or just
AT will match that document, and querying for "AT T" as a phrase also match it.
Last but not least, phrase query for "AT&T company" will also
match it, despite the position
Blended characters can overlap with special characters used in query
syntax (think of T-Mobile or @twitter). Where possible, query parser will
automatically handle blended character as blended. For instance, "hello @twitter"
within quotes (a phrase operator) would handle @-sign as blended, because
@-syntax for field operator is not allowed within phrases. Otherwise,
the character would be handled as an operator. So you might want to
escape the keywords.
Starting with version 2.0.1-beta, blended characters can be remapped,
so that multiple different blended characters could be normalized into
just one base form. This is useful when indexing multiple alternative
Unicode codepoints with equivalent glyphs.
Example:
blend_chars = +, &, U+23
blend_chars = +, &->+ # 2.0.1 and above
Blended tokens indexing mode.
Optional, default is trim_none
.
Introduced in version 2.0.1-beta.
By default, tokens that mix blended and non-blended characters
get indexed in there entirety. For instance, when both at-sign and
an exclamation are in blend_chars
, "@dude!" will get
result in two tokens indexed: "@dude!" (with all the blended characters)
and "dude" (without any). Therefore "@dude" query will not
match it.
blend_mode
directive adds flexibility to this indexing
behavior. It takes a comma-separated list of options.
blend_mode = option [, option [, ...]]
option = trim_none | trim_head | trim_tail | trim_both | skip_pure
Options specify token indexing variants. If multiple options are
specified, multiple variants of the same token will be indexed.
Regular keywords (resulting from that token by replacing blended
with whitespace) are always be indexed.
- trim_none
Index the entire token.
- trim_head
Trim heading blended characters, and index the resulting token.
- trim_tail
Trim trailing blended characters, and index the resulting token.
- trim_both
Trim both heading and trailing blended characters, and index the resulting token.
- skip_pure
Do not index the token if it's purely blended, that is, consists of blended characters only.
Returning to the "@dude!" example above, setting blend_mode = trim_head,
trim_tail
will result in two tokens being indexed, "@dude" and "dude!".
In this particular example, trim_both
would have no effect,
because trimming both blended characters results in "dude" which is already
indexed as a regular keyword. Indexing "@U.S.A." with trim_both
(and assuming that dot is blended two) would result in "U.S.A" being indexed.
Last but not least, skip_pure
enables you to fully ignore
sequences of blended characters only. For example, "one @@@ two" would be
indexed exactly as "one two", and match that as a phrase. That is not the case
by default because a fully blended token gets indexed and offsets the second
keyword position.
Default behavior is to index the entire token, equivalent to
blend_mode = trim_none
.
Example:
blend_mode = trim_tail, skip_pure
RAM chunk size limit.
Optional, default is 128M.
Introduced in version 1.10-beta.
RT index keeps some data in memory (so-called RAM chunk) and
also maintains a number of on-disk indexes (so-called disk chunks).
This directive lets you control the RAM chunk size. Once there's
too much data to keep in RAM, RT index will flush it to disk,
activate a newly created disk chunk, and reset the RAM chunk.
The limit is pretty strict; RT index should never allocate more
memory than it's limited to. The memory is not preallocated either,
hence, specifying 512 MB limit and only inserting 3 MB of data
should result in allocating 3 MB, not 512 MB.
Example:
rt_mem_limit = 512M
Full-text field declaration.
Multi-value, mandatory
Introduced in version 1.10-beta.
Full-text fields to be indexed are declared using rt_field
directive. The names must be unique. The order is preserved; and so field values
in INSERT statements without an explicit list of inserted columns will have to be
in the same order as configured.
Example:
rt_field = author
rt_field = title
rt_field = content
Unsigned integer attribute declaration.
Multi-value (an arbitrary number of attributes is allowed), optional.
Declares an unsigned 32-bit attribute.
Introduced in version 1.10-beta.
Example:
rt_attr_uint = gid
Boolean attribute declaration.
Multi-value (there might be multiple attributes declared), optional.
Declares a 1-bit unsigned integer attribute.
Introduced in version 2.1.2-release.
Example:
rt_attr_bool = available
BIGINT attribute declaration.
Multi-value (an arbitrary number of attributes is allowed), optional.
Declares a signed 64-bit attribute.
Introduced in version 1.10-beta.
Example:
rt_attr_bigint = guid
Floating point attribute declaration.
Multi-value (an arbitrary number of attributes is allowed), optional.
Declares a single precision, 32-bit IEEE 754 format float attribute.
Introduced in version 1.10-beta.
Example:
rt_attr_float = gpa
Multi-valued attribute (MVA) declaration.
Declares the UNSIGNED INTEGER (unsigned 32-bit) MVA attribute.
Multi-value (ie. there may be more than one such attribute declared), optional.
Applies to RT indexes only.
Example:
rt_attr_multi = my_tags
12.2.56. rt_attr_multi_64
Multi-valued attribute (MVA) declaration.
Declares the BIGINT (signed 64-bit) MVA attribute.
Multi-value (ie. there may be more than one such attribute declared), optional.
Applies to RT indexes only.
Example:
rt_attr_multi_64 = my_wide_tags
12.2.57. rt_attr_timestamp
Timestamp attribute declaration.
Multi-value (an arbitrary number of attributes is allowed), optional.
Introduced in version 1.10-beta.
Example:
rt_attr_timestamp = date_added
String attribute declaration.
Multi-value (an arbitrary number of attributes is allowed), optional.
Introduced in version 1.10-beta.
Example:
rt_attr_string = author
JSON attribute declaration.
Multi-value (ie. there may be more than one such attribute declared), optional.
Introduced in version 2.1.1-beta.
Refer to Section 12.1.24, “sql_attr_json” for more details on the JSON attributes.
Example:
rt_attr_json = properties
Agent mirror selection strategy, for load balancing.
Optional, default is random.
Added in 2.1.1-beta.
The strategy used for mirror selection, or in other words, choosing
a specific agent mirror in a distributed
index. Essentially, this directive controls how exactly master does the
load balancing between the configured mirror agent nodes.
As of 2.1.1-beta, the following strategies are implemented:
Simple random balancing
ha_strategy = random
The default balancing mode. Simple linear random distribution among the mirrors.
That is, equal selection probability are assigned to every mirror. Kind of similar
to round-robin (RR), but unlike RR, does not impose a strict selection order.
Adaptive randomized balancing
The default simple random strategy does not take mirror status, error rate,
and, most importantly, actual response latencies into account. So to accommodate
for heterogeneous clusters and/or temporary spikes in agent node load, we have
a group of balancing strategies that dynamically adjusts the probabilities
based on the actual query latencies observed by the master.
The adaptive strategies based on latency-weighted probabilities
basically work as follows:
latency stats are accumulated, in blocks of ha_period_karma seconds;
once per karma period, latency-weighted probabilities get recomputed;
once per request (including ping requests), "dead or alive" flag is adjusted.
Currently (as of 2.1.1-beta), we begin with equal probabilities (or percentages,
for brevity), and on every step, scale them by the inverse of the latencies observed
during the last "karma" period, and then renormalize them. For example, if during
the first 60 seconds after the master startup 4 mirrors had latencies of
10, 5, 30, and 3 msec/query respectively, the first adjustment step
would go as follow:
initial percentages: 0.25, 0.25, 0.25, 0.2%;
observed latencies: 10 ms, 5 ms, 30 ms, 3 ms;
inverse latencies: 0.1, 0.2, 0.0333, 0.333;
scaled percentages: 0.025, 0.05, 0.008333, 0.0833;
renormalized percentages: 0.15, 0.30, 0.05, 0.50.
Meaning that the 1st mirror would have a 15% chance of being chosen during
the next karma period, the 2nd one a 30% chance, the 3rd one (slowest at 30 ms)
only a 5% chance, and the 4th and the fastest one (at 3 ms) a 50% chance.
Then, after that period, the second adjustment step would update those chances
again, and so on.
The rationale here is, once the observed latencies stabilize,
the latency weighted probabilities stabilize as well. So all these
adjustment iterations are supposed to converge at a point where the average
latencies are (roughly) equal over all mirrors.
ha_strategy = nodeads
Latency-weighted probabilities, but dead mirrors are excluded from
the selection. "Dead" mirror is defined as a mirror that resulted
in multiple hard errors (eg. network failure, or no answer, etc) in a row.
ha_strategy = noerrors
Latency-weighted probabilities, but mirrors with worse errors/success ratio
are excluded from the selection.
Round-robin balancing
ha_strategy = roundrobin
Simple round-robin selection, that is, selecting the 1st mirror
in the list, then the 2nd one, then the 3rd one, etc, and then repeating
the process once the last mirror in the list is reached. Unlike with
the randomized strategies, RR imposes a strict querying order (1, 2, 3, ..,
N-1, N, 1, 2, 3, ... and so on) and guarantees that
no two subsequent queries will be sent to the same mirror.
12.2.61. bigram_freq_words
A list of keywords considered "frequent" when indexing bigrams.
Optional, default is empty.
Added in 2.1.1-beta.
Bigram indexing is a feature to accelerate phrase searches.
When indexing, it stores a document list for either all or some
of the adjacent words pairs into the index. Such a list can then be used
at searching time to significantly accelerate phrase or sub-phrase
matching.
Some of the bigram indexing modes (see Section 12.2.62, “bigram_index”)
require to define a list of frequent keywords. These are not to be
confused with stopwords! Stopwords are completely eliminated when both indexing
and searching. Frequent keywords are only used by bigrams to determine whether
to index a current word pair or not.
bigram_freq_words
lets you define a list of such keywords.
Example:
bigram_freq_words = the, a, you, i
Bigram indexing mode.
Optional, default is none.
Added in 2.1.1-beta.
Bigram indexing is a feature to accelerate phrase searches.
When indexing, it stores a document list for either all or some
of the adjacent words pairs into the index. Such a list can then be used
at searching time to significantly accelerate phrase or sub-phrase
matching.
bigram_index
controls the selection of specific word pairs.
The known modes are:
all
, index every single word pair.
(NB: probably totally not worth it even on a moderately sized index,
but added anyway for the sake of completeness.)
first_freq
, only index word pairs
where the first word is in a list of frequent words
(see Section 12.2.61, “bigram_freq_words”). For example, with
bigram_freq_words = the, in, i, a
, indexing
"alone in the dark" text will result in "in the" and "the dark" pairs
being stored as bigrams, because they begin with a frequent keyword
(either "in" or "the" respectively), but "alone in" would not
be indexed, because "in" is a second word in that pair.
both_freq
, only index word pairs where
both words are frequent. Continuing with the same example, in this mode
indexing "alone in the dark" would only store "in the" (the very worst
of them all from searching perspective) as a bigram, but none of the
other word pairs.
For most usecases, both_freq
would be the best mode, but
your mileage may vary.
Example:
bigram_freq_words = both_freq
12.2.63. index_field_lengths
Enables computing and storing of field lengths (both per-document and
average per-index values) into the index.
Optional, default is 0 (do not compute and store).
Added in 2.1.1-beta.
When index_field_lengths
is set to 1, indexer
will 1) create a respective length attribute for every full-text field,
sharing the same name but with _len suffix; 2) compute a field length (counted in keywords) for
every document and store in to a respective attribute; 3) compute the per-index
averages. The lengths attributes will have a special TOKENCOUNT type, but their
values are in fact regular 32-bit integers, and their values are generally
accessible.
BM25A() and BM25F() functions in the expression ranker are based
on these lengths and require index_field_lengths
to be enabled.
Historically, Sphinx used a simplified, stripped-down variant of BM25 that,
unlike the complete function, did not account for document length.
(We later realized that it should have been called BM15 from the start.)
Starting with 2.1.1-beta, we added support for both a complete variant of BM25,
and its extension towards multiple fields, called BM25F. They require
per-document length and per-field lengths, respectively. Hence the additional
directive.
Example:
index_field_lengths = 1
Regular expressions (regexps) to filter the fields and queries with.
Optional, multi-value, default is an empty list of regexps.
Added in 2.1.1-beta.
In certain applications (like product search) there can be
many different ways to call a model, or a product, or a property,
and so on. For instance, 'iphone 3gs' and 'iphone 3 gs'
(or even 'iphone3 gs') are very likely to mean the same
product. Or, for a more tricky example, '13-inch', '13 inch',
'13"', and '13in' in a laptop screen size descriptions do mean
the same.
Regexps provide you with a mechanism to specify a number of rules
specific to your application to handle such cases. In the first
'iphone 3gs' example, you could possibly get away with a wordforms
files tailored to handle a handful of iPhone models. However even
in a comparatively simple second '13-inch' example there is just
way too many individual forms and you are better off specifying
rules that would normalize both '13-inch' and '13in' to something
identical.
Regular expressions listed in regexp_filter
are
applied in the order they are listed. That happens at the earliest
stage possible, before any other processing, even before tokenization.
That is, regexps are applied to the raw source fields when indeixng,
and to the raw search query text when searching.
We use the RE2 engine
to implement regexps. So when building from the source, the library must be
installed in the system and Sphinx must be configured built with a
--with-re2
switch. Binary packages should come with RE2
builtin.
Example:
# index '13-inch' as '13inch'
regexp_filter = \b(\d+)\" => \1inch
# index 'blue' or 'red' as 'color'
regexp_filter = (blue|red) => color
12.2.65. stopwords_unstemmed
Whether to apply stopwords before or after stemming.
Optional, default is 0 (apply stopword filter after stemming).
Added in 2.1.1-beta.
By default, stopwords are stemmed themselves, and applied to
tokens after stemming (or any other morphology
processing). In other words, by default, a token is stopped when
stem(token) == stem(stopword). That can lead to unexpected results
when a token gets (erroneously) stemmed to a stopped root. For example,
'Andes' gets stemmed to 'and' by our current stemmer implementation,
so when 'and' is a stopword, 'Andes' is also stopped.
stopwords_unstemmed directive fixes that issue. When it's enabled,
stopwords are applied before stemming (and therefore to the original
word forms), and the tokens are stopped when token == stopword.
Example:
stopwords_unstemmed = 1
The path to a file with global (cluster-wide) keyword IDFs.
Optional, default is empty (use local IDFs).
Added in 2.1.1-beta.
On a multi-index cluster, per-keyword frequencies are quite
likely to differ across different indexes. That means that when
the ranking function uses TF-IDF based values, such as BM25 family
of factors, the results might be ranked slightly different
depending on what cluster node they reside.
The easiest way to fix that issue is to create and utilize
a global frequency dictionary, or a global IDF file for short.
This directive lets you specify the location of that file.
It it suggested (but not required) to use a .idf extension.
When the IDF file is specified for a given index and
and OPTION global_idf is set to 1, the engine will use the keyword
frequencies and collection documents count from the global_idf file,
rather than just the local index. That way, IDFs and the values
that depend on them will stay consistent across the cluster.
IDF files can be shared across multiple indexes. Only a single
copy of an IDF file will be loaded by searchd
,
even when many indexes refer to that file. Should the contents of
an IDF file change, the new contents can be loaded with a SIGHUP.
You can build an .idf file using indextool
utility, by dumping dictionaries using --dumpdict
switch
first, then converting those to .idf format using --buildidf
,
then merging all .idf files across cluser using --mergeidf
.
Refer to Section 7.4, “indextool
command reference” for more information.
Example:
global_idf = /usr/local/sphinx/var/global.idf
RLP context configuration file. Mandatory if RLP is used.
Added in 2.2.1-beta.
Example:
rlp_context = /home/myuser/RLP/rlp-context.xml
Allows for fine-grain control over how attributes are loaded into memory
when using indexes with external storage. It is now possible (since
version 2.2.1-beta) to keep attributes on disk. Although, the daemon does
map them to memory and the OS loads small chunks of data on demand. This
allows use of docinfo = extern instead of docinfo = inline, but still
leaves plenty of free memory for cases when you have large collections
of pooled attributes (string/JSON/MVA) or when you're using many indexes
per daemon that don't consume memory. It is not possible to update
attributes left on disk when this option is enabled and the constraint
of 4Gb of entries per pool is still in effect.
Note that this option also affects RT indexes. When it is enabled, all atribute updates
will be disabled, and also all disk chunks of RT indexes will behave described above. However
inserting and deleting of docs from RT indexes is still possible with enabled ondisk_attrs.
Possible values:
-
0 - disabled and default value, all attributes are loaded in memory
(the normal behaviour of docinfo = extern)
-
1 - all attributes stay on disk. Daemon loads no files (spa, spm, sps).
This is the most memory conserving mode, however it is also the slowest
as the whole doc-id-list and block index doesn't load.
-
pool - only pooled attributes stay on disk. Pooled attributes are string,
MVA, and JSON attributes (sps, spm files). Scalar attributes stored in
docinfo (spa file) load as usual.
This option does not affect indexing in any way, it only requires daemon
restart.
Example:
ondisk_attrs = pool #keep pooled attributes on disk
12.4. searchd
program configuration options
This setting lets you specify IP address and port, or Unix-domain
socket path, that searchd
will listen on.
Introduced in version 0.9.9-rc1.
The informal grammar for listen
setting is:
listen = ( address ":" port | port | path ) [ ":" protocol ]
I.e. you can specify either an IP address (or hostname) and port
number, or just a port number, or Unix socket path. If you specify
port number but not the address, searchd
will listen on
all network interfaces. Unix path is identified by a leading slash.
Starting with version 0.9.9-rc2, you can also specify a protocol
handler (listener) to be used for connections on this socket.
Supported protocol values are 'sphinx' (Sphinx 0.9.x API protocol)
and 'mysql41' (MySQL protocol used since 4.1 upto at least 5.1).
More details on MySQL protocol support can be found in
Section 5.10, “MySQL protocol support and SphinxQL” section.
Examples:
listen = localhost
listen = localhost:5000
listen = 192.168.0.1:5000
listen = /var/run/sphinx.s
listen = 9312
listen = localhost:9306:mysql41
There can be multiple listen directives, searchd
will
listen for client connections on all specified ports and sockets. If
no listen
directives are found then the server will listen
on all available interfaces using the default SphinxAPI port 9312.
Starting with 1.10-beta, it will also listen on default SphinxQL
port 9306. Both port numbers are assigned by IANA (see
http://www.iana.org/assignments/port-numbers
for details) and should therefore be available.
Unix-domain sockets are not supported on Windows.
Log file name.
Optional, default is 'searchd.log'.
All searchd
run time events will be logged in this file.
Also you can use the 'syslog' as the file name. In this case the events will be sent to syslog daemon.
To use the syslog option the sphinx must be configured '--with-syslog' on building.
Example:
log = /var/log/searchd.log
Query log file name.
Optional, default is empty (do not log queries).
All search queries will be logged in this file. The format is described in Section 5.9, “searchd
query log formats”.
In case of 'plain' format, you can use the 'syslog' as the path to the log file.
In this case all search queries will be sent to syslog daemon with LOG_INFO priority,
prefixed with '[query]' instead of timestamp.
To use the syslog option the sphinx must be configured '--with-syslog' on building.
Example:
query_log = /var/log/query.log
Query log format.
Optional, allowed values are 'plain' and 'sphinxql', default is 'plain'.
Introduced in version 2.0.1-beta.
Starting with version 2.0.1-beta, two different log formats are supported.
The default one logs queries in a custom text format. The new one logs
valid SphinxQL statements. This directive allows to switch between the two
formats on search daemon startup. The log format can also be altered
on the fly, using SET GLOBAL query_log_format=sphinxql
syntax.
Refer to Section 5.9, “searchd
query log formats” for more discussion and format
details.
Example:
query_log_format = sphinxql
Network client request read timeout, in seconds.
Optional, default is 5 seconds.
searchd
will forcibly close the client connections which fail to send a query within this timeout.
Example:
read_timeout = 1
Maximum time to wait between requests (in seconds) when using
persistent connections. Optional, default is five minutes.
Example:
client_timeout = 3600
Maximum amount of children to fork (or in other words, concurrent searches to run in parallel).
Optional, default is 0 (unlimited).
Useful to control server load. There will be no more than this much concurrent
searches running, at all times. When the limit is reached, additional incoming
clients are dismissed with temporarily failure (SEARCHD_RETRY) status code
and a message stating that the server is maxed out.
Example:
max_children = 10
searchd
process ID file name.
Mandatory.
PID file will be re-created (and locked) on startup. It will contain
head daemon process ID while the daemon is running, and it will be unlinked
on daemon shutdown. It's mandatory because Sphinx uses it internally
for a number of things: to check whether there already is a running instance
of searchd
; to stop searchd
;
to notify it that it should rotate the indexes. Can also be used for
different external automation scripts.
Example:
pid_file = /var/run/searchd.pid
Prevents searchd
stalls while rotating indexes with huge amounts of data to precache.
Optional, default is 1 (enable seamless rotation).
Indexes may contain some data that needs to be precached in RAM.
At the moment, .spa
, .spi
and
.spm
files are fully precached (they contain attribute data,
MVA data, and keyword index, respectively.)
Without seamless rotate, rotating an index tries to use as little RAM
as possible and works as follows:
new queries are temporarily rejected (with "retry" error code);
searchd
waits for all currently running queries to finish;
old index is deallocated and its files are renamed;
new index files are renamed and required RAM is allocated;
new index attribute and dictionary data is preloaded to RAM;
searchd
resumes serving queries from new index.
However, if there's a lot of attribute or dictionary data, then preloading step
could take noticeable time - up to several minutes in case of preloading 1-5+ GB files.
With seamless rotate enabled, rotation works as follows:
new index RAM storage is allocated;
new index attribute and dictionary data is asynchronously preloaded to RAM;
on success, old index is deallocated and both indexes' files are renamed;
on failure, new index is deallocated;
at any given moment, queries are served either from old or new index copy.
Seamless rotate comes at the cost of higher peak
memory usage during the rotation (because both old and new copies of
.spa/.spi/.spm
data need to be in RAM while
preloading new copy). Average usage stays the same.
Example:
seamless_rotate = 1
Whether to forcibly preopen all indexes on startup.
Optional, default is 1 (preopen everything).
Starting with 2.0.1-beta, the default value for this
option is now 1 (foribly preopen all indexes). In prior
versions, it used to be 0 (use per-index settings).
When set to 1, this directive overrides and enforces
preopen on all indexes.
They will be preopened, no matter what is the per-index
preopen
setting. When set to 0, per-index
settings can take effect. (And they default to 0.)
Pre-opened indexes avoid races between search queries
and rotations that can cause queries to fail occasionally.
They also make searchd
use more file
handles. In most scenarios it's therefore preferred and
recommended to preopen indexes.
Example:
preopen_indexes = 1
Whether to unlink .old index copies on successful rotation.
Optional, default is 1 (do unlink).
Example:
unlink_old = 0
12.4.12. attr_flush_period
When calling UpdateAttributes()
to update document attributes in
real-time, changes are first written to the in-memory copy of attributes
(docinfo
must be set to extern
).
Then, once searchd
shuts down normally (via SIGTERM
being sent), the changes are written to disk.
Introduced in version 0.9.9-rc1.
Starting with 0.9.9-rc1, it is possible to tell searchd
to periodically write these changes back to disk, to avoid them being lost. The time
between those intervals is set with attr_flush_period
, in seconds.
It defaults to 0, which disables the periodic flushing, but flushing will
still occur at normal shut-down.
Example:
attr_flush_period = 900 # persist updates to disk every 15 minutes
Maximum allowed network packet size.
Limits both query packets from clients, and response packets from remote agents in distributed environment.
Only used for internal sanity checks, does not directly affect RAM use or performance.
Optional, default is 8M.
Introduced in version 0.9.9-rc1.
Example:
max_packet_size = 32M
12.4.14. mva_updates_pool
Shared pool size for in-memory MVA updates storage.
Optional, default size is 1M.
Introduced in version 0.9.9-rc1.
This setting controls the size of the shared storage pool for updated MVA values.
Specifying 0 for the size disable MVA updates at all. Once the pool size limit
is hit, MVA update attempts will result in an error. However, updates on regular
(scalar) attributes will still work. Due to internal technical difficulties,
currently it is not possible to store (flush) any updates on indexes
where MVA were updated; though this might be implemented in the future.
In the meantime, MVA updates are intended to be used as a measure to quickly
catchup with latest changes in the database until the next index rebuild;
not as a persistent storage mechanism.
Example:
mva_updates_pool = 16M
Maximum allowed per-query filter count.
Only used for internal sanity checks, does not directly affect RAM use or performance.
Optional, default is 256.
Introduced in version 0.9.9-rc1.
Example:
max_filters = 1024
12.4.16. max_filter_values
Maximum allowed per-filter values count.
Only used for internal sanity checks, does not directly affect RAM use or performance.
Optional, default is 4096.
Introduced in version 0.9.9-rc1.
Example:
max_filter_values = 16384
TCP listen backlog.
Optional, default is 5.
Windows builds currently (as of 0.9.9) can only process the requests
one by one. Concurrent requests will be enqueued by the TCP stack
on OS level, and requests that can not be enqueued will immediately
fail with "connection refused" message. listen_backlog directive
controls the length of the connection queue. Non-Windows builds
should work fine with the default value.
Example:
listen_backlog = 20
Per-keyword read buffer size.
Optional, default is 256K.
For every keyword occurrence in every search query, there are
two associated read buffers (one for document list and one for
hit list). This setting lets you control their sizes, increasing
per-query RAM use, but possibly decreasing IO time.
Example:
read_buffer = 1M
Unhinted read size.
Optional, default is 32K.
When querying, some reads know in advance exactly how much data
is there to be read, but some currently do not. Most prominently,
hit list size in not currently known in advance. This setting
lest you control how much data to read in such cases. It will
impact hit list IO time, reducing it for lists larger than
unhinted read size, but raising it for smaller lists. It will
not affect RAM use because read buffer will be already
allocated. So it should be not greater than read_buffer.
Example:
read_unhinted = 32K
12.4.20. max_batch_queries
Limits the amount of queries per batch.
Optional, default is 32.
Makes searchd perform a sanity check of the amount of the queries
submitted in a single batch when using multi-queries.
Set it to 0 to skip the check.
Example:
max_batch_queries = 256
12.4.21. subtree_docs_cache
Max common subtree document cache size, per-query.
Optional, default is 0 (disabled).
Limits RAM usage of a common subtree optimizer (see Section 5.11, “Multi-queries”).
At most this much RAM will be spent to cache document entries per each query.
Setting the limit to 0 disables the optimizer.
Example:
subtree_docs_cache = 8M
12.4.22. subtree_hits_cache
Max common subtree hit cache size, per-query.
Optional, default is 0 (disabled).
Limits RAM usage of a common subtree optimizer (see Section 5.11, “Multi-queries”).
At most this much RAM will be spent to cache keyword occurrences (hits) per each query.
Setting the limit to 0 disables the optimizer.
Example:
subtree_hits_cache = 16M
Multi-processing mode (MPM).
Optional; allowed values are none, fork, prefork, and threads.
Default is threads.
Introduced in version 1.10-beta.
Lets you choose how searchd
processes multiple
concurrent requests. The possible values are:
- none
All requests will be handled serially, one-by-one.
Prior to 1.x, this was the only mode available on Windows.
- fork
A new child process will be forked to handle every
incoming request.
- prefork
On startup, searchd
will pre-fork
a number of worker processes, and pass the incoming requests
to one of those children.
- threads
A new thread will be created to handle every
incoming request. This is the only mode compatible with
RT indexing backend. This is a default value.
Historically, searchd
used fork-based model,
which generally performs OK but spends a noticeable amount of CPU
in fork() system call when there's a high amount of (tiny) requests
per second. Prefork mode was implemented to alleviate that; with
prefork, worker processes are basically only created on startup
and re-created on index rotation, somewhat reducing fork() call
pressure.
Threads mode was implemented along with RT backend and is required
to use RT indexes. (Regular disk-based indexes work in all the
available modes.)
Example:
workers = threads
Max local worker threads to use for parallelizable requests (searching a distributed index; building a batch of snippets).
Optional, default is 0, which means to disable in-request parallelism.
Introduced in version 1.10-beta.
Distributed index can include several local indexes. dist_threads
lets you easily utilize multiple CPUs/cores for that (previously existing
alternative was to specify the indexes as remote agents, pointing searchd
to itself and paying some network overheads).
When set to a value N greater than 1, this directive will create up to
N threads for every query, and schedule the specific searches within these
threads. For example, if there are 7 local indexes to search and dist_threads
is set to 2, then 2 parallel threads would be created: one that sequentially
searches 4 indexes, and another one that searches the other 3 indexes.
In case of CPU bound workload, setting dist_threads
to 1x the number of cores is advised (creating more threads than cores
will not improve query time). In case of mixed CPU/disk bound workload
it might sometimes make sense to use more (so that all cores could be
utilizes even when there are threads that wait for I/O completion).
Note that dist_threads
does not require
threads MPM. You can perfectly use it with fork or prefork MPMs too.
Starting with version 2.0.1-beta, building a batch of snippets
with load_files
flag enabled can also be parallelized.
Up to dist_threads
threads are be created to process
those files. That speeds up snippet extraction when the total amount
of document data to process is significant (hundreds of megabytes).
Example:
index dist_test
{
type = distributed
local = chunk1
local = chunk2
local = chunk3
local = chunk4
}
# ...
dist_threads = 4
Binary log (aka transaction log) files path.
Optional, default is build-time configured data directory.
Introduced in version 1.10-beta.
Binary logs are used for crash recovery of RT index data, and also of
attributes updates of plain disk indices that
would otherwise only be stored in RAM until flush. When logging is enabled,
every transaction COMMIT-ted into RT index gets written into
a log file. Logs are then automatically replayed on startup
after an unclean shutdown, recovering the logged changes.
binlog_path
directive specifies the binary log
files location. It should contain just the path; searchd
will create and unlink multiple binlog.* files in that path as necessary
(binlog data, metadata, and lock files, etc).
Empty value disables binary logging. That improves performance,
but puts RT index data at risk.
WARNING! It is strongly recommended to always explicitly define 'binlog_path' option in your config.
Otherwise, the default path, which in most cases is the same as working folder, may point to the
folder with no write access (for example, /usr/local/var/data). In this case, the searchd
will not start at all.
Example:
binlog_path = # disable logging
binlog_path = /var/data # /var/data/binlog.001 etc will be created
Binary log transaction flush/sync mode.
Optional, default is 2 (flush every transaction, sync every second).
Introduced in version 1.10-beta.
This directive controls how frequently will binary log be flushed
to OS and synced to disk. Three modes are supported:
0, flush and sync every second. Best performance,
but up to 1 second worth of committed transactions can be lost
both on daemon crash, or OS/hardware crash.
1, flush and sync every transaction. Worst performance,
but every committed transaction data is guaranteed to be saved.
2, flush every transaction, sync every second.
Good performance, and every committed transaction is guaranteed
to be saved in case of daemon crash. However, in case of OS/hardware
crash up to 1 second worth of committed transactions can be lost.
For those familiar with MySQL and InnoDB, this directive is entirely
similar to innodb_flush_log_at_trx_commit
. In most
cases, the default hybrid mode 2 provides a nice balance of speed
and safety, with full RT index data protection against daemon crashes,
and some protection against hardware ones.
Example:
binlog_flush = 1 # ultimate safety, low speed
12.4.27. binlog_max_log_size
Maximum binary log file size.
Optional, default is 0 (do not reopen binlog file based on size).
Introduced in version 1.10-beta.
A new binlog file will be forcibly opened once the current binlog file
reaches this limit. This achieves a finer granularity of logs and can yield
more efficient binlog disk usage under certain borderline workloads.
Example:
binlog_max_log_size = 16M
12.4.28. snippets_file_prefix
A prefix to prepend to the local file names when generating snippets.
Optional, default is empty.
Introduced in version 2.1.1-beta.
This prefix can be used in distributed snippets generation along with
load_files
or load_files_scattered
options.
Note how this is a prefix, and not a path! Meaning that if a prefix
is set to "server1" and the request refers to "file23", searchd
will attempt to open "server1file23" (all of that without quotes). So if you
need it to be a path, you have to mention the trailing slash.
Note also that this is a local option, it does not affect the agents anyhow.
So you can safely set a prefix on a master server. The requests routed to the
agents will not be affected by the master's setting. They will however be affected
by the agent's own settings.
This might be useful, for instance, when the document storage locations
(be those local storage or NAS mountpoints) are inconsistent across the servers.
Example:
snippets_file_prefix = /mnt/common/server1/
12.4.29. collation_server
Default server collation.
Optional, default is libc_ci.
Introduced in version 2.0.1-beta.
Specifies the default collation used for incoming requests.
The collation can be overridden on a per-query basis.
Refer to Section 5.12, “Collations” section for the list of available collations and other details.
Example:
collation_server = utf8_ci
12.4.30. collation_libc_locale
Server libc locale.
Optional, default is C.
Introduced in version 2.0.1-beta.
Specifies the libc locale, affecting the libc-based collations.
Refer to Section 5.12, “Collations” section for the details.
Example:
collation_libc_locale = fr_FR
12.4.31. mysql_version_string
A server version string to return via MySQL protocol.
Optional, default is empty (return Sphinx version).
Introduced in version 2.0.1-beta.
Several picky MySQL client libraries depend on a particular version
number format used by MySQL, and moreover, sometimes choose a different
execution path based on the reported version number (rather than the
indicated capabilities flags). For instance, Python MySQLdb 1.2.2 throws
an exception when the version number is not in X.Y.ZZ format; MySQL .NET
connector 6.3.x fails internally on version numbers 1.x along with
a certain combination of flags, etc. To workaround that, you can use
mysql_version_string
directive and have searchd
report a different version to clients connecting over MySQL protocol.
(By default, it reports its own version.)
Example:
mysql_version_string = 5.0.37
RT indexes RAM chunk flush check period, in seconds.
Optional, default is 10 hours.
Introduced in version 2.0.1-beta.
Actively updated RT indexes that however fully fit in RAM chunks
can result in ever-growing binlogs, impacting disk use and crash
recovery time. With this directive the search daemon performs
periodic flush checks, and eligible RAM chunks can get saved,
enabling consequential binlog cleanup. See Section 4.4, “Binary logging”
for more details.
Example:
rt_flush_period = 3600 # 1 hour
Per-thread stack size.
Optional, default is 1M.
Introduced in version 2.0.1-beta.
In the workers = threads
mode, every request is processed
with a separate thread that needs its own stack space. By default, 1M per
thread are allocated for stack. However, extremely complex search requests
might eventually exhaust the default stack and require more. For instance,
a query that matches a thousands of keywords (either directly or through
term expansion) can eventually run out of stack. Previously, that resulted
in crashes. Starting with 2.0.1-beta, searchd
attempts
to estimate the expected stack use, and blocks the potentially dangerous
queries. To process such queries, you can either the thread stack size
by using the thread_stack
directive (or switch to a different
workers
setting if that is possible).
A query with N levels of nesting is estimated to require approximately
30+0.16*N KB of stack, meaning that the default 64K is enough for queries
with upto 250 levels, 150K for upto 700 levels, etc. If the stack size limit
is not met, searchd
fails the query and reports
the required stack size in the error message.
Example:
thread_stack = 256K
The maximum number of expanded keywords for a single wildcard.
Optional, default is 0 (no limit).
Introduced in version 2.0.1-beta.
When doing substring searches against indexes built with
dict = keywords
enabled, a single wildcard may
potentially result in thousands and even millions of matched
keywords (think of matching 'a*' against the entire Oxford
dictionary). This directive lets you limit the impact
of such expansions. Setting expansion_limit = N
restricts expansions to no more than N of the most frequent
matching keywords (per each wildcard in the query).
Example:
expansion_limit = 16
Threaded server watchdog.
Optional, default is 1 (watchdog enabled).
Introduced in version 2.0.1-beta.
A crashed query in threads
multi-processing mode
(workers = threads
)
can take down the entire server. With watchdog feature enabled,
searchd
additionally keeps a separate lightweight
process that monitors the main server process, and automatically
restarts the latter in case of abnormal termination. Watchdog
is enabled by default.
Example:
watchdog = 0 # disable watchdog
12.4.36. prefork_rotation_throttle
Delay between restarting preforked children on index rotation, in milliseconds.
Optional, default is 0 (no delay).
Introduced in version 2.0.2-beta.
When running in workers = prefork
mode, every index rotation needs to restart all children to propagate the newly
loaded index data changes. Restarting all of them at once might put excessive
strain on CPU and/or network connections. (For instance, when the application
keeps a bunch of open persistent connections to different children, and all those
children restart.) Those bursts can be throttled down with
prefork_rotation_throttle
directive. Note that
the children will be restarted sequentially, and thus "old" results might
persist for a few more seconds. For instance, if
prefork_rotation_throttle
is set to 50 (milliseconds), and
there are 30 children, then the last one would only be actually
restarted 1.5 seconds (50*30=1500 milliseconds) after
the "rotation finished" message in the searchd
event log.
Example:
prefork_rotation_throttle = 50 # throttle children restarts by 50 msec each
Path to a file where current SphinxQL state will be serialized.
Available since version 2.1.1-beta.
On daemon startup, this file gets replayed. On eligible state changes (eg. SET GLOBAL),
this file gets rewritten automatically. This can prevent a hard-to-diagnose problem:
If you load UDF functions, but Sphinx crashes, when it
gets (automatically) restarted, your UDF and global variables will no longer be available;
using persistent state helps a graceful recovery with no such surprises.
Example:
sphinxql_state = uservars.sql
12.4.38. ha_ping_interval
Interval between agent mirror pings, in milliseconds.
Optional, default is 1000.
Added in 2.1.1-beta.
For a distributed index with agent mirrors in it (see more in ???),
master sends all mirrors a ping command during the idle periods.
This is to track the current agent status (alive or dead, network
roundtrip, etc). The interval between such pings is defined
by this directive.
To disable pings, set ha_ping_interval to 0.
Example:
ha_ping_interval = 0
Agent mirror statistics window size, in seconds.
Optional, default is 60.
Added in 2.1.1-beta.
For a distributed index with agent mirrors in it (see more in ???),
master tracks several different per-mirror counters. These counters
are then used for failover and balancing. (Master picks the best
mirror to use based on the counters.) Counters are accumulated in
blocks of ha_period_karma
seconds.
After beginning a new block, master may still use the accumulated
values from the previous one, until the new one is half full. Thus,
any previous history stops affecting the mirror choice after
1.5 times ha_period_karma seconds at most.
Despite that at most 2 blocks are used for mirror selection,
upto 15 last blocks are actually stored, for instrumentation purposes.
They can be inspected using
SHOW AGENT STATUS
statement.
Example:
ha_period_karma = 120
12.4.40. persistent_connections_limit
The maximum # of simultaneous persistent connections to remote persistent agents.
Each time connecting agent defined under 'agent_persistent' we try to reuse existing connection (if any), or connect and save the connection for the future.
However we can't hold unlimited # of such persistent connections, since each one holds a worker on agent size (and finally we'll receive the 'maxed out' error,
when all of them are busy). This very directive limits the number. It affects the num of connections to each agent's host, across all distributed indexes.
It is reasonable to set the value equal or less than max_children option of the agents.
Example:
persistent_connections_limit = 29 # assume that each host of agents has max_children = 30 (or 29).
A maximum number of I/O operations (per second) that the RT chunks merge thread is allowed to start.
Optional, default is 0 (no limit). Added in 2.1.1-beta.
This directive lets you throttle down the I/O impact arising from
the OPTIMIZE
statements. It is guaranteed that all the
RT optimization activity will not generate more disk iops (I/Os per second)
than the configured limit. Modern SATA drives can perform up to around 100 I/O operations per
second, and limiting rt_merge_iops can reduce search performance degradation caused by merging.
Example:
rt_merge_iops = 40
12.4.42. rt_merge_maxiosize
A maximum size of an I/O operation that the RT chunks merge
thread is allowed to start.
Optional, default is 0 (no limit).
Added in 2.1.1-beta.
This directive lets you throttle down the I/O impact arising from
the OPTIMIZE
statements. I/Os bigger than this limit will be
broken down into 2 or more I/Os, which will then be accounted as separate I/Os
with regards to the rt_merge_iops
limit. Thus, it is guaranteed that all the optimization activity will not
generate more than (rt_merge_iops * rt_merge_maxiosize) bytes of disk I/O
per second.
Example:
rt_merge_maxiosize = 1M
12.4.43. predicted_time_costs
Costs for the query time prediction model, in nanoseconds.
Optional, default is "doc=64, hit=48, skip=2048, match=64" (without the quotes).
Added in 2.1.1-beta.
Terminating queries before completion based on their execution time
(via either SetMaxQueryTime()
API call, or SELECT ... OPTION max_query_time
SphinxQL statement) is a nice safety net, but it comes with an inborn drawback:
indeterministic (unstable) results. That is, if you repeat the very same (complex)
search query with a time limit several times, the time limit will get hit
at different stages, and you will get different result sets.
Starting with 2.1.1-beta, there is a new option,
SELECT ... OPTION max_predicted_time,
that lets you limit the query time and get stable,
repeatable results. Instead of regularly checking the actual current time
while evaluating the query, which is indeterministic, it predicts the current
running time using a simple linear model instead:
predicted_time =
doc_cost * processed_documents +
hit_cost * processed_hits +
skip_cost * skiplist_jumps +
match_cost * found_matches
The query is then terminated early when the predicted_time
reaches a given limit.
Of course, this is not a hard limit on the actual time spent (it is, however,
a hard limit on the amount of processing work done), and
a simple linear model is in no way an ideally precise one. So the wall clock time
may be either below or over the target limit. However,
the error margins are quite acceptable: for instance, in our experiments with
a 100 msec target limit the majority of the test queries fell into a 95 to 105 msec
range, and all of the queries were in a 80 to 120 msec range.
Also, as a nice side effect, using the modeled query time instead of measuring
actual run time results in somewhat less gettimeofday() calls, too.
No two server makes and models are identical, so predicted_time_costs
directive lets you configure the costs for the model above. For convenience, they are
integers, counted in nanoseconds. (The limit in max_predicted_time is counted
in milliseconds, and having to specify cost values as 0.000128 ms instead of 128 ns
is somewhat more error prone.) It is not necessary to specify all 4 costs at once,
as the missed one will take the default values. However, we strongly suggest
to specify all of them, for readability.
Example:
predicted_time_costs = doc=128, hit=96, skip=4096, match=128
12.4.44. shutdown_timeout
searchd --stopwait wait time, in seconds.
Optional, default is 3 seconds.
Added in 2.2.1-beta.
When you run searchd --stopwait your daemon needs to perform some
activities before stopping like finishing queries, flushing RT RAM chunk,
flushing attributes and updating binlog. And it requires some time.
searchd --stopwait will wait up to shutdown_time seconds for daemon to
finish its jobs. Suitable time depends on your index size and load.
Example:
shutdown_timeout = 5 # wait for up to 5 seconds
12.4.45. ondisk_attrs_default
Instance-wide defaults for ondisk_attrs
directive. Optional, default is 0 (all attributes are loaded in memory). This
directive lets you specify the default value of ondisk_attrs for all indexes
served by this copy of searchd. Per-index directives take precedence, and will
overwrite this instance-wide default value, allowing for fine-grain control.
12.4.46. query_log_min_msec
Limit (in milliseconds) that prevents the query from being written to the query log.
Optional, default is 0 (all queries are written to the query log). This directive
specifies that only queries with execution times that exceed the specified limit will be logged.
12.4.47. agent_connect_timeout
Instance-wide defaults for agent_connect_timeout parameter.
The last defined in distributed (network) indexes.
12.4.48. agent_query_timeout
Instance-wide defaults for agent_query_timeout parameter.
The last defined in distributed (network) indexes, or also may be overrided per-query using OPTION clause.
12.4.49. agent_retry_count
Integer, specifies how many times sphinx will try to connect and query remote agents in distributed index before reporting
fatal query error. Default is 0 (i.e. no retries). This value may be also specified on per-query basis using
'OPTION retry_count=XXX' clause. If per-query option exists, it will override the one specified in config.
12.4.50. agent_retry_delay
Integer, in milliseconds. Specifies the delay sphinx rest before retrying to query a remote agent in case it fails.
The value has sense only if non-zero agent_retry_count
or non-zero per-query OPTION retry_count specified. Default is 500. This value may be also specified
on per-query basis using 'OPTION retry_delay=XXX' clause. If per-query option exists, it will override the one specified in config.