|
|
| |
INDEXER.CONF(5) |
mnoGoSearch reference manual |
INDEXER.CONF(5) |
indexer.conf - configuration file for indexer
This is configuration file for indexer (1). Configuration file consists
of commands and their arguments. All commands are case-insensitive. You can
use # to comment out lines.
These commands should be used only once and take global effect for the whole
configuration file.
- DBType type
- Database type, currently supported values are mysql, pgsql, msql,
solid, mssql, oracle, ibase, sqlite Actually it does not matter for
native libraries support, but ODBC users must specify one of the supported
values. If your database type is not supported, use unknown
instead.
- DBHost host
- SQL host name (Not required for ODBC)
Default: localhost
- DBName mnogosearch
- SQL database name or ODBC DSN
Default: mnogosearch
- DBUser foo
- Database username to connect to database
Default: no user
- DBPass bar
- Database password to connect to database
Default: no password
- DBMode single/multi/crc/crc-multi
- SQL database words storage mode. Does not apply for built-in database.
When single is specified, all words are stored in the same table.
multi means that words are stored in different tables depending on
wordlength. multi mode is usualy faster, but it requires more
tables in database. In case of crc mode, mnoGoSearch will store 32
bit integer word ID's calculated by CRC32 algorythm instead of words.
crc mode requires less diskspace and is faster than single
and multi modes. crc-multi mode shares storage structure
with crc mode, but stores words in different tables depending on
wordlength like multi mode. Default DBMode value is single
- LocalCharset charset
- Defines charset for local file system. It is required if you are using 8
bit characters and is not applicable for 7 bit characters. This command is
to be used once and takes global effect for the whole configuration file.
Example:
LocalCharset windows-1250
- CrossWords yes|no
- Building CrossWords index. Crosswords are those, that are used in a link
to the present page. The default value is no
- StopWordFile filename
- This command indicates which file contains stopwords list to load. You may
specify either absolute file name, or filename with a relative path to
mnoGoSearch /etc directory. You may use several StopWordsFile
commands.
- MinWordLength characters
- MinWordLength characters With these commands you can change
default length range of words stored in database. By default mnoGoSearch
stores words that are longer than 1 and shorter than 32. Example:
MaxWordLength 35
- MaxDocSize bytes
- Specify maximum size of a document in bytes that can be indexed. The
default value is 1048576 (1 Mb). This command take global effect
for the whole config file.
- HTTPHeader header
- You may add custom HTTP headers to indexer HTTP request. Do not use
"If-modified-since" and "Accept-Charset" headers,
since they are composed by indexer itself. "User-Agent:
mnoGoSearch/version" is sent too, although you may override it. The
command has global effect for the whole configuration file.
- ServerTable table_name
- This command works only with SQL database and is not applicable for
built-in database mode. Load servers with all their parameters from the
table table_name For an example of such tables structure, please
refer to the file create/mysql/server.txt You may use several arguments
with this command: ServerTable my_servers1 my_servers2 my_servers3
or just a single argument: ServerTable server
- DeleteNoServer yes|no
- Use this command to specify whether to delete the URL that have no
corresponding Server commands. Default value is yes
- VarDir /path/to/my/var/dir
- Specify a custom path to directory that indexer stores data to when use
with built-in database and in cache mode. By default /var directory of
mnoGoSearch installation is used.
- Allow [Match|NoMatch] {NoCase|Case] [String|Regex] <arg> [<arg> ...]
- Use this command to allow URL's that match (does not match) given
argument. First three optional parameters describe the type of comparison.
Default values are Match, NoCase, String Use NoCase or
Case values to to choose case insensitive or sensitive comparison.
Use Regex to choose regular expression comparison. Use
String to choose string with wildcards comparison. Wildcards are
* for any number of characters, and ? for one character.
Note that * and ? have special meaning in String
match type. Please use Regex to describe documents with ?
and * signs in URL. String match is much faster than
Regex, so use String where it is possible. You may use
several arguments for one Allow command and use this command any
number of times. It takes global effect for the config file. Note that
mnoGoSearch automatically adds one Allow regex .* command after
reading config file. That command means that everything is allowed that is
not disallowed
- Disallow [Match|NoMatch] [Case|NoCase] [String|Regex] [<arg> ...]
- Use this to disallow indexing documents with URLs that match given
argument. The meaning of the first three optional parameters is exactly
the same as with the Allow command. You can use several arguments
for one Disallow command. Takes global effect for config file.
- Example:
- #Exclude cgi-bin and non-parsed-headers
Disallow /cgi-bin/ \.cgi /nph
#Exclude some known extensions
Disallow \.b$ \.sh$ \.md5$
Disallow \.arj$ \.tar$ \.zip$ \.tgz$ \.gz$
Disallow \.lha$ \.lzh$ \.tar\.Z$ \.rar$ \.zoo$
Disallow \.gif$ \.jpg$ \.jpeg$ \.bmp$ \.tiff$
Disallow \.vdo$ \.mpeg$ \.mpe$ \.mpg$ \.avi$ \.movie$
Disallow \.mid$ \.mp3$ \.rm$ \.ram$ \.wav$ \.aiff$ \.ra$
Disallow \.vrml$ \.wrl$
Disallow \.exe$ \.cab$ \.dll$ \.bin$ \.class$
Disallow \.tex$ \.texi$ \.xls$ \.doc$ \.texinfo$
Disallow \.rtf$ \.pdf$ \.cdf$ \.ps$
Disallow \.ai$ \.eps$ \.ppt$ \.hqx$
Disallow \.cpt$ \.bms$ \.oda$ \.tcl$
Disallow \.rpm$
#Exclude Apache directory list in different sort order
Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$
\?S=D$
#Exclude ./. and ./.. from Apache and Squid directory list
Disallow /[.]{1,2} /\%2e /\%2f
- CheckOnly regexp [regexp [...] ]
- Indexer will use HEAD instead of GET http method for URLs that matches
regexp. It means that file will be checked only and will not be
downloaded. Usefull for zip,exe,arj etc files. One can use several
arguments for one 'CheckOnly' command. One can use this command any times
but not more than MAXFILTER in indexer.h Takes global effect for config
file.
- Examples:
- #Use HEAD method for some known non-text extensions:
CheckOnly \.b$ \.sh$ \.md5$
CheckOnly \.arj$ \.tar$ \.zip$ \.tgz$ \.gz$
CheckOnly \.lha$ \.lzh$ \.tar\.Z$ \.rar$ \.zoo$
CheckOnly \.gif$ \.jpg$ \.jpeg$ \.bmp$ \.tiff$
CheckOnly \.vdo$ \.mpeg$ \.mpe$ \.mpg$ \.avi$ \.movie$
CheckOnly \.mid$ \.mp3$ \.rm$ \.ram$ \.wav$ \.aiff$
CheckOnly \.vrml$ \.wrl$
CheckOnly \.exe$ \.cab$ \.dll$ \.bin$ \.class$
CheckOnly \.tex$ \.texi$ \.xls$ \.doc$ \.texinfo$
CheckOnly \.rtf$ \.pdf$ \.cdf$ \.ps$
CheckOnly \.ai$ \.eps$ \.ppt$ \.hqx$
CheckOnly \.cpt$ \.bms$ \.oda$ \.tcl$
CheckOnly \.rpm$
- HrefOnly regexp [regexp [...] ]
- Indexer scans html documents that match regexp as it would scan any other
URLs, except that it will not index the contents. It will add any URLs it
finds in html document to database. Usefull when indexing mail list
archives with big index pages which contain mostly URLs. One can use
several arguments for one 'HrefOnly' command. One can use this command any
times but not more than MAXFILTER in indexer.h Takes global effect for
config file.
- Examples:
- #Scan these files for href tags only, but do not index there contents.
HrefOnly mail.*\.html$ thr.*\.html$
- UseRemoteContentType yes|no
- This command specifies if the indexer should get content type from HTTP
server headers (yes) , or from its AddType settings (no). If set to
no , and the indexer could not determine content-type with its
AddType settings,
- SyslogFacility facility
- Useful only if indexer is compiled with syslog support and if you
do not like the default. Argument is the same as used in syslog.conf file
(for example: local7 , daemon ). For list of possible
facilities see syslog.conf(5) Takes global effect and should be used only
once ! Default: depends on compilation.
- LogdAddr host[:port]
- Use cachelogd at given host and port if specified. Required for
cache mode only. Default values are localhost and port
7000
- FollowOutside yes|no
- Allow/disallow indexer to walk outside current server. Should be used
carefully (see MaxHops command).
Default: no
- Period seconds
- Reindex period in seconds, 604800 = 1 week. May be used before every
Server command and takes effect till the end of config file or till
next Period command.
- Tag number
- Use this parameter for your own purposes. For example for grouping some
servers into one group, etc. May be used multiple times before every
Server command and takes effect till the end of config file or till
next Tag command.
- MaxHops number
- Maximum way in "mouse clicks" from start URL given in
Server command. May be used multiple times before every
Server command and takes effect till the end of config file or till
next MaxHops command.
Default: 256
- MaxNetErrors number
- Maximum network errors for each server. If there are too many network
errors on some server (server is down, host unreachable etc.)
indexer will try not to do more than number attempts to
connect to this server. May be used multiple times before Server
command and takes effect till the end of config file or till next
MaxNetErrors command.
Default: 16
- TitleWeight number
- Weight of the words in the <title>...</title> Can be set
multiple times before Server command and takes effect till the end
of config file or till next TitleWeight command.
Default: 2
- BodyWeight number
- Weight of the words in the <body>...</body> of the html
documents and in the contents of the text/plain documents. Can be set
multiple times before Server command and takes effect till the end
of config file or till next BodyWeight command.
Default: 1
- DescWeight number
- Weight of the words in the <META NAME="Description"
Content="..."> Can be set multiple times before Server
command and takes effect till the end of config file or till next
DescWeight command.
Default: 2
- KeywordWeight number
- Weight of the words in the <META NAME="Keywords"
Content="..."> Can be set multiple times before Server
command and takes effect till the end of config file or till next
KeywordWeight command.
Default: 2
- UrlWeight number
- Weight of the words in the URL of the documents. Can be set multiple times
before Server command and takes effect till the end of config file
or till next UrlWeight command.
Default: 0
- DeleteBad yes|no
- Prevent indexer from deleting bad (not found, forbidden etc) URLs from
database. Useful if you want to check 'integrity' of you server(s), so if
you set it to no , that "bad" URLs will remain in
database. Can be set multiple times before Server command and takes
effect till the end of config file or till next DeleteBad command.
Default: yes
- Robots yes|no
- Allows/disallows using robots.txt and <META NAME="robots">
exclusions. Useful if you want to check 'integrity' of you server(s). Can
be set multiple times before Server command and takes effect till
the end of config file or till next Robots command.
Default: yes.
- Section <string> <number>
- where <string> is a section name and <number> is section ID
between 0 and 255. Use 0 if you don't want to index some of these
sections. It is better to use different sections IDs for different
documents parts. In this case during search time you'll be able to give
different weight to each part or even disallow some sections at a search
time.
- Index yes|no
- Prevent indexer from storing words into database. Useful if you want to
check 'integrity' of you server(s). Can be set multiple times before
"Server" command and takes effect till the end of config file or
till next Index command.
Note: Instead of Index no you can use the
alternate form NoIndex
Default: yes
- Follow yes|no
- Allow/disallow indexer to store <a href="..."> into
database. Can be set multiple times before Server command and takes
effect till the end of config file or till next Follow command.
Note: Instead of Follow no you can use the
alternate form NoFollow
Default: yes
- MaxDocSize size
-
Hope the name is self-explanatory, this command is to limit
maximum document size. size is in bytes. If there is document
with size more than size , indexer will parse only first
size bytes of documents.
Default: 1048576 (which is 1 megabyte)
- Mime
- <from_mime> <to_mime>[;charset]
["command line [$1]"]
This is used to add support for parsing documents with mime
types other than text/plain and text/html. It can be done
via external parser (which should provide output in plain or html text)
or just by substituting mime type so indexer can understand it
directly.
<from_mime> and <to_mime> are
standard mime types. <to_mime> should be either
text/plain or text/html , because these are the only types
that indexer understands.
We assume external parser generates results on stdout (if not,
you have to write a little script and cat results to stdout).
Optional charset parameter used to change charset if
needed.
Command line parameter is optional. If there's no command
line, this is used to change mime type. Command line could also have $1
parameter which stands for temporary file name. Some parsers could not
operate on stdin, so indexer creates temporary file for parser
and its name passed instead of $1.
- CharSet charset
- Useful for 8 bit character sets. WWW-servers send data in different
character sets. charset is default character set of server in next
Server command(s). May be used before every Server command
and takes effect till the end of config file or till next CharSet
command.
By now indexer supports Cyrillic koi8-r, cp1251, cp866,
iso8859-5, x-mac-cyrillic, Arabic cp1256, Western iso-8859-1, Central
Europe iso-8859-2 and cp1250 character sets.
This parameter is default character set for "bad"
servers that do not send information about charset in header: just
"Content-type: text/html" instead of for example
"Content-type: text/html; charset=koi8-r" and do not send
charset information in META tags.
CharSet command.
- Examples:
-
CharSet koi8-r
CharSet windows-1250
CharSet ISO-8859-1
- ForceIISCharset1251 yes/no
- This option is useful for users dealing with Cyrillic content and broken
(or misconfigured?) Microsoft IIS web servers, which tends to report
charset incorrectly. This is a really dirty hack, but if this option is
turned on it is assumed that all servers that are reported as 'Microsoft'
or 'IIS' have content in Windows-1251 codepage. This command should be
used only once in configuration file and takes global effect.
Default: no
- AuthBasic login:passwd
- Use basic http authorization. Can be set before every Server
command and takes effect only for next Server command.
- Examples:
-
AuthBasic somebody:something
If you have password protected directory(ies), but whole
server is open, use:
AuthBasic login1:passwd1
Server http://my.server.com/my/secure/directory1/
AuthBasic login2:passwd2
Server http://my.server.com/my/secure/directory2/
Server http://my.server.com/
- ProxyAuthBasic login:passwd
- Use http proxy basic authorisation. Can be used before every Server
command and taked effect only for the next one Server command! It
should be also before Proxy command.
- Example:
- ProxyAuthBasic somebody:smth
- Proxy your.proxy.host[:port]
- Connect ia proxy rather directly. You can index ftp servers (only) when
using proxy. If port is not specified, it is set to default value
of 3128 (Squid). If proxy host is not specified, direct connection will be
performed. Can be set before every Server command and takes effect
till the end of config file or till next Proxy command.
- Examples:
- Proxy atoll.anywhere.com
- proxy on atoll.anywhere.com, port 3128
Proxy lota.anywhere.com:8090
- proxy on lota.anywhere.com, port 8090
Proxy
- turn off proxy usage (direct connection)
- Server URL
- It is the main configuration command. Use this to add start URL of server
to be indexed. You may use many Server commands in the same
indexer.conf file
- Examples:
-
Server http://localhost/
Server http://www.yoursite.com/
Server http://www.yoursite.com/~yourname/
Server ftp://ftp.yourdomain.com/pub/
- This is a minimal sample indexer config file
-
DBHost localhost
DBName udmsearch
DBUser foo
DBPass bar
Server http://localhost/
Disallow /cgi-bin/ \.cgi /nph
Disallow \.b$ \.sh$ \.md5$
Disallow \.arj$ \.tar$ \.zip$ \.tgz$ \.gz$
Disallow \.lha$ \.lzh$ \.tar\.Z$ \.rar$ \.zoo$
Disallow \.gif$ \.jpg$ \.jpeg$ \.bmp$ \.tiff$
Disallow \.vdo$ \.mpeg$ \.mpe$ \.mpg$ \.avi$ \.movie$
Disallow \.mid$ \.mp3$ \.rm$ \.ram$ \.wav$ \.aiff$ \.ra$
Disallow \.vrml$ \.wrl$
Disallow \.exe$ \.cab$ \.dll$ \.bin$ \.class$
Disallow \.tex$ \.texi$ \.xls$ \.doc$ \.texinfo$
Disallow \.rtf$ \.pdf$ \.cdf$ \.ps$
Disallow \.ai$ \.eps$ \.ppt$ \.hqx$
Disallow \.cpt$ \.bms$ \.oda$ \.tcl$
Disallow \.rpm$
Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$ \?S=D$
Disallow /[.]{1,2} /\%2e /\%2f
indexer(1), syslog.conf(5)
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |