estwaver - command line interface of web crawler
estwaver init [-apn|-acc] [-xs|-xl|-xh] [-sv|-si|-sa] rootdir
estwaver crawl [-restart|-revisit|-revcont] rootdir
estwaver unittest rootdir
estwaver fetch [-proxy hostr port] [-tout num] [-il lang]
url
estwaver is an aggregation of sub commands. The name of a sub command is
specified by the first argument. Other arguments are parsed according to each
sub command. The argument rootdir specifies the crawler root directory
which contains configuration file and so on.
- estwaver init [-apn|-acc] [-xs|-xl|-xh] [-sv|-si|-sa] rootdir
- Create the crawler root directory.
If -apn is specified, N-gram analysis is performed against European
text also.
If -acc is specified, character category analysis is performed
instead of N-gram analysis.
If -xs is specified, the index is tuned to register less than 50000
documents.
If -xl is specified, the index is tuned to register more than 300000
documents.
If -xh is specified, the index is tuned to register more than 1000000
documents.
If -sv is specified, scores are stored as void.
If -si is specified, scores are stored as 32-bit integer.
If -sa is specified, scores are stored as-is and marked not to be
tuned when search.
- estwaver crawl [-restart|-revisit|-revcont] rootdir
- Start crawling.
If -restart is specified, crawling is restarted from the seed
documents.
If -revisit is specified, collected documents are revisited.
If -revcont is specified, collected documents are revisited and then
crawling is continued.</dd>
- estwaver unittest rootdir
- Perform unit tests.
- estwaver fetch [-proxy hostr port] [-tout num] [-il lang] url
- Fetch a document.
url specifies the URL of a document.
-proxy specifies the host name and the port number of the proxy
server.
-tout specifies timeout in seconds.
-il specifies the preferred language. By default, it is English.
All sub commands return 0 if the operation is success, else return
1. A running crawler finishes with closing the database when it catches the
signal 1 (SIGHUP), 2 (SIGINT), 3 (SIGQUIT), or 15 (SIGTERM).
When crawling finishes, there is a directory _index in the
crawler root directory. It is an index available by estcmd and so
on.
estconfig(1), estcmd(1), estmaster(1), estcall(1),
estraier(3), estnode(3)