GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages
ucto(1) FreeBSD General Commands Manual ucto(1)

ucto - Unicode Tokenizer

ucto [[options]] [input‐file] [[output‐file]]

ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally paragraphs), and finds paired quotes. Ucto is preconfigured with tokenisation rules for several languages.

-c configfile
read settings from a file

-d value

set debug mode to 'value'

-e value

set input encoding. (default UTF8)

-N value

set UTF8 output normalization. (default NFC)

--filter=[YES|NO]

disable filtering of special characters, (default YES) These special characters can be specified in the [FILTER] block of the configuration file.

-f

OBSOLETE. use --filter=NO

-L language

Automatically selects a configuration file by language code. The language code is generally a three-letter iso-639-3 code. For example, 'fra' will select the file tokconfig‐fra from the installation directory

--detectlanguages=<lang1,lang2,..langn>

try to detect all the specified languages. The default language will be 'lang1'. (only useful for FoLiA output)

-l

Convert to all lowercase

-u

Convert to all uppercase

-n

Emit one sentence per line on output

-m

Assume one sentence per line on input

--normalize=class1,class2,..,classn

map all occurrences of tokens with class1,...class to their generic names. e.g --normalize=DATE will map all dates to the word {{DATE}}. Very useful to normalize tokens like URL's, DATE's, E-mail addresses and so on.

--add-tokens="file"

Add additional tokens to the [TOKENS] block of the default language. The file should contain one TOKEN per line.

--passthru

Don't tokenize, but perform input decoding and simple token role detection

--filterpunct

remove most of the punctuation from the output. (not from abreviations and embedded punctuation like John's)

-P

Disable Paragraph Detection

-Q

Enable Quote Detection. (this is experimental and may lead to unexpected results)

-s <string>

Set End‐of‐sentence marker. (Default <utt>)

-V

Show version information

-v

set Verbose mode

-F

Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: -nPQvs) For files with an '.xml' extension, -F is the default.

--inputclass="cls"

When tokenizing a FoLiA XML document, search for text nodes of class 'cls'. The default is "current".

--outputclass="cls"

When tokenizing a FoLiA XML document, output the tokenized text in text nodes with 'cls'. The default is "current". It is recommended to have different classes for input and output.

--textclass="cls"(obsolete)

use 'cls' for input and output of text from FoLiA. Equivalent to both --inputclass='cls' and --outputclass='cls')

This option is obsolete and NOT recommended. Please use the separate --inputclass= and --outputclass options.

-X

Output FoLiA XML. (this disables usage of most other options: -nPQvs)

--id <DocId>

Use the specified Document ID for the FoLiA XML

-x <DocId> (obsolete)

Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: -nPQvs).

obsolete Use -X and --id instead

likely

Maarten van Gompel proycon@anaproy.nl

Ko van der Sloot Timbl@uvt.nl

2018 nov 13

Search for    or go to Top of page |  Section 1 |  Main Index

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with ManDoc.