Flashnux

GNU/Linux man pages

Livre :
Expressions régulières,
Syntaxe et mise en oeuvre :

ISBN : 978-2-7460-9712-4
EAN : 9782746097124
(Editions ENI)

GNU/Linux

Debian 7.3.0

(Wheezy)

swath(1)


SWATH

SWATH

NAME
SYNOPSIS
DESCRIPTION
OPTIONS
ENVIRONMENT VARIABLES
EXAMPLES
AUTHOR

NAME

swath − General-purpose Thai word segmentation utility

SYNOPSIS

swath [options] < infile outfile

DESCRIPTION

Thai script has no word delimitor. Applications need some knowledge about Thai word list to recognize word boundaries before they can do useful things about Thai text, such as line wrapping.

Swath provides word analysis filter to insert word delimitors in a text stream. It reads text from standard input, analyze it for word boundaries by consulting a Thai word list, and output to standard output the same text with the predefined word delimitors inserted.

Currently, it can read plain text, HTML, RTF, LaTeX and Lambda (Unicode version of LaTeX with Omega typesetter kernel) documents and insert commonly used word delimitors for each format (pipe ’|’ for plain text). But the user can always override this with a preferred delimitor.

OPTIONS

−b [delimitor]

Define a string to be used as word delimitor code in the output text.

−d [dict-path]

Specify alternative dictionary location. dict-path must either be a directory containing the swath dictionary file ’swathdic.tri’, or be a path to the dictionary file itself. The dictionary file must be a trie file prepared using trietool-0.2(1) utility from libdatrie package.

If this option is given, swath will override normal dictionary search and will exit on failure. Otherwise, it will try to open dictionary from the location specified in SWATHDICT environment if set, then in current working directory, and finally in the usual installed location.

−f [format]

Specify format of the input. Possible formats are: html, rtf, latex, lambda.

−m [scheme]

Choose word matching scheme when analyzing word boundaries. Possible schemes are ’long’ (for longest or greedy matching) and ’max’ (for maximal matching, with least words preferred). Maximal matching is the default value.

−u input-enc,output-enc

Specify encodings of input and output. input-enc and output-enc can be one of ’u’ (for UTF-8 encoding) and ’t’ (for TIS-620 encoding). Swath will convert the character encoding as necessary. If omitted, TIS-620 encodings on both input and output are assumed.

−v, −−verbose

Turn on verbose mode.

−help, −−help

Show help.

ENVIRONMENT VARIABLES

SWATHDICT

If specified, swath will search for dictionary in this location before the usual places (current working directory and usual installed directory, respectively). This value is overridden by −d option.

EXAMPLES

For LaTeX (to be used with thailatex package):

$ swath −f latex < thaifile.tex > thaifile.ttex
$ latex thaifile.ttex

For HTML (to provide web pages to web browsers that cannot wrap Thai lines properly, but support the <wbr> tag):

$ swath −f html < myweb.html > myweb-wbr.html

To preprocess a Thai UTF-8 encoded LaTeX file for thailatex, which always works with TIS-620:

$ swath −f latex −u u,t < thaifile.tex > thaifile.ttex
$ latex thaifile.ttex

This is equivalent to filtering with iconv(1):

$ iconv −f UTF-8 −t TIS-620 thaifile.tex | swath −f latex > thaifile.ttex
$ latex thaifile.ttex

To use longest matching scheme with LaTeX document:

$ swath −f latex −m long < thaifile.tex > thaifile.ttex
$ latex thaifile.ttex

To use an alternative dictionary from libthai:

$ swath −f latex −d /usr/share/libthai/thbrk.tri < thaifile.tex > thaifile.ttex

AUTHOR

This manual page was written by Theppitak Karoonboonyanan <thep@linux.thai.net>.



swath(1)