
WHAT'S THIS?
-----------

mguesser is a standalong part of libudmsearch (a core of mnogo search engine
http://mnogosearch.org) which allows to guess text's charset and language.


Guessing is implemented using "N-Gram-Based Text Categorization" technique
which is implemented in TextCat language guesser written in Perl
(http://www.let.rug.nl/~vannoord/TextCat/). mguesser is significantly
faster than TextCat especially on large texts.


This package consist of C written N-gram based algorythms as well
as a number of maps for texts in various languages and charsets. 
Take a look into "maps" directory of this package to check currently
supported languages and charsets.



INSTALLATION
------------

Just type "make".  

By default guesser will seek for language maps in "maps" subdir of current 
directory. You may change the default language maps location in Makefile. 
In order to do it please edit -DLMDIR value.



USAGE
-----


guesser takes a plain text data to stdin. Note that other "almost text" 
formats like HTML will return bad results. In later releases probably I'll 
add a command line switch to tell guesser that input data is in HTML. 
guesser works fine for texts from 500 bytes and more. Shorter texts
will be guessed not so good.

To guess language and charset of some text file use:

  mguesser < text_file

It will display how much your file corresponds to various language maps
in the order of quality. guesser returns values between 0 and 1.

You can also display specified number of best results using -n
command line swith. For example, this will display 3 best results:

  mguesser -n3 < text_file


To create new language map, use:

  mguesser -p -c charset -l language < text_file

Being executed with -p command ling key mguesser outputs language
map built on text_file to stdout. Please note that text_file 
should be big enough. 500kb texts usualy give high quality maps.


You may also include this files into your own applications.
Take a look into main() function which is located in the guesser.c to
check the order of guesser functions calls.



TODO
----

 * Make it possible to guess other than text formats: html, xml
 * Implement various command line switches to choose output format


Alexander Barkov <bar@izhcom.ru>
