RULEXDB_OPEN(3) | Library Functions Manual | RULEXDB_OPEN(3) |
rulexdb_open - open or create a rulex database
#include <rulexdb.h> RULEXDB *rulexdb_open(const char *path, int mode);
The rulexdb_open() function opens the rulex database in the file whose name is the string pointed to by path and allocates and initializes all necessary internal data structures associated with it.
The argument mode specifies a database access mode. It may accept one of the following values:
The rulex database consists of two dictionaries and four sets of rules. The Explicit dictionary contains the words that are described individually and do not imply any information for other forms. This dictionary is looked up first if the search includes this stage. The Implicit dictionary contains words in some basic form. This dictionary is used to construct pronunciation string for various forms of these words. The basic form of a word is guessed according to the rules from the Classifiers and Prefix detectors rulesets. This is the second stage of search process. If these stages do not bring a result or are not performed the rules from the General ruleset are used to guess stressing word. If no one of these rules can be applied than no guessing is made and search process fails.
Externally all the data are represented textually. For the Russian letters the koi8-r character set is used and only lower case is allowed.
Each dictionary record consists of two fields. The first field contains Russian word that serves as a key when searching. Only lowercase Russian letters are allowed here. The second field provides pronunciation string for this word. The pronunciation string is the word itself, but written in such a manner as it should be pronounced. There are three additional symbols allowed in the pronunciation string along with the lowercase Russian letters. The "+" sign can be used to point the stressed letter. It should be placed just after that letter. The "=" sign is used in some cases just in the same manner to point so-called weak stress. The "-" sign can serve as a separator in some complex words. All other symbols are treated as illegal.
There are four rulesets in the database: General rules, Classifiers, Prefix detectors and Correctors. Externally all these rules are represented by records consisting of one or two fields. The first field always contains a regular expression which is matched against the word to make a decision whether this rule can be applied.
The only task of General rules is to guess stress in the
words when dictionary lookup fails. The rules are tried sequentially until
match or the list exhaustion. If match succeeds then the "+" sign
is inserted into the word right after the first subexpression match to point
stressing position.
These rules do not contain a second field.
For the Classifiers ruleset each rule is checked one by one until match occurs. Then the part from the beginning of the word through to the end of the first subexpression match is extracted and if a second field is present it is appended to the extracted part as a suffix. The resulting string is treated as a basic form of the word, so it is looked up in the Implicit dictionary. If nothing is found the process continues until the ruleset will be exceeded.
When nothing is found in the database for a word in its original form, Prefix detection rules are applied to it sequentially until match occurs. The matched prefix is stripped and replaced by the replacement string if any. Then the result word is searched in the Implicit dictionary. In the case of success the original prefix is restored in the pronunciation string.
The rules from Correctors ruleset are applied to the pronunciation strings instead of the original words. The second field in these rules specifies a regular replacement string where digits serve as subexpression numbers.
Upon successful completion rulexdb_open() return a RULEXDB pointer that should be used in other database access functions for referencing the database. Otherwise, NULL is returned.
rulexdb_classify(3), rulexdb_close(3), rulexdb_dataset_name(3), rulexdb_discard_dictionary(3), rulexdb_discard_ruleset(3), rulexdb_fetch_rule(3), rulexdb_lexbase(3), rulexdb_load_ruleset(3), rulexdb_remove_item(3), rulexdb_remove_rule(3), rulexdb_remove_this_item(3), rulexdb_retrieve_item(3), rulexdb_search(3), rulexdb_seq(3), rulexdb_subscribe_item(3), rulexdb_subscribe_rule(3)
Igor B. Poretsky <poretsky@mlbox.ru>.
February 19, 2012 |