<s> <w> w1-1 w1-2 w1-3 <w> w2-1 <w> w3-1 w3-2 <w> </s>where wA-B is the Bth part of the A:th word.
counts2kn [OPTIONS] text_in lm_out
text_in should contain the training set (except for the held-out part for discount optimization) and the results are written to lm_out.
Mandatory options:
-n | --norder | The desired n-gram model order. |
Other options:
-h | --help | Print help |
-o | --opti | The file containing the held-out part of training set. This will be used for training the discounts. Suitable size for held out set is around 100 000 words/tokens. If not set, use leave-one-out discount estimates. |
-a | --arpa | Output arpa instead of binary. Recommended for compatibility with other tools. This is the only output compatible with SRILM toolkit. |
-x | --narpa | Output nonstandard interpolated arpa instead of binary. Saves a little memory during model creation but the resulting models should be converted to standard back-off form. |
-p | --prunetreshold | Pruning treshold for removing the least useful n-grams from the model. 0.0 for no pruning, 1.0 for lots of pruning. Corresponds to epsilon in [1]. |
-f | --nfirst | Number of the most common words to be included in the language model vocabulary |
-d | --ndrop | Drop the words seen less than x times from the language model vocabulary. |
-s | --smallvocab | The vocabulary does not exceed 65000 words. Saves a lot of memory. |
-A | --absolute | Use absolute discounting instead of Kneser-Ney smoothing. |
-C | --clear_history | Clear language model history at the sentence boundaries. Recommended. |
-3 | --3nzer | Use modified KN smoothing, that is 3 discount parameters per model order. Recommended. Increases the memory consumption somewhat, should be omitted if memory is tight. |
-O | --cutoffs | Use count cutoffs, --cutoffs "val1 val2 ... valN". Remove n-grams seen less or equal than val times. Val is specified for each order of the model, if the cutoffs are only specified for a few first orders, the last cutoff value is used for all higher order n-grams. All unigrams are included in any case, so if several cutoff values are given, val1 has no real effect. |
-L | --longint | Store the counts in "long int" type of variable instead of "int". This is necessary when the number of tokens in the training set exceeds the number that can be stored in a regular integer. Increases memory consumption somewhat. |
varigram_kn [OPTIONS] textin LM_out
text_in should contain the training set (except for the held-out part for discount optimization) and the results are written to lm_out. Suitable size for held out set is around 100 000 words/tokens.
Mandatory options:
-D | --dscale | The treshold for accepting new n-grams to the model. 0.05 for generating a fairly small model, 0.001 for a large model. Corresponds to delta in [1]. |
Other options:
-h | --help | Print help |
-o | --opti | The file containing the held-out part of training set. This will be used for training the discounts. Suitable size for held out set is around 100 000 words/tokens. If not specified, use leave-one-out estimates for discounts. |
-n | --norder | Maximum n-gram order that will be searched. |
-a | --arpa | Output arpa instead of binary. Recommended for compatibility with other tools. This is the only output compatible with SRILM toolkit. |
-x | --narpa | Output nonstandard interpolated arpa instead of binary. Saves a little memory during model creation but the resulting models should be converted to standard back-off form. |
-E | --dscale2 | Pruning treshold for removing the least useful n-grams from the model. 1.0 for lots of pruning, 0 for no pruning. Corresponds to epsilon in [1]. |
-f | --nfirst | Number of the most common words to be included in the language model vocabulary |
-d | --ndrop | Drop the words seen less than x times from the language model vocabulary. |
-s | --smallvocab | The vocabulary does not exceed 65000 words. Saves a lot of memory. |
-A | --absolute | Use absolute discounting instead of Kneser-Ney smoothing. |
-C | --clear_history | Clear language model history at the sentence boundaries. Recommended. |
-3 | --3nzer | Use modified KN smoothing, that is 3 discount parameters per model order. Recommended. Increases the memory consumption somewhat, should be omitted if memory is tight. |
-S | --smallmem | Do not load the training data into memory. Instead read it from the disk each time it is needed. Saves some memory, slows training down somewhat. |
-O | --cutoffs | Use count cutoffs, --cutoffs "val1 val2 ... valN". Remove n-grams seen less or equal than val times. Val is specified for each order of the model, if the cutoffs are only specified for a few first orders, the last cutoff value is used for all higher order n-grams. All unigrams are included in any case, so if several cutoff values are given, val1 has no real effect. |
-L | --longint | Store the counts in "long int" type of variable instead of "int". This is necessary when the number of tokens in the training set exceeds the number that can be stored in a regular integer. Increases memory consumption somewhat. |
perplexity [OPTIONS] text_in results_out
text_in should contain the test set and the results are printed to results_out.
Mandatory options:
-a | --arpa | The input language model is in either standard arpa format or interpolated arpa. |
-A | --bin | The input language model is in binary format. Either "-a" or "-A" must be specified. |
Other options:
-h | --help | Print help |
-C | --ccs | File containing the list of context cues that should be ignored during perplexity computation. |
-W | --wb | File containing word break symbols. The language model is assumed to be a sub-word n-gram model and word breaks are explicitly marked. |
-X | --mb | File containing morph break prefixes or postfixes. The language model is assumed to be a sub-word n-gram model and morphs that are not preceeded (or followed) by a word break are marked with a prefix (or postfix) string. Prefix strings start with "^" (e.g. "^#" tells that a token starting with "#" is not preceeded by a word break) and postfix strings end with "$" (e.g. "+$" tells that a token ending with "+" is not followed by a word break). The file should also include sentence start and end tags (e.g. "^<s>" and "^</s>"), otherwise they are considered as words. |
-u | --unk | The string is used as the unknown word symbols. For compability reasons only. |
-U | --unkwarn | Warn if unknown tokens are seen |
-i | --interpolate | Interpolate with given arpa LM. |
-I | --inter_coeff | Interpolation coefficient. The interpolated model will be weighted by coeff whereas the main model will be weighted by 1.0-coeff. |
-t | --init_hist | The number of symbols assumed to be known from the sentence start. Normally 1, for sub-word n-grams the inital word break should be assumed known and this should be set to 2. Default 0 (fix this). |
-S | --probstream | The filename, where the individual probabilities given to each word should be put. |
simpleinterpolate2arpa "lm1_in.arpa,weight1;lm2_in.arpa,weight2" arpa_out
To evaluate the just created model:
perplexity -t 1 -a model.arpa.bz2 test_set.txt -
or
perplexity -S stream_out.txt.gz -t 1 -a model.arpa.bz2 "| cat test_set.txt | preprocess.pl" out.txt
Note, that for evaluating a language model based on subword units, the
parameter -t 2 should be used since the two first tokens (sentence
start and word break) are assumed to be known.
To create a grown model do:
varigram_kn -a -o held_out.txt -D 0.1 -E 0.25 -s -C train.txt grown.arpa.gz