Example 12: Candidate Corrections for Mispelled Words
FreeLing includes Mans Hulden's libFOMA Finite-State Technology library, which can be used directly or through FreeLing compounds
and alternatives
modules.
The compounds
module builds a FSM that recognizes the language L(_L)+
, being L
the language of a given dictionary. Thus, this module can find very efficiently whether a given word is a compound made of two or more words present in the given dictionary.
FOMA includes also a very fast approximate search algorithm that can return the word in the FSM language closer to the input string. This is used by the alternatives
module to return a list of candidate words for a string that was not found in the dictionary.
We are going to use Example 9 as a starting point.
First, we will modify configuration and invocation options for our analyzer
to have it perform only morphological analysis, setting MACO_ProbabilityAssignment = false
. This will deactivate the guesser and thus, unknown words will have no analysis, which will trigger the alternatives
modules to take action on these words. (Note that this is the default behaviour of alternatives
, but it can be changed via its configuration file and have it act on all words, only on words with certain PoS, etc...)
///////////////////////////////////////////////////
// Load an ad-hoc set of configuration options
freeling::analyzer::config_options fill_config(const wstring &lang, const wstring &ipath) {
freeling::analyzer::config_options cfg;
// Language of text to process
cfg.Lang = lang;
// path to language specific data
wstring lpath = ipath + L"/share/freeling/" + cfg.Lang + L"/";
// Tokenizer configuration file
cfg.TOK_TokenizerFile = lpath + L"tokenizer.dat";
// Splitter configuration file
cfg.SPLIT_SplitterFile = lpath + L"splitter.dat";
// Morphological analyzer options
cfg.MACO_Decimal = L".";
cfg.MACO_Thousand = L",";
cfg.MACO_LocutionsFile = lpath + L"locucions.dat";
cfg.MACO_QuantitiesFile = lpath + L"quantities.dat";
cfg.MACO_AffixFile = lpath + L"afixos.dat";
cfg.MACO_ProbabilityFile = lpath + L"probabilitats.dat";
cfg.MACO_DictionaryFile = lpath + L"dicc.src";
cfg.MACO_NPDataFile = lpath + L"np.dat";
cfg.MACO_PunctuationFile = lpath + L"../common/punct.dat";
cfg.MACO_ProbabilityThreshold = 0.001;
// other modules are not needed in this example
// Sense annotator and WSD config files
cfg.SENSE_ConfigFile = L"";
cfg.UKB_ConfigFile = L"";
// Tagger options
cfg.TAGGER_HMMFile = L"";
// Statistical dependency parser & SRL config file
cfg.DEP_TreelerFile = L"";
// NEC config file.
cfg.NEC_NECFile = L"";
// Chart parser config file.
cfg.PARSER_GrammarFile = L"";
// Rule based dependency parser config files.
cfg.DEP_TxalaFile = L"";
// Coreference resolution config file.
cfg.COREF_CorefFile = L"";
return cfg;
}
///////////////////////////////////////////////////
// Load an ad-hoc set of invoke options
freeling::analyzer::invoke_options fill_invoke() {
freeling::analyzer::invoke_options ivk;
// Level of analysis in input and output
ivk.InputLevel = freeling::TEXT;
ivk.OutputLevel = freeling::MORFO;
// activate/deactivate morphological analyzer modules
ivk.MACO_UserMap = false;
ivk.MACO_AffixAnalysis = true;
ivk.MACO_MultiwordsDetection = true;
ivk.MACO_NumbersDetection = true;
ivk.MACO_PunctuationDetection = true;
ivk.MACO_DatesDetection = true;
ivk.MACO_QuantitiesDetection = true;
ivk.MACO_DictionarySearch = true;
ivk.MACO_CompoundAnalysis = false;
ivk.MACO_NERecognition = true;
ivk.MACO_RetokContractions = false;
// deactivate guesser so words with no analysis do not get one
// and can be processed by the alternative proposers
ivk.MACO_ProbabilityAssignment = false;
// other modules are not used in this examples
ivk.NEC_NEClassification = false;
ivk.SENSE_WSD_which = freeling::NO_WSD;
ivk.TAGGER_which = freeling::NO_TAGGER;
ivk.DEP_which = freeling::NO_DEP;
return ivk;
}
Next, we will create two instances of the alternatives
module: One using the morphological dictionary and a basic String Edit Distance (SED) matrix to find the closest words, and another using a phonetic dictionary and a phonetic-based SED matrix.
// create alternative porposers
freeling::alternatives alts_ort(ipath+L"/share/freeling/"+lang+L"/alternatives-ort.dat");
// comment this out if there is no phonetic encoder for target language
freeling::alternatives alts_phon(ipath+L"/share/freeling/"+lang+L"/alternatives-phon.dat");
You can tune the behaviour of these modules by changing the options on their respective configuration files. See the user manual for details.
Once the alternative suggestion modules have been created, we only need to run the document through them, as we would do with any other module:
// propose alternative forms based on ortographical distance
alts_ort.analyze(ls);
// propose alternative forms based on phonetic distance.
// Comment this out if there is no phonetic encoder for your language
alts_phon.analyze(ls);
Then we can check and use the obtained alternatives (e.g. to pick the best one, or simply to dump them as in this example)
// for each sentence
for (list<freeling::sentence>::iterator s=ls.begin(); s!=ls.end(); s++) {
// for each word
for (freeling::sentence::iterator w=s->begin(); w!=s->end(); w++) {
// print form, and analysis proposed by the morphological analyzer (if any)
wcout<<L"FORM: "<<w->get_form()<<endl;
wcout<<L" ANALYSIS:";
for (freeling::word::iterator a=w->begin(); a!=w->end(); a++)
wcout<<L" ["<<a->get_lemma()<<L","<<a->get_tag()<<L"]";
wcout<<endl;
// print alternative forms proposed by the alternative suggestors
wcout<<L" ALTERNATIVE FORMS:";
for (list<pair<wstring,int> >::iterator a=w->alternatives_begin(); a!=w->alternatives_end(); a++)
wcout<<L" ["<<a->first<<L","<<a->second<<L"]";
wcout<<endl;
}
wcout<<endl;
}
Code
Find here the whole code:
Example
Assuming the input file contains the following sentences:
The beeg catt eates freshh feesh.
we would obtain the output:
FORM: the
ANALYSIS: [the,DT]
ALTERNATIVE FORMS:
FORM: beeg
ANALYSIS:
ALTERNATIVE FORMS: [big,100] [beg,350] [bag,600] [bead,800] [bug,850] [bid,900] [dig,900]
[beak,1000] [beef,1000] [been,1000] [beep,1000] [beer,1000] [bees,1000]
[beet,1000] [berg,1000] [bee,1000] [be,1000] [bing,1100] [mig,1100] [pig,1100]
[bed,1150] [deg,1150]
FORM: catt
ANALYSIS:
ALTERNATIVE FORMS: [cat,0] [kat,0] [kas,500] [kit,500] [cot,600] [cut,750] [cap,800] [batt,1000]
[cant,1000] [cart,1000] [cash,1000] [cast,1000] [cath,1000] [cats,1000]
[kats,1000] [khat,1000] [kiss,1000] [kite,1000] [matt,1000] [watt,1000]
[att,1000] [cad,1000] [can't,1000] [caste,1000] [catch,1000] [gat,1000]
[capped,1000] [cashed,1000] [cattle,1000]
FORM: eates
ANALYSIS:
ALTERNATIVE FORMS: [eases,500] [ekes,800] [bates,1000] [dates,1000] [eaten,1000] [eater,1000]
[eaves,1000] [fates,1000] [fetes,1000] [gates,1000] [hates,1000] [mates,1000]
[metes,1000] [pates,1000] [rates,1000] [sates,1000] [eaters,1000] [eats,1000]
[elates,1000] [assess,1100] [it's,1100] [itches,1100] [its,1100]
FORM: freshh
ANALYSIS:
ALTERNATIVE FORMS: [fresh,1000]
FORM: feesh
ANALYSIS:
ALTERNATIVE FORMS: [fish,100] [phis,600] [fes,850] [flesh,1000] [fresh,1000] [feat,1000] [fees,1000]
[feet,1000] [fete,1000] [fee,1000] [fey,1000] [cease,1100] [fishy,1100] [fished,1100]
[fit,1100] [peace,1200] [piece,1200] [pees,1200] [sash,1200]
FORM: .
ANALYSIS: [.,Fp]
ALTERNATIVE FORMS: