WSD algorithm based on a new method of vector-word contexts proximity calculation via epsilon-filtration

The problem of word sense disambiguation (WSD) is considered in the article. Given a set of synonyms (synsets) and sentences with these synonyms. It is necessary to select the meaning of the word in the sentence automatically. 1285 sentences were tagged by experts, namely, one of the dictionary meanings was selected by experts for target words. To solve the WSD-problem, an algorithm based on a new method of vector-word contexts proximity calculation is proposed. In order to achieve higher accuracy, a preliminary epsilon-filtering of words is performed, both in the sentence and in the set of synonyms. An extensive program of experiments was carried out. Four algorithms are implemented, including a new algorithm. Experiments have shown that in a number of cases the new algorithm shows better results. The developed software and the tagged corpus have an open license and are available online. Wiktionary and Wikisource are used. A brief description of this work can be viewed in slides (https://goo.gl/9ak6Gt). Video lecture in Russian on this research is available online (https://youtu.be/-DLmRkepf58).


Introduction
The problem of word sense disambiguation (WSD) is a real challenge to computer scientists and linguists. Lexical ambiguity is widespread and is one of the obstructions in natural language processing.
In our previous work "Calculated attributes of synonym sets" [6], we have proposed the geometric approach to mathematical modeling of synonym set (synset) using the word vector representation. Several geometric characteristics of the synset words were suggested (synset interior, synset word rank and centrality). They are used to select the most significant synset words, i.e. the words whose senses are the nearest to the sense of the synset.
The topic related to polysemy, synonyms, filtering and WSD is continued in this article. Let us formulate the mathematical foundations for solving the problems of computational linguistics in this article.
Using the approach proposed in the paper [2], we present the WSD algorithm based on a new context distance (proximity) calculation via -filtration. The experiments show the advantages of the proposed distance over the traditional average vectors similarity measure of distance between contexts.

New -proximity between finite sets
It is quite evident that the context distance choice is one of the crucial factors influencing WSD algorithms. Here, in order to classify discrete structures, namely contexts, we propose a new approach to context proximity based on Hausdorff metric and symmetric difference of sets: △ = ( ∪ ) ∖ ( ∩ ).  .., 2 } the sets of vectors 1 , 2 corresponding to the words 1 , 2 . Recall that generally in WSD procedures, the distance between words is measured by similarity function, which is a cosine of angle between vectors representing words: is a scalar (inner) product of vectors 1 , 2 , and || || is a norm of vector, = 1, 2. In what follows, 1]. Thus, the less distance the more similarity. Keeping in mind the latter remark, we introduce the following -proximity of vector contexts 1 , 2 . Given 0, construct the sets . Supposing that plays the role of a metric, then ( 1 , 2 , ) is analogous to the expression ( ) ∪ ( ) in the definition of the Hausdorff distance.
Denote by | | the power of a set ⊂ , Definition 1. The -proximity of contexts 1 , 2 is the function We also define the following function.
Definition 2. The˜-proximity of contexts 1 , 2 is the functioñ describing the ratio of "near" and "distant" elements of sets.
The ubiquitous distance between contexts 1 , 2 is based on the similarity of average vectors: ( 1 , 2 ) = ( 1 , 2 ). But the example (Fig. 3) shows that for two geometrically distant and not too similar structures ( 1 , 2 ) = 1, that is the similarity takes the maximum value.

Example
Consider the sets = { 1 , 2 , 3 }, The equality of average vectors does not mean the coincidence of and , which are rather different (Fig. 3).  In what follows, we introduce a procedure of -filtration, the idea of which is borrowed from the paper [2].
The synset filtration is the formation of a so called candidate set which consists of those synonyms whose similarity with the words from a sentence is higher than a similarity threshold .
The first average algorithm 1, described below, uses average vectors of words of sentences and average vectors of the candidate set of synonyms in synsets.
This algorithm contains the following lines. Line 1. Calculate the average vector of words of the sentence the similarity of average vectors of the sentence and the k -th filtered synset: Result: the target word * has the sense corresponding to the * -th synset * . Remark: in the case = 0, we denote this algorithm as 0 -algorithm. In this case, the traditional averaging of similarity is used.

Note.
0 -algorithm was used in our experiments, it was implemented in Python. 1

-algorithm example
A simple example and figures 4-6 will help to understand how this 0 -algorithm works.
Take some dictionary word 2 with several senses and several synonym sets (for example, 1 and 2 ) and the sentence with this word (Fig. 4). The task is to select a meaning (synset) of 2 (that is the target word is * 2 ) used in the sentence via the 0 -algorithm.
Let us match the input data and the symbols used in the 0 -algorithm. The word "служить" (sluzhit') corresponds to the vector 2 . corresponding to words of the sentence , the vertex 2 was excluded since it corresponds to the target word * 2 , and (2) the target word * 2 with two synsets 1 and 2 ( Fig. 4), (3) vertices (vectors correspond to words) of the first synset are Fig. 6. Similarity between the mean value of vectors of the sentence and the first synonym set is lower than the similarity with the second synset, that is . Thus, the second sense of the target word * 2 (the second synset 2 ) will be selected in the sentence by 0 -algorithm There is a dictionary article about this word in the Wiktionary, see Fig. 4 (a parsed database of Wiktionary is used in our projects). 2 Two synonym sets of this Wiktionary entry are denoted by 1 and 2 . Mean values of the vectors corresponding to synonyms in these synsets will be denoted as 1 and 2 , and is the mean vector of all vectors corresponding to words in the sentence containing the word "служить" (sluzhit').

Average algorithm with sentence and synonyms -filtration ( )
This algorithm 2 is a modification of algorithm 1. The filtration of a sentence is added to synset filtration. Namely, we select a word from the sentence for which the similarity with at least one synonym from the synset is higher than the similarity threshold . Then, we average the set of selected words forming the set of candidates from the sentence. Let us explain algorithm 2 line by line.
Lines 2-5. Given > 0, let us construct the set of words of the sentence filtered by synonyms of the k -th synset

Denote by
( ) = | ( )| the power of the set ( ). Line 6. Calculate the average vector of words of the filtered sentence If ( ) = 0, then let ( ) be equal to the zero vector.
Lines 7-8. Construct filtered sets of synonyms Denote by ( ) = | ( )| the power of the k -th filtered synonym set. Line 9. Calculate for ( ) > 0 the average vector of the k -th synset of candidates If ( ) = 0, then ( ) equals to the zero vector.
Line 10. Calculate the similarity of the average vectors of the filtered sentence and the k -th filtered synset .
If * is not unique then take another > 0 and repeat the procedure from line 2.
Result: the target word * in the sentence has the sense corresponding to the * -th synset * . This algorithm was implemented in Python. 3 Algorithm 2: Average algorithm with sentence and synonyms -filtration ( ) Data: * -vector of the target word * with senses (synsets), ∈ , -sentence with the target word * , * ∈ , { } -synsets of the target word, that is ∋ * , = 1, . Result: * ∈ {1, . . . , } is the number of the sense of the word * in the sentence .
construct the set of words of the sentence filtered by synonyms of the k -th synset : the average vector of sentence candidates: -filtration of the synset by the sentence : the similarity of the average vectors of the sentence and the k -th filtered synset: The algorithm 3 ( -algorithm) is based on the function˜( , , ) (see previous section "New -proximity between finite sets" on page 150), where = , that is k -th synset, and = , where is a sentence. The algorithm includes the following steps.
Lines 2-4. Given > 0, let us construct the ( ) set of "near" words of the k -th synset and the sentence . Line 5. Denote by ( ) the set of "distant" words Line 6. Calculate˜( ) as the ratio of "near" and "distant" elements of the sets Lines 8-9. Suppose =1,...,˜( ) =˜*( ). If * is not unique, then take another > 0 and repeat the procedure from line 2. Result: the target word * has the sense corresponding to the * -th synset * . An example of constructing C and D sets is presented in Fig. 7 and Table. It uses the same source data as for the 0 -algorithm, see Fig. 5.
Remark. This algorithm is applicable to the -function described in the previous section 3 as well. This algorithm was implemented in Python. 4 More details for this example (Fig. 7) are presented in Table, which shows and sets with different and values of the˜-function.
Bold type of word-vertices in Table indicates new vertices. These new vertices are captured by a set of "near" vertices and these vertices are excluded from the set of "distant" vertices with each subsequent dilatation extension with each subsequent . For example, in the transition from 1 to 2 the set 2 ( 1 ) loses the vertex 3 . During this transition 1 → 2 the set 2 ( 2 ) gets the same vertex 3 in comparison with the set 2 ( 1 ).
In Fig. 8, the function˜1( ) shows the proximity of the sentence and the synset 1 , the function˜2( ) -the proximity of and the synset 2 . It can be seen in Figure 8 that with decreasing , the value of˜2( ) grows faster thañ Therefore, the sentence is closer to the second synset 2 . The same result can be seen in the previous Fig. 7.  show that the sentence is closer to the second synset 2 An example of the -algorithm treating the word 2 , which has two synsets 1 , 2 and the sentence , where 2 ∈ , see Fig. 4. The number of the algorithm iteration corresponds to the index of . Let the series of be ordered so that 1 = 0 > 1 > 2 > ... > 7 = −1. It is known that | 1 ∪ 1 ∖ 2 | = | ∖ 2 | = 6, that is the total number of words in the synsets and in the sentence are constants.

Web of tools and resources
This section describes the resources used in our research, namely: Wikisource, Wiktionary, WCorpus and RusVectores.
The developed WCorpus 5 system includes texts extracted from Wikisource and provides the user with a text corpus analysis tool. This system is based on the Laravel framework (PHP programming language). MySQL database is used. 6 Wikisource. The texts of Wikipedia have been used as a basis for several contemporary corpora [5]. But there is no mention of using texts from Wikisource in text processing. Wikisource is an open online digital library with texts in many languages. Wikisource sites contains 10 millions of texts 7 in more than 38 languages. 8 Russian Wikisource (the database dump as of February 2017) was used in our research.
Texts parsing. The texts of Wikisource were parsed, analysed and stored to the WCorpus database. Let us describe this process in detail. The database dump containing all texts of Russian Wikisource was taken from "Wikimedia Downloads" site. 9 These Wikisource database files were imported into the local MySQL database titled "Wikisource Database" in Fig. 9, where "WCorpus Parser" is the set of WCorpus PHP-scripts which analyse and parse the texts in the following three steps.
1. First, the title and the text of an article from the Wikisource database are extracted (560 thousands of texts). One text corresponds to one page on Wikisource site. It may be small (for example, several lines of a poem), medium (chapter or short story), or huge size (e.g. the size of the page with the novella "The Eternal Husband" written by Fyodor Dostoyevsky is 500 KB). Text preprocessing includes the following steps: • Texts written in English and texts in Russian orthography before 1918 were excluded; about 12 thousands texts were excluded. • Service information (wiki markup, references, categories and so on) was removed from the text. • Very short texts were excluded. As a result, 377 thousand texts were extracted. • Texts splitting into sentences produced 6 millions of sentences. • Sentences were split into words (1.5 millions of unique words). 4. Lastly, lemmas, wordforms, sentences and relations between words and sentences were stored to WCorpus database (Fig. 9).
In our previous work "Calculated attributes of synonym sets" [6] we also used neural network models of the great project RusVectores 11 , which is a kind of a word2vec tool based on Russian texts [9].

Context similarity algorithms evaluation
In order to evaluate the proposed WSD algorithms, several words were selected from a dictionary, then sentences with these words were extracted from the corpus and tagged by experts.

Nine words
Only polysemous words which have at least two meanings with different sets of synonyms are suitable for our evaluation of WSD algorithms.
The following criteria for the selection of synonyms and sets of synonyms from Russian Wiktionary were used: 1. Only single-word synonyms are extracted from Wiktionary. This is due to the fact that the RusVectores neural network model "ruscorpora_2017_1_600_2" used in our research does not support multiword expressions.
2. If a word has meanings with equal sets of synonyms, then these sets were skipped because it is not possible to discern different meanings of the word using only these synonyms without additional information.
A list of polysemous words was extracted from the parsed Russian Wiktionary 12 using PHP API piwidict 13 (Fig. 9).

Sentences of three Russian writers
The sentences which contain previously defined 9 words were to be selected from the corpus and tagged by experts. But the Wikisource corpus was too huge for this purpose. So, in our research a subcorpus of Wikisource texts was used. These are the texts written by Fyodor Dostoevsky, Leo Tolstoy and Anton Chekhov.
Analysis of the created WCorpus database with texts of three writers shows that the subcorpus contains: 15 • 2635 texts; • 333 thousand sentences; • 215 thousand wordforms; • 76 thousand lemmas; Texts of this subcorpus contain 1285 sentences with these 9 words, wherein 9 words have in total 42 synsets (senses). It was developed A graphical user interface (webform) of the WCorpus system ( Fig. 10) was developed, where experts selected one of the senses of the target word for each of the 1285 sentences.
This subcorpus database with tagged sentences and linked synsets is available online [7].

Text processing and calculations
These 1285 sentences were extracted from the corpus. Sentences were split into tokens. Then wordforms were extracted. All the wordforms were lowercase and lemmatized. Therefore, a sentence is a bag of words. Sentences with only one word were skipped.
The phpMorpy lemmatizer takes a wordform and yields possible lemmas with the corresponding part of speech (POS). Information on POS of a word is needed to work with the RusVectores' prediction neural network model "ruscorpora_2017_1_600_2", because to get a vector it is necessary to ask for a word and POS, for example "serve_VERB". Only nouns, verbs, adjectives and adverbs remain in a sentence bag of words, other words were skipped.
The computer program (Python scripts) which works with the WCorpus database and RusVectores was written and presented in the form of the project wcorpus.py at GitHub. 16 The source code in the file synset_selector.py 17 implements three algorithms described in the article, namely: • 0 -algorithm implemented in the function selectSynsetForSentenceByAverageSimilarity(); • -algorithm -function selectSynsetForSen-tenceByAlienDegree(); • -algorithm -function selectSynsetForSen-tenceByAverageSimilarityModified().
These three algorithms calculated and selected one of the possible synsets for each of 1285 sentences.
Two algorithms ( and ) have an input parameter of , therefore, a cycle with a step of 0.01 from 0 to 1 was added, which resulted in 100 iterations for each sentence.
Then, answers generated by the algorithms were compared with the synsets selected by experts.
The number of sentences with the sense correctly tagged by the -algorithm for nine Russian words presented in Fig. 11.
The legend of this figure lists target words with numbers in brackets ( , ), where is the number of sentences with these words, is the number of senses.
The curves for the words "ЗАНЯТИЕ" ("ZANYATIYE", cyan solid line with star points) and "ОТСЮДА" ("OTSYUDA", green solid line with triangle points) are quite high for some , because (1) there are many sentences with these words (352 and 308) in our subcorpus, (2) these words have few meanings (3 and 2).  If a word has more meanings, then the algorithm yields even poorer results. It is visible in the normalised data (Fig. 12), where examples with good results are "ОТСЮДА" (OTSYUDA) and "ЛИХОЙ" (LIKHOY, pink dash dot line with diamond points) with 2 meanings; the example "БРОСАТЬ" (BROSAT', red bold dotted line) with 9 meanings has the worst result (the lowest dotted curve).

Comparison of three algorithms
Let us compare three algorithms by summing the results for all nine words. Fig. 13 contains the following curves: • 0 -algorithm -long dash blue line; • -algorithm -solid red line; • -algorithm -dash yellow line.
The 0 -algorithm does not depend on . It showed mediocre results.
The -algorithm yields better results than -algorithm when > 0.15. The -algorithm showed the best results on the interval [0.15; 0.35]. Namely, more than 700 sentences (out of 1285 human-tagged sentences) were properly tagged with the -algorithm on this interval (Fig. 13).

Comparison of four algorithms as applied to nine words
Let us compare the results of running four algorithms for each word separately (Fig. 14): • 0 -algorithm -long dash blue line with triangle points; • -algorithm -solid red line with square points; • -algorithm -dash yellow line with circle points; • "Most frequent meaning" -green dashed line with X marks.
The simple "most frequent meaning" algorithm was added to compare the results. This algorithm does not depend on the variable , it selects the meaning (synset) that is the most frequent in our corpus of texts. In Fig. 14 this algorithm corresponds to a green dashed line with X marks.
The results of the "most frequent meaning" algorithm and 0 -algorithm are similar (Fig. 14).
The -algorithm is the absolute champion in this competition, that is for each word there exists an such that the -algorithm outperforms other algorithms (Fig. 14).
Let us explain the calculation of the curves in Fig. 14.
For the 0 -algorithm and the "most frequent meaning" algorithm, the meaning (synset) is calculated for each of the nine words on the set of 1285 sentences. Thus, 1285 · 2 calculations were performed.
And again, the -algorithm and the -algorithm depend on the variable . But how can the results be shown without the axis? If at least one value of gives a positive result, then we suppose that the WSD problem was correctly solved for this sentence by the algorithm.
The value on the Y axis for the selected word (for -algorithm and -algorithm) is equal to the sum of such correctly determined sentences (with different values of ) in Fig. 14. Perhaps it would be more correct to fix corresponding to the maximum number of correctly determined sentences. Then, the result will not be so optimistic.
To show the complexity of comparing and evaluating -algorithms (that is, algorithms that depend on ), let us try to analyze the results of the -algorithm, shown in Fig 15. The percentage (proportion) of correctly determined 1285 sentences for 9 words by the -algorithm, where the variable changes from 0 to 1 in increments of 0.01, is presented in Fig. 15. Thus, 1285 · 100 calculations were performed. These proportions are distributed over a set of possible calculated results from 0% (no sentence is guessed) to 100% (all sentences are guessed) for each of nine words.
This Figure 15 does not show which -values produce better or poorer results, although it could be seen in Figures 11-13. But the Figure  does show the set and the quality of the results obtained with the help of the -algorithm. For example, the word "лихой" (likhoy) with 22 sentences and 100 different has only 8 different outcomes of the -algorithm, seven of which lie in the region above 50%, that is, more than eleven sentences are guessed at any . For example, the word "бросать" (brosat') has the largest number of meanings in our data set, it has 9 synonym sets in our dictionary and 11 meanings in Russian Wiktionary. 18 Аll possible results of the -algorithm for this word are distributed in the range of 10-30%. The maximum share of guessed sentences is 30.61%. Note that this value is achieved when = 0.39, and this is clearly shown in Figure 12, see the thick dotted line.
All calculations, charts drawn from experimental data and results of the experiments are available online in Google Sheets [8].

Conclusions
The development of the corpus analysis system WCorpus 19 was started. 377 thousand texts were extracted from Russian Wikisource, processed and uploaded to this corpus.
Context-predictive models of the RusVectores project are used to calculate the distance between lemmas. Scripts in Python were developed to process RusVectores data, see the wcorpus.py project on the GitHub website.
The WSD algorithm based on a new method of vector-word contexts proximity calculation is proposed and implemented. Experiments have shown that in a number of cases the new algorithm shows better results.
The future work is matching Russian lexical resources (Wiktionary, WCorpus) to Wikidata objects [11].