Word Embeddings and Evaluation Datasets

This site houses evaluation datasets for semantic and morphological relatedness in the Icelandic language, along with pre-trained word embeddings evaluated using those datasets.

The datasets, called MSL and IceBATS, are based on international standards, and have been fully localized and adapted to the Icelandic language.

The embeddings are based on three different algorithms - word2vec, fastText, and GloVe - and have been trained using both lemmatized and nonlemmatized data from The Icelandic Gigaword Corpus (IGC), a tagged corpus intended for linguistic research.

About MSL

Our MSL dataset may be downloaded here

MSL, short for Multi-SimLex, is an evaluation protocol and associated dataset for lexical semantics. The original English-language MSL builds on several older, well-known datasets, most notably SimLex-999, and has already been released in a dozen languages.

A fully processed MSL dataset consists of 1,888 unordered word pairs, where each pair is tagged with grammatical categories and marked with a numerical score that indicates the words' semantic similarity. A small number of pairs in our set consist of multiword phrases, when no suitable monolexical translation was available.

(It should be noted that MSL is, by design, not intended to measure semantic relatedness. For example, antonyms, although certainly related in a linguistic sense, are not similar in meaning. The words "black" and "white" may thus be strongly related, both being colors, but are notionally dissimilar and would likely receive a low score as an MSL pair.)

This similarity score is derived from a team of annotators who, working separately, give each pair in the set a grade between 0 and 6 (inclusive) according to how semantically similar they are, with 0 being the lowest possible level of similarity and 6 the highest. The raw sets of annotator scores are then evaluated and filtered through repeated calculation of average pairwise inter-annotator agreement (APIAA) and average mean inter-annotator agreement (AMIAA), with full sets being removed until a maximum agreement level, or a minimum number of annotators, is reached.

Scores of 0.600 and higher indicate "strong agreement" for both APIAA and AMIAA. The average overall APIAA for hereto published MSL sets is 0.631, while the Icelandic MSL set has a score of 0.690. Likewise, the average overall AMIAA is 0.740, while the Icelandic set scores 0.799.

As noted at the beginning of this section, the Icelandic MSL dataset used to evaluate our word embeddings may be downloaded here. Each line represents a single pair and contains four tab-separated entries: The first word in the pair; the second word in the pair; the average annotator score from the final APIAA-filtered set of scores; and the same score normalized to a 0-to-1 range. Multiword phrases have their constituent words separated by a single space.

A separate list, containing the original English-language pairs and their grammatical categories alongside their Icelandic counterparts, may be downloaded [02]. This list is intended only for reference; it contains no annotations and was not used to evaluate our embeddings.

About IceBATS

Our IceBATS dataset may be downloaded here

IceBATS is an Icelandic adaptation of the Bigger Analogy Test Set (BATS). BATS is intended to evaluate word embeddings based on word analogy tasks. This extensive set demonstrates a language model’s ability to recognize various linguistic relations with the use of the vector offset method. In its simplest form, a word analogy consists of two word pairs, (a:b) and (c:d), where the relationship between a and b is considered to be analogous to the relationship between c and d. A famous example is (man:woman) and (king:queen). If word embeddings have been suitably trained, their offset between the word vectors b and a should be equivalent to that between c and d. In other words, d = c - a + b, or in our example, queen = king - man + woman, the linguistic relation is captured as a distance in the vector space.

The test set contains 98,000 analogy questions that cover inflectional and derivational morphology as well as lexicographic and encyclopedic semantics. Each category is divided into 10 subcategories and each of them has 50 unique word pairs. The morphological categories are sampled to reduce homonymy so that words that can belong to more than one word category are avoided (e.g. run : runs which could either be nouns or verbs). Additionally, the semantic categories include multiple correct answers where applicable, something that becomes especially important when testing relations such as homonyms and hypernyms.

IceBATS follows the original set structurally, making minor changes where applicable. Changes are usually due to varying morphological characteristics and important syntactical parts of the language in question.

Inflectional morphology Derivational morphology Lexicographic semantics Encyclopedic semantics
Nouns I01 singular, nom-gen (maður - manns) Affix added D01 ó adjectives (skemmtilegur - óskemmtilegur) hypernyms L01 animals (hundur - gæludýr/spendýr) geography E01 country - capital (Ísland - Reykjavík)
I02 singular, articles (veður - veðrið) D02 aðal nouns (leikkona - aðalleikkona) L02 misc (tölva - tæki/tækni) E02 country - language (Kanada - enska/franska)
I03 plural, articles (dætur - dæturnar) D02 a verbs (hopp - hoppa) hyponyms L03 misc (poki - bakpoki/plastpoki) E03 Icelandic town - townee (Reykjavík - Reykvíkingur
I04 nominative, sing-plur (félag - félög) Inflectional ending cut, suffix added D04 andi nouns (eiga - eigandi) meronyms L04 substance (andrúmsloft - súrefni/vetni) E04 country - countryman (Spánn - Spánverji
Adjectives I05 comparative masc (veikur - veikari) D05 ing nouns (dreifa - dreifing) L05 member (hreindýr - hjörðhópur) people E05 nationalities (Beethoven - þýskur/austurrískur
I06 comparative fem (grunn - grynnri) D06 legur adjectives (nauðsyn - nauðsynlegur) L06 part (horn - hreindýr/hrútur) E06 occupation (Laxness - rithöfundur/skáld)
I07 comparative neutral (nýtt - nýrra) Includes sound shift in stem D07 na verbs (unninn - vinna) synonyms L07 intensity (reiður - gramur/heiftugur) animals E07 the young (hestur - folald/fyl)
Verbs I08 inf - indic sing past (fara - fór) D08 ari nouns (dæma - dómari) L08 exact (drengur - piltur/sveinn) E08 sounds (kýr - baula)
I09 inf - past D09 ingur nouns (andstaða - andstæðingur) antonyms L09 gradable (ódýr - dýr/ómetanlegur) other E09 thing - color (blóð - rauður/djúprauður)
I10 past participle - indic sing past (farið - fór) D10 ja verbs (varinn - verja) L10 binary (hvítur - svartur) E10 male - female (leikari - leikkona)

The Embeddings

We trained embeddings based on three models: word2vec, fastText, and GloVe.

Embeddings for word2vec may be downloaded [04].

Embeddings for fastText may be downloaded [05].

Embeddings for GloVe may be downloaded [06].

The Code

Code for training and evaluation using word2vec may be downloaded here, for fastText here, and for GloVe here.

All programs were written in Python and run on Ubuntu using Python 3.6. We strongly recommend that the user read inline comments in the code before running it. Certain options - for example, loading pre-trained embeddings rather than training them scratch - may be valuable to some users, but will only be activated if the correct lines of code are rendered active.

Each program performs two main tasks: Training of embeddings for the model in question, and subsequent evaluation of those embeddings using MSL and IceBATS. In order to avoid memory-related issues, the programs generally try to load into memory only the word vectors themselves, rather than the entirety of an embedded model.

Training allows the user to define a number of hyperparameters, and to choose between training a model from scratch or loading the vectors from previously trained embeddings.

Evaluation employs MSL and IceBATS sequentially inside a single function. The user may safely comment out one of the two, and only run the other, as needed.

In addition, MSL evaluations employ and chain together several postprocessing methods. Each such sequence is clearly delineated inside the code, and may be commented out if not needed.

How to use

As noted earlier, the embeddings creation process requires specially prepared data from the IGC. The raw data may be downloaded from the CLARIN-IS site, and the code to prepare it for use is available here.

For fastText and word2vec, the entire training and evaluation process takes place inside a single Python program. For GloVe, on the other hand, the user will need to run a Bash script (which is also where GloVe hyperparameters need to be set) which in turn executes a Python program.

We make use of the Gensim module for Python. This should be installed as version 4.0.0b or higher; not as version 3.x or lower. All other dependencies derive directly from import statements in each Python program.

All ancillary input files should be saved to the same directory folder as the Python code. All output files will automatically be written to the same directory.

The default hyperparameter values are set to those we used to train the embeddings. While we encourage any user to try their hand at using different values, it should be noted that some hyperparameters may affect, or be affected by, computer hardware, and changing them could lead to memory errors or system stability issues.

People

The following people have worked on the corpus:

Hjalti Daníelsson, project management, software development, data collection, and translation
Steinunn Rut Friðriksdóttir, project management, software development, data collection, and translation
Steinþór Steingrímsson, project management
Einar Freyr Sigurðsson, data collection
Hildur Hafsteinsdóttir, data collection
Þórdís Dröfn Andrésdóttir, data collection and translation
Þórður Arnar Árnason, data collection and translation
Gunnar Thor Örnólfsson, data collection

In addition, we would like to express our sincere gratitude to all the volunteers who participated in the annotation of Multi-SimLex data.

Licencing and Citing

All material on this site - including the embeddings, evaluation datasets, and relevant code - is published under CC BY 4.0, as is the IGC.

If you use the embeddings, datasets, or code in your published research, please cite the Clarin repository and this web site:

[TILVÍSUN Í CLARIN REPOSITORY]
[TILVÍSUN Í ÞESSA VEFSÍÐU?]

In addition, please consider citing the following papers, when applicable:

[TILVÍSUN Í ICEBATS GREIN]
[TILVÍSUN Í MSL GREIN]

Cooperation and Financing

Work on the embeddings and evaluation datasets was carried out at The Árni Magnússon Institute for Icelandic Studies. It was supported by the Language Technology Programme for Icelandic 2019-2023, funded by the Icelandic government.