Models

As of now, you can choose from the following models:

* British National Corpus (bnc)

We trained this model on the British National Corpus. Corpus size is 98 million tokens. The model knows 163 473 different English words (lemmas).

Linguistic preprocessing: split into sentences, tokenized, lemmatized, PoS-tagged, phrases (two-word entities) extracted, stop words removed

Context window length: 10 words to the left and 10 words to the right.

Model performance:

  • SimLex999: 0.39
  • Google Analogy: 0.56

Download the model (176 Mbytes)

* English Wikipedia (enwiki)

We trained this model on all the texts from the English Wikipedia dump (February 2017). Corpus size is about 2.3 billion tokens. The model knows 296 630 different English words (lemmas).

Linguistic preprocessing: split into sentences, tokenized, lemmatized, PoS-tagged, multi-word named entities extracted, stop words removed

Context window length: 5 words to the left and 5 words to the right.

Model performance:

  • SimLex999: 0.40
  • Google Analogy: 0.81

Download the model (318 Mbytes)

* English Gigaword (gigaword)

We trained this model on all the texts from English Gigaword Fifth Edition. Corpus size is about 4.8 billion tokens. The model knows 314 815 different English words (lemmas).

Linguistic preprocessing: split into sentences, tokenized, lemmatized, PoS-tagged, phrases (two-word entities) extracted, stop words removed

Context window length: 2 words to the left and 2 words to the right.

Model performance:

  • SimLex999: 0.44
  • Google Analogy: 0.67

Download the model (338 Mbytes)

* Google News corpus (googlenews)

This model was initially trained by Google researchers; the original version available here. Corpus size is about 100 billion tokens. The model knows about 2.9 million English words and phrases (no lemmatization!).

Linguistic preprocessing: tokenized, phrases (multi-word entities) extracted. For details, see the paper Distributed Representations of Words and Phrases and their Compositionality.

We additionally performed PoS tagging for all the words. Warning! This is lossy transformation: if the original model had a vector for 'dance', encompassing all possible PoS meanings for this word, after the tagging the same vector is labeled 'dance_NOUN'.

Context window length: unknown.

Model performance:

  • SimLex999: 0.43
  • Google Analogy: 0.75

Download the PoS-tagged model (2.7 Gbytes)

* Norsk aviskorpus / NoWAC (norge)

We trained this model on the texts from Norwegian Newspaper Corpus and Norwegian Web Corpus (NoWAC). Corpus size is about 1.9 billion tokens. The model knows 306 943 different Norwegian words (lemmas).

Linguistic preprocessing: split into sentences, tokenized, lemmatized, PoS-tagged, phrases (two-word entities) extracted, stop words removed

Context window length: 5 words to the left and 5 words to the right.

Download the model (329 Mbytes)

Lemmatization and PoS tagging of the training corpora was performed using Stanford CoreNLP. Then, the PoS tags were converted to the Universal PoS Tags format (for instance, "parliament_NOUN"). Lemma case was preserved.

Additionally, some strong two-word collocations (bigrams) were joined into 1 token with the special character "::", for instance, Coca_PROPN Cola_PROPN became Coca::Cola_PROPN. In the recent models (for example, the Wikipedia one) we performed this only for the multi-word sequences which Stanford CoreNLP annotated as named entities of the types 'Organization', 'Person', or 'Location'.

The models themselves were trained with Continuous Skip-Gram algorithm introduced by Tomas Mikolov (see Publications). Vector dimensionality for all models was set to 300.

The models were evaluated against the SimLex999 test set (Spearman correlation) and the semantic part of the Google Analogy test set (accuracy).