Models

The following models are available for online exploration (more to download in our Model Repository):

* English Wikipedia (enwiki_upos_skipgram_300_2_2021)

We trained this model on all the texts from the English Wikipedia dump (November 2021). Corpus size is about 3 billion tokens. The model knows 199 430 different English words (lemmas).

Linguistic preprocessing: split into sentences, tokenized, lemmatized, PoS-tagged, multi-word named entities extracted, stop words removed

Context window length: 2 words to the left and 2 words to the right.

Model performance:

SimLex999: 0.42
Google Analogy: 0.63

Download the model (637 Mbytes)

* English Gigaword (gigaword_upos_skipgram_300_2_2018)

We trained this model on all the texts from English Gigaword Fifth Edition. Corpus size is about 4.8 billion tokens. The model knows 297 790 different English words (lemmas).

Linguistic preprocessing: split into sentences, tokenized, lemmatized, PoS-tagged, multi-word named entities extracted, stop words removed

Context window length: 2 words to the left and 2 words to the right.

Model performance:

SimLex999: 0.44
Google Analogy: 0.71

Download the model (636 Mbytes)

* Google News corpus (googlenews_upos_skipgram_300_xxx_2013)

This model was initially trained by Google researchers; the original version available here. Corpus size is about 100 billion tokens. The model knows about 2.9 million English words and phrases (no lemmatization!).

Linguistic preprocessing: tokenized, phrases (multi-word entities) extracted. For details, see the paper Distributed Representations of Words and Phrases and their Compositionality.

We additionally performed PoS tagging for all the words. Warning! This is lossy transformation: if the original model had a vector for 'dance', encompassing all possible PoS meanings for this word, after the tagging the same vector is labeled 'dance_NOUN'.

Context window length: unknown.

Model performance:

SimLex999: 0.43
Google Analogy: 0.75

Download the PoS-tagged model (5.3 Gbytes)

* British National Corpus (bnc_upos_skipgram_300_10_2017)

We trained this model on the British National Corpus. Corpus size is 98 million tokens. The model knows 163 473 different English words (lemmas).

Linguistic preprocessing: split into sentences, tokenized, lemmatized, PoS-tagged, phrases (two-word entities) extracted, stop words removed

Context window length: 10 words to the left and 10 words to the right.

Model performance:

SimLex999: 0.39
Google Analogy: 0.56

Download the model (328 Mbytes)

* Norsk aviskorpus / NoWAC (naknowac_upos_skipgram_300_5_2017)

We trained this model on the texts from Norwegian Newspaper Corpus and Norwegian Web Corpus (NoWAC). Corpus size is about 1.9 billion tokens. The model knows 306 943 different Norwegian words (lemmas).

Linguistic preprocessing: split into sentences, tokenized, lemmatized, PoS-tagged, phrases (two-word entities) extracted, stop words removed

Context window length: 5 words to the left and 5 words to the right.

Download the model (616 Mbytes)

Lemmatization and PoS tagging of the training corpora was performed using Stanford CoreNLP. Then, the PoS tags were converted to the Universal PoS Tags format (for instance, "parliament_NOUN"). Lemma case was preserved.

Additionally, some strong two-word collocations (bigrams) were joined into 1 token with the special character "::", for instance, Coca_PROPN Cola_PROPN became Coca::Cola_PROPN. In the Wikipedia and Gigaword models we performed this only for the multi-word sequences which Stanford CoreNLP annotated as named entities of the types 'Organization', 'Person', or 'Location'.

The models themselves were trained with Continuous Skip-Gram algorithm introduced by Tomas Mikolov. Vector dimensionality for all models was set to 300.

The models were evaluated against the SimLex999 test set (Spearman correlation) and the semantic part of the Google Analogy test set (accuracy).

Citing us

If you use these models in your research, please cite the following paper:

Fares, Murhaf; Kutuzov, Andrei; Oepen, Stephan & Velldal, Erik (2017). Word vectors, reuse, and replicability: Towards a community repository of large-text resources, In Jörg Tiedemann (ed.), Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017. Linköping University Electronic Press. ISBN 978-91-7685-601-7

Language Technology Group

Nordic Language Processing Laboratory