About

This service computes semantic relations between words in English and Norwegian. Semantic vectors reflect meaning based on word co-occurrence distribution in the training corpus (huge amounts of raw linguistic data).

In distributional semantics, words are usually represented as vectors in a multi-dimensional space of their contexts. Semantic similarity between two words is then trivially calculated as a cosine similarity between their corresponding vectors; it takes values between -1 and 1. 0 value means the words lack similar contexts, and thus their meanings are unrelated to each other. 1 value means that the words' contexts are absolutely identical, and thus their meaning is very similar.

Recently, distributional semantics received a substantially growing attention. The main reason for this is a very promising approach of employing artificial neural networks to learn hiqh-quality dense vectors (embeddings), using the so-called predictive models. The most well-known tool in this field now is possibly word2vec, which allows very fast training, compared to previous approaches.

Unfortunately, training and querying neural embedding models for large corpora can be computationally expensive. Thus, we provide ready-made models trained on several corpora, and a convenient web interface to query them. You can also download the models to process them on your own. Moreover, our web service features a bunch of (hopefully) useful visualizations for semantic relations between words. In general, the reason behind WebVectors is to lower the entry threshold for those who want to work in this new and exciting field.

What it does?

WebVectors is basically a tool to explore relations between words in distributional models. You can think about it as a kind of `semantic calculator'. A user can choose one or several models to work with: currently we provide five models trained on different corpora.

Much more word embeddings models can be found in our NLPL repository.

After choosing a model, it is possible to:

  1. calculate semantic similarity between pairs of words;
  2. find words semantically closest to the query word (optionally with part-of-speech filters);
  3. perform analogical inference: find a word X which is related to the word Y in the same way as the word A is related to the word B;
  4. draw 2D and 3D semantic maps of relations between input words (it is useful to explore clusters and oppositions);
  5. get the raw vectors (arrays of real values) and their visualizations for words in the chosen model: just click on any word anywhere, or use a direct URI to the word of interest, as described below.

In the spirit of Semantic Web, each word in each model has its own unique URI explicitly stating lemma, model and part of speech (example). Web pages at these URIs contain lists of the nearest semantic associates for the corresponding word, belonging to the same PoS as the word itself. Other information about the word is also shown.

We also provide a simple API to get the list of semantic associates for a given word in a given model. There are two possible formats: json and csv. Perform GET requests to URLs following the pattern http://vectors.nlpl.eu/explore/embeddings/MODEL/WORD/api/FORMAT where MODEL is the identifier for the chosen model, WORD is the query word and FORMAT is "csv" or "json", depending on the output format you need. We will return a json file or a tab-separated text file with the first 10 associates.

Additionally, you can get semantic similarities for word pairs in any of the provided models via queries of the following format: http://vectors.nlpl.eu/explore/embeddings/MODEL/WORD1__WORD2/api/similarity/ (note 2 underscore signs).

Naturally, one can compare results from different models on one screen.

Citing us

If you use our service in your research, please cite this paper:

Fares, Murhaf; Kutuzov, Andrei; Oepen, Stephan & Velldal, Erik (2017). Word vectors, reuse, and replicability: Towards a community repository of large-text resources, In Jörg Tiedemann (ed.), Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017. Linköping University Electronic Press. ISBN 978-91-7685-601-7

Links

This service runs on WebVectors, free and open source toolkit for serving distributional semantic models over the web.

You can also check RusVectōrēs, our sister service for Russian.

Publications

If you are interested in word embedding models, you should really check these publications:

  1. Turney, P. D., P. Pantel (2010). “From frequency to meaning: Vector space models of semantics”. Journal of artificial intelligence research, 37(1), 141-188.
  2. Mikolov, T., et al. (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  3. Mikolov, Tomas, et al. (2013) “Exploiting similarities among languages for machine translation.” arXiv preprint arXiv:1309.4168.
  4. Baroni, Marco, et al. (2014) "Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors.” Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Vol. 1.
  5. Rong, X. (2014). word2vec parameter learning explained. arXiv preprint arXiv:1411.2738.
  6. Levy, O., Goldberg Y., and Dagan, I. (2015) Improving Distributional Similarity with Lessons Learned from Word Embeddings. TACL-2015
  7. Kutuzov, A., Velldal, E, and Øvrelid, L. (2016) Redefining part-of-speech classes with distributional semantic models. CoNLL-2016
  8. Sahlgren, M., and Lenci, A. (2016) The Effects of Data Size and Frequency Range on Distributional Semantic Models. Proceedings of EMNLP-2016