About

RegisterExplorer is a web service to explore words' semantic differences depending on the language registers where they are used. It is also possible to compare register-specific word meaning with its meaning in English in general.

It accompanies the paper `Exploration of register-dependent lexical semantics using word embeddings' by Andrey Kutuzov, Elizaveta Kuzmenko and Anna Marakasova.

Registers are text types, styles or 'sublanguages' depending on the communicative situations. We feature models of the following English language registers:

  • spoken language,
  • academic language,
  • news language,
  • fiction language
  • popular non-fiction language.

You can search for any English word, optionally accompanied with one of the following part-of-speech tags (for example, 'boot_SUBST'):

Under the hood of this service are distributional language models. They were trained on register-specific slices of the British National Corpus (BNC), large and well-established collection of English texts (note: all the texts are older than 1994, so don't try to look for 21st century memes!). Each model corresponds to one register (or the whole BNC), so by comparing word embeddings in different models we compare different registers. You can read about the BNC registers in more detail in the paper by David Lee (see References below).

Distributional models use word co-occurrences data extracted from large corpora to represent semantics of each particular word with dense vectors called word embeddings. Words with similar meaning possess similar vectors. This allows to computationally process natural language taking semantics into account. We use the well-known Continuous Bag of Words algorithm developed by Tomas Mikolov et al. to train our models.

The resulting models demonstrate semantic specificity of different registers. It can be observed through comparing the lists of nearest semantic associates for a given word in different models. You can look for the nearest associates of any word and grasp what this word means in a particular register. For example, the word star means very different things in a typical newspaper article and in an academic paper.

Additionally, differences between the associates in different models can be expressed quantitatively. RegisterExplorer will calculate the normalized Kendall's Tau distance between all the registers and the model trained on the whole BNC, so that you know in which register the meaning of your query word is most 'exotic'.

In the Text analysis tab you can paste your text and find out how likely it is that it was produced within one of the aforementioned registers. For this, we employ the approach called 'deep inverse regression' developed by Matt Taddy. It analyses word sequences and outputs log-likelihood of encountering such sequences in the given models. Thus, one can estimate how strong is the 'expression' of each register in the query text.

Models

Download our models for your own use. Models come in Gensim format, tar/gzipped.

* Full British National Corpus

Download the model (207 Mbytes)

* Academic subcorpus

Download the model (108 Mbytes)

* Fiction subcorpus

Download the model (90 Mbytes)

* News subcorpus

Download the model (109 Mbytes)

* Popular non-fiction subcorpus

Download the model (143 Mbytes)

* Spoken subcorpus

Download the model (102 Mbytes)

References

  1. Kutuzov, Andrei and Kuzmenko, Elizaveta and Marakasova, Anna. "Exploration of register-dependent lexical semantics using word embeddings", In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH). The COLING 2016 Organizing Committee. ISBN 978-4-87974-708-2. 2016
  2. Lee, David. "Genres, Registers, Text Types, Domain, and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle", in Language Learning & Technology 5.3 (2001): 37-72.
  3. Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality", in Advances in neural information processing systems. 2013.
  4. Taddy, Matt. "Document Classification by Inversion of Distributed Language Representations", in Proceedings of the 2015 Conference of the Association of Computational Linguistics.

Team

Register Explorer is maintained by Andrey Kutuzov (University of Oslo, Norway), Anna Marakasova and Elizaveta Kuzmenko (National Research University Higher School of Economics, Russia).

The source code for the service is available here and you are welcome to deploy it with any other set of models trained on text corpora of your choice.


 University of Oslo National Research University Higher School of Economics Nordic Language Processing Laboratory (NLPL)