ZipfExplorer

This tool lets you compare the frequencies of shared word types in different texts or corpora.

Select texts or corpora to explore with the drop-down menus, or use the buttons to upload your own files (in .txt format). The plot x-axes show word frequency ranks, while the y-axes show the relative frequency per 10k words. The circles represent individual word types. Hovering over a word shows its rank, frequency, relative frequency (per 10k words), a log-likelihood measure (Dunning's G) compared to the other text, and the log-likelihood p-value. Use the plot tools (above the second plot) to drag the plots around, select specific words, zoom in and out, and reset the plots.

Selecting words on the plot or on the sortable tables below highlights them. The tables show word rank, frequency, difference in relative frequency per 10k words compared to the other text, and log-likelihood.

Words with positive rel_diff values are more frequent in the first text; those with negative values in the second text.

With the “Remove most frequent words” drop-down menu, up to 200 of the most frequent words in English can be removed from the plots/tables. This can help to highlight content differences between the texts. The frequent words are from a corpus of the English-language texts in the Project Gutenberg made available by Sketch Engine.

Several measures of lexical diversity are provided: the type-token ratio (TTR), the Gini coefficient, which ranges from 0 (all words have the same frequency) to a theoretical maximum of 1 (all words have frequency zero except one word, $n \rightarrow \infty$), the alpha parameter of the fitted power law function, and the Shannon entropy 𝑯.

The data consist of several literary texts and a corpus of inaugural addresses of U.S. presidents from 1789–2017 (from NLTK), a number of texts scraped from Project Gutenberg, the Brown Corpus and its subsections, and the Freiburg-Brown Corpus of American English, available via Clarin-NO's Corpuscle tool.

The Heroku server shuts off if there has been no user input for a while -- refresh the page to re-start the app in this case.



Tool created by Steven Coats

If you wish, please cite with: Coats, Steven. (2020). Comparing word frequencies and lexical diversity with the ZipfExplorer tool. In Sanita Reinsone, Inguna Skadiņa, Anda Baklāne and Jānis Daugavieti (eds.), Proceedings of the 5th Digital Humanities in the Nordic Countries Conference, Riga, Latvia, October 21–23, 2020, 219–225. Aachen, Germany: CEUR.