Textmining

Text Mining with Voyant Tools

Keypoints:

Voyant Tools is an open-source, web-based application for performing automated computational text analysis on documents. It can be used to analyze online texts you link to and texts you upload, and it accepts a number of different file formats, such as docx, pdf, txt, etc. Voyant is popular among scholars in the digital humanities and has a large, international user base. It does not perform linguistical text analysis such as part-of-speech tagging or named entity recognition, but is highly user friendly, provides nice visualizations, and has a rich functionality. See the list of tools here.

Acknowledging the limitations of its pre-programmed functionalities, Voyant welcomes users to develop their own tools using Voyant’s functionality and code, and endorses the use of other tools, in particular Python with Jupyter Notebook.

Access Voyant Tools at voyant-tools.org.

Get started:

Topic Modeling

Topic modeling, sometimes referred to as “theme modeling”, is a method that enables analysis of words’ co-occurrence patterns in texts. Statistical calculations are performed by an algorithm, the output of which allows for grouping, or clustering, of words under the concept of a topic. Despite these clusters being nothing more than words grouped by statistical analysis, a researcher may glean interesting information about the thematic structure of texts through this method.

There are several algorithms used in topic modeling. Latent Dirichlet Allocation (LDA), the most common, and BERTopic, which is newer, are two examples. The outputs of these two algorithms are about equivalent, despite being different algorithms.

Voyant Tools features an implementation of LDA, which requires relatively little previous experience with topic modeling to use. You can read more under the help menu about Voyant’s Topics tool and the algorithm behind it. Voyant’s topic tool uses an implementation of LDA called jsLDA. Words in each document are randomly assigned to a specified number of topics (you can determine the number of topics). The algorithm then goes through a number of iterations (50) and tries to refine the model of which terms are best suited to which topics (based on co-occurrence in the documents).

Topic modeling may also be done through Python. The LDA algorithm is implementd in the open-source Gensim Python package. The DH-Lab at the National Library of Norway has written example code in Python to text mine their digitized collection with Gensim-LDA, you can find it in their open collections of Jupyter Notebooks. They have also made a web-based app for topic modeling based on the co-occurrence of words.

Index or
next