Textmining

Introduction to Text Mining with DH-Lab Jupyter Notebooks with example Python code

Keypoints:

With the API from the DH-Lab you can text mine just about all of the National Library of Norway’s digitized collection.
The DH-Lab has prepared Python code in Jupyter Notebooks you can copy and adapt precisely as you want.
There is example code to search for concordances, collocations, N-gram and word frequencies, Named Entity Recognition (NER) & Part-of-Speech (POS) tagging, Topic Modelling, and Sentiment Analysis.

The DH-Lab at the National Library of Norway has written example code to text mine the National Library’s huge digitized collection, as well as web-based apps for a simpler introduction to text mining the National Library’s collection. The code is written in Python and shared in Jupyter Notebook, and the apps are made using Streamlit, a free and open-source app framework in Python.

Access DH-Lab with Python code in Jupyter Notebooks at DH-Lab Digital tekstanalyse.

To run code to text mine the National Library’s digitized collections:

You must have Anaconda downloaded and installed (as for this course).
Download example notebook from DH-Lab. Begin witht the first before selecting notebook based on type of text mining.
Open Anaconda, launch Jupyter Notebook from Anaconda, and open the downloaded notebook.
Begin at the top of the downloaded notebook and follow the instructions.
Run all cells in the notebook.

To run code from this course or elsewhere on a corpus in DH-Lab, you need to have dhlab installed with pip, and the file path to the defined corpus must be changed from your local machine to the URNs of your defined corpus in dhlab.

# !pip install dhlab

Sentiment Analysis

Also known as “opinion mining”, sentiment analysis describes automated methods to identify affective states in data sets. This is done through systematic selection of expressions of subjective opinions and emotional evaluations in the material. Sentiment analysis is popular in marketing, advertising and to examine the tone of political communication, public debate, social media as well as studies of plot and genre in literary corpora.

Digital sentiment analysis uses word lists and data sets where words and expressions are given a score based on perceived emotional meaning in sentiment analysis of text data. NLTK comes with a sentiment package, and DataCamp has made a NLTK Sentiment Analysis Tutorial for Beginners.

Scientists at the University of Oslo participating in the project Sentiment Analysis for Norwegian Text (SANT) have developed data sets for sentiment analysis in Norwegian, NorSentLex, a Norwegian sentiment lexicon, indicating the prior positive or negative polarity of words. DH-Lab has written example code in Python using NorSentLex, to do a sentiment analysis of a newspaper corpus in the digitized collection of the National Library. See example code shared in Jupyter Notebook.

Index or
DH-Lab apps