Textmining

Intro to Jupyter Python Fundamentals Tokenising Text Data Preprocessing Concordances Regular Expressions Counting Tokens Frequency Distributions Lexical Dispersion Plot Plotting Frequency over Time Collocations Part-of-Speech-Tagging (POS) Named-Entity-Recognition (NER) Topic Modeling Sentiment Analsysis Python Day Practice Review About

Welcome to this Introduction to Text Mining!

What is text mining?

“Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources… The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts.” - from What is Text Mining? by Marti Hearst

Why text mining? Automated search strategies can provide an overview of patterns and tendencies in large volumes of text. This may provide insight that would otherwise be difficult to achieve, time-consuming or both through conventional qualitative methods.

Types of text mining and digital analysis of text

Tools and software bundles for text mining

Text mining requires readable big data. In this course, we will pay particular note to the value of platforms such as NLTK, a leading platform for building Python programs to work with human language data, and DH-Lab, which facilitates text mining of the entire digitized collection of the National Library of Norway.

More and more archives have their own programming or application tools to text mine their collections, with or without subscription. Some familiarity with programming is often necessary, typically in Python. Most of this course is therefore devoted to getting a fundamental competence of the Python programming langauge.

Learning a programming language is like learning any new language. It requires copying, repeating, using a dictionary, practicing, learning enough of the ‘grammar’ of the language to detect typos and mistakes.

A key in learning to master a programming language, is learning how to look things up. Therefore in this course you will become familiar with the practice of looking things up and becoming familiar with some of the most relevant (textbooks, published tutorials, online resources, and forums).

With a competence in Python, comes a deeper understanding of web applications for text mining, including those from DH-Lab and Voyant which this course will also introduce you to.

Learning goals:

Book Chapters:

Articles:

Overview of workshop episodes:

Program:

Part I: Getting started
9-9:15: Introduction: What is text mining? | Text mining types | Text mining tools
9:15-10: Getting started with Python
Introduction to Jupyter, Python, and NLTK.
How the Python program works (the math behind it with strings and loops).
How Python code can be used to text mine.

10-10:15: Break

Part II: Text Mining the Medical History of British India
10:15-11:
Preprocessing dataset
Concordances

11-11:15: Break

11:15-12:
Regular expressions search
Counting tokens
Frequency distributions and word clouds - plots

12-13: Lunch

Part III: Text Mining Inaugural addresses
13-14:
Lexical dispersion (inaugural addresses) - plots
Plotting frequency over time - plots
Collocations (with specified measured)
Part-of-Speech tagging - plots

14-14:15: Break

Part IV: Text Mining with Voyant and DH-Lab
14:15-14:45: Tour and demonstration of Voyant and DH-Lab notebooks and DH-Lab apps

14:45-15:35: Individual work in Python, Voyant, or DH-Lab
15:35-15:50: Q&A, group discussion of tools, discuss assignment and assignment schedule
15:50-16: Course evaluation

Suggestions for individual practice:

Python: text mine other corpora made available by NLTK, e.g.: Project Gutenberg, Chat Corpus, Personals Corpus, Wall Street Journal
CODE: Concordances, Words in Contexts (similar, common), Collocations, Plots (Dispersion & Frequency Distribution)

Voyant: text mine Inaugural Address Corpus (or another corpus from NLTK), comparing functionality provided by Voyant with what you did in Python:
TOOLS: Cirrus (word cloud based on frequencies), Trends (word dispersions), KWIC (concordances), Topics (topic modelling with term clusters) (w/many more tools available, incl. graphs and plots for visualizing word frequencies, distributions, proximities, collocates, correlations)

DH-Lab: build corpus from the National Library and text mine with the provided Python code in Jupyter and/or applications
TOOLS: Corpus builder, Concordances, Collocations, Frequency Distributions, Topic Modelling, Named Entity Recognition (NER), and Part-of-Speech Tagging (POS)