Textmining

Introduction to Text Mining with DH-Lab with apps

Keypoints:

To use an app to text mine the National Library’s digitized collections:

Named Entity Recognition (NER)

Named Entity Recognition (NER) tries to find out whether a word is a named entity (NE) or not. Named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, and so on. Entity recognition is often performed using chunkers, and part-of-speech (POS) tags are often a very important feature when searching for chunks. NLTK provides a classifier that has already been trained to recognize and tag named entities; you can read more about it in the NLTK textbook.

DH-Lab uses spaCy instead of NLTK in its app for NER. DH-Lab also offers example Python code in Jupyter Notebook for how to do NER and POS analysis with the spaCy models in the National Library’s digitized collections. spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. Unlike NLTK, which is widely used for teaching and research, spaCy focuses on providing software for production usage and is popular among app developers.

Index