Workshop: HathiTrust Research Center’s Extracted Features 2.0 Dataset

Workshop ID:


Workshop Title:

Workshop: HathiTrust Research Center’s Extracted Features 2.0 Dataset


Ryan Dubnicek, Jennifer Christie

Time (in JST and UTC):

July 26 1:00-4:30 (JST)

July 25 16:00-19:30 (UTC)

Maximum number of participants:



This workshop will introduce participants to the HathiTrust Research Center’s Extracted Features Dataset, and demo new data fields and functionality introduced in the latest version, 2.0. Generated from the over 17 million volumes (over 60% still in copyright) in the HathiTrust Digital Library, the EF 2.0 Dataset supports text and data mining in this corpus while still being distributed as open, restriction-free data. This tutorial will introduce the EF 2.0 Dataset, the key concepts behind its creation, and hands-on research use cases for the Dataset using IPython notebooks.

Aim of the workshop/tutorial:

The general objectives of this workshop are to introduce the HathiTrust context, motivation for, and development and release of the Extracted Features Dataset, and to familiarize participants with the data format, its potential applications, and the latest additions in the 2.0 version. Our goal is for attendees to leave this workshop with a general understanding of the utility of derived datasets and to be comfortable beginning exploratory data analysis using the EF Dataset. In addition to more HTRC-centric learning objectives, hands-on activities will have added bonuses of an introduction to common cultural analytics tasks in Python, and the associated software libraries used for such tasks, including Pandas, NLTK and Gensim.


Section 1: Intro to HathiTrust Digital Library and HTRC

  • What is HathiTrust?

  • What is/isn’t the HathiTrust Digital Library?

  • What is the HathiTrust Research Center?

Section 2: Context and motivation for the HTRC EF Dataset

  • Non-consumptive research

  • What is in the data?

  • Data models and analysis techniques

Section 3: Ethical considerations of text datasets

  • Bias in libraries, datasets, data, and algorithms

Section 4: Getting and Exploring EF data

  • Hands on with EF data, Python notebooks and the HTRC FeatureReader library