Text and data mining for East Asian sources in classical Chinese

Workshop ID:


Workshop Title:

Text and data mining for East Asian sources in classical Chinese


Donald Sturgeon

Time (in JST and UTC):

July 25 17:00-20:30 (JST)

July 25 8:00-11:30 (UTC)

Maximum number of participants:



Substantial volumes of primary sources important to the historical written record of China and other East Asian civilizations have been made available through online databases, either as transcriptions, image sequences, or both. A small but growing number of texts have also been semantically annotated, with named entities linked to suitable knowledge bases. Using the Chinese Text Project (https://ctext.org), this workshop introduces ways of working with digitized and annotated historical texts, as well as demonstrating how to improve the state of digitization of such texts in a crowdsourced environment supporting manual correction of OCR, semantic annotation of named entities, and construction and use of an open knowledge graph.

Aim of the workshop/tutorial:

The goal of this tutorial is to give participants direct, hands-on experience of efficiently locating digitized materials in this digital library, using these materials for open-ended text and data mining, and also contributing to the ongoing digitization and annotation of these and other related materials.


This session will introduce participants to: 1) Basic navigation of a large and moderately complex digital library. 2) Text mining using openly available browser-based tools that use interactive visualizations to allow user-driven exploration of the contents of both this digital library, and arbitrary user-supplied materials in any language. 3) Crowdsourced editing to correct errors in textual transcriptions and principles of versioned textual repositories. 4) Semantic annotation and knowledge base construction. 5) Basic knowledge graph querying and data mining. 6) Querying the knowledge graph with RDF and SPARQL.