中国古典大規模テキストデータベース

DH 2022東京記念レクチャーシリーズ

DH 2022 Tokyo Commemorative Lecture Series

2022-06-09（木）18:00-19:30　オンライン【終了】

講師：Donald Sturgeon

英国・ダラム大学コンピュータ科学部助教。彼の主な研究上の関心はデジタル・ヒューマニティーズとデジタル図書館であり、とくに前近代中国の言語・文学・歴史の研究にデジタル技術を応用することに関心がある。彼は、前近代中国文献のデジタル図書館、中国哲学書電子化計画（Chinese Text Project）の創始者である。

Donald Sturgeon is Assistant Professor at the Department of Computer Science, Durham University, UK. His main research interests are in digital humanities and digital libraries, and in particular the application of digital methods to the study of the language, literature, and history of premodern China. He is also the creator of the Chinese Text Project (https://ctext.org), a digital library of premodern Chinese written works.

中国哲学書電子化計画による前近代漢籍への翻刻と注釈

"Transcribing and annotating premodern Chinese texts with the Chinese Text Project"

2022-06-09（木）18:00-19:30 中国古典大規模テキストデータベース【終了】

前近代中国には、膨大な量の文献資料が版本や手稿として存在する。現在までに、何億ページもの資料がデジタル化され、オンラインで閲覧できるようになったが、これは個人では一生かかっても読み切れないほどの量である。

これらの資料を処理するための純粋に機械的な計算機によるアプローチは、その単純性、予測可能性、拡張性ゆえにかなりの価値があるが、これらのアプローチは、処理が容易なデジタルオブジェクトの形式的特徴を扱おうとすることが多く、文献研究者が関心を持つ意味論的に意味のある単位には直接対応しないことが多い。結果として、写本の画像からテキストを正確に転写したり、転写されたテキスト内の特定の人物、場所、日付への参照をすべて特定するといった一見すると単純なタスクは、通常、高度な人間の入力に依存するか、あるいは、自明ではないエラー率をもたらす自動処理によっておおまかに実施するしかない。深層学習のような技術は、このようなタスクを人間のパフォーマンスにより近づけるために役立つが、高品質の結果を達成することができるモデルを訓練するためには、通常は、（費用のかかる）人手で作成した大量の用例が必要である。

本講演では、中国語テキストの翻刻、注釈、知識ベース作成のためのクラウドソーシングプラットフォーム（https://ctext.org）の進捗について説明する。このプラットフォームでは、完全自動化および半自動化の手法と、多数のボランティアユーザーによる手動注釈と修正が組み合わされている。このプロジェクトは、比較的小規模な個人の貢献を大量に集約することで、デジタル化された資料の品質と有用性を継続的に直接的に向上させることを目的としている。テキスト中の固有表現を参照するための注釈、およびこれらの固有表現に関する構造化されたデータの作成は、人間の読者に直ちに有用な文脈情報を提供し、意味検索と手動注釈作業自体の支援向上を直接的に促進するものである。長期的には、この（ユーザーとの対話を通じて継続的に拡張する）データセットは、より意味的にニュアンスのあるテキストとデータマイニングの可能性を提供し、注釈や知識抽出などのタスクでより正確に人間を近似するモデルの訓練と評価のための基礎を提供するものでもある。

Vast amounts of premodern Chinese written materials exist as block printed or handwritten texts. To date, hundreds of millions of pages of such material have been digitized and made available online – far more than any individual person could ever hope to read in a lifetime.

While purely mechanical computational approaches to processing these materials have considerable value due to their simplicity, predictability, and scalability, these approaches tend to deal primarily with formal features of digital objects which are easy to process, but typically do not correspond directly to semantically meaningful units of interest to scholars of the works. As a result, seemingly simple tasks like accurately transcribing text from an image of a manuscript, or identifying all references to a given person, place, or date within a transcribed text typically either rely upon a high level of human input, or can only be performed approximately through automated processes that introduce a non-trivial rate of error. Techniques such as deep learning can be used to more closely approximate human performance in these tasks, but these typically also require large numbers of examples created manually – and expensively – by humans in order to train models able to achieve high quality results.

This talk describes progress made towards a crowdsourced platform for transcription, annotation, and knowledge base creation for Chinese texts (https://ctext.org), in which fully-automated and semi-automated methods are combined with manual annotation and correction performed by large numbers of volunteer users. By aggregating large numbers of relatively small individual contributions, this project aims to directly improve the quality and utility of digitized materials as an ongoing process. Annotation of references to entities in texts, as well as the creation of structured data about these entities provides immediately useful contextual information to human readers, and directly facilitates both semantic search and improved assistance with the manual annotation task itself. Longer term, this same dataset – continually expanding through user interaction – also offers the potential for more semantically nuanced text and data mining, and provides a basis for training and evaluating models that can more accurately approximate humans in tasks such as annotation and knowledge extraction.

コメンテイター：

永崎研宣（一般財団法人人文情報学研究所）

一般財団法人人文情報学研究所主席研究員。日本学術振興会人文・社会科学ﾃﾞｰﾀｲﾝﾌﾗｽﾄﾗｸﾁｬｰ構築推進事業センター研究員（PO）。国文学研究資料館客員教授。筑波大学大学院博士課程哲学・思想研究科単位取得退学。博士（関西大学・文化交渉学）。東京外国語大学アジア・アフリカ言語文化研究所COE研究員、山口県立大学国際文化学部助教授等を経て一般財団法人人文情報学研究所の設立に参画。これまで各地の大学研究機関で文化資料のデジタル化と応用についての研究支援活動を行ってきた。学会関連活動としては、情報処理学会論文誌編集委員、日本印度学仏教学会常務委員情報担当、日本デジタル・ヒューマニティーズ学会議長、TEI Consortium理事等を歴任。著書に『文科系のための情報発信リテラシー』（東京電機大学出版局、2004年）、『日本の文化をデジタル世界に伝える』（樹村房、2019年）など。