Using TXM to analyze speech transcriptions corpora

Workshop ID:


Workshop Title:

Using TXM to analyze speech transcriptions corpora


Serge Heiden

Time (in JST and UTC):

July 25 17:00-20:30 (JST)

July 25 8:00-11:30 (UTC)

expected no. of participants:



The TXM platform <> combines powerful and original techniques for the analysis of structured and annotated textual corpora using modular and open-source components (Heiden 2010). It was initiated by the ANR Textometry project, which launched a new generation of textometric research, in synergy with current corpus and statistical technologies (Unicode, XML, TEI, NLP tools, CQP and R).

It helps users to build and analyze any type of digital textual corpus that may be tagged and structured in XML. It is distributed as a Windows, Mac or Linux software called “TXM” (based on Eclipse RCP ) technology and as a web portal software (based on GWT technology) for online corpora access.

It is commonly used by research projects from different disciplines of humanities and social sciences such as history, literature, geography, linguistics, sociology and political science. Textometric scientific publications are presented during the Days of Textual Data statistical Analysis (JADT - Journées d’Analyse statistique des Données Textuelles) international conference <>.

Aim of the workshop/tutorial:

The objective of this tutorial is to introduce participants to the new tools developed into TXM to analyze corpora of speech record transcriptions, that is texts synchronized with an audio or video source record at the speech turn, the utterance or the word level. Participants will first learn how to use any word processing software (like MS Word or LibreOffice Writer) to transcribe records using a simple syntax to encode speech turns and their locutor, time bullets, comments and sections limits. Then they will learn how to convert those transcriptions to the XML Transcriber format for import into TXM, and then analyze them with the panel of content analysis tools available: word frequency lists, collocation analysis, kwic concordancers, and transcription reading synchronized with audio or video records playing.

During the tutorial, each participant will install TXM and the TreeTagger lemmatizer on her Windows, Mac or Linux laptop and will leave the tutorial with a ready to use environment.


This is a half-day tutorial:

Part 1: Introduction [30 min]

  • meetup introduction

  • introduction to TXM and record transcription corpora

Part 2: Interactive hands-on session [2h30 min]

Part 2.1: Corpus preparation

  • transcribing records using a simple syntax in LibreOffice Writer or MS Word (.odt or .docx files)

** speech turns and their locutor code, time bullets, comments and sections limits

  • converting Word transcription documents to the Transcriber XML format (.trs files) with TXM utilities

  • importing Transcriber transcriptions (.trs) into TXM

** preparing records media files (.mp4 or .mp3)

** automatic lemmatization with TreeTagger

Part 2.2: Corpus analysis

  • transcriptions corpus analysis with TXM

** word frequency lists

** kwic concordancers

** transcription edition reading

** links to audio or video records playing (from edition or concordances)


Heiden Serge. (2010). The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. In 24th Pacific Asia Conference on Language, Information and Computation (pp. 10 p.). Sendai, Japon. Retrieved from