The Corpus of Teaching Assistant Classroom Speech (CoTACS) is a prosodically-annotated corpus that has been developed to provide (1) researchers with data to conduct studies on pronunciation and discourse features of TA speech, and (2) teachers and learners with materials for pronunciation instruction.
The data include classroom speech by 10 Teaching Assistants (TAs) who were native speakers of American English (ATAs) and 20 international TAs (ITAs) from different first language and disciplinary backgrounds. Each TA was recorded (using a non-intrusive wearable recorder) while teaching a lecture, lab, problem solving, or studio session at Iowa State University between Spring 2018 and Fall 2019. For each speaker, the corpus includes an audio file (Mean length = 54 minutes) and a TextGrid file that can be opened in Praat (Boersma & Weenink, 2009) to show the aligned transcription and annotations. Orthographic transcriptions are aligned with the audio at the level of sounds, words, and tone units. Prosodic annotations include tone units, prominence, and pauses based on Brazil’s (1976) discourse intonation framework.
This is a CoTACS TA speech segment opened in Praat. From the top, you can see waveform, pitch movement, aligned words, aligned sounds, syllables, orthographic transcription, tone unit and prominence markup, the student speech tier (if any), and a comments tier for any additional notes. Tone unit boundaries are marked with double slashes (//), prominence is marked using CAPITAL LETTERS and new or contrastive information (predicted independently from prosody) is marked in bold. The annotations were done based on auditory analyses of phonetic cues such as pause, vowel lengthening, and pitch direction shift (Pickering, 1999).
CoTACS has been developed following Egbert et al.'s (2022) proposed steps for corpus design to answer a general research question: “How do patterns of prosodic features (e.g., prominence, tone units, and pauses) compare between ATA and ITA speech?” The target domain was classroom speech by TAs from different L1 backgrounds teaching courses in a variety of disciplines. To represent this domain, the following steps were taken: (1) describing the domain, (2) operationalizing the domain, and (3) planning the sample.
The table below summarizes CoTACS's composition, including speech data across a range of disciplines, courses, levels, and class types. A total of 69,750 words have been transcribed in the corpus so far, with an average of 2,325 words per speaker. The extrapolated wordcount for the whole corpus based on individual speakers’ speech rate is 171,442 words total, with an estimated average of 5,715 words in each file. The lengths of audio files range from 26 to 130 minutes, reflecting the difference in the length of classes across the disciplines and class types. An average audio file is, therefore, an approximately 54-minute recording of TA classroom speech, with occasional student responses and questions. Identifiable information such as full names of TAs or students mentioned during class has been removed.