Validating a computer system
Finally, notice that even though TIMIT is a speech corpus, its transcriptions and associated data are just text, and can be processed using programs just like any other text corpus.
Therefore, many of the computational methods described in this book are applicable.
The inclusion of speaker demographics brings in many more independent variables, that may help to account for variation in the data, and which facilitate later uses of the corpus for purposes that were not envisaged when the corpus was created, such as sociolinguistics.
A third property is that there is a sharp division between the original linguistic event captured as an audio recording, and the annotations of that event.
Structured collections of annotated linguistic data are essential in most areas of NLP, however, we still face many obstacles in using them.
It was designed to provide data for the acquisition of acoustic-phonetic knowledge and to support the development and evaluation of automatic speech recognition systems.
This last observation is less surprising when we consider that text and record structures are the primary domains for the two subfields of computer science that focus on data management, namely text retrieval and databases.
A notable feature of linguistic data management is that usually brings both data types together, and that it can draw on results and techniques from both fields.
Five of the sentences read by each speaker are also read by six other speakers (for comparability).
The remaining three sentences read by each speaker were unique to that speaker (for coverage). You can access its documentation in the usual way, using This gives us a sense of what a speech processing system would have to do in producing or recognizing speech in this particular dialect (New England).