Introducing the HiPSTAS Audio Toolkit Workflow: Audio Labeling

By Tanya Clement and Steve McLaughlin

Audio preservation and access presents a significant resource management issue for libraries and archives. Digitizing sound is a task that can be partly automated, but describing a recording so its contents are discoverable requires a much more labor-intensive workflow. Imagine you have found an unlabeled cassette. You put it into your boom box, you press play, you turn on your auto-magic digitizer (i.e., your phone’s recorder) because you’d rather have an MP3, and then you get up and leave the room. The machine still plays the music and your phone still records it, but the sound is too indistinct for Siri or Shazam to recognize it. You end up with 90 minutes of audio and you have no idea what’s in it. Is it your grade school piano recital, a bootleg of that U2 concert you went to in 1988, or, quite possibly, the blank hum of white noise? You’d have to listen to every minute of it to find out. Similarly, librarians and archivists lack tools that can generate descriptive metadata on unheard audio files automatically, especially for recordings that include background noise, non-speech sounds, or poorly documented languages, all of which lie beyond the reach of sound identification and automatic transcription software.

So why not use machine learning tools?

This is the first blog post in a series that will present some DIY techniques for using freely available machine learning algorithms to help label these “unheard” recordings. The HiPSTAS Audio Tagging Toolkit, a Python package designed by Steve McLaughlin, is the result of many years work, initially funded by the NEH (in 2012 and 2013) and currently supported by a multi-year grant funded by the IMLS in collaboration with the Pop Up Archive, the WGBH Educational Foundation, and the American Archive of Public Broadcasting. Along with the interactive Audio Labeler application, the Audio ML Lab virtual environment, and the Speaker Identification for Archives (SIDApipeline, we present a set of resources that make it possible for anyone with a laptop to start automatically generating metadata for mixed-sound audio collections.

The Audio Tagging Toolkit builds on several open source audio processing tools, including FFmpegLibrosa, and aubio, to support a workflow for training and applying audio machine learning classifiers. The toolkit is designed to be accessible for programming novices, offering several readable, modifiable modules that expedite common tasks in an audio annotation workflow.

Upcoming posts will share information about the workflow and examples for speaker identification as part of the HiPSTAS project with WGBH and the Pop Up Archive, but this workflow can be adapted for other sounds as well. The workflow includes the following steps, which are explained more thoroughly below and in future blog posts:

  1. Select a set of audio files (MP3 or WAV) as a training corpus, typically several dozen or several hundred episodes of a radio or TV show. Your primary speaker(s) of interest should appear fairly often in these recordings.
  2. Launch the Audio Labeler application and apply labels to a series of randomly selected 1-second clips.  500–1000 labels is a good average number for a training set for identifying a speaker of interest.
  3. The next three steps take place in the SIDA preprocessing template:
    1. First, extract individual WAV files for each 1-second clip you just labeled.
    2. Next, extract vowel segments from these WAV clips using Audio Tagging Toolkit, creating a large collection of very short audio files. (On systems with limited processor speed and/or memory, this can help save time in the long run. If a batch process is interrupted, for instance, you can start again where you left off.)
    3. Finally, extract training features (MFCCs + deltas + delta deltas) for each vowel clip and write each clip’s features to a separate CSV file.
  4. Switch to the SIDA train and classify template  and load your saved features from CSV for each class you plan to use for training. (Reading data from CSVs is much faster than extracting features directly from audio, making trial-and-error experimentation easier.)
  5. Download pre-extracted features from the AAPB Universal Background Model dataset and add them to your non-speaker-of-interest training data. With several thousand speakers chosen at random from the AAPB’s collection of public radio and TV broadcasts, the AAPB-UBM will help you make your speaker classifier more robust.
  6. Train a classifier model using scikit-learn and save the model to disk as a Python pickle (“.pkl”) file. A simple multi-layer perceptron classifier (i.e., a shallow neural network) is a good starting point: scikit-learn’s MLPC module is versatile, fast, and easy to train.
  7. Load your saved model from the pickle file. (If you want to use the same classifier in the future, you can start from this step.)
  8. Run the classifier on a new audio file by breaking it into equal-sized windows (somewhere in the range of 1- to 5-second resolution), then averaging the model’s output across each window. For better accuracy, detect vowel segments in each window first, then discard classification values for the non-vowel portions of the audio.
  9. Write classifier output for each unseen audio file to a CSV. Depending on your needs, you may want to apply a rolling average to these values first and/or choose a cutoff threshold to convert decimal values to binary classes. Audio Tagging Toolkit includes several handy functions for these cleanup tasks.
  10. To view your results, open an audio file in Sonic Visualiser, then load its CSV-formatted classification values as a region layer, which will display your data as colored bars overlaid on the audio’s waveform. If you wish, you can correct and adjust your machine labels by hand or simply use them as a guide for a new annotation layer.

A more thorough description of the first steps in the workflow, audio labeling, is the focus of this first workflow post.

Labeling Audio for Machine Learning

The first steps in our audio machine learning workflow include assembling a set of labeled training data, i.e., examples of the categories we want our machine learning algorithm to recognize. For instance, we might train a binary classifier with labels for “music” vs. “speech,” or “Speaker A” vs. “Not Speaker A.” This technique, in which we begin with labels applied by humans, is called supervised learning, and the data we use to create a new classifier model is known as the training set. For examples, see previous machine learning applications in the HiPSTAS project, which have included finding applause in poetry performances and identifying instances of changing genres (speech, song, and instrumental) in field recordings. 

The goal of the project that we will discuss here is to train a model that identifies a single speaker’s voice. One approach to train an algorithm would be to first collect a handful of recordings that contain the speaker, then mark every point where that speaker is heard. Because listening is time-consuming, an individual labeler may be able to get through five or ten hours of material for a single speaker. Unfortunately, sometimes models trained this way perform poorly: In most cases they end up being overfit to the training set, meaning they’re too narrowly tailored to the five or ten hours of audio they’re drawn from and cannot be used to identify that same voice in other kinds of recording settings or when that speaker ages and his or her voice changes. To create more robust classifiers, we need a broader, more varied set of examples.

The workflow we will introduce here facilitates labeling audio clips for a training set at random from a corpus of several dozen or several hundred recordings. As we developed our current workflow, we focused on creating models that would identify two speakers in particular: Terry Gross, host of NPR’s Fresh Air, and Marco Werman, host of the news show The World, co-produced by WGBH and the BBC World Service. We chose Fresh Air because it posed few technical challenges: Nearly every guest is recorded in a professional studio, so the audio quality is consistent and free of background noise. And NPR.org hosts thousands of downloadable Fresh Air episodes, spanning the past three decades (although before the mid-2000s, only selected programs are available). The World was appealing precisely because it seemed challenging: It includes speakers from many countries, with many accents, recorded under wildly heterogeneous conditions. A given program may contain several phone interviews, stories filed by far-flung correspondents using a range of recording gear, and and the varied background sounds of noisy street environments. We used 20 episodes of Fresh Air and 100 from The World, all of which were chosen at random to reflect changes in the hosts’ voices over time.

To expedite the random labeling process, McLaughlin created the Audio Labeler application. This application provides a browser-based interface for labeling one-second clips chosen at random from a set of audio files provided by the user, displaying a waveform as a visual aid and offering the following four seconds to help settle ambiguities. We chose one second as the default label duration since speech segments shorter than a second can be difficult to identify on the first listen but longer clips are more likely to include multiple speakers and multiple sounds, making them useless for training classifiers to find unique voices or sounds.

As a rough rule of thumb, we find we need at least 400–500 one-second labels for a given speaker of interest, which required labeling 4,000 clips each from Fresh Air and The World for our example. In order to employ a binary classifier (Is it this or that?) or a multi-class classifier (Is it this or that or this other thing?) to identify sounds of interest, it is necessary to also have counter examples to the voice you are seeking to find. In other words, if the machine is deciding between this and that, it’s necessary to provide examples of that. For example, in labeling clips to train a classifier to identify Terry Gross across episodes of Fresh Air, our process included labeling clips of Terry Gross from different episodes but also labeling “Music”, “Background speaker,” and “Silence”: https://github.com/hipstas/podcast-speaker-labels/blob/master/Fresh_Air/Terry_Gross_labels_randomized.csv. In labeling clips to train a classifier to identify Marco Werman across episodes of The World, our labeling included snippets of Marco Werman but also of other speakers who this time we marked “Male”, “Female”, “Carol Hills”, “Multiple Speakers” (instead of simply “Background Speaker”) and other sounds such as “Music” and “Silence”:  https://github.com/hipstas/aapb-speaker-labels/blob/master/speaker_labels_randomized/The_World_WGBH_labels_100_episodes.csv . Once we’ve trained our classifiers, our model will examine a previously unseen segment of audio and assign it to one of these labels or “bins.” A decision tree shows this part of the process.

Finally, state-of-the-art speaker identification systems include thousands of individual speakers in their training sets, with those data known collectively as a “universal background model” (UBM). For example, once we’ve applied the model we’ve created above with the labels we’ve generated, we could end up with relatively few speakers in our “Background speaker” label set. A typical episode of Fresh Air, for instance, might include four or five speakers in addition to Terry Gross. With 20 episodes in our training corpus, that comes to no more than 100 speakers for our model to consider — a good start, but far from representing all the vast range of “Not Terry Gross” speakers that have appeared on other Fresh Air episodes over the years and would likely appear in the “unheard” portion of Fresh Air where we’d like to identify the presence of Terry Gross when we apply the model.  To help address the problem of too few examples of “other” voices in training sets for identifying speakers in the American Archive of Public Broadcasting, we selected a 4000-hour subset of the AAPB’s media collection (6547 audio and video files in all), then extracted two ten-second clips at random from each file. McLaughlin, along with Ryan Blake, a Master’s student at UT Austin’s School of Information, then used the Audio Labeler application to apply 1-second labels within this cross-section of the AAPB. As a result, the AAPB-UBM currently contains 3700 usable 1-second speech clips, each labeled by apparent gender. In addition to the raw WAV audio, we have also posted a set of extracted features (MFCCs + deltas + delta deltas) from vowel segments in these clips, which can imported directly into the the Speaker Identification for Archives (SIDA) speech identification pipeline. We are sharing this AAPB-UBM dataset, which can be combined with users’ own background speaker labels to create more robust classifiers than would otherwise be possible with limited time and training data.

In our next post we will describe and introduce the Speaker Identification for Archives (SIDA) pipeline including a Jupyter notebook that walks through the rest of our speaker identification workflow including feature extraction, model training, and classification.

Machine-labeled segments of Marco Werman’s speech on The World.

Posted in Uncategorized | Comments closed

Using ARLO in the History of Modern Latin America Archives

Hannah Alpert-Abrams, PhD candidate in Comparative Literature at UT, discusses using ARLO in the History of Modern America through Digital Archives classroom: http://www.pterodactilo.com/blog/experimental-technology-and-digital-pedagogy/ #hipstas

Posted in Uncategorized | Comments closed

HiPSTAS NEH Institute Final White Paper

The HiPSTAS NEH Institute Final White Paper is here.

Posted in Uncategorized | Comments closed

MLA 2016: Close and Distant Listening to Poetry with HiPSTAS and PennSound

HiPSTAS is at MLA 2016 in Austin!

Thursday, 7 January

136. Close and Distant Listening to Poetry with HiPSTAS and PennSound

5:15–6:30 p.m.

Program arranged by the Forum TM Libraries and Research

Presiding: Tanya E. Clement, Univ. of Texas, Austin

There are hundreds of thousands of hours of important spoken text audio files, dating back to the nineteenth century and up to the present day. These artifacts, many of which comprise poetry readings by significant literary figures, are only marginally accessible for listening and almost completely inaccessible for new forms of analysis and instruction in the digital age. Further, in August 2010, the Council on Library and Information Resources and the Library of Congress issued a report titled The State of Recorded Sound Preservation in the United States: A National Legacy at Risk in the Digital Age, which suggests that if scholars and students do not use sound archives, our cultural heritage institutions will be less inclined to preserve them. Librarians and archivists need to know what scholars and students want to do with sound artifacts in order to make these collections more accessible, but humanities scholars, arguably, also need to know what kinds of analysis are possible in an age of large, freely available digital collections and advanced computational analysis.

To be sure, computer performance, in terms of speed and storage capacity, has increased to the point where it is now possible to analyze large audio collections with high performance systems, but scholars’ abilities to do new kinds of research (what Jerome McGann calls “imagining what you don’t know”) and to share and teach these methodologies with colleagues and students is almost entirely inhibited by present modes of access. This panel addresses these issues through an introduction to the HiPSTAS (High Performance Sound Technologies for Access and Scholarship) Project. Funded by the National Endowment for the Humanities, HiPSTAS is a collaboration between the iSchool at the University of Texas, Austin and the Illinois Informatics Institute (I3) at the University of Illinois at Urbana-Champaign as well as scholars, librarians and archivists to develop new technologies for facilitating accessing and analyzing spoken word recordings.

Specifically, this panel will address what it means to “close-listen” (Bernstein 2011) and “distant-listen” (Clement 2012) to digital recordings of poetry performances.

Charles Bernstein, co-director of PennSound (the largest internet archive of poetry readings, both in terms of content and audience), closely identifies literary scholarly inquiry into sound or “close listening” with increased access, claiming that with such access, “the sound file would become . . . a text for study, much like the visual document. The acoustic experience of listening to the poem would begin to compete with the visual experience of reading the poem” (Bernstein 114). This fifteen-minute introduction will be the first MLA presentation on how PennSound’s modes of access, though freely available as downloads, are shaped not only by editorial criteria and approaches to copyright, but also by modes of funding, technical features, how the site is used by listeners both in the U.S. and globally, and the relationship the site has to institutional affiliations such as the University of Pennsylvania Libraries and the Electronic Poetry Center.

This panel will also comprise first presentations from the HiPSTAS project. The HiPSTAS team is developing ARLO (Adaptive Recognition with Layered Optimization), as a tool for “distant listening” or “investigat[ing] significant patterns within the context of a system that can translate ‘noise’ (or seemingly unintelligible information) into patterns” (Clement 2012) for interpretation. As the remaining panelists will show in three, fifteen-minute presentations, some of these patterns of interest include audience sounds, material sounds that resound from recording technologies, and performance sounds that help us discern versions from remixes of poems.

Steve McLaughlin will consider audience feedback as a distinctive feature of public poetry performance that is widely overlooked. Applause, a convention so common as to be nearly invisible, indexes the presence of an audience while conveying a general sense of its size, disposition, and perhaps the success of a given reading. Fortunately, the sonic properties of applause make it well-suited for identification through machine learning. Using measurements produced by the ARLO audio analysis tool, this presentation will tease out applause patterns in poetry recordings from the PennSound archive, with reference to region, venue, time period, and other factors.

The provenance of recordings, which can provide important clues to social, economic, and production histories, is another feature that is often lost in transcriptions. The question remains whether material provenance can be recovered from vestigial artifacts encoded in recordings as “para-sound watermarks”. In his talk, Chris Mustazza will consider whether audio analysis tools can help uncover material signatures in early poetry recordings, including some by Vachel Lindsay, Gertrude Stein, and James Weldon Johnson, originally made on aluminum records and attempt to locate other recordings of common provenance in the PennSound archive. Additional topics will include the ontological implications for audio transcodings, and connecting materiality to the conditions of (re)publication.

Kenneth Sherwood explores the opportunities for interpretation that are made available by the fact that Audio poetry archives provide scholars unprecedented access to multiple recordings of a given poem. Close listening and ethnopoetic transcription provide a methodology for the identification and description of significant paralinguistic variations but are inadequate to the scale of archives like Pennsound. Using ARLO as a visualization tool, it becomes feasible to work on the scale of the archive and to address questions of broader scope, such as: Do readings tend to increase or decrease in pace over time? Do they become more or less dynamic? Do the answers to the above questions conform to or challenge dominant notions of poetic school, style, audience, setting, region, etc.? To the extent that we find it interesting to pursue such questions, computational analysis and visualization tools may help us frame the answers.

This panel will demonstrate that infrastructures (both social and technological) that facilitate access to sound recordings have a direct impact on how we understand and teach sound cultures.

Bernstein, Charles. Attack of the Difficult Poems: Essays and Inventions. University Of Chicago Press, 2011. Print.

Clement, Tanya E. “Distant Listening: On Data Visualisations and Noise in the Digital Humanities.” Text Tools for the Arts. Digital Studies / Le champ numérique. 3.2 (2012). Web. 4 April 2015.

Posted in Uncategorized | Comments closed

John A. Lomax and Folklore Data

This post includes more technical details on a longer post I have included on the Sounding Out blog in which I mention that we analyzed the recordings in the UT Folklore Center Archives at the Dolph Briscoe Center for American History, The University of Texas at Austin, which comprises 57 feet of tapes (reels and audiocassettes) and covers 219 hours of field recordings (483 audio files) collected by John and Alan Lomax, Américo Paredes, and Owen Wilson, among others. We wanted to find different sonic patterns including the presence of instrumental music versus singing versus speech. The results of our analysis are noteworthy. For example, in the visualization shown in this brief movie, we see a subtle yet striking difference between the Lomax recordings (created 1926-1941), which are the oldest in the collection, and the others created up until 1968. The Lomax recordings (primarily created by John Lomax) consistently contain the least amount of speech in comparison to the other files.

UT Folklore Collection, Visualizing the predicted presence of Instruments, Speech, and Song using ARLO from Tanya Clement on Vimeo.

How was this data produced? We used the ARLO software. We tagged 4,000 randomly selected two-second windows; ARLO divided these windows into 1/32 windows.

machineTagging

We ended up with 93966 instrument tags, 48718 spoken tag and 81890 sung tags. With all the spectra tagged (even non-instrumental, speech, or sung), we had 25,053,489 (all spectra, all 4,000 files).

The results in the movie are shown for each file, grouped according to date across the x-axis. The dates are shown at the top of the screen. The Y-axis shows the number of seconds that each class (green=instrumental; red=spoken; and purple=sung) was predicted highest for each file. The blue bar shows the total number of seconds for each file. The movie shows a scrolling of these results across the collection according to date.

Of course, there are a number of ways you can read these results, which I’ve outlined on the longer post on the Sounding Out Blog.

Posted in Uncategorized | Comments closed

Hearing the Audience

HiPSTAS Participant Eric Rettberg has written a new piece at Jacket2 titled Hearing the Audience.

Posted in Uncategorized | Comments closed

Marit MacArthur receives ACLS digital innovation fellowship

HiPSTAS participant Marit MacArthur has received an ACLS digital innovation fellowship to develop the ARLO interface for humanists interested in pitch tracking.

Posted in Uncategorized | Comments closed

Distanced sounding: ARLO as a tool for the analysis and visualization of versioning phenomena within poetry audio

HiPSTAS Participant Kenneth Sherwood has written a new piece at Jacket2 titled Distanced sounding: ARLO as a tool for the analysis and visualization of versioning phenomena within poetry audio

Posted in Uncategorized | Comments closed

The Noise is the Content

HiPSTAS Participant Chris Mustazza has written a great piece at Jacket2 titled The noise is the content: Toward computationally determining the provenance of poetry recordings using ARLO.

Posted in Uncategorized | Comments closed

HiPSTAS wins a second grant from NEH for HRDR

Even digitized, unprocessed sound collections, which hold important cultural artifacts such as poetry readings, story telling, speeches, oral histories, and other performances of the spoken word remain largely inaccessible.

In order to increase access to recordings of significance to the humanities, Tanya Clement at the University of Texas School of Information in collaboration with David Tcheng and Loretta Auvil at the Illinois Informatics Institute at the University of Illinois, Urbana Champaign have received $250,000 of funding from the National Endowment for the Humanities Preservation and Access Office for the HiPSTAS Research and Development with Repositories (HRDR) project. Support for the HRDR project will further the work of HiPSTAS, which is currently being funded by an NEH Institute for Advanced Topics in the Digital Humanities grant to develop and evaluate a computational system for librarians and archivists for discovering and cataloging sound collections. The HRDR project will include three primary products: (1) a release of ARLO (Automated Recognition with Layered Optimization) that leverages machine learning and visualizations to augment the creation of descriptive metadata for use with a variety of repositories (such as a MySQL database, Fedora, or CONTENTdm); (2) a Drupal ARLO module for Mukurtu, an open source content management system, specifically designed for use by indigenous communities worldwide; (3) a white paper that details best practices for automatically generating descriptive metadata for spoken word digital audio collections in the humanities.

Posted in Uncategorized | Comments closed