Features:
Text, Audio, and Meaning: Lessons from TextAV
An event that explored how captions + transcripts can improve audio + video production
“Voice is a sound with a meaning, and is not merely the result of any impact of the breath as in coughing…” —Aristotle, “On the Soul,” trans. J.A. Smith
A couple of weeks ago, I attended the TextAV workshop, supported by OpenNews and hosted by NYU’s ITP program. The purpose of the event was to explore new ways to marry text with online audio and video. (Text + AV = the TextAV workshop.) Developers, archivists, and producers gathered to share and brainstorm new methods for automating transcription and captioning processes to make audio and video media more interactive and accessible, as well as a lot easier to produce. As someone who creates all the transcripts of interviews that I use for our METRO podcast by hand—I use a transcription tool but it still takes me about three times the total length of the audio to create the transcript—I was definitely interested to see the latest innovations out there. Here are some of my main takeaways from the workshop, both concrete and cerebral.
My Starting Points
I am coming from the perspective of someone who produces audio stories and helps libraries and archives in New York work with audio. As we try to coordinate oral history projects across METRO’s member institutions, one of the most frequent complaints is that creating transcripts for oral histories is labor intensive and time-consuming. Transcription is often cited as a major barrier to entry for many libraries and archives thinking about collecting, organizing, preserving, and providing access to audio files.
Transcripts are essential for any oral history or audio production process. There is no other way to cut, wrangle, and edit long audio files without a text document with time stamps. Editing becomes a process of blind hunting and pecking as you try to find the clips you want within a bunch of indistinguishable sound waves. Even if an oral history project is just collecting unedited audio interviews, the transcripts make these audio files discoverable via search keywords and accessible to people using screen readers.
Coding Fluidity
Blaine Cook, a founding engineer at Twitter, put forward a statement during his presentation that encapsulated a theme that I saw running through the workshop: “Content is like water.”
Blaine’s point—a reformulation of Bruce Lee’s description of martial artists—was that content is not hierarchical. The text, audio, and video content that we are trying to break down into concrete components and categorize is not made up of concrete, categorizable components, which thwarts the promise of a semantic wonderland. How do you capture the meaning of a poem or the process of its creation? He showed an example of a typewritten poem that was covered with hand-written annotations, scribbles, and lines criss-crossing it in patterns that make sense to the human eye. But capturing these annotations would be difficult to program, and that code would not be reproducible.
Blaine’s suggestion is to treat content like water. Water is hard to parse and manipulate, but you can capture it in containers. And then you can move those buckets around. So the idea is to create containers that acknowledge content’s fluidity rather than try to break it down into complex hierarchical structures. Here’s a simplified version of how he would do that in JSON:
{
contents: “Main text document goes here etc etc etc”,
annotations: [
{ position: 4, text: “Change this” }
{ position: 12, text: “I like this” }
]
}
This example hit on something that I have been pondering for a while: how the regime of computation alternately furthers and obscures human communication. Human beings have an instinct for telling stories that was apparently essential for our evolution as a species and brings us psychological comfort. Despite the amazing advances being made in automated speech-to-text software, it is difficult to automate AV production in a way that doesn’t strip stories of the information patterns that make sense to humans.
During the workshop Mark Boas gave a presentation about an open-source tool called Hyperaudio that he has been developing as part of a team with support from the Knight Foundation and the Mozilla Foundation. His demo convinced me that we are looking at a situation in the near future where automated transcripts transform how we interact with audio and video media. Just check out the projects Hyperaudio has done with the Studs Terkel Archive and Radiolab. Text-based editing could revolutionize audio production, making it more efficient and more accessible to newbie producers.
Yet, as things currently stand, there are roadblocks. The text output that is generated by software is missing some crucial pieces. The transcripts don’t have punctuation. They don’t have capitalization. There are weird and wonky inaccuracies. They still require significant human labor to clean them up if they are going to be published. That’s why so many radio production studios pay people at a rate of $1/minute to transcribe their raw tape. Humans are more accurate, and the final product is much more useful to radio producers who need to quickly get a sense of what content is encoded in their audio files.
Because content is like water. Human communication follows an amorphous, complex logic that is difficult to encode and decode.
Voices Communicate Meaning
Steve Reich summarizes it beautifully in this quote that I’ve borrowed from Brian Foo’s presentation:
In our Western languages, speech melody hovers over all our conversations, giving them fine emotional meaning—“It’s not what she said, it’s how she said it.” We are, with speech melody, in an area of human behavior where music, meaning, and feelings are completely fused.
The meaningful musicality of the spoken word is something that intrigued Brian Foo when he worked on an automated transcription tool for the New York Public Library’s Community Oral Histories project. As he explains in a blog post:
I became preoccupied with the question of what was lost from the translation of spoken word to text: the subtle pauses, speech cadences, evolving dynamics, the speeding up and slowing down, the stumbles, the stammers, the um’s, and ah’s.
This prompted Brian to write a computer program that translates the spoken word into sheet music. He worked with Maya Angelou’s spoken performance of her poem Still I Rise, and the musical score that his program generated captures a poetic essence that is missing in a written transcript. (Which makes me think about how sheet music is another form of code that misses things—microtonality comes to mind.)
Encoding Broadcasting for Interactivity
Chris Baume, a Senior Research Engineer at the BBC, gave a really interesting presentation about a concept called “Object-based Broadcasting” (OBB). It calls to mind, of course, the modularity of object-oriented programming, and it applies the same principle to broadcasting. So rather than rendering broadcasts into linear media, OBB would send out media objects and the necessary metadata to put them together into a program, and the receiving devices would mix the media individually.
This would allow for personal, interactive, immersive experiences. For example, devices would be able to reduce background noise, or alter the length of a program dynamically. BBC has implemented this concept in experimental responsive video programs, like Squeezebox and VideoContext, but it is complicated, and challenges remain. These are clear use cases for finding ways to categorize and encode media with instructions so that we can manipulate them.
In a later presentation, Chris made the comment: “Speech to text is a lossy process.” Transcoding into text strips the content of some of its meaning. Which gets to a fundamental question that has been nagging me for a while: does abstraction get us closer to or further from the essence of a thing?
Questions about Media and Meaning
All communication involves abstraction. Meaning is compressed, created, and lost by the media (a.k.a. language, images, speech) we use to communicate. The further that meaning is abstracted from the thing or phenomenon being communicated, the more universally it can be applied and the more easily manipulated. The ultimate example is mathematical equations that describe fundamental laws of physics. These equations, which were either discovered or invented by humans depending how you look at it, have tremendous power to manipulate things (the atomic bomb comes to mind). But it was only by finding ways to abstract phenomena into communicable formats that allowed for the manipulation.
The power of abstraction has revolutionized the ways we communicate, especially in the age of computation. So is abstracting meaning, like encoding voices into text, a process of loss or refinement? Or both? In his book Gramophone, Typewriter, Film, media theorist Friedrich Kittler points out that recording devices are non-discriminating, and it is through encoding devices, like text generation, that we can glean meaning from the noise:
“Thanks to the phonograph, science is for the first time in possession of a machine that records noises regardless of so-called meaning. Written protocols were always unintentional selections of meaning” (Page 85).
Philosophical musings aside, the TextAV workshop brought to light current and forthcoming tools that will help text, audio, and video live side-by-side in ways that aren’t mutually exclusive, providing more access points for people to experience media in whatever way means the most to them. And I really look forward to giving text-based audio editing a spin!
The Goodies
Here’s a rundown of some of the TextAV projects presented at the workshop:
- Autoedit, automated transcript generation with transcript-based editing
- Automated transcripts for Audiogram
- BBC Dialogger (specifically this component which extends CKEditor
- BBC transcript-editor
- BBC Kaldi, an open source framework trained on audio using BBC’s extensive archive
- Cadet, from WGBH
- Extract-CC-Bytestream, from Indiana University
- Guardian Media Atom Maker (blog post)
- Inline audio component
- Microservice approach for extracting Line 21 data from archival video, developed at CUNY
- NYPL Transcript Editor
- Opened Captions, a real-time caption API
- oTranscribe
- Popcorn.js at the Internet Archive
- Squeezebox, for rapidly adjusting duration
- Transcript model for representing speaker, segment, word
- VideoContext, An experimental HTML5 & WebGL video composition and rendering API
- Web annotation data model (JSON LD) file format
If you would like to explore further, please check out the livestreams, group notes from Day 1 and Day 2, and presentations, which can all be found on GitHub.
Credits
-
Molly Schwartz
Molly Schwartz is the Studio Manager at the Metropolitan New York Library Council (METRO). She is also the host and producer of METRO’s podcast, Library Bytegeist. She holds a master’s degree in Library Science from the University of Maryland at College Park and a BA/MA in History from the Johns Hopkins University.