Where poetry meets tech: Building a visual storytelling experience with the Google Cloud Speech-to-Text API
Staff Developer Relations Engineer
This post is a Q&A with Developer Advocate Sara Robinson and Maxwell Neely-Cohen, a Brooklyn-based writer who used the Cloud Speech-to-Text API to take words spoken at a poetry reading and display them on a screen to enhance the audience’s experience.
Could you tell me about the oral storytelling problem you’re trying to solve?
When you become a writer, you end up spending a lot of time at readings, where authors read their work aloud live to an audience. And I always wondered if it might be possible to create dynamic reactive visuals for a live reading the same way you might for a live musical performance. I didn’t have a specific design or aesthetic or even goal in mind, nor did I think the experience would necessarily be the greatest thing ever, I just wanted to see if it was possible. How it might work or what it might look like? That was my question. While the result was systemically very simple, sending speech-to-text results through a dictionary that had been sorted by what color people thought words were, it ended up being the perfect test case for this sort of idea and a ton of fun to play with.
What is your background?
I’m a novelist in the very old school dead tree literary fiction sense, but a lot of what I write about involves the social and cultural impact of technology. My first novel had a lot of internet and video game culture embedded in it, so I’ve always been following and thinking about software development even without that ever being my life or career. I did a whole bunch of professional music production and performance when I was a teenager and in college. This experience gave me at least a little bit of a technical relationship to using hardware and software creatively and it was the main reason I felt confident enough to undertake this project. Lately I’ve been doing these little projects to try to get the literary world and the tech world in greater conversation and collaboration. I think those two cultures have a lot to offer each other.
How did you come across the Google Cloud Speech-to-Text API, and what makes it a good fit for adding visuals to poetry?
We ended up searching for every possible speech-to-text API or software that exists, and tried to find the one that reacted fastest with the least possible amount of lag. We had been messing around with a few others, and decided to give Cloud Speech-to-Text a try, and it just worked beautifully. Because the API can so quickly return an interim result, a guess, in addition to a final updated guess, it was really ideal for this project. We had been kind of floundering for a day, and then it was like BAM as soon as the API got involved.
What’s the format of these poetry events? Could you tell me more about CultureHub?
The first weeklong residency, last June, was four days of development with an absolute genius NYU ITP student named Oren Shoham, and then three days of having writers come in and test it. I just emailed a whole bunch of friends basically, who luckily for me includes a lot of award-winning authors, and they were kind enough to show up and launch themselves into it. We really had no idea what would work and what wouldn’t, so it was just a very experimental process.
The second week, this November, we got the API running into Unity, and then had a group of young game developers prototype different visual designs for the system. They spent four days cranking out little ideas, and then we had public event, a reading with poets Meghann Plunkett, Rhiannon McGavin, Angel Nafis, and playwright Jeremy O. Harris, just to see what it would be like to have in the context of an event. Both times I tried to create collaborative environments, so it wasn’t just me trying to do it all myself. With experimental creative forms, I think having as many viewpoints in the room as possible is important.
CultureHub is a collaboration between the famed La MaMa Experimental Theatre Club and Seoul Institute of the Arts. It’s a global art community that supports and curates all sorts of work using emerging technologies. They are particularly notable for projects that have used telepresence in all sorts of creative ways. It’s a really great place to try out an idea like this, something there previously wasn’t a great analog for.
How did you solve this with Cloud Speech-to-Text? Any code snippets you can share?
For the initial version, we used a Python script to interact with the API, the biggest change being adapting and optimizing it to run pseudo-continuously, then feeding the results into the NRC Word-Emotion Association Lexicon, a database which had been assembled by computer scientists Saif Mohammad and Peter Turney. We then fed both the color results and the text itself into a Max/MSP patch which generated the visual results.
The second version used Node instead of the Python script, and Unity instead of Max/MSP. You can find it on GitHub.
Do you have advice for other new developers looking to get started with Cloud Speech-to-Text or ML in general?
I would say even if you have no experience coding, if you have an idea, just go after it. Building collaborative environments where non-technical creatives can collaborate with developers is innovative and fun in itself. I would also say there can be tremendous value in ideas that have no commercial angle or prospect. No part of what I wanted to do was a potential business idea or anything like that, it’s just a pure art project done because why not.