Tags: voice

8

sparkline

Voice User Interface Design by Cheryl Platz

Cheryl Platz is speaking at An Event Apart Chicago. Her inaugural An Event Apart presentation is all about voice interfaces, and I’m going to attempt to liveblog it…

Why make a voice interface?

Successful voice interfaces aren’t necessarily solving new problems. They’re used to solve problems that other devices have already solved. Think about kitchen timers. There are lots of ways to set a timer. Your oven might have one. Your phone has one. Why use a $200 device to solve this mundane problem? Same goes for listening to music, news, and weather.

People are using voice interfaces for solving ordinary problems. Why? Context matters. If you’re carrying a toddler, then setting a kitchen timer can be tricky so a voice-activated timer is quite appealing. But why is voice is happening now?

Humans have been developing the art of conversation for thousands of years. It’s one of the first skills we learn. It’s deeply instinctual. Most humans use speach instinctively every day. You can’t necessarily say that about using a keyboard or a mouse.

Voice-based user interfaces are not new. Not just the idea—which we’ve seen in Star Trek—but the actual implementation. Bell Labs had Audrey back in 1952. It recognised ten words—the digits zero through nine. Why did it take so long to get to Alexa?

In the late 70s, DARPA issued a challenge to create a voice-activated system. Carnagie Mellon came up with Harpy (with a thousand word grammar). But none of the solutions could respond in real time. In conversation, we expect a break of no more than 200 or 300 milliseconds.

In the 1980s, computing power couldn’t keep up with voice technology, so progress kind of stopped. Time passed. Things finally started to catch up in the 90s with things like Dragon Naturally Speaking. But that was still about vocabulary, not grammar. By the 2000s, small grammars were starting to show up—starting an X-Box or pausing Netflix. In 2008, Google Voice Search arrived on the iPhone and natural language interaction began to arrive.

What makes natural language interactions so special? It requires minimal training because it uses the conversational muscles we’ve been working for a lifetime. It unlocks the ability to have more forgiving, less robotic conversations with devices. There might be ten different ways to set a timer.

Natural language interactions can also free us from “screen magnetism”—that tendency to stay on a device even when our original task is complete. Voice also enables fast and forgiving searches of huge catalogues without time spent typing or browsing. You can pick a needle straight out of a haystack.

Natural language interactions are excellent for older customers. These interfaces don’t intimidate people without dexterity, vision, or digital experience. Voice input often leads to more inclusive experiences. Many customers with visual or physical disabilities can’t use traditional graphical interfaces. Voice experiences throw open the door of opportunity for some people. However, voice experience can exclude people with speech difficulties.

Making the case for voice interfaces

There’s a misconception that you need to work at Amazon, Google, or Apple to work on a voice interface, or at least that you need to have a big product team. But Cheryl was able to make her first Alexa “skill” in a week. If you’re a web developer, you’re good to go. Your voice “interaction model” is just JSON.

How do you get your product team on board? Find the customers (and situations) you might have excluded with traditional input. Tell the stories of people whose hands are full, or who are vision impaired. You can also point to the adoption rate numbers for smart speakers.

You’ll need to show your scenario in context. Otherwise people will ask, “why can’t we just build an app for this?” Conduct research to demonstrate the appeal of a voice interface. Storyboarding is very useful for visualising the context of use and highlighting existing pain points.

Getting started with voice interfaces

You’ve got to understand how the technology works in order to adapt to how it fails. Here are a few basic concepts.

Utterance. A word, phrase, or sentence spoken by a customer. This is the true form of what the customer provides.

Intent. This is the meaning behind a customer’s request. This is an important distinction because one intent could have thousands of different utterances.

Prompt. The text of a system response that will be provided to a customer. The audio version of a prompt, if needed, is generated separately using text to speech.

Grammar. A finite set of expected utterances. It’s a list. Usually, each entry in a grammar is paired with an intent. Many interfaces start out as being simple grammars before moving on to a machine-learning model later once the concept has been proven.

Here’s the general idea with “artificial intelligence”…

There’s a human with a core intent to do something in the real world, like knowing when the cookies in the oven are done. This is translated into an intent like, “set a 15 minute timer.” That’s the utterance that’s translated into a string. But it hasn’t yet been parsed as language. That string is passed into a natural language understanding system. What comes is a data structure that represents the customers goal e.g. intent=timer; duration=15 minutes. That’s sent to the business logic where a timer is actually step. For a good voice interface, you also want to send back a response e.g. “setting timer for 15 minutes starting now.”

That seems simple enough, right? What’s so hard about designing for voice?

Natural language interfaces are a form of artifical intelligence so it’s not deterministic. There’s a lot of ruling out false positives. Unlike graphical interfaces, voice interfaces are driven by probability.

How do you turn a sound wave into an understandable instruction? It’s a lot like teaching a child. You feed a lot of data into a statistical model. That’s how machine learning works. It’s a probability game. That’s where it gets interesting for design—given a bunch of possible options, we need to use context to zero in on the most correct choice. This is where confidence ratings come in: the system will return the probability that a response is correct. Effectively, the system is telling you how sure or not it is about possible results. If the customer makes a request in an unusual or unexpected way, our system is likely to guess incorrectly. That’s because the system is being given something new.

Designing a conversation is relatively straightforward. But 80% of your voice design time will be spent designing for what happens when things go wrong. In voice recognition, edge cases are front and centre.

Here’s another challenge. Interaction with most voice interfaces is part conversation, part performance. Most interactions are not private.

Humans don’t distinguish digital speech fom human speech. That means these devices are intrinsically social. Our brains our wired to try to extract social information, even form digital speech. See, for example, why it’s such a big question as to what gender a voice interface has.

Delivering a voice interface

Storyboards help depict the context of use. Sample dialogues are your new wireframes. These are little scripts that not only cover the happy path, but also your edge case. Then you reverse engineer from there.

Flow diagrams communicate customer states, but don’t use the actual text in them.

Prompt lists are your final deliverable.

Functional prototypes are really important for voice interfaces. You’ll learn the real way that customers will ask for things.

If you build a working prototype, you’ll be building two things: a natural language interaction model (often a JSON file) and custom business logic (in a programming language).

Eventually voice design will become a core competency, much like mobile, which was once separate.

Ask yourself what tasks your customers complete on your site that feel clunkly. Remember that voice desing is almost never about new scenarious. Start your journey into voice interfaces by tackling old problems in new, more inclusive ways.

May the voice be with you!

Designing for Trust in an Uncertain World by Margot Bloomstein

The second talk of the first day of An Event Apart Seattle is from Margot Bloomstein. She’ll be speaking about Designing for Trust in an Uncertain World. The talk description reads:

Mass media and our most cynical memes say we live in a post-fact era. So who can we trust—and how do our users invest their trust? Expert opinions are a thing of the past; we favor user reviews from “people like us” whether we’re planning a meal or prioritizing a newsfeed. But as our filter bubbles burst, consumers and citizens alike turn inward for the truth. By designing for empowerment, the smartest organizations meet them there.

We must empower our audiences to earn their trust—not the other way around—and our tactical choices in content and design can fuel empowerment. Margot will walk you through examples from retail, publishing, government, and other industries to detail what you can do to meet unprecedented problems in information consumption. Learn how voice, volume, and vulnerability can inform your design and content strategy to earn the trust of your users. We’ll ask the tough questions: How do brands develop rapport when audiences let emotion cloud logic? Can you design around cultural predisposition to improve public safety? And how do voice and vulnerability go beyond buzzwords and into broader corporate strategy? Learn how these questions can drive design choices in organizations of any size and industry—and discover how your choices can empower users and rebuild our very sense of trust itself.

I’m sitting in the audience, trying to write down the gist of what she’s saying…

She begins by thanking us for joining her to confront some big problems. About ten years ago, A List Apart was the first publication to publish a piece of hers. It had excellent editors—Carolyn, Erin, and so on. The web was a lot smaller ten years ago. Our problems are bigger now. Our responsibilities are bigger now. But our opportunities are bigger now too.

Margot takes us back to 1961. The Twilight Zone aired an episode called The Mirror. We’re in South America where a stealthy band are working to take over the government. The rebels confront the leader. He shares a secret with them. He shows them a mirror that reveals his enemies. The revolution is successful. The rebels assume power. The rebel leader starts to use the same oppressive techniques as his predecessor. One day he says in his magic mirror the same group of friends that he worked with to assume power. Now they’re working to depose him, according to the mirror. He rounds them up and has them killed. One day he sees himself in the mirror. He smashes the mirror with his gun. He is incredibly angry. A priest walking past the door hears a commotion. The priest hears a gunshot. Entering the room, he sees the rebel leader dead on the ground with the gun in his hand.

We look to see ourselves. We look to see the truth. We hope the images coincide.

When our users see themselves, and then see the world around them, the images don’t coincide.

Internal truths trump external facts.

We used to place trust in brands. Now we’ve knocked them off the pedestal, or they’ve knocked themselves off the pedestal. They’ve been shady. Creeping inconsistencies. Departments of government are exhorting people not to trust external sources. It’s gaslighting. The blowback of gaslighting is broad. It effects us. An insidious scepticism—of journalism, of politics, of brands. This is our problem now.

To regain the trust of our audiences, we must empower them.

Why now? Maybe some of this does fall on our recent history. We punish politicians for flip-flopping and yet now Rudy Giuliani and Donald Trump simply deny reality, completely contradicting their previous positions. The flip-flopping doesn’t matter. If you were a Trump supporter before, you continued to support him. No amount of information would cause you to change your mind.

Inconsistency erodes our ability to evaluate and trust. In some media circles, coached scepticism, false equivalency, and rampant air quotes all work to erode consensus. It offers us a cosy echo chamber. It’s comforting. It’s the journalism of affirmation. But our ability to evaluate information for ourselves suffers. Again, that’s gaslighting.

You can find media that bolsters your existing opinions. It’s a strange space that focuses more on hiding information, while claiming to be unbiased. It works to separate the listener, viewer, and reader from their own lived experiences. If you work in public services, this effects you.

Do we get comfortable in our faith, or confidentally test our beliefs through education?

Marketing relies on us re-evaluating our choices. Now we’ve turned away from the old arbiters of experts. We’ve moved from expertise to homophily—only listening to people like us. But people have recently become aware of their own filter bubbles. So people turn inward to narcissism. If you can’t trust anyone, you can only turn inward. But that’s when we see the effects of a poor information diet. We don’t know what objective journalism looks like any more. Our analytic skills are suffering as a result. Our ability to trust external sources of expertise suffers.

Inconsistency undermines trust—externally and internally. People turn inward and wonder if they can even trust their own perceptions any more. You might raise an eyebrow when a politician plays fast and loose with the truth, or a brand does something shady.

We look for consistency with our own perceptions. Does this fit with what I know? Does this make me feel good? Does this brand make me feel good about myself? It’s tied to identity. There’s a cycle of deliberation and validation. We’re validating against our own worldview. Referencing Jeffrey’s talk, Margot says that giving people time to slow down helps them evaluate and validate. But there’s a self-perpetuating cycle of belief and validation. Jamelle Bouie from Slate says:

We adopt facts based on our identities.

How we form our beliefs affects our reality more than what we already believe. Cultural predisposition is what give us our confirmation bias.

Say you’re skeptical of big pharma. You put the needs of your family above the advice of medical experts. You deny the efficacy of vaccination. The way to reach these people is not to meet them with anger and judgement. Instead, by working in the areas they already feel comfortable in—alternative medicine, say—we can reach them much more effictively. We need to meet a reluctant audience on their own terms. That empowers them. Empowerment reflects and rebuilds trust. If people are looking inward for information, we can meet them there.

Voice

The language a brand uses to express itself. You don’t want to alienate your audience. You need to bring your audience along with you. When a brand changes over time, it runs the risk of alienating its audience. But by using a consistent voice, and speaking with transparency, it empowers the audience.

A good example of this is Mailchimp. When Mailchimp first moved into the e-commerce space, they approached it from a point of humility. They wrote on the blog in a very personal vulnerable way, using plain language. The language didn’t ask more acclimation from their audience.

ClinicalTrials.gov does not have a cute monkey. Their legal disclaimer used to have reams of text. They took a step back to figure what they needed to provide in order to make the audience comfortable. They empowered their audience by writing clearly, avoiding the passive voice.

Volume

What is enough detail to allow a user to feel good about their choices? We used to think it was all about reducing information. For a lot of brands, that’s true. But America’s Test Kitchen is known for producing a lot of content. They’re known for it because their content focuses on empowering people. You’re getting enough content to do well. They try to engage people regardless of level of expertise. That’s the ultimate level of empathy—meeting people wherever they are. Success breeds confidence. That’s the ethos that underpins all their strategy.

Crutchfield Electronics also considers what the right amount of content is to allow people to succeed. By making sure that people feel good and confident about the content they’re receiving, Crutchfield Electronics are also making sure that people good and confident in their choices.

Gov.uk had to contend with where people were seeking information. The old version used to have information spread across multiple websites. People then looked elsewhere. Government Digital Services realised they were saying too much. They reduced the amount of content. Let government do what only government can do.

So how do you know when you have “enough” content? Whether you’re America’s Test Kitchen or Gov.uk. You have enough content when people feel empowered to move forward. Sometimes people need more content to think more. Sometimes people need less.

Vulnerability

How do we open up and support people in empowering themselves? Vulnerability can also mean letting people know how we’re doing, and how we’re going to change over time. That’s how we build a conversation with our audience.

Sometimes vulnerability can mean prototyping in public. Buzzfeed rolled out a newsletter by exposing their A/B testing in public. This wasn’t user-testing on the sidelines; it was front and centre. It was good material for their own blog.

When we ask people “what do you think?” we allow people to become evalangists of our products by making them an active part of the process. Mailchimp did this when they dogfooded their new e-commerce product. They used their own product and talked openly about it. There was a conversation between the company and the audience.

Cooks Illustrated will frequently revisit their old recommendations and acknowledge that things have changed. It’s admitting to a kind of falliability, but that’s not a form of weakness; it’s a form of strength.

If you use some of the recommendations on their site, Volkswagen ask “what are you looking for in a car?” rather than “what are you looking for in Volkswagen?” They’re building the confidence of their audience. That builds trust.

Buzzfeed also hosts opposing viewpoints. They have asides on articles called “Outside Your Bubble”. They bring in other voices so their audiences can have a more informed opinion.

A consistent and accessible voice, appropriate volume for the context, and humanising vulnerability together empowers users.

Margot says all that in the face of the question: do we live in a post-fact era? To which she says: when was the fact era?

Cynicism is a form of cowardice. It’s not a fruitful position. It doesn’t move us forward as designers, and it certainly doesn’t move us forward as a society. Cynics look at the world and say “it’s worse.” Designers look at the world and say “it could be better.”

Design won’t save the world—but it may make it more worth saving. Are we uniquely positioned to fix this problem? No. But that doesn’t free us from working hard to do our part.

Margot thinks we can design our way out of cynicism. And we need to. For ourselves, for our clients, and for our very society.

Google Duplicitous

I can’t recall the last time I was so creeped out by a technology as I am by Google Duplex—the AI that can make reservations over the phone by pretending to be a human.

I’m not sure what’s disturbing me more: the technology itself, or the excited reaction of tech bros who can’t wait to try it.

Thing is …when these people talk about being excited to try it, I’m pretty sure they are only thinking of trying it as a caller, not a callee. They aren’t imagining that they could possibly be one of the people on the other end of one of those calls.

The visionaries of technology—Douglas Engelbart, J.C.R Licklider—have always recognised the potential for computers to augment humanity, to be bicycles for the mind. I think they would be horrified to see the increasing trend of using humans to augment computers.

Other days, other voices

I think that Mandy’s talk at this year’s dConstruct might be one of the best talks I’ve ever heard at any conference ever. If you haven’t listened to it yet, you really should.

There are no videos from this year’s dConstruct—you kind of had to be there—but Mandy’s talk works astoundingly well as a purely audio experience. In fact, it’s remarkable how powerful many of this year’s talks are as audio pieces. From Warren’s thoughtful opening words to Cory’s fiery closing salvo, these are talks packed so full of ideas that revisiting them really pays off.

That holds true for previous years as well—James Burke’s talk from two years ago really is a must-listen—but there’s something about this year’s presentations that really comes through in the audio recordings.

Then again, I’m something of a sucker for the spoken word. There’s something about having to use the input from one sensory channel—my ears—to create moving images in my mind, that often results in a more powerful experience than audio and video together.

We often talk about the internet as a revolutionary new medium, and it is. But it is revolutionary in the way that it collapses geographic and temporal distance; we can have instant access to almost any information from almost anywhere in the world. That’s great, but it doesn’t introduce anything fundamentally new to our perception of the world. Instead, the internet accelerates what was already possible.

Even that acceleration is itself part of a longer technological evolution that began with the telegraph—something that Brian drove home in in his talk when he referred to Tom Standage’s excellent book, The Victorian Internet. It’s probably true to say that the telegraph was a more revolutionary technology than the internet.

To find the last technology that may have fundamentally altered how we perceive the world and our place in it, I propose the humble gramophone.

On the face of it, the ability to play back recorded audio doesn’t sound like a particularly startling or world-changing shift in perspective. But as Sarah pointed out in her talk at last year’s dConstruct, the gramophone allowed people to hear, for the first time, the voices of people who aren’t here …including the voices of the dead.

Today we listen to the voices of the dead all the time. We listen to songs being sung by singers long gone. But can you imagine what it must have been like the first time that human beings heard the voices of people who were no longer alive?

There’s something about the power of the human voice—divorced from the moving image—that still gets to me. It’s like slow glass for the soul.

In the final year of her life, Chloe started publishing audio versions of some of her blog posts. I find myself returning to them again and again. I can look at pictures of Chloe, I can re-read her writing, I can even watch video …but there’s something so powerful about just hearing her voice.

I miss her so much.

Voice of the bot-hive

Creating telephone answering systems can be fun as I discovered at History Hack Day when I put together the Huffduffer hotline using the Tropo API. There’s something thrilling about using the human voice as an interface on your loosely joined small pieces. Navigating by literally talking to a machine feels simultaneously retro and sci-fi.

I think there’s a lot of potential for some fun services in this area. What a shame then that the technology has mostly been used for dreary customer service narratives:

Horrific glimpse of a broken future. I sniffed while a voice activated phone menu was being read out and it started from the beginning again.

There’s been a lot of talk lately about injecting personality into web design, often through the tone of voice in the . When personality is conveyed in the spoken as well as the written word, the effect is even more striking.

Have a listen for yourself by calling:

That’s the number for Customer Service Romance:

What happens when Customer Service bots start getting too smart? What if they start needing help too? How would they use the tools at their disposal to reach out to those they care about? What if they start caring about us a little too much?

It’s using the Voxeo service, which looks similar to Tropo.

The end result is amusing …but also slightly disconcerting. You may find yourself chuckling, but your laughter will be tinged with nervousness.

Customer Service Romance on Huffduffer

On the face of it, it’s an amusing little art project. But it’s might also be a glimpse of an impending bot-driven algorithmpocalypse.

The Huffduffer Hotline

After seeing (and hearing) what Brian was doing at History Hack Day, I decided I’d have to have a play with Tropo. Like Twilio, it’s a service that allows you to build voice-activated apps that you call up and talk to.

The API is pretty straightforward and it seems like there’s quite a lot that you can do as a developer before upgrading to a paid account. They’ll also host your code for you, and you have a choice of scripting languages.

At the most basic level, you can send text-to-voice messages:

say("Hello world")

But you can also give it audio files to play:

say(http://example.com/helloworld.mp3)

Huffduffer has the locations of thousands of audio files, so I thought a voice interface onto Huffduffer’s collection would be fun.

Call +1 202 600 8751 in the US, +44 2035 142722 in the UK, or use Skype. When the nice digital man on the other end picks up the phone and asks you want you want to hear, you can respond with “what’s new”, “what’s popular”, or say a tag like music, science, history, politics, technology, etc.

The script then fetches the latest files with that tag and will go through them with you one by one, asking “Would you like to hear… ?” followed by the title. If you don’t like the sound of it, just say no. When you find something you do want to hear, say yes. It will then start playing and you will be listening to a podcast down a telephone line.

Audioboo / searching huffduffer.com audio by phone on Huffduffer

I call it the Huffduffer Hotline. The code is on Github. If you fancy playing around with the Tropo API and want to use Huffduffer’s links to audio files, go ahead. You should find everything you need through the Huffduffer API.

If people find the Huffduffer Hotline useful or just plain fun, I’ll upgrade from the developer account to get better performance. Let me know your thoughts on Get Satisfaction.

Nashville

I’ve finished my little bout of timezone parkour to Nashville and San Francisco. I attended a conference in each place and enjoyed both in very different ways.

Voices That Matter had an eclectic line-up of speakers. Whereas other conferences are organized around a theme or a set of technologies, the only commonality at this conference, organized by New Riders, is that the speakers have all published books through New Riders. While this means that the conference doesn’t have a specific focus, it does offer a nice varied range of subjects. Talks ranged from the specifics of using CSS for colour, typography and layout right through to discussions of user-testing and social networking.

I enjoyed getting the nitty-gritty details of CSS fonts from Jason Cranford Teague. He and Richard are clearly kindred spirits. The revelation of the conference for me was hearing a great hands-on presentation from Zoe Mickley Gillenwater on liquid and elastic layouts. Okay, so I might be a bit biased but I think it’s great that this subject is getting coverage and Zoe is just the person to do it. She’s currently writing a book for New Riders on this neglected area of web design. It should be out by December. Pre-order it now.

For my part, I gave a half-day workshop on Bulletproof Ajax, which seemed to go well, and I reprised a talk I had given once before called Microformats: what are they and why do I care?

I missed a few talks because I was whisked away to be interviewed for a future video podcast. Under the very professional-looking lights and cameras, I participated in a one-on-chat and also a thoroughly enjoyable discussion with Christopher Schmitt and Steve Krug. I missed more talks because I wanted to get outside the hotel and explore Nashville a bit. The highlight of that exploration was getting a guided tour —thanks to Ari—around the historic Hatch Show Print where they have been making letterpress posters for musicians for over a century; a great place to soak up some design inspiration.

My ulterior motive for escaping from the conference hotel was to seek out a mandolin for myself. I went to the Gibson outlet store at the Opry Mills shopping mall on the outskirts of town but even the cheapest mandolin there was still beyond my price range. They sure were a pleasure to play, though. Fortunately for me, I stumbled across a flea market in the same mall where I happened upon a cheap second-hand epiphone. It’s not brilliant but it’s suitable for my purposes; a decent little instrument that I can take travelling with me. I’ve got a suitable travel bag to go with it. It has the shape of a tennis racket case but all the pockets of a laptop bag. I may even try to pass myself off as some kind of freakish sporty geek hybrid.

All in all, I think I managed to get a good look around Nashville and get plenty out of the conference too. I was only there for a few days before it was time for me to head on to San Francisco for Supernova 2008. That was a different kettle of thought-leading fish.

Voices that natter

The Voices That Matter conference just wrapped up here in San Francisco. My talk was the last one of the day apart from a lightning round of two-minute takeaway points from a phalanx of speakers, moderated by myself.

My presentation was entitled Microformats: what are they and why do I care? You can download a PDF of the slides. The presentation is licensed under a Creative Commons attribution license so do with it as you please.

The talk went okay—I have the horrible feeling that there were quite a few “um”s and “ah”s peppered throughout. I made sure to leave plenty of time for questions and, as usual, the questions turned out to be the best part. Tantek took notes of the Q&A and I’ve published them on the wiki page for the event (if you were at the presentation be sure to add yourself to the list of attendees).

When he wasn’t taking notes, Tantek was diligently folding cheat sheets for the attendees. They were popular. If you weren’t lucky enough to get a pre-folded one, you can always print out and fold your own pocket cheat sheet courtesy of Erin.

And now, with my speaking duties fulfilled, I’ve got a day to spend in San Francisco before I head home. I intend to make the most of it. If you’d like to join me in soaking up the last of the California sunshine, come along to the picnic tables in South Park at noon tomorrow (Friday) for a geek picnic. Be there or be even more square.