Just how much progress have computer systems made in understanding the human voice? I don’t mean recognizing specific commands, like the hands-free Bluetooth in your car. I’m looking for a tool an author can use to write a book without typing. I want to know how well a computer can understand people when they just talk to it. And the answer is, “not very well.”
State of the Art
When I started my research, I was surprised to learn that there are few players in the field. The only commercial product is Nuance’s Dragon Dictate for Mac (or NaturallySpeaking for Windows), which costs just over $100. And then Mac has Dictation Speech to Text, installed in the operating system for all Macs since Mountain Lion in 2012 or earlier. Windows has an embedded program called Speech Recognition. There are also several free online voice-to-text services, which RJ Crayton dealt with on Indies Unlimited last September.
I had a pragmatic (read: selfish) reason for this research. I am working on a project recording and publishing stories told by seniors. I was hoping technology would save me hours of transcribing the hundred or so stories I will collect. People who know anything about this subject will start chuckling about now, because I was dreaming.
As a test piece, I recorded my brother talking to me about an event from our childhood to see how it would transcribe. Then I read the passage aloud myself, using Dragon Dictate that I had already initialized to understand my voice. Then I read the same passage into my Mac with Dictation running. Last, I used Dragon Dictate on the original recording, played on my Zoom H5 digital recorder.
The original passage (punctuation added later):
Yeah, Gord, this story I’m going to tell was actually published in the Province newspaper. They used to have a little program for kids to write in from different parts of the province telling about what they were doing. This is around the time when I was 6 and you were 5 and it was in the spring of the year and we were living at home down by Highway 16, not in sawmill camp or tie hacking camp with our parents.
Trial 1: created by me, using Dragon Dictate after initializing the program to work for my voice, a process that involved reading aloud several short paragraphs given to me by the setup process.
Yeah, Gord, this story and going to tell was actually published in the Providence newspaper. They used to have a little program for kids to write in from different parts of the province telling about what they were doing. This is around the time when I was six and your five and it was in the spring of the year and we were living at home down by Highway 16 not in the sawmill account or tie hacking With her parents.
I type at a decent speed, (most people talk at about 140 wpm. I type around 60) but I suspect using this transcription and fixing the errors would take me about the same amount of time as keyboarding it.
Trial 2: My voice, using the Dictation Speech to Text software in the latest Mac OS, El Capitan.
This story I’m going to tell was actually published in the province newspaper. They used to have a little program for kids to write in from different parts of the province telling about what they were doing. This is a around the time when I was six and you were five and it was in the spring of the year and we were living at home Down by highway 16 not in sawmill Camp or Thai hacking camp with our parents.
Not too bad, although it couldn’t deal with “Yeah, Gord,” at all. More on “Thai hacking” later. Still not good enough to save me any time, compared to entering it via keyboard. This program came free on my computer, so it was definitely cost effective.
Trial 3: my brother’s voice with the Mac program, sent directly from the digital recorder to the computer.
This story to tell was Archie published in the province newspaper list of a little program for kids to right in front of the parts the province holder doing. This is around the time when I was six and you were five and I was the spring of year and we’re living at home Dumbo I was 16 not in the sawmill camper kayaking camp and with her parents.
Recognizable as the same text (barely), but for my purposes, completely useless.
Trial 4: from the same digital recording of my brother, using Dragon, not initialized for his speech.
So this store so that is published promised to stand for some of you will strengthen the “proximal other small talk, however is the smallest way of the surrounds are 65 that is here and the whole town is.
Even worse. One can only hope that the program would learn over time, but that becomes less cost effective the more you have to mess around with it. I don’t plan to run 30 or 40 people through the initialization process.
I didn’t test Speech Recognition for Windows, but it was designed by Nuance, and from the descriptions in other reviews, it sounds like a Lite version of NaturallySpeaking.
Improvement Over Time
I wondered if I could train the computer to stop saying “Thai hacking” or “kayaking,” instead of the unusual expression “tie hacking,” if I used the expression a few times in different contexts. This is how it went…
Hacking ties Thai hacking hack time Half time hacking ties hat ties hat ties hack ties hack ties hacking tires Thai hacking hat Pack hack hack hat Pack hat hi tie tack packing ties it’s Okay Josh
(That last bit was to my dog, who was worried because his master was shouting at the computer. Obviously the learning experience has to go both ways.)
Be warned, when you start trying to input sound from devices other than your computer’s built in microphone, you run into technical problems. Several reviewers mentioned the advantage of having a good mic in getting accurate responses. The built-in Mac microphone seems to be pretty good. I wanted to input pre-recorded stories on my Zoom H5 digital recorder, and that required a different input adapter for a newer iMac. My old MacBook Pro had no trouble once I figured out how to set up the Sound in System Preferences. The direct line from the digital recorder helps the transfer of good quality sound, but of course the quality of the original recording (type of recorder, type of microphone, ambient noise, etc., etc.) makes a difference. In my experience, you just have to keep trying different equipment until the product is good enough for your purposes. Or not, as the case may be.
The Learning Curve
As I worked with each program, I learned to speak slower and more clearly, and was able to improve my (or the program’s) accuracy. Likewise, I assume the program will learn my voice and vocabulary better over time. Dragon has a “correct” function that allows you to go back and say “correct ‘Thai’,” which presumably also goes into the memory bank. I didn’t experiment with that.
Basically, Dragon’s program does more, but it is more complex to set up and learn. Mac’s is more intuitive and easier to use. Most reviewers say that both these programs improve as they get used to your voice and your usual vocabulary. I noted that under most conditions the computer could handle simple vocabulary like, “This is around the time when I was six and you were five.”
I suspect that over time both the Dragon program and the Mac program could be taught to do a decent job on the voice of an individual. I think it’s a matter of your individual preferences. My other brother, who is a great storyteller but doesn’t like writing, swears by the Dragon program, because it allows him to just talk. I learned to type in Grade 10 because my handwriting was illegible, so I’m pretty well conditioned to letting my fingers do the walking. I don’t see myself switching.
In general, I think the software designers have a long way to go before a person can jump into a driverless taxi and say, with a strong foreign accent, “I wanna go to the nearest good sushi joint.”
And I’m going to be manually transcribing a lot of stories in the next few months.
25 thoughts on “Voice-to-Text Dictation Software”
Oh, Gordon, this was uproariously funny! I hope you meant it to be because I almost, ahem, myself from laughing. The sad part is that I remember people saying [real] voice recognition software was just around the corner. This was back in the late 80’s. We ain’t there yet. 🙂
In Asimov’s Foundation trilogy they had perfect voice-to-text devices. Mind you, that was set more than 12,000 years in the future, so don’t hold your breath.
Good article, and I see a story in this. Skynet wipes us out because it had poor voice recognition software. 😉
Ha, love it! It’s billing time in the Skynet corporation.
Me: “Skynet, please bill all new ones.”
Skynet: “Thank you. You have request that I kill all humans!”
Skynet: “Thank you for confirmation ‘Gooooooo’. Initiating request.”
Run with it, folks 🙂
Really interesting, Gordon. I think I’d like to try voice to text tech again, but this hasn’t given me the most hope. 🙁 I’d really really love transcription with punctuation, which I never found with the free options. Going in and cleaning up a couple of paragraphs for punctuation isn’t bad. Cleaning up an entire novel-length work –even if you’re doing it 4 or 5 pages at a time– seems like such a headache, you might be better off typing it.
Anyway, thanks for sharing your amusing journey, Thai hacking and all.
This came just at the right moment as I was starting to think about a new manuscript and wondering if I should try dictating it rather than typing – my fingers are packing up and I make so many errors it takes hours correcting them. Now you’ve made me think again. Perhaps I’ll go back to my quill, which still works, and scan my handrwiting. I make less errors writing by hand.
As for the amusement factor, clearly this is the method for writing humorous text. Your post made me positively hoot with laughter! 🙂
Or maybe I’ll try audio books. 🙂
Ha-ha – wonderful post, Gordon! And kudos for sacrificing hours of your time to bring us up-to-date on Thai hacking. Seriously, thanks for the very useful info on this whole subject. As AJ Flory said, this was supposed to be around-the-corner tech years ago. I’ve noticed doctors are using some kind of voice-to-text software now. I assume it’s okay since it’s good enough for their medical records. They do have to verbally punctuate as they go, though. I doubt that would work well for creative writing.
Doctors using these programs on our medical files? Well, doesn’t that give us great confidence!
The mind boggles! What interesting conditions might one suddenly get diagnosed by a slip of the software? Still, it could make going to see the doctor a more exciting adventure; one could run a book and take odds on what condition the dictation will select this time. 🙂
I have to intrude to correct a misconception. Candace, Ian and RJ seem to have the impression that I came out completely against using this sort of program. I only said that the programs don’t work for multiple voices, and that I didn’t see myself switching. I think that for someone who invests the time in learning to do voice punctuation and teaching the machine to understand, it could be very useful. The only way to find out is to try it for an extended period of time.
I’d really love to hear from someone who uses one of these programs and likes it
PS. I didn’t think it was that funny. (insert curmudgeon face)
No, I wasn’t thinking that, Gordon, just that technology clearly has some way to to to meet your exacting standards. I don’t fault you if thats what you think, I think the same and have reached a point in my life when I expect technology to work flawlessly first time or I can’t be bothered with it. If I, or the technology have to spend extended periods learning to get on together I am likely to abandon it for something I already know how to use and which works right every time – my quill. It’s what I learned to write with and I can see the results instantly as I pt it to the paper. Any correcting is easily done with a stroke of the nib and it only requires a fair copy once the draft is done. That’s part of my editing process anyway, so no added burden. All the faffing about with voice recognition you describe is unnecessary and my quill produces perfectly any foreign words I choose to use without scrambling them. In fact it writes fluently in five languages; probably more if I tried them.
No need for any curmudgeonly face. 🙂
The Pen is still mightier than Word, is it?
Thank you for sharing your experience with voice recognition techs sounds like the similar experience I had. Experimented with it at work at different times with MICROSOFT Operating systems. I agree it still has a long way to go. For Micosoft you need to down load their free voice to text upgrade app otherwise it is not worth it. Also with it, you can type or speak what you are writing, which keeps you from going insane from trying to remember to speak punctuation. From what I could tell I could not see any difference between the Microsoft and Dragon. At this time I no longer use it,but with my Carole tunnel getting worse I am ramped to use it for rough story notes that I want to follow up on. And yes you were funny LoL
As I understand it, this area is one of the few that we proper computer users — sorry, I meant to say MAC users — get the rough end of the stick.
The Dragon software for WINDOWS is the proper one, but the software for the MAC is the old technology — as in the same one that comes with a Microsoft PC.
So, never fear, those doctors are now more accurate with your notes than they were with their terrible handwriting.
It’s a strange world, is it not, where the solution to this problem is to use an inferior computer? (Sorry, PC users, you just don’t know any better. Except this time — this time, you win. Enjoy your moment…)
I was given to understand that the Dragon for Mac was the new software. Can’t remember where that came from. Another reviewer, I think.
Thanks for your really interesting article.
Actually, I use dragon dictation all the time and love it. It took a while for it to recognize my voice and British/S. African accent. At first the mistakes were hilarious, and even obscene at times, but it improved tremendously as time wore on and I am now comfortably using it very often. I successfully punctuate as I speak—quite easy to get used to doing that.
I’m an extremely speedy typist (I’m a pianist) that results in missing letters in my text, so I plan to practice using dictation until I’m happy to write another book using the program as much as possible.
I’ve found dictation on my iPhone to be more accurate than on my Macbook Air. I have the new iPad Pro but haven’t tried it out yet with my voice recognition.
I hope this could be helpful—anything good that works to cut writing time is a boon.
By the way, I might mention that I didn’t use it to write this comment!
Thanks for a more optimistic point of view. I was hoping hi-tech was doing better 🙂
Hi, Gordon —
I’ve been trying to use Nuance software for years. Years. I first used it unsuccessfully on a PC, then a Mac. I’ve lost money on their Bluetooth mics, replaced three, and they’ve sent me five physical headsets as replacements. The headsets worked, unlike the Bluetooth mics, but the problem was in the software.
I could never figure out how to make their product learn from their mistakes and mine. And I’m a determined soul.
The best I can say about about their voice recognition software over these last 10 or 13 years is that they’ve accurately transcribed my phone number. Nuance has hounded me by phone and mail for more than a decade.
While, VR holds such promise, the 2,000 words I was able to crack out speech-to-text using VR was way beyond my typical 250-400 typed words per day. But. But, the thousands of words I dictated proved to be a non-starter. After wondering what I said to get such weird copy, I grew bored ramping their nonsense up to my normal high level of typos.
I’d describe Nuance, from long experience, as a no-trick pony, one that makes promise after promise with only the promise that they’ll be calling back soon with another new and fabulous promise, and a $100 discount into the bargain.
Takeaway? Buyer Beware.
Thanks, Jeff. I had a better experience, but it just goes to show that the ability to understand us weird and wonderful humans is still a very individualistic, hit-or-miss situation, and not one you’d want to spend a lot of money on. I was very glad I had a two-week free trial.
I’m not surprised. As someone who is deaf and an independent consultant giving training to businesses about how to make their podcasts, videos, live events accessible via quality captions and transcripts, I would say that machines cannot replace professionally trained human captioners. There are a lot of factors in producing good quality transcripts – especially live transcripts – and machines cannot meet them.
That’s an interesting point most of us wouldn’t think of. Thanks for pointing it out. 🙂
My bottom line is always whether it takes more time to do it with technology or with human effort. That presupposes that the results will be equal. In the situation Sveta mentions, I’m thinking it will be quite a while before machines can do better than humans. Look at language translation programs if you want a laugh.
Comments are closed.