Comparison of AI Voice Generators

November 15 2024: Updated with Play.ai (click to jump to update)

Introduction

Having voice-over in e-learning or multimedia training content is always a value add, and I think most of us know that intuitively. Adult learning theory tells us that adults want choice in how they consume training information, and we know that some people learn better audibly than visually. Additionally, including voice-over supports accessibility and is arguably required from that point of view alone. It also supports diversity among learners since, hypothetically, voice-over can be offered in multiple languages.

The science backs this up. Dual Coding Theory suggests that cognition involves both verbal and non-verbal systems working together to enhance memory and comprehension. Meanwhile, Cognitive Load Theory emphasizes minimizing the extraneous load caused by too much visual information, and voice-over provides a way to balance that out.

So why don’t we always do it?

We know we should include voice-over in our content, but I know I’m guilty of keeping it on the back burner or skipping it altogether; I suspect the same applies to many of you. Why?

Illustration of a seesaw with a person labeled "LoE" on one side and a person labeled "RoI" on the other. — While trying to generate this I learned that AI has no idea what a seesaw is…

The problem is that producing voice-over requires a lot of time and potentially significant expense; the scales of Level of Effort (LoE) vs Return on Investment (RoI) have always been pretty tipped on this. Recording audio takes time, requires decent equipment, and usually involves extensive editing. Producing voice-over for an hour-long e-learning session can take days, and that assumes you can do it all yourself. If you need to source it out, hiring voice actors or having someone else do the editing, it adds even more time and cost. So it gets skipped.

Enter AI

Today’s generative AI changes all that. In the last few years text-to-speech technology has evolved at an incredible rate. We’ve moved from the robotic voice of Stephen Hawking to voices that sound scarily realistic, complete with emotional nuances, a range of accents, and diverse gender representations. They’re easy to create, and they’re cheap. Now, producing voice-over for that hour-long e-learning session can be completed in half that time. It also makes maintenance easy; whereas before, updating just a few words audio could be a huge burden, it’s now as simple as editing the text and regenerating the MP3.

Additionally, AI voice generation opens the door to more personalization. Not only can you easily and affordably include a voice-over in the first place, you can include several with very little increase in time or cost. Now your learners can choose a preferred tone, gender, or accent to even further improve the attention they give your content!

Opposing Considerations

There are arguments to be made against doing this whole AI voice thing at all, and they do deserve a mention. For one, while these AI voices have gotten really, really good, they are still not going to replace an actual person. Voice actors are going to be able to add character and inflection that no AI can touch. (Plus, as someone who likes doing voice work, please don’t replace us!)

There are also important ethical issues hanging over all this: Impersonation, fraud, identity theft, and misinformation. AI voice generators can and have been used to negatively impact people and society. Now obviously, that’s not our goal here, but I do have to admit there’s a part of me that wishes Pandora’s box had not been opened. But for now I’ll set that aside and focus on using them for good. Most of these providers explicitly require you to divulge you are using AI voice, which I’d agree is a good practice for learners anyway. We’re not trying to trick them, just give them a good experience with less effort on our part.

Comparing Options

Two robotic figures with microphone heads face off in a boxing ring, surrounded by blue and orange energy. — This image was not worth the prompt battle I had to do to generate it…

I’ve used many AI tools for voice and other purposes since we’ve seen this generative AI explosion in the last few years. Below is a comparison of the best voice AI tools available today (IMHO). For each one I’ll give you my thoughts as well as 2 samples I created with each tool. For the sake of consistency I’m using the same short blurb for each sample and didn’t spend more than a minute tweaking settings. (I’ll try to keep this list updated as new products might hit the market. So please check back often!)

Note that I’m showing you a comparison of first-party products – where the actual AI model is developed by the provider – vs third party products that leverage those first party engines. If you go out and search AI voice generator, you’ll get a million hits, but most of them are wrappers around just a few core first-party models. For example, Articulate recently launched new AI features including new voice generation – but it’s not their own new tech, they are leveraging ElevenLabs. ElevenLabs voice generation is great, therefore Storyline’s new voice generation is great (look for more info on that specifically in my upcoming cost comparison article). You’ll have to take into consideration whether or not these wrapper-type products are worth it, because they often have higher cost than using the core product they’re based on and sometimes have much less flexibility. Alternatively, if they’re built into your tool, like Storyline, and someone else is paying for your Storyline subscription, then by all means, use it.

Other notes:

I’m focused on the core “turn my text into a voice-over” feature here. Some of these products have other features that I’ll point out but am mostly ignoring.
I am ignoring enterprise pricing – if you are in that situation I imagine you aren’t the one paying for it (plus companies like to hide that from us worker bees).

Quick Compare Grid

Provider	Quality	Options & Diversity	Cost*
ElevenLabs	Very High	Very High	High
OpenAI	Very High	Low	Very Low
WellSaid Labs	Very High	High	Medium
Speechify	Medium	High	Medium
Murf.AI	High	High	Medium
Piper	Medium	Medium	None
Play.ai	High	Medium	High

*Detailed cost breakdown coming in a follow up article soon

ElevenLabs

You’ve almost certainly heard ElevenLabs voices without even realizing it. Their free tier, easy UI, and outstanding quality make them very popular, and (IMO) your best bet as a place to start playing with this technology if you’re new to it.

They have a very generous free tier (although you have to keep in mind it does not give you a license to use the output commercially!) and their lowest paid tiers are reasonable for small projects.
Their interface is extremely easy to use, and they have a lot of great features beyond just voice generation, such as voice cloning and using your own voice to instruct the model on inflection and emotion.
The quality of ElevenLabs voices is among the highest, and they have an enormous selection of over 1,000 voices in different accents, tones, and styles to choose from. They also support 29 different languages.
On the flip side, as you start to scale up, ElevenLabs gets very expensive and probably not ideal if you’re paying out of your own pocket.

ElevenLabs – Jessica (American, Female)

ElevenLabs – Charlie (Australian, Male)

ElevenLabs.io

OpenAI

OpenAI is the company behind ChatGPT. They also have an AI text-to-speech model that’s targeted at developers. It doesn’t have any user interface that’s intended for our kind of usage; instead it is completely accessed via an API.

On the downside, API only means it’s less accessible than any of the others.
On the upside, it is by far the most affordable because the API pricing is based only on usage. There is no subscription or other cost involved.
They currently only have 6 voices available and no tools to control the emotion or inflection of those voices.
However, the quality of the voices is very high.

I plan to write a follow-up article going over how to leverage this tool in spite of its downsides because it’s actually my favorite and the one I reach for first!

OpenAI – Shimmer (American, Female)

OpenAI – Echo (American, Male)

OpenAI TTS Documentation

WellSaid Labs

I haven’t decided how I feel about WellSaid yet. At first, I was annoyed with them because their pricing is very obfuscated. Unlike any of the others, they don’t bill based on time, but per “download”. I had to dig through a lot of FAQs in order to make sense of their pricing and realistically compare it to other solutions.

That said, it turns out their pricing is very good if you need a high-enough volume.
Their quality is excellent.
They have a lot of voices to choose from and some really nice features for controlling the output of the voices; for example, specific pronunciation and word replacement, which would be great for things like jargon and acronyms that may pop up in scripts.

WellSaid Labs – Jordon (American, Female)

WellSaid Labs – Tobin (American, Male)

WellSaidLabs.com

Speechify

Speechify offers two different products. They have a Reader product, which lets you listen to news and so forth with AI voices, including official partnerships with celebrities like Snoop Dogg, so that’s fun. But the product we’re interested in is their Studio.

Speechify is definitely the lowest-quality paid voice on the list, lacking much of the tone variance of the others; but it’s still very good compared to old-school non-AI voices. You probably won’t mistake them for a real person, or if you do, you’ll think they’re just not a very good voice actor.
Their pricing is very good, especially if you go with an annual plan, so I think that makes them still worth consideration.
Their voice selection is excellent, with hundreds of voices, accents, and plenty of languages to choose from.
Their UI also has some interesting additional features that allow you to create entire multimedia output with video, images, etc using just their tool, if you’re interested in that sort of thing.

Speechify – Cliff (American, Male)

Speechify – Erica (American, Female)

Speechify Studio

Murf.AI

I initially ignored Murf.AI (for no good reason) and it turns out that was a mistake. They are very much a contender and one of my top 3 personal tools now.

They have very good quality with a very nice voice selection, including many languages, accents, and styles.
They also have some of the best controls for controlling the nuances of the voice, including pronunciation tools and the ability to adjust emphasis.
Their UI also includes tools to do general multimedia editing like Speechify.
However, their pricing is relatively high if you do a subscription, and the biggest mark against them is that their licensing require the higher price plan in order to use it in most of our L&D use cases.

When I first started writing all this up I missed the fact that Murf AI does have independent API billing. It’s $1 for 10,000 characters (~10min). This is more expensive than OpenAI, but Murf has a ton more available voices, and the API appears to have controls for some of the nuance features as well. This makes them a great best-of-both-worlds choice for me. I’ll include them alongside OpenAI in an upcoming how-to article.

Murf.AI – Terrell (American, Male)

Murf.AI – Alicia (American, Female)

Murf.AI

Piper

Piper is very much an outlier in this list. It is by far the lowest quality with the least options. However, I include it for two very important reasons:

It costs nothing; it is free and open source.
It can be run entirely locally.

All of the other products on this list are cloud-based and require your information to go out to these company’s server; if you are in a corporate situation working with proprietary data or an NDA client, that may make this whole thing a non-starter. Piper would allow you to get the advantages of the AI voice generation without those risks. It’ll take some work, if you’re in a corporate environment you’ll probably have to get IT involved. But if you’re a freelancer, you can hypothetically install Piper right on your local computer and go. The selection of voices for Piper is actually pretty good and very diverse because the community that’s creating these open-source voices is global.

Piper – Alba (Irish, Female)

Piper – Joe (American, Male)

Piper (Github)
Piper (Examples)

Play.ai (New!)

Coincidentally, I came across this one the day after originally publishing this article. Overall, I’m extremely impressed and will be keeping an eye on it for sure, but I’m not sure it’s ready for primetime.

Outstanding quality, potentially the best. I’ve not heard another model add as much inflection; it pauses naturally, has a subtle breathing sound, and even laughs or uses crutch words like “um”. (Also, I absolutely lost it at their “Fritz” voice, sample below)
HOWEVER… it also seems to randomly introduce artifacts. It’s very strange, listen to the 3rd sample below for an example: I used my standard PB&J prompt, but it replaced the last sentence with what I’m guessing is part of its training data, or maybe a hidden prompt? Weird, and makes me not trust it for full use just yet.
It also lacks good controls over that impressive inflection. So if it inflects in a way that doesn’t match your needs, you’re out of luck.
Pricing is similar to ElevenLabs: reasonable on the low end but gets expensive as you scale up.
It has a unique feature I haven’t seen elsewhere is dual speaker “conversational mode”. This let’s you enter a script, assign 2 voices, and it will read both parts. You could manually edit something similar from other tools, but in this mode the voices actually play off each other in a “radio/podcast banter” style. Listen to sample 3 below for an example (which also has the weird artifact).

Play.ai – Deedee

Play.ai – Fritz

Play.ai – Casper & Inara (Conversation Mode)

Play.ai Playground

Closing

I’ll wrap this here for today. Expect a follow up soon with a more comprehensive breakdown of the costs for each tool!

In summary:

Adding voice to e-learning is something we absolutely should do.
AI is here to make that realistic given the time crunches and cost limitations we work in today.
You’ve got a lot of options.

If you’ve made it this far I hope you’ve found this valuable! Drop me a message here or on LinkedIn to let me know what you think, or if you’ve used any other AI generation solutions I should look at!