Creating free audiobooks with local TTS models

2024-10-01 (updated 2024-10-26) Eiko Wagenknecht

As a fan of Alexander Wales’ rational fiction “Worth the Candle,” I’ve enjoyed the audiobook versions of the first 104 chapters (Book 1 - Through Adversity (affiliate link), Book 2 - Trust and Consequences (affiliate link), Book 3 - Building Strongholds (affiliate link)). However, the remaining 150 chapters are only available as text (Royal Road.

This post explores how to create free audiobooks for these missing chapters using local text-to-speech (TTS) models.

Why use local TTS models?
Exploring TTS models
Setting up XTTS2
Using XTTS2
Customizing the voice
Sample output

Why use local TTS models?

There are several ways of varying complexity and quality to create an audiobook from text. While web services like ElevenLabs or Neets offer high-quality TTS, they can be expensive for large projects. Local, open-source TTS models provide a cost-effective alternative, though they require more setup and technical knowledge.

The quality is not as high as with professional services, but it’s come a long way in the past few years. You can find a sample of the output at the end of this post.

Exploring TTS models

To figure out which models are good, I did what I always do: I asked the internet, checked out some Reddit threads and conversed with ChatGPT and Claude about the topic. Then I tried out the most promising sounding models.

Since the models are evolving quickly, this is only a snapshot of the current landscape and I might also miss some good models.

Here’s a list of the models I tried:

For additional comparisons, you can also check out the TTS-Arena benchmark.

After some quite extensive testing, I chose Coqui XTTS2 for its balance of quality, multi-language support, and ease of use. It offers many voice options and can even mimic a specific voice given a short audio sample.

Setting up XTTS2

This is a guide on how to set up XTTS2 on Windows, but it should be similar on other operating systems.

To use XTTS2, you need to have some prerequisites installed:

Python 3.11 (as of writing, 3.11 is the latest supported version, 3.12 is not supported yet)
espeak-ng
ffmpeg (this needs to be on your PATH variable)

Then you can set up your Python environment:

Create a new virtual environment: python -m venv .venv (make sure to use 3.11 if you have multiple versions installed)
Activate it: .\.venv\Scripts\Activate.ps1
Install coqui-tts: pip install coqui-tts
If you have an NVIDIA GPU, add CUDA support for faster execution: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
Install nltk: pip install nltk

Using XTTS2

Download my ttsify.py script and save it in a new directory.

Put the text you want to convert to audio in a UTF-8 encoded text file called input.txt in the same directory as the script.

Then run the script with python ttsify.py input.txt. That will generate an MP3 file in the same directory.

Customizing the voice

To use your own voice:

Record a sample with any recording software.
Convert it to a WAV file with ffmpeg: ffmpeg -i myvoicesample.mp3 -acodec pcm_s16le -ar 16000 -ac 1 myvoicesample.wav.
Use this WAV file as the speaker_wav parameter in the script.

To list available preset voices, download my list-speakers.py script into the same directory and run it with python list-speakers.py.

Sample output

I’ve generated a sample from the beginning of chapter 77 of “Worth the Candle” to demonstrate the quality. Listen to the sample here.

Have you found other models that work even better? I’d love to hear about them.

No Comments? No Problem.

This blog doesn't support comments, but your thoughts and questions are always welcome. Reach out through the contact details at the bottom of the page.

Support Me

If you found this page helpful and want to say thanks, you can support me here.