Toucan TTS: Ultimate 7000+ Language Text-to-Speech Tool

August 1, 2024

by kevin

Toucan TTS is a state-of-the-art text-to-speech (TTS) toolkit meticulously developed by the Institute for Natural Language Processing (IMS) at the University of Stuttgart, Germany. This powerful toolkit supports speech synthesis in over 7,000 languages, including a wide variety of dialects and language variants. Built on the Python and PyTorch frameworks, Toucan TTS is not only user-friendly but also feature-rich, enabling multi-speaker voice synthesis, speech style imitation, and human-in-the-loop voice editing. With its versatility, Toucan TTS is suitable for a wide range of applications, including education, reading, and multilingual software development. As an open-source project under the Apache 2.0 license, it encourages users and developers to freely use and modify the source code to meet their specific application needs.

Key Features

Multilingual Speech Synthesis

Toucan TTS can process and generate speech in more than 7,000 different languages, including various dialects and language variants, making it one of the most linguistically diverse TTS projects globally.

Multi-Speaker Support

The toolkit supports multi-speaker speech synthesis, allowing users to select or create speaker models with different voice characteristics for personalized speech output.

Human-in-the-Loop Editing

Toucan TTS offers human-in-the-loop editing capabilities, enabling users to fine-tune synthesized speech to suit different application scenarios, such as literary reading or educational materials.

Speech Style Cloning

Users can leverage Toucan TTS to clone the speech style of a specific speaker, including rhythm, stress, and intonation, making the synthesized speech more closely resemble the original speaker’s voice characteristics.

Speech Parameter Adjustment

Toucan TTS allows users to adjust speech parameters such as duration, pitch variation, and energy changes to control the fluency, emotional expression, and sound characteristics of the synthesized speech.

Pronunciation Clarity and Gender Characteristic Adjustment

Users can adjust the clarity and gender characteristics of the synthesized speech according to their needs, making it sound more natural and suitable for specific roles or scenarios.

Interactive Demos

Toucan TTS provides online interactive demos, allowing users to experience and test speech synthesis effects in real-time through a web interface, helping them quickly understand and use the toolkit’s features.

Applications

Literary Reading

Synthesize speech for poetry, literary works, and web content for reading appreciation or as audiobooks.

Multilingual Application Development

Provide speech synthesis services for applications that require multilingual support, such as internationalized software and games.

Assistive Technology

Offer text-to-speech services for visually impaired or reading-challenged individuals, helping them better access information.

Customer Service

Utilize Toucan TTS in customer service systems to provide multilingual automated voice responses or interactive voice response systems.

News and Media

Automatically convert news articles into speech, providing busy listeners with a convenient way to consume news.

Film and Video Production

Generate voiceovers for movies, animations, or video content, especially when the original audio is unavailable or specific language versions are required.

Audiobook Creation

Convert e-books or documents into audiobooks for users who prefer listening to reading.

Installation

These instructions should work for most cases, but I heard of some instances where espeak behaves weird, which are sometimes resolved after a re-install and sometimes not. Also, M1 and M2 MacBooks require a very different installation process, with which I am unfortunately not familiar.

Basic Requirements

To install this toolkit, clone it onto the machine you want to use it on (should have at least one cuda enabled GPU if you intend to train models on that machine. For inference, you don’t need a GPU). Navigate to the directory you have cloned. We recommend creating and activating a virtual environment to install the basic requirements into. The commands below summarize everything you need to do under Linux. If you are running Windows, the second line needs to be changed, please have a look at the venv documentation.

python -m venv <path_to_where_you_want_your_env_to_be>

source <path_to_where_you_want_your_env_to_be>/bin/activate

pip install --no-cache-dir -r requirements.txt

Run the second line everytime you start using the tool again to activate the virtual environment again, if you e.g. logged out in the meantime. To make use of a GPU, you don’t need to do anything else on a Linux machine. On a Windows machine, have a look at the official PyTorch website for the install-command that enables GPU support.

Storage configuration

If you don’t want the pretrained and trained models as well as the cache files resulting from preprocessing your datasets to be stored in the default subfolders, you can set corresponding directories globally by editing Utility/storage_config.py to suit your needs (the path can be relative to the repository root directory or absolute).

Pretrained Models

You don’t need to use pretrained models, but it can speed things up tremendously. Run the run_model_downloader.py script to automatically download them from the release page and put them into their appropriate locations with appropriate names.

[optional] eSpeak-NG

eSpeak-NG is an optional requirement, that handles lots of special cases in many languages, so it’s good to have.

On most Linux environments it will be installed already, and if it is not, and you have the sufficient rights, you can install it by simply running

apt-get install espeak-ng

For Windows, they provide a convenient .msi installer file on their GitHub release page. After installation on non-linux systems, you’ll also need to tell the phonemizer library where to find your espeak installation by setting the PHONEMIZER_ESPEAK_LIBRARY environment variable, which is discussed in this issue.

For Mac it’s unfortunately a lot more complicated. Thanks to Sang Hyun Park, here is a guide for installing it on Mac: For M1 Macs, the most convenient method to install espeak-ng onto your system is via a MacPorts port of espeak-ng. MacPorts itself can be installed from the MacPorts website, which also requires Apple’s XCode. Once XCode and MacPorts have been installed, you can install the port of espeak-ng via

sudo port install espeak-ng

As stated in the Windows install instructions, the espeak-ng installation will need to be set as a variable for the phonemizer library. The environment variable is PHONEMIZER_ESPEAK_LIBRARY as given in the GitHub thread linked above. However, the espeak-ng installation file you need to set this variable to is a .dylib file rather than a .dll file on Mac. In order to locate the espeak-ng library file, you can run port contents espeak-ng. The specific file you are looking for is named libespeak-ng.dylib.

Usage and Experience

Non-developers can visit Hugging Face to experience Toucan TTS’s online text-to-speech and voice cloning demos:

https://huggingface.co/spaces/Flux9665/MassivelyMultilingualTTS

Developers can access the GitHub repository, clone the code locally for deployment and execution:

https://github.com/DigitalPhonetics/IMS-Toucan

With its extensive language support, advanced features, and wide range of applications, Toucan TTS is a powerful tool for anyone looking to generate high-quality, multilingual speech from text. As an open-source project, it encourages collaboration and customization, making it adaptable to various use cases and individual needs.

What is Toucan TTS and how does it function?

Toucan TTS is a sophisticated text-to-speech toolkit developed by the Institute for Natural Language Processing, supporting over 7,000 languages. It uses advanced machine learning techniques to convert written text into natural-sounding speech, making it suitable for diverse applications, from education to media production. For more information, visit the official Toucan TTS GitHub page.

Can I modify the voice output in Toucan TTS?

Yes, Toucan TTS allows users to customize voice output extensively. You can adjust parameters such as pitch, speed, and emotional tone to create a more personalized audio experience. This flexibility makes it ideal for various applications, including audiobooks and educational tools. Learn more about voice customization on the Toucan TTS documentation page.

Is Toucan TTS free for commercial use?

Toucan TTS is an open-source project licensed under Apache 2.0, allowing both personal and commercial use without any licensing fees. This makes it a cost-effective solution for developers and businesses looking to implement text-to-speech technology. For licensing details, refer to the Apache License page.

How can I try Toucan TTS before using it?

You can test Toucan TTS through interactive demos available on platforms like Hugging Face. These demos allow users to experience the toolkit’s capabilities in real-time, helping you assess its features and quality before integration. Check out the demo here.

What are the main applications of Toucan TTS?

Toucan TTS can be utilized in various fields, including:
Assistive Technology: Aiding visually impaired users by converting text to speech.
Audiobook Production: Transforming written content into audio format.
Multilingual Development: Supporting applications that require diverse language outputs.
Media and Content Creation: Generating voiceovers for videos and presentations.
These applications highlight the versatility and effectiveness of Toucan TTS in meeting different user needs. For more insights, visit the Toucan TTS official page.

Categories: AI Tools