back to home

SWivid / F5-TTS

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"

14,103 stars
2,083 forks
46 issues
PythonShellDockerfile

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing SWivid/F5-TTS in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind-ai.vercel.app/repo/SWivid/F5-TTS)
Preview:Analyzed by RepoMind

Repository Summary (README)

Preview

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

python arXiv demo hfspace msspace lab lab lab

<!-- <img src="https://github.com/user-attachments/assets/12d7749c-071a-427c-81bf-b87b91def670" alt="Watermark" style="width: 40px; height: auto"> -->

F5-TTS: Diffusion Transformer with ConvNeXt V2, faster trained and inference.

E2 TTS: Flat-UNet Transformer, closest reproduction from paper.

Sway Sampling: Inference-time flow step sampling strategy, greatly improves performance

Thanks to all the contributors !

News

Installation

Create a separate environment if needed

# Create a conda env with python_version>=3.10  (you could also use virtualenv)
conda create -n f5-tts python=3.11
conda activate f5-tts

# Install FFmpeg if you haven't yet
conda install ffmpeg

Install PyTorch with matched device

<details> <summary>NVIDIA GPU</summary>
# Install pytorch with your CUDA version, e.g.
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128

# And also possible previous versions, e.g.
pip install torch==2.4.0+cu124 torchaudio==2.4.0+cu124 --extra-index-url https://download.pytorch.org/whl/cu124
# etc.
</details> <details> <summary>AMD GPU</summary>
# Install pytorch with your ROCm version (Linux only), e.g.
pip install torch==2.5.1+rocm6.2 torchaudio==2.5.1+rocm6.2 --extra-index-url https://download.pytorch.org/whl/rocm6.2
</details> <details> <summary>Intel GPU</summary>
# Install pytorch with your XPU version, e.g.
# IntelĀ® Deep Learning Essentials or IntelĀ® oneAPI Base Toolkit must be installed
pip install torch torchaudio --index-url https://download.pytorch.org/whl/test/xpu

# Intel GPU support is also available through IPEX (IntelĀ® Extension for PyTorch)
# IPEX does not require the IntelĀ® Deep Learning Essentials or IntelĀ® oneAPI Base Toolkit
# See: https://pytorch-extension.intel.com/installation?request=platform
</details> <details> <summary>Apple Silicon</summary>
# Install the stable pytorch, e.g.
pip install torch torchaudio
</details>

Then you can choose one from below:

1. As a pip package (if just for inference)

pip install f5-tts

2. Local editable (if also do training, finetuning)

git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
# git submodule update --init --recursive  # (optional, if use bigvgan as vocoder)
pip install -e .

Docker usage also available

# Build from Dockerfile
docker build -t f5tts:v1 .

# Run from GitHub Container Registry
docker container run --rm -it --gpus=all --mount 'type=volume,source=f5-tts,target=/root/.cache/huggingface/hub/' -p 7860:7860 ghcr.io/swivid/f5-tts:main

# Quickstart if you want to just run the web interface (not CLI)
docker container run --rm -it --gpus=all --mount 'type=volume,source=f5-tts,target=/root/.cache/huggingface/hub/' -p 7860:7860 ghcr.io/swivid/f5-tts:main f5-tts_infer-gradio --host 0.0.0.0

Runtime

Deployment solution with Triton and TensorRT-LLM.

Benchmark Results

Decoding on a single L20 GPU, using 26 different prompt_audio & target_text pairs, 16 NFE.

ModelConcurrencyAvg LatencyRTFMode
F5-TTS Base (Vocos)2253 ms0.0394Client-Server
F5-TTS Base (Vocos)1 (Batch_size)-0.0402Offline TRT-LLM
F5-TTS Base (Vocos)1 (Batch_size)-0.1467Offline Pytorch

See detailed instructions for more information.

Inference

  • In order to achieve desired performance, take a moment to read detailed guidance.
  • By properly searching the keywords of problem encountered, issues are very helpful.

1. Gradio App

Currently supported features:

# Launch a Gradio app (web interface)
f5-tts_infer-gradio

# Specify the port/host
f5-tts_infer-gradio --port 7860 --host 0.0.0.0

# Launch a share link
f5-tts_infer-gradio --share
<details> <summary>NVIDIA device docker compose file example</summary>
services:
  f5-tts:
    image: ghcr.io/swivid/f5-tts:main
    ports:
      - "7860:7860"
    environment:
      GRADIO_SERVER_PORT: 7860
    entrypoint: ["f5-tts_infer-gradio", "--port", "7860", "--host", "0.0.0.0"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  f5-tts:
    driver: local
</details>

2. CLI Inference

# Run with flags
# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
f5-tts_infer-cli --model F5TTS_v1_Base \
--ref_audio "provide_prompt_wav_path_here.wav" \
--ref_text "The content, subtitle or transcription of reference audio." \
--gen_text "Some text you want TTS model generate for you."

# Run with default setting. src/f5_tts/infer/examples/basic/basic.toml
f5-tts_infer-cli
# Or with your own .toml file
f5-tts_infer-cli -c custom.toml

# Multi voice. See src/f5_tts/infer/README.md
f5-tts_infer-cli -c src/f5_tts/infer/examples/multi/story.toml

Training

1. With Hugging Face Accelerate

Refer to training & finetuning guidance for best practice.

2. With Gradio App

# Quick start with Gradio web interface
f5-tts_finetune-gradio

Read training & finetuning guidance for more instructions.

Evaluation

Development

Use pre-commit to ensure code quality (will run linters and formatters automatically):

pip install pre-commit
pre-commit install

When making a pull request, before each commit, run:

pre-commit run --all-files

Note: Some model components have linting exceptions for E722 to accommodate tensor notation.

Acknowledgements

Citation

If our work and codebase is useful for you, please cite as:

@article{chen-etal-2024-f5tts,
      title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, 
      author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
      journal={arXiv preprint arXiv:2410.06885},
      year={2024},
}

License

Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license due to the training data Emilia, which is an in-the-wild dataset. Sorry for any inconvenience this may cause.