Automatic Speech Recognition for Chinese
Red Hen has a preliminary Automatic Speech Recognition pipeline on Chinese. Would you like to help improve it?
If so, write to
and we will try to connect you with a mentor.
Related Scrolls
Related Links
- Red Hen Lab github repository: ASR for Chinese Pipeline (master)
- Suwei Xu's github repository (development of the ASR for Chinese Pipeline, Google Summer of Code 2018) -- blog
- Zhaoqing Xu's github repository (a fork of the master) -- blog
- A PaddlePaddle implementation of DeepSpeech2 architecture for ASR
- THCHS-30(A Free Chinese Speech Corpus Released by CSLT@Tsinghua University)
- DeepSpeech (A TensorFlow implementation of Baidu's DeepSpeech architecture)
- Battenberg et al. (2017). Exploring Neural Transducers for End-to-End Speech Recognition. (arXiv; writeup as DeepSpeech3)
- Music removal by convolutional denoising autoencoder in speech recognition
- VAD (voice activity detection, used to cut audio between sentences) -- python interface to the WebRTC VAD
- Diarization: https://arxiv.org/pdf/1810.04719.pdf
- Kur: Descriptive Deep Learning (blog)
- wav2letter++
More Information
Red Hen has a pipeline in production at the Case HPC that runs Chinese ASR using Baidu's DeepSpeech2 with PaddlePaddle inside a Singularity container built on Singularity Hub from a recipe. It starts with this command:
singularity exec -e --nv ../Chinese_Pipeline.simg bash infer.sh $DAY
In the Slurm job submission, it requests a GPU:
#SBATCH -p gpu -C gpuk40 --mem=100gb --gres=gpu:2
abc123@server:~/cp$ squeue -u abc123
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12267389 gpu work.slu abc123 PD 0:00 1 (Priority)
12267373 gpu work.slu abc123 R 27:51 1 gput025
12267379 gpu work.slu abc123 R 15:41 1 gput026
It takes about four minutes to run ASR on a standard one-hour recording.
To Do
Chinese Red Hens report that the output makes sense, but has copious errors and disfluencies; to improve it, the audio should be cut at pauses or in word breaks rather than mechanically at ten-second intervals. A news content training dataset would also help.
Thoughts
Other approaches are also worth exploring, notably Baidu DeepSpeech3.