Red Hen uses Tessaract for Optical Character Recognition. Tesseract can be configured for Chinese characters. See
- Chinese Character Recognition Using Tessaract OCR. Which says:
- You need to download chinese trained data (it will be a file like chi_sim.traineddata) and add it to your tessdata folder. Download the file https://github.com/tesseract-ocr/tessdata/raw/master/chi_sim.traineddata and use like this:
Tesseract* tesseract= [[Tesseract alloc] initWithDataPath:@"tessdata" language:@"chi_sim"]
Would you like to establish a Chinese OCR pipeline for Red Hen's large Chinese audiovisual holdings?
If so, write to
and we will try to connect you with a mentor.
- Chinese Character Recognition Using Tessaract OCR
- Red Hen Lab github repository: ASR for Chinese Pipeline (master)
- Suwei Xu's github repository (development of the ASR for Chinese Pipeline, Google Summer of Code 2018) -- blog
- Zhaoqing Xu's github repository (a fork of the master) -- blog
- A PaddlePaddle implementation of DeepSpeech2 architecture for ASR
- THCHS-30（A Free Chinese Speech Corpus Released by CSLT@Tsinghua University）
- DeepSpeech (A TensorFlow implementation of Baidu's DeepSpeech architecture)
- Battenberg et al. (2017). Exploring Neural Transducers for End-to-End Speech Recognition. (arXiv; writeup as DeepSpeech3)
- STN-OCR: A single Neural Network for Text Detection and Text Recognition (2017)