Chinese Pipeline

Introduction

Red Hen gathers Chinese broadcasts to make datasets for NLP, OCR, audio, and video pipelines. Work on establishing pipelines was launched as part of Red Hen's Google Summer of Code 2018. The broadcasts include TV news from CCTV1, CCTV13, HNTV1, and so on. We will attempt to use OCR technologies to extract on-screen text. We will attempt to use Deep Learning technologies (e.g. DeepSpeeach, from Baidu) to convert audio to text. We will implement some basic natural language processing tasks such as Chinese Word Segmentation, Part-of-Speech tagging, Named-Entity-Recognition, etc.  Future project includes exploring data analysis on texts produced by these technologies. 

Resources

  • THCHS-30(A Free Chinese Speech Corpus Released by CSLT@Tsinghua University)
  • DeepSpeech (A TensorFlow implementation of Baidu's DeepSpeech architecture)

Getting started with Chinese Pipeline

Prerequisite

  • For Audio-only Speech Recognition:
    • Git large file storage
    • Tensorflow 1.0 or above
    • Scipy
    • PyXDG
    • python_speech_features
    • python_soxs
    • pandas
    • FFmpeg
  • For Audio-Visual Speech Recognition:
In addition to above requirements, you will also require:
    • OpenCV 3.x for Python
    • scikit-image
    • Dlib for Python
  • For Natural Language Processing:
    •  jieba
    •  pyltp

Installation

  • wait to be edited

Data-Preprocessing for Training 

Audio-only Speech Recognition

Audio-Visual Speech Recognition(AVSR)

Training

Audio-only Model

Audio-Visual Model

Training Results