Chinese Pipeline

Introduction

Red Hen gathers Chinese broadcasts to make datasets for NLP, OCR, audio, and video pipelines. Work on establishing pipelines was launched as part of Red Hen's Google Summer of Code 2018. The broadcasts include TV news from CCTV1, CCTV13, HNTV1, and HNCSTV. We will attempt to use OCR technologies to extract on-screen text. As of December 2018, we have a preliminary Automatic Speech Recognition pipeline in place (see below) but it needs considerable improvement. We will implement some basic natural language processing tasks such as Chinese Word Segmentation, Part-of-Speech tagging, Named-Entity-Recognition, etc.  Future project includes exploring data analysis on texts produced by these technologies. 

Getting started with Chinese Pipeline

Preliminary Automatic Speech Recognition pipeline in production

Prerequisites

  • For Audio-only Speech Recognition:
    • Git large file storage
    • Tensorflow 1.0 or above
    • Scipy
    • PyXDG
    • python_speech_features
    • python_soxs
    • pandas
    • FFmpeg

  • For Audio-Visual Speech Recognition:
In addition to above requirements, you will also require:
    • OpenCV 3.x for Python
    • scikit-image
    • Dlib for Python

  • For Natural Language Processing:
    •  jieba
    •  pyltp

Installation

  • wait to be edited

Data-Preprocessing for Training 

Audio-only Speech Recognition

Audio-Visual Speech Recognition(AVSR)

Training

Audio-only Model

Audio-Visual Model

Training Results