Chinese Pipeline
Introduction
Red Hen gathers Chinese broadcasts to make datasets for NLP, OCR, audio, and video pipelines. Work on establishing pipelines was launched as part of Red Hen's Google Summer of Code 2018. The broadcasts include TV news from CCTV1, CCTV13, HNTV1, and HNCSTV. We will attempt to use OCR technologies to extract on-screen text. As of December 2018, we have a preliminary Automatic Speech Recognition pipeline in place (see below) but it needs considerable improvement. We will implement some basic natural language processing tasks such as Chinese Word Segmentation, Part-of-Speech tagging, Named-Entity-Recognition, etc. Future project includes exploring data analysis on texts produced by these technologies.
Related pages
- Automatic Speech Recognition on Chinese
- Audio processing pipeline
- Current state of text tagging
- Machine learning
- Overview of research
- Red Hen corpus data format
- Red Hen data format
- Video processing pipelines
Getting started with Chinese Pipeline
Preliminary Automatic Speech Recognition pipeline in production
Prerequisites
- For Audio-only Speech Recognition:
- Git large file storage
- Tensorflow 1.0 or above
- Scipy
- PyXDG
- python_speech_features
- python_soxs
- pandas
- FFmpeg
- For Audio-Visual Speech Recognition:
In addition to above requirements, you will also require:
- OpenCV 3.x for Python
- scikit-image
- Dlib for Python
- For Natural Language Processing:
- jieba
- pyltp
Installation
- wait to be edited