Red Hen gathers Chinese broadcasts to make datasets for NLP, OCR, audio, and video pipelines. Work on establishing pipelines was launched as part of Red Hen's Google Summer of Code 2018. The broadcasts include TV news from CCTV1, CCTV13, HNTV1, and HNCSTV. We will attempt to use OCR technologies to extract on-screen text. As of December 2018, we have a preliminary Automatic Speech Recognition pipeline in place (see below) but it needs considerable improvement. We will implement some basic natural language processing tasks such as Chinese Word Segmentation, Part-of-Speech tagging, Named-Entity-Recognition, etc. Future project includes exploring data analysis on texts produced by these technologies.
In addition to above requirements, you will also require: