Abstract
This study proposes an automatic naturalness recognition from an acted dialogue. The problem can be stated that: given speech utterances with their naturalness labels, is it possible to recognize these labels automatically? By what methods? And how to evaluate these methods? We evaluated two supervised classifiers to investigate the possibility of recognizing naturalness automatically in acted speech: long short-term memory and multilayer perceptron neural networks. These classifiers accept inputs in the form of acoustic features from a speech dataset. Two kinds of acoustic features were evaluated: low-level and high-level features. This initial study on automatic naturalness recognition of speech resulted in a moderate performance of the assessed systems. We measured the performance in concordance correlation coefficients, Pearson correlation coefficients, and root mean square errors. This study opens a potential application of speech processing techniques for measuring naturalness in acted dialogue, which benefits for drama- or movie-making in the future.