Visual speech recognition with lightweight psychologically motivated gabor features

Xuejie Zhang; Yan Xu; Andrew K. Abel; Leslie S. Smith; Roger Watt; Amir Hussain; Chengxiang Gao

doi:10.3390/e22121367

Visual speech recognition with lightweight psychologically motivated gabor features

Xuejie Zhang, Yan Xu, Andrew K. Abel^*, Leslie S. Smith, Roger Watt, Amir Hussain, Chengxiang Gao

^*Corresponding author for this work

Department of Computing

Research output: Contribution to journal › Article › peer-review

6 Citations (Scopus)

Abstract

Extraction of relevant lip features is of continuing interest in the visual speech domain. Using end-to-end feature extraction can produce good results, but at the cost of the results being difficult for humans to comprehend and relate to. We present a new, lightweight feature extraction approach, motivated by human-centric glimpse-based psychological research into facial barcodes, and demonstrate that these simple, easy to extract 3D geometric features (produced using Gabor-based image patches), can successfully be used for speech recognition with LSTM-based machine learning. This approach can successfully extract low dimensionality lip parameters with a minimum of processing. One key difference between using these Gabor-based features and using other features such as traditional DCT, or the current fashion for CNN features is that these are human-centric features that can be visualised and analysed by humans. This means that it is easier to explain and visualise the results. They can also be used for reliable speech recognition, as demonstrated using the Grid corpus. Results for overlapping speakers using our lightweight system gave a recognition rate of over 82%, which compares well to less explainable features in the literature.

Original language	English
Article number	1367
Pages (from-to)	1-24
Number of pages	24
Journal	Entropy
Volume	22
Issue number	12
DOIs	https://doi.org/10.3390/e22121367
Publication status	Published - Dec 2020

Keywords

Explainable
Gabor features
Image processing
Lip reading
Speech recognition

Access to Document

10.3390/e22121367

Cite this

@article{7cd059a8707e4453b22e41f4fa7c28b6,

title = "Visual speech recognition with lightweight psychologically motivated gabor features",

abstract = "Extraction of relevant lip features is of continuing interest in the visual speech domain. Using end-to-end feature extraction can produce good results, but at the cost of the results being difficult for humans to comprehend and relate to. We present a new, lightweight feature extraction approach, motivated by human-centric glimpse-based psychological research into facial barcodes, and demonstrate that these simple, easy to extract 3D geometric features (produced using Gabor-based image patches), can successfully be used for speech recognition with LSTM-based machine learning. This approach can successfully extract low dimensionality lip parameters with a minimum of processing. One key difference between using these Gabor-based features and using other features such as traditional DCT, or the current fashion for CNN features is that these are human-centric features that can be visualised and analysed by humans. This means that it is easier to explain and visualise the results. They can also be used for reliable speech recognition, as demonstrated using the Grid corpus. Results for overlapping speakers using our lightweight system gave a recognition rate of over 82%, which compares well to less explainable features in the literature.",

keywords = "Explainable, Gabor features, Image processing, Lip reading, Speech recognition",

author = "Xuejie Zhang and Yan Xu and Abel, {Andrew K.} and Smith, {Leslie S.} and Roger Watt and Amir Hussain and Chengxiang Gao",

note = "Publisher Copyright: {\textcopyright} 2020 by the authors. Licensee MDPI, Basel, Switzerland.",

year = "2020",

month = dec,

doi = "10.3390/e22121367",

language = "English",

volume = "22",

pages = "1--24",

journal = "Entropy",

issn = "1099-4300",

number = "12",

}

TY - JOUR

T1 - Visual speech recognition with lightweight psychologically motivated gabor features

AU - Zhang, Xuejie

AU - Xu, Yan

AU - Abel, Andrew K.

AU - Smith, Leslie S.

AU - Watt, Roger

AU - Hussain, Amir

AU - Gao, Chengxiang

PY - 2020/12

Y1 - 2020/12

N2 - Extraction of relevant lip features is of continuing interest in the visual speech domain. Using end-to-end feature extraction can produce good results, but at the cost of the results being difficult for humans to comprehend and relate to. We present a new, lightweight feature extraction approach, motivated by human-centric glimpse-based psychological research into facial barcodes, and demonstrate that these simple, easy to extract 3D geometric features (produced using Gabor-based image patches), can successfully be used for speech recognition with LSTM-based machine learning. This approach can successfully extract low dimensionality lip parameters with a minimum of processing. One key difference between using these Gabor-based features and using other features such as traditional DCT, or the current fashion for CNN features is that these are human-centric features that can be visualised and analysed by humans. This means that it is easier to explain and visualise the results. They can also be used for reliable speech recognition, as demonstrated using the Grid corpus. Results for overlapping speakers using our lightweight system gave a recognition rate of over 82%, which compares well to less explainable features in the literature.

AB - Extraction of relevant lip features is of continuing interest in the visual speech domain. Using end-to-end feature extraction can produce good results, but at the cost of the results being difficult for humans to comprehend and relate to. We present a new, lightweight feature extraction approach, motivated by human-centric glimpse-based psychological research into facial barcodes, and demonstrate that these simple, easy to extract 3D geometric features (produced using Gabor-based image patches), can successfully be used for speech recognition with LSTM-based machine learning. This approach can successfully extract low dimensionality lip parameters with a minimum of processing. One key difference between using these Gabor-based features and using other features such as traditional DCT, or the current fashion for CNN features is that these are human-centric features that can be visualised and analysed by humans. This means that it is easier to explain and visualise the results. They can also be used for reliable speech recognition, as demonstrated using the Grid corpus. Results for overlapping speakers using our lightweight system gave a recognition rate of over 82%, which compares well to less explainable features in the literature.

KW - Explainable

KW - Gabor features

KW - Image processing

KW - Lip reading

KW - Speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85097060874&partnerID=8YFLogxK

U2 - 10.3390/e22121367

DO - 10.3390/e22121367

M3 - Article

AN - SCOPUS:85097060874

SN - 1099-4300

VL - 22

SP - 1

EP - 24

JO - Entropy

JF - Entropy

IS - 12

M1 - 1367

ER -

Visual speech recognition with lightweight psychologically motivated gabor features

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this