Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 56

Search results for: Speech Recognition

56 A Mixing Matrix Estimation Algorithm for Speech Signals under the Under-Determined Blind Source Separation Model

Authors: Jing Wu, Wei Lv, Yibing Li, Yuanfan You

Abstract:

The separation of speech signals has become a research hotspot in the field of signal processing in recent years. It has many applications and influences in teleconferencing, hearing aids, speech recognition of machines and so on. The sounds received are usually noisy. The issue of identifying the sounds of interest and obtaining clear sounds in such an environment becomes a problem worth exploring, that is, the problem of blind source separation. This paper focuses on the under-determined blind source separation (UBSS). Sparse component analysis is generally used for the problem of under-determined blind source separation. The method is mainly divided into two parts. Firstly, the clustering algorithm is used to estimate the mixing matrix according to the observed signals. Then the signal is separated based on the known mixing matrix. In this paper, the problem of mixing matrix estimation is studied. This paper proposes an improved algorithm to estimate the mixing matrix for speech signals in the UBSS model. The traditional potential algorithm is not accurate for the mixing matrix estimation, especially for low signal-to noise ratio (SNR).In response to this problem, this paper considers the idea of an improved potential function method to estimate the mixing matrix. The algorithm not only avoids the inuence of insufficient prior information in traditional clustering algorithm, but also improves the estimation accuracy of mixing matrix. This paper takes the mixing of four speech signals into two channels as an example. The results of simulations show that the approach in this paper not only improves the accuracy of estimation, but also applies to any mixing matrix.

Keywords: Clustering algorithm, potential function, speech signal, the UBSS model.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 119
55 Optimized Brain Computer Interface System for Unspoken Speech Recognition: Role of Wernicke Area

Authors: Nassib Abdallah, Pierre Chauvet, Abd El Salam Hajjar, Bassam Daya

Abstract:

In this paper, we propose an optimized brain computer interface (BCI) system for unspoken speech recognition, based on the fact that the constructions of unspoken words rely strongly on the Wernicke area, situated in the temporal lobe. Our BCI system has four modules: (i) the EEG Acquisition module based on a non-invasive headset with 14 electrodes; (ii) the Preprocessing module to remove noise and artifacts, using the Common Average Reference method; (iii) the Features Extraction module, using Wavelet Packet Transform (WPT); (iv) the Classification module based on a one-hidden layer artificial neural network. The present study consists of comparing the recognition accuracy of 5 Arabic words, when using all the headset electrodes or only the 4 electrodes situated near the Wernicke area, as well as the selection effect of the subbands produced by the WPT module. After applying the articial neural network on the produced database, we obtain, on the test dataset, an accuracy of 83.4% with all the electrodes and all the subbands of 8 levels of the WPT decomposition. However, by using only the 4 electrodes near Wernicke Area and the 6 middle subbands of the WPT, we obtain a high reduction of the dataset size, equal to approximately 19% of the total dataset, with 67.5% of accuracy rate. This reduction appears particularly important to improve the design of a low cost and simple to use BCI, trained for several words.

Keywords: Brain-computer interface, speech recognition, electroencephalography EEG, Wernicke area, artificial neural network.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 354
54 Speech Enhancement Using Wavelet Coefficients Masking with Local Binary Patterns

Authors: Christian Arcos, Marley Vellasco, Abraham Alcaim

Abstract:

In this paper, we present a wavelet coefficients masking based on Local Binary Patterns (WLBP) approach to enhance the temporal spectra of the wavelet coefficients for speech enhancement. This technique exploits the wavelet denoising scheme, which splits the degraded speech into pyramidal subband components and extracts frequency information without losing temporal information. Speech enhancement in each high-frequency subband is performed by binary labels through the local binary pattern masking that encodes the ratio between the original value of each coefficient and the values of the neighbour coefficients. This approach enhances the high-frequency spectra of the wavelet transform instead of eliminating them through a threshold. A comparative analysis is carried out with conventional speech enhancement algorithms, demonstrating that the proposed technique achieves significant improvements in terms of PESQ, an international recommendation of objective measure for estimating subjective speech quality. Informal listening tests also show that the proposed method in an acoustic context improves the quality of speech, avoiding the annoying musical noise present in other speech enhancement techniques. Experimental results obtained with a DNN based speech recognizer in noisy environments corroborate the superiority of the proposed scheme in the robust speech recognition scenario.

Keywords: Binary labels, local binary patterns, mask, wavelet coefficients, speech enhancement, speech recognition.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 481
53 The Capacity of Mel Frequency Cepstral Coefficients for Speech Recognition

Authors: Fawaz S. Al-Anzi, Dia AbuZeina

Abstract:

Speech recognition is of an important contribution in promoting new technologies in human computer interaction. Today, there is a growing need to employ speech technology in daily life and business activities. However, speech recognition is a challenging task that requires different stages before obtaining the desired output. Among automatic speech recognition (ASR) components is the feature extraction process, which parameterizes the speech signal to produce the corresponding feature vectors. Feature extraction process aims at approximating the linguistic content that is conveyed by the input speech signal. In speech processing field, there are several methods to extract speech features, however, Mel Frequency Cepstral Coefficients (MFCC) is the popular technique. It has been long observed that the MFCC is dominantly used in the well-known recognizers such as the Carnegie Mellon University (CMU) Sphinx and the Markov Model Toolkit (HTK). Hence, this paper focuses on the MFCC method as the standard choice to identify the different speech segments in order to obtain the language phonemes for further training and decoding steps. Due to MFCC good performance, the previous studies show that the MFCC dominates the Arabic ASR research. In this paper, we demonstrate MFCC as well as the intermediate steps that are performed to get these coefficients using the HTK toolkit.

Keywords: Speech recognition, acoustic features, Mel Frequency Cepstral Coefficients.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1298
52 A Two-Stage Adaptation towards Automatic Speech Recognition System for Malay-Speaking Children

Authors: Mumtaz Begum Mustafa, Siti Salwah Salim, Feizal Dani Rahman

Abstract:

Recently, Automatic Speech Recognition (ASR) systems were used to assist children in language acquisition as it has the ability to detect human speech signal. Despite the benefits offered by the ASR system, there is a lack of ASR systems for Malay-speaking children. One of the contributing factors for this is the lack of continuous speech database for the target users. Though cross-lingual adaptation is a common solution for developing ASR systems for under-resourced language, it is not viable for children as there are very limited speech databases as a source model. In this research, we propose a two-stage adaptation for the development of ASR system for Malay-speaking children using a very limited database. The two stage adaptation comprises the cross-lingual adaptation (first stage) and cross-age adaptation. For the first stage, a well-known speech database that is phonetically rich and balanced, is adapted to the medium-sized Malay adults using supervised MLLR. The second stage adaptation uses the speech acoustic model generated from the first adaptation, and the target database is a small-sized database of the target users. We have measured the performance of the proposed technique using word error rate, and then compare them with the conventional benchmark adaptation. The two stage adaptation proposed in this research has better recognition accuracy as compared to the benchmark adaptation in recognizing children’s speech.

Keywords: Automatic speech recognition system, children speech, adaptation, Malay.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1264
51 Possibilities, Challenges and the State of the Art of Automatic Speech Recognition in Air Traffic Control

Authors: Van Nhan Nguyen, Harald Holone

Abstract:

Over the past few years, a lot of research has been conducted to bring Automatic Speech Recognition (ASR) into various areas of Air Traffic Control (ATC), such as air traffic control simulation and training, monitoring live operators for with the aim of safety improvements, air traffic controller workload measurement and conducting analysis on large quantities controller-pilot speech. Due to the high accuracy requirements of the ATC context and its unique challenges, automatic speech recognition has not been widely adopted in this field. With the aim of providing a good starting point for researchers who are interested bringing automatic speech recognition into ATC, this paper gives an overview of possibilities and challenges of applying automatic speech recognition in air traffic control. To provide this overview, we present an updated literature review of speech recognition technologies in general, as well as specific approaches relevant to the ATC context. Based on this literature review, criteria for selecting speech recognition approaches for the ATC domain are presented, and remaining challenges and possible solutions are discussed.

Keywords: Automatic Speech Recognition, ASR, Air Traffic Control, ATC.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2902
50 Advances in Artificial Intelligence Using Speech Recognition

Authors: Khaled M. Alhawiti

Abstract:

This research study aims to present a retrospective study about speech recognition systems and artificial intelligence. Speech recognition has become one of the widely used technologies, as it offers great opportunity to interact and communicate with automated machines. Precisely, it can be affirmed that speech recognition facilitates its users and helps them to perform their daily routine tasks, in a more convenient and effective manner. This research intends to present the illustration of recent technological advancements, which are associated with artificial intelligence. Recent researches have revealed the fact that speech recognition is found to be the utmost issue, which affects the decoding of speech. In order to overcome these issues, different statistical models were developed by the researchers. Some of the most prominent statistical models include acoustic model (AM), language model (LM), lexicon model, and hidden Markov models (HMM). The research will help in understanding all of these statistical models of speech recognition. Researchers have also formulated different decoding methods, which are being utilized for realistic decoding tasks and constrained artificial languages. These decoding methods include pattern recognition, acoustic phonetic, and artificial intelligence. It has been recognized that artificial intelligence is the most efficient and reliable methods, which are being used in speech recognition.

Keywords: Speech recognition, acoustic phonetic, artificial intelligence, Hidden Markov Models (HMM), statistical models of speech recognition, human machine performance.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 6161
49 A Novel RLS Based Adaptive Filtering Method for Speech Enhancement

Authors: Pogula Rakesh, T. Kishore Kumar

Abstract:

Speech enhancement is a long standing problem with numerous applications like teleconferencing, VoIP, hearing aids and speech recognition. The motivation behind this research work is to obtain a clean speech signal of higher quality by applying the optimal noise cancellation technique. Real-time adaptive filtering algorithms seem to be the best candidate among all categories of the speech enhancement methods. In this paper, we propose a speech enhancement method based on Recursive Least Squares (RLS) adaptive filter of speech signals. Experiments were performed on noisy data which was prepared by adding AWGN, Babble and Pink noise to clean speech samples at -5dB, 0dB, 5dB and 10dB SNR levels. We then compare the noise cancellation performance of proposed RLS algorithm with existing NLMS algorithm in terms of Mean Squared Error (MSE), Signal to Noise ratio (SNR) and SNR Loss. Based on the performance evaluation, the proposed RLS algorithm was found to be a better optimal noise cancellation technique for speech signals.

Keywords: Adaptive filter, Adaptive Noise Canceller, Mean Squared Error, Noise reduction, NLMS, RLS, SNR, SNR Loss.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2589
48 Environmentally Adaptive Acoustic Echo Suppression for Barge-in Speech Recognition

Authors: Jong Han Joo, Jeong Hun Lee, Young Sun Kim, Jae Young Kang, Seung Ho Choi

Abstract:

In this study, we propose a novel technique for acoustic echo suppression (AES) during speech recognition under barge-in conditions. Conventional AES methods based on spectral subtraction apply fixed weights to the estimated echo path transfer function (EPTF) at the current signal segment and to the EPTF estimated until the previous time interval. However, the effects of echo path changes should be considered for eliminating the undesired echoes. We describe a new approach that adaptively updates weight parameters in response to abrupt changes in the acoustic environment due to background noises or double-talk. Furthermore, we devised a voice activity detector and an initial time-delay estimator for barge-in speech recognition in communication networks. The initial time delay is estimated using log-spectral distance measure, as well as cross-correlation coefficients. The experimental results show that the developed techniques can be successfully applied in barge-in speech recognition systems.

Keywords: Acoustic echo suppression, barge-in, speech recognition, echo path transfer function, initial delay estimator, voice activity detector.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1789
47 Recognizing an Individual, Their Topic of Conversation, and Cultural Background from 3D Body Movement

Authors: Gheida J. Shahrour, Martin J. Russell

Abstract:

The 3D body movement signals captured during human-human conversation include clues not only to the content of people’s communication but also to their culture and personality. This paper is concerned with automatic extraction of this information from body movement signals. For the purpose of this research, we collected a novel corpus from 27 subjects, arranged them into groups according to their culture. We arranged each group into pairs and each pair communicated with each other about different topics. A state-of-art recognition system is applied to the problems of person, culture, and topic recognition. We borrowed modeling, classification, and normalization techniques from speech recognition. We used Gaussian Mixture Modeling (GMM) as the main technique for building our three systems, obtaining 77.78%, 55.47%, and 39.06% from the person, culture, and topic recognition systems respectively. In addition, we combined the above GMM systems with Support Vector Machines (SVM) to obtain 85.42%, 62.50%, and 40.63% accuracy for person, culture, and topic recognition respectively. Although direct comparison among these three recognition systems is difficult, it seems that our person recognition system performs best for both GMM and GMM-SVM, suggesting that intersubject differences (i.e. subject’s personality traits) are a major source of variation. When removing these traits from culture and topic recognition systems using the Nuisance Attribute Projection (NAP) and the Intersession Variability Compensation (ISVC) techniques, we obtained 73.44% and 46.09% accuracy from culture and topic recognition systems respectively.

Keywords: Person Recognition, Topic Recognition, Culture Recognition, 3D Body Movement Signals, Variability Compensation.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1754
46 Hybrid Modeling Algorithm for Continuous Tamil Speech Recognition

Authors: M. Kalamani, S. Valarmathy, M. Krishnamoorthi

Abstract:

In this paper, Fuzzy C-Means clustering with Expectation Maximization-Gaussian Mixture Model based hybrid modeling algorithm is proposed for Continuous Tamil Speech Recognition. The speech sentences from various speakers are used for training and testing phase and objective measures are between the proposed and existing Continuous Speech Recognition algorithms. From the simulated results, it is observed that the proposed algorithm improves the recognition accuracy and F-measure up to 3% as compared to that of the existing algorithms for the speech signal from various speakers. In addition, it reduces the Word Error Rate, Error Rate and Error up to 4% as compared to that of the existing algorithms. In all aspects, the proposed hybrid modeling for Tamil speech recognition provides the significant improvements for speechto- text conversion in various applications.

Keywords: Speech Segmentation, Feature Extraction, Clustering, HMM, EM-GMM, CSR.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1637
45 Combined Automatic Speech Recognition and Machine Translation in Business Correspondence Domain for English-Croatian

Authors: Sanja Seljan, Ivan Dunđer

Abstract:

The paper presents combined automatic speech recognition (ASR) of English and machine translation (MT) for English and Croatian and Croatian-English language pairs in the domain of business correspondence. The first part presents results of training the ASR commercial system on English data sets, enriched by error analysis. The second part presents results of machine translation performed by free online tool for English and Croatian and Croatian-English language pairs. Human evaluation in terms of usability is conducted and internal consistency calculated by Cronbach's alpha coefficient, enriched by error analysis. Automatic evaluation is performed by WER (Word Error Rate) and PER (Position-independent word Error Rate) metrics, followed by investigation of Pearson’s correlation with human evaluation.

Keywords: Automatic machine translation, integrated language technologies, quality evaluation, speech recognition.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2370
44 Recognition of Noisy Words Using the Time Delay Neural Networks Approach

Authors: Khenfer-Koummich Fatima, Mesbahi Larbi, Hendel Fatiha

Abstract:

This paper presents a recognition system for isolated words like robot commands. It’s carried out by Time Delay Neural Networks; TDNN. To teleoperate a robot for specific tasks as turn, close, etc… In industrial environment and taking into account the noise coming from the machine. The choice of TDNN is based on its generalization in terms of accuracy, in more it acts as a filter that allows the passage of certain desirable frequency characteristics of speech; the goal is to determine the parameters of this filter for making an adaptable system to the variability of speech signal and to noise especially, for this the back propagation technique was used in learning phase. The approach was applied on commands pronounced in two languages separately: The French and Arabic. The results for two test bases of 300 spoken words for each one are 87%, 97.6% in neutral environment and 77.67%, 92.67% when the white Gaussian noisy was added with a SNR of 35 dB.

Keywords: Neural networks, Noise, Speech Recognition.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1625
43 Robust Features for Impulsive Noisy Speech Recognition Using Relative Spectral Analysis

Authors: Hajer Rahali, Zied Hajaiej, Noureddine Ellouze

Abstract:

The goal of speech parameterization is to extract the relevant information about what is being spoken from the audio signal. In speech recognition systems Mel-Frequency Cepstral Coefficients (MFCC) and Relative Spectral Mel-Frequency Cepstral Coefficients (RASTA-MFCC) are the two main techniques used. It will be shown in this paper that it presents some modifications to the original MFCC method. In our work the effectiveness of proposed changes to MFCC called Modified Function Cepstral Coefficients (MODFCC) were tested and compared against the original MFCC and RASTA-MFCC features. The prosodic features such as jitter and shimmer are added to baseline spectral features. The above-mentioned techniques were tested with impulsive signals under various noisy conditions within AURORA databases.

Keywords: Auditory filter, impulsive noise, MFCC, prosodic features, RASTA filter.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1994
42 Analysis of Combined Use of NN and MFCC for Speech Recognition

Authors: Safdar Tanweer, Abdul Mobin, Afshar Alam

Abstract:

The performance and analysis of speech recognition system is illustrated in this paper. An approach to recognize the English word corresponding to digit (0-9) spoken by 2 different speakers is captured in noise free environment. For feature extraction, speech Mel frequency cepstral coefficients (MFCC) has been used which gives a set of feature vectors from recorded speech samples. Neural network model is used to enhance the recognition performance. Feed forward neural network with back propagation algorithm model is used. However other speech recognition techniques such as HMM, DTW exist. All experiments are carried out on Matlab.

Keywords: Speech Recognition, MFCC, Neural Network, classifier.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2753
41 Bidirectional Dynamic Time Warping Algorithm for the Recognition of Isolated Words Impacted by Transient Noise Pulses

Authors: G. Tamulevičius, A. Serackis, T. Sledevič, D. Navakauskas

Abstract:

We consider the biggest challenge in speech recognition – noise reduction. Traditionally detected transient noise pulses are removed with the corrupted speech using pulse models. In this paper we propose to cope with the problem directly in Dynamic Time Warping domain. Bidirectional Dynamic Time Warping algorithm for the recognition of isolated words impacted by transient noise pulses is proposed. It uses simple transient noise pulse detector, employs bidirectional computation of dynamic time warping and directly manipulates with warping results. Experimental investigation with several alternative solutions confirms effectiveness of the proposed algorithm in the reduction of impact of noise on recognition process – 3.9% increase of the noisy speech recognition is achieved.

Keywords: Transient noise pulses, noise reduction, dynamic time warping, speech recognition.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1562
40 Efficient DTW-Based Speech Recognition System for Isolated Words of Arabic Language

Authors: Khalid A. Darabkh, Ala F. Khalifeh, Baraa A. Bathech, Saed W. Sabah

Abstract:

Despite the fact that Arabic language is currently one of the most common languages worldwide, there has been only a little research on Arabic speech recognition relative to other languages such as English and Japanese. Generally, digital speech processing and voice recognition algorithms are of special importance for designing efficient, accurate, as well as fast automatic speech recognition systems. However, the speech recognition process carried out in this paper is divided into three stages as follows: firstly, the signal is preprocessed to reduce noise effects. After that, the signal is digitized and hearingized. Consequently, the voice activity regions are segmented using voice activity detection (VAD) algorithm. Secondly, features are extracted from the speech signal using Mel-frequency cepstral coefficients (MFCC) algorithm. Moreover, delta and acceleration (delta-delta) coefficients have been added for the reason of improving the recognition accuracy. Finally, each test word-s features are compared to the training database using dynamic time warping (DTW) algorithm. Utilizing the best set up made for all affected parameters to the aforementioned techniques, the proposed system achieved a recognition rate of about 98.5% which outperformed other HMM and ANN-based approaches available in the literature.

Keywords: Arabic speech recognition, MFCC, DTW, VAD.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3712
39 Comparison of Parameterization Methods in Recognizing Spoken Arabic Digits

Authors: Ali Ganoun

Abstract:

This paper proposes evaluation of sound parameterization methods in recognizing some spoken Arabic words, namely digits from zero to nine. Each isolated spoken word is represented by a single template based on a specific recognition feature, and the recognition is based on the Euclidean distance from those templates. The performance analysis of recognition is based on four parameterization features: the Burg Spectrum Analysis, the Walsh Spectrum Analysis, the Thomson Multitaper Spectrum Analysis and the Mel Frequency Cepstral Coefficients (MFCC) features. The main aim of this paper was to compare, analyze, and discuss the outcomes of spoken Arabic digits recognition systems based on the selected recognition features. The results acqired confirm that the use of MFCC features is a very promising method in recognizing Spoken Arabic digits.

Keywords: Speech Recognition, Spectrum Analysis, Burg Spectrum, Walsh Spectrum Analysis, Thomson Multitaper Spectrum, MFCC.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1177
38 On Developing an Automatic Speech Recognition System for Standard Arabic Language

Authors: R. Walha, F. Drira, H. El-Abed, A. M. Alimi

Abstract:

The Automatic Speech Recognition (ASR) applied to Arabic language is a challenging task. This is mainly related to the language specificities which make the researchers facing multiple difficulties such as the insufficient linguistic resources and the very limited number of available transcribed Arabic speech corpora. In this paper, we are interested in the development of a HMM-based ASR system for Standard Arabic (SA) language. Our fundamental research goal is to select the most appropriate acoustic parameters describing each audio frame, acoustic models and speech recognition unit. To achieve this purpose, we analyze the effect of varying frame windowing (size and period), acoustic parameter number resulting from features extraction methods traditionally used in ASR, speech recognition unit, Gaussian number per HMM state and number of embedded re-estimations of the Baum-Welch Algorithm. To evaluate the proposed ASR system, a multi-speaker SA connected-digits corpus is collected, transcribed and used throughout all experiments. A further evaluation is conducted on a speaker-independent continue SA speech corpus. The phonemes recognition rate is 94.02% which is relatively high when comparing it with another ASR system evaluated on the same corpus.

Keywords: ASR, HMM, acoustical analysis, acoustic modeling, Standard Arabic language

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1401
37 Puff Noise Detection and Cancellation for Robust Speech Recognition

Authors: Sangjun Park, Jungpyo Hong, Byung-Ok Kang, Yun-keun Lee, Minsoo Hahn

Abstract:

In this paper, an algorithm for detecting and attenuating puff noises frequently generated under the mobile environment is proposed. As a baseline system, puff detection system is designed based on Gaussian Mixture Model (GMM), and 39th Mel Frequency Cepstral Coefficient (MFCC) is extracted as feature parameters. To improve the detection performance, effective acoustic features for puff detection are proposed. In addition, detected puff intervals are attenuated by high-pass filtering. The speech recognition rate was measured for evaluation and confusion matrix and ROC curve are used to confirm the validity of the proposed system.

Keywords: Gaussian mixture model, puff detection and cancellation, speech enhancement.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1813
36 Accent Identification by Clustering and Scoring Formants

Authors: Dejan Stantic, Jun Jo

Abstract:

There have been significant improvements in automatic voice recognition technology. However, existing systems still face difficulties, particularly when used by non-native speakers with accents. In this paper we address a problem of identifying the English accented speech of speakers from different backgrounds. Once an accent is identified the speech recognition software can utilise training set from appropriate accent and therefore improve the efficiency and accuracy of the speech recognition system. We introduced the Q factor, which is defined by the sum of relationships between frequencies of the formants. Four different accents were considered and experimented for this research. A scoring method was introduced in order to effectively analyse accents. The proposed concept indicates that the accent could be identified by analysing their formants.

Keywords: Accent Identification, Formants, Q Factor.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1652
35 Comparison among Various Question Generations for Decision Tree Based State Tying in Persian Language

Authors: Nasibeh Nasiri, Dawood Talebi Khanmiri

Abstract:

Performance of any continuous speech recognition system is highly dependent on performance of the acoustic models. Generally, development of the robust spoken language technology relies on the availability of large amounts of data. Common way to cope with little data for training each state of Markov models is treebased state tying. This tying method applies contextual questions to tie states. Manual procedure for question generation suffers from human errors and is time consuming. Various automatically generated questions are used to construct decision tree. There are three approaches to generate questions to construct HMMs based on decision tree. One approach is based on misrecognized phonemes, another approach basically uses feature table and the other is based on state distributions corresponding to context-independent subword units. In this paper, all these methods of automatic question generation are applied to the decision tree on FARSDAT corpus in Persian language and their results are compared with those of manually generated questions. The results show that automatically generated questions yield much better results and can replace manually generated questions in Persian language.

Keywords: Decision Tree, Markov Models, Speech Recognition, State Tying.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1398
34 Assamese Numeral Speech Recognition using Multiple Features and Cooperative LVQ -Architectures

Authors: Manash Pratim Sarma, Kandarpa Kumar Sarma

Abstract:

A set of Artificial Neural Network (ANN) based methods for the design of an effective system of speech recognition of numerals of Assamese language captured under varied recording conditions and moods is presented here. The work is related to the formulation of several ANN models configured to use Linear Predictive Code (LPC), Principal Component Analysis (PCA) and other features to tackle mood and gender variations uttering numbers as part of an Automatic Speech Recognition (ASR) system in Assamese. The ANN models are designed using a combination of Self Organizing Map (SOM) and Multi Layer Perceptron (MLP) constituting a Learning Vector Quantization (LVQ) block trained in a cooperative environment to handle male and female speech samples of numerals of Assamese- a language spoken by a sizable population in the North-Eastern part of India. The work provides a comparative evaluation of several such combinations while subjected to handle speech samples with gender based differences captured by a microphone in four different conditions viz. noiseless, noise mixed, stressed and stress-free.

Keywords: Assamese, Recognition, LPC, Spectral, ANN.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1626
33 Intelligibility of Cued Speech in Video

Authors: P. Heribanová, J. Polec, S. Ondrušová, M. Hosťovecký

Abstract:

This paper discusses the cued speech recognition methods in videoconference. Cued speech is a specific gesture language that is used for communication between deaf people. We define the criteria for sentence intelligibility according to answers of testing subjects (deaf people). In our tests we use 30 sample videos coded by H.264 codec with various bit-rates and various speed of cued speech. Additionally, we define the criteria for consonant sign recognizability in single-handed finger alphabet (dactyl) analogically to acoustics. We use another 12 sample videos coded by H.264 codec with various bit-rates in four different video formats. To interpret the results we apply the standard scale for subjective video quality evaluation and the percentual evaluation of intelligibility as in acoustics. From the results we construct the minimum coded bit-rate recommendations for every spatial resolution.

Keywords: cued speech, inteligibility, logatom, video

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1151
32 Recognition by Online Modeling – a New Approach of Recognizing Voice Signals in Linear Time

Authors: Jyh-Da Wei, Hsin-Chen Tsai

Abstract:

This work presents a novel means of extracting fixedlength parameters from voice signals, such that words can be recognized in linear time. The power and the zero crossing rate are first calculated segment by segment from a voice signal; by doing so, two feature sequences are generated. We then construct an FIR system across these two sequences. The parameters of this FIR system, used as the input of a multilayer proceptron recognizer, can be derived by recursive LSE (least-square estimation), implying that the complexity of overall process is linear to the signal size. In the second part of this work, we introduce a weighting factor λ to emphasize recent input; therefore, we can further recognize continuous speech signals. Experiments employ the voice signals of numbers, from zero to nine, spoken in Mandarin Chinese. The proposed method is verified to recognize voice signals efficiently and accurately.

Keywords: Speech Recognition, FIR system, Recursive LSE, Multilayer Perceptron

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1021
31 Applications of Support Vector Machines on Smart Phone Systems for Emotional Speech Recognition

Authors: Wernhuar Tarng, Yuan-Yuan Chen, Chien-Lung Li, Kun-Rong Hsie, Mingteh Chen

Abstract:

An emotional speech recognition system for the applications on smart phones was proposed in this study to combine with 3G mobile communications and social networks to provide users and their groups with more interaction and care. This study developed a mechanism using the support vector machines (SVM) to recognize the emotions of speech such as happiness, anger, sadness and normal. The mechanism uses a hierarchical classifier to adjust the weights of acoustic features and divides various parameters into the categories of energy and frequency for training. In this study, 28 commonly used acoustic features including pitch and volume were proposed for training. In addition, a time-frequency parameter obtained by continuous wavelet transforms was also used to identify the accent and intonation in a sentence during the recognition process. The Berlin Database of Emotional Speech was used by dividing the speech into male and female data sets for training. According to the experimental results, the accuracies of male and female test sets were increased by 4.6% and 5.2% respectively after using the time-frequency parameter for classifying happy and angry emotions. For the classification of all emotions, the average accuracy, including male and female data, was 63.5% for the test set and 90.9% for the whole data set.

Keywords: Smart phones, emotional speech recognition, socialnetworks, support vector machines, time-frequency parameter, Mel-scale frequency cepstral coefficients (MFCC).

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1496
30 Face Localization Using Illumination-dependent Face Model for Visual Speech Recognition

Authors: Robert E. Hursig, Jane X. Zhang

Abstract:

A robust still image face localization algorithm capable of operating in an unconstrained visual environment is proposed. First, construction of a robust skin classifier within a shifted HSV color space is described. Then various filtering operations are performed to better isolate face candidates and mitigate the effect of substantial non-skin regions. Finally, a novel Bhattacharyya-based face detection algorithm is used to compare candidate regions of interest with a unique illumination-dependent face model probability distribution function approximation. Experimental results show a 90% face detection success rate despite the demands of the visually noisy environment.

Keywords: Audio-visual speech recognition, Bhattacharyyacoefficient, face detection,

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1243
29 Comparison of MFCC and Cepstral Coefficients as a Feature Set for PCG Biometric Systems

Authors: Justin Leo Cheang Loong, Khazaimatol S Subari, Muhammad Kamil Abdullah, Nurul Nadia Ahmad, RosliBesar

Abstract:

Heart sound is an acoustic signal and many techniques used nowadays for human recognition tasks borrow speech recognition techniques. One popular choice for feature extraction of accoustic signals is the Mel Frequency Cepstral Coefficients (MFCC) which maps the signal onto a non-linear Mel-Scale that mimics the human hearing. However the Mel-Scale is almost linear in the frequency region of heart sounds and thus should produce similar results with the standard cepstral coefficients (CC). In this paper, MFCC is investigated to see if it produces superior results for PCG based human identification system compared to CC. Results show that the MFCC system is still superior to CC despite linear filter-banks in the lower frequency range, giving up to 95% correct recognition rate for MFCC and 90% for CC. Further experiments show that the high recognition rate is due to the implementation of filter-banks and not from Mel-Scaling.

Keywords: Biometric, Phonocardiogram, Cepstral Coefficients, Mel Frequency

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3141
28 3DARModeler: a 3D Modeling System in Augmented Reality Environment

Authors: Trien V. Do, Jong-Weon Lee

Abstract:

This paper describes a 3D modeling system in Augmented Reality environment, named 3DARModeler. It can be considered a simple version of 3D Studio Max with necessary functions for a modeling system such as creating objects, applying texture, adding animation, estimating real light sources and casting shadows. The 3DARModeler introduces convenient, and effective human-computer interaction to build 3D models by combining both the traditional input method (mouse/keyboard) and the tangible input method (markers). It has the ability to align a new virtual object with the existing parts of a model. The 3DARModeler targets nontechnical users. As such, they do not need much knowledge of computer graphics and modeling techniques. All they have to do is select basic objects, customize their attributes, and put them together to build a 3D model in a simple and intuitive way as if they were doing in the real world. Using the hierarchical modeling technique, the users are able to group several basic objects to manage them as a unified, complex object. The system can also connect with other 3D systems by importing and exporting VRML/3Ds Max files. A module of speech recognition is included in the system to provide flexible user interfaces.

Keywords: 3D Modeling, Augmented Reality, GeometricModeling, Virtual Reality

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2190
27 Text-independent Speaker Identification Based on MAP Channel Compensation and Pitch-dependent Features

Authors: Jiqing Han, Rongchun Gao

Abstract:

One major source of performance decline in speaker recognition system is channel mismatch between training and testing. This paper focuses on improving channel robustness of speaker recognition system in two aspects of channel compensation technique and channel robust features. The system is text-independent speaker identification system based on two-stage recognition. In the aspect of channel compensation technique, this paper applies MAP (Maximum A Posterior Probability) channel compensation technique, which was used in speech recognition, to speaker recognition system. In the aspect of channel robust features, this paper introduces pitch-dependent features and pitch-dependent speaker model for the second stage recognition. Based on the first stage recognition to testing speech using GMM (Gaussian Mixture Model), the system uses GMM scores to decide if it needs to be recognized again. If it needs to, the system selects a few speakers from all of the speakers who participate in the first stage recognition for the second stage recognition. For each selected speaker, the system obtains 3 pitch-dependent results from his pitch-dependent speaker model, and then uses ANN (Artificial Neural Network) to unite the 3 pitch-dependent results and 1 GMM score for getting a fused result. The system makes the second stage recognition based on these fused results. The experiments show that the correct rate of two-stage recognition system based on MAP channel compensation technique and pitch-dependent features is 41.7% better than the baseline system for closed-set test.

Keywords: Channel Compensation, Channel Robustness, MAP, Speaker Identification

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1201