Tuesday, August 6, 2019
Speaker Independent Speech Recognizer Development
Speaker Independent Speech Recognizer Development Chapter 4 Methodology and Implementation This chapter describes the methodology and implementation of the speaker independent speech recognizer for the Sinhala language and the Android mobile application for voice dialing. Mainly there are two phases of the research. First one is to build the speaker independent Sinhala speech recognizer to recognize the digits spoken in Sinhala language. The second phase is to build an android application by integrating the trained speech recognizer. This chapter covers the tools, algorithms, theoretical aspects, the models and the file structures used for the entire research process. 4.1Research phase 1: Build the speaker independent Sinhala speech recognizer for recognizing the digits. In this section the development of the speaker independent Sinhala speech recognizer is described, step by step. It includes the phonetic dictionary, language model, grammar file, acoustic speech database and the trained acoustic model creation. 4.1.1 Data preparation This system is a Sinhala speech recognition voice dial and since there is no such speech database which is done earlier was available, the speech has to be taken from the scratch to develop the system. Data collection The first stage of any speech recognizer is the collection of sound signals. Database should contain a variety of enough speakers recording. The size of the database is compared to the task we handle. For this application only little number of words was considered. This research aims only the written Sinhala vocabulary that can be applied for voice dialing. Altogether twelve words were considered with the ten numbers including two initial calling words ââ¬Å"amatannaâ⬠and ââ¬Å"katakarannaâ⬠. Here the Database has two parts, the training part and the testing part. Usually about 1/10th of the full speech data is used to the testing part. In this research 3000 speech samples were used for training and 150 speech samples were used for testing. Speech database Before collecting data, a speech database was created. The database was included with the Sinhala speech samples taken from variety of people who were in different age levels. Since there was no such database published anywhere for Sinhala language relevant for voice dialing, speech had to be collected from Sinhala native speakers. Prompt sheet To create the speech database, the first step was to prepare the prompt sheet having a list of sentences for all the recordings. Here it used 100 sentences that are different from each other by generating the numbers randomly. 50 sentences are starting with the word ââ¬Å"amatannaâ⬠while the other half is starting with the word ââ¬Å"katakarannaâ⬠. The prompt sheet used for this research is given in the Appendix A. Recording The prepared sentences in the prompt sheet were recorded by using thirty (30) native speakers since this is speaker independent application. The speakers were selected according to the age limits and divided them into eight age groups. Four people were selected from each group except one age group. Two females and two males were included into each age group. One group only contained two people with one female and one male. Each speaker was given 100 sentences to speak and altogether 3000 speech samples were recorded for training. The description of speakers such as gender and age can be found in Appendix A. If there was an error in the recording due to the background noise and filler sounds, the speaker was asked to repeat it and got the correct sound signal. Since the proposed system is a discrete system, the speakers have to make a short pause at the start and end of the recording and also between the words when they were uttered. Speech was recorded in a quiet room and the recordi ngs were done at nights by using a condenser recorder microphone. The sounds were recorded under the sampling rate of 44.1 kHz using mono channel and they were saved under *.wav format. Sampling frequency and format of speech audio files Speech recording files were saved in the file format of MS WAV. The ââ¬Å"Praatââ¬Å" software was used to convert the 44.1 kHz sampling frequency signals to 16 kHz frequency signals since the frequency should be 16kHz of the training samples. Audio files were recorded in a medium length of 11 seconds. Since there should be a silence in the beginning and the end of the utterance and it should not be exceeded 0.2 seconds, the ââ¬Å"Praatâ⬠software was used to edit all 3000 sound signals. 4.1.2 Pronunciation dictionary The pronunciation dictionary was implemented by hand since the number of words used for the voice dialing system is very few. It is used only 12 words from the Sinhala vocabulary. To create the dictionary, the International Phonetic Alphabet for Sinhala Language and the previously created dictionaries by CMU Sphinx were used. But the acoustic phones were taken mostly by studying the different types of databases given by the Carnegie Mellon Universityââ¬â¢s Sphinx Forum (CMU Sphinx Forum). Two dictionaries were implemented for this system. One is for the speech utterances and the other one is for filler sounds. The filler sounds contain the silences in the beginning, middle and at the end of the speech utterances. The attachment of the two types of dictionaries can be found on the Appendix A. They are referred to as the languagedictionaryand thefiller dictionary. 4.1.3 Creating the grammar file The grammar file also created by hand since the number of words used for the system is very few. The JSGF (JSpeech Grammar Format) format was used to implement the grammar file. The grammar file can be found in Appendix A. 4.1.4 Building the language model Word search is restricted by a language model. It identifies the matching words by comparing the previously recognized words by the model and restricts the matching process by taking off the words that are not possible to be. N-gram language model is the most common language models used nowadays. It is a finite state language model and it contains statistics of word sequences. In search space where restriction is applied, a good accuracy rate can be obtained if the language model is a very successful one. The result is the language model can predict the next word properly. It usually restricts the word search which are included the vocabulary. The language model was built using the cmuclmtk software. First of all the reference text was created and that text (svd.text) can be found in Appendix A. It was written in a specific format. The speech sentences were delimited byandtags. Then the vocabulary file was generated by giving the following command. text2wfreq svd.vocab Then the generated vocabulary file was edited to remove words (numbers and misspellings). When finding misspellings, they were fixed in the input reference text. The generated vocabulary file (svd.vocab) can be found in the Appendix A. Then the ARPA format language model was generated using these commands. text2idngram -vocab svd.vocab -idngram svd.idngram idngram2lm -vocab_type 0 -idngram svd.idngram -vocab svd.vocab ââ¬âarpa svd.arpa Finally the CMU binary of language model (DMP file) was generated using the command sphinx_lm_convert -i svd.arpa -o svd.lm.DMP The final output containing the language model needed for the training process is svd.lm.dmp file. This is a binary file. 4.1.5Acoustic model Before starting the acoustic model creation, the following file structure was arranged as described by the CMU Sphinx tool kit guide. The name of the speech database is ââ¬Å"svdâ⬠(Sinhala Voice Dial). The content of these files is given in Appendix A. svd.dic -Phonetic dictionary svd.phone -Phoneset file svd.lm.DMP -Language model svd.filler -List of fillers svd _train.fileids -List of files for training svd _train.transcription -Transcription for training svd _test.fileids -List of files for testing svd _test.transcription -Transcription for testing All these files were included in to one directory and it was named as ââ¬Å"etcâ⬠. The speech samples of wav files were included in to another directory and named it as ââ¬Å"wavâ⬠. These two directories were included in to another directory and named it using the name of the database (svd). Before starting the training process, there should be another directory that contains the ââ¬Å"svdâ⬠and the required compilation package ââ¬Å"pocketsphinxâ⬠, ââ¬Å"sphinxbaseâ⬠and ââ¬Å"sphinxtrainâ⬠directories. All the packages and the ââ¬Å"svdâ⬠directory were put into another directory and started the training process. Setting up the training scripts The command prompt terminal is used to run the scripts of the training process. Before starting the process, terminal was changed to the database ââ¬Å"svdâ⬠directory and then the following command was run. python ../sphinxtrain/scripts/sphinxtrain ââ¬ât svd setup This command copied all the required configuration files into etc sub directory of the database directory and prepared the database for training. The two configuration files created were feat.params and sphinx_train.cfg. These two are given in Appendix A. Set up the database These values were filled in at configuration time. The Experiment name, will be used to name model files and log files in the database. $CFG_DB_NAME = svd; $CFG_EXPTNAME = $CFG_DB_NAME; Set up the format of database audio Since the database contains speech utterances with the ââ¬Ëwavââ¬â¢ format and they were recorded using MSWav, the extension and the type were given accordingly as ââ¬Å"wavâ⬠and ââ¬Å"mswavâ⬠. $CFG_WAVFILES_DIR = $CFG_BASE_DIR/wav; $CFG_WAVFILE_EXTENSION = wav; $CFG_WAVFILE_TYPE = mswav; # one of nist, mswav, raw Configure Path to files This process was done automatically when having the right file structure in the running directory. The naming of the files must be very accurate. The paths were assigned to the variables used in main training of models. $CFG_DICTIONARY = $CFG_LIST_DIR/$CFG_DB_NAME.dic; $CFG_RAWPHONEFILE = $CFG_LIST_DIR/$CFG_DB_NAME.phone; $CFG_FILLERDICT = $CFG_LIST_DIR/$CFG_DB_NAME.filler; $CFG_LISTOFFILES = $CFG_LIST_DIR/${CFG_DB_NAME}_train.fileids; $CFG_TRANSCRIPTFILE = $CFG_LIST_DIR/${CFG_DB_NAME}_train.transcription; $CFG_FEATPARAMS = $CFG_LIST_DIR/feat.params; Configure model type and model parameters The model type continuous and semi continuous can be used in pocket sphinx. Continuous type is used for continuous speech recognition. Semi continuous is used for discrete speech recognition process. Since this application use discrete speech the semi continuous model training was used. #$CFG_HMM_TYPE = .cont.; # Sphinx 4, Pocketsphinx $CFG_HMM_TYPE = .semi.; # PocketSphinx $CFG_FINAL_NUM_DENSITIES = 8; # Number of tied states (senones) to create in decision-tree clustering $CFG_N_TIED_STATES = 1000; The number of senones used to train the model is indicated in this value. The sound can be chosen accurately if the number of senones is higher. But if we use too much senones, then it may not be able to recognize the unseen sounds. So the Word Error Rate can be very much higher on unseen sounds. The approximate number of senones and number of densities is provided in the table below. Configure sound feature parameters The default parameter used for sound files in Sphinx is a rate of 16 thousand samples per second (16KHz). If this is the case, then the etc/feat.params file will be automatically generated with the recommended values. The Recommended values are: # Feature extraction parameters $CFG_WAVFILE_SRATE = 16000.0; $CFG_NUM_FILT = 40; # For wideband speech its 40, for telephone 8khz reasonable value is 31 $CFG_LO_FILT = 133.3334; # For telephone 8kHz speech value is 200 $CFG_HI_FILT = 6855.4976; # For telephone 8kHz speech value is 3500 Configure decoding parameters The following were properly configured in theetc/sphinx_train.cfg. $DEC_CFG_DICTIONARY = $DEC_CFG_BASE_DIR/etc/$DEC_CFG_DB_NAME.dic; $DEC_CFG_FILLERDICT = $DEC_CFG_BASE_DIR/etc/$DEC_CFG_DB_NAME.filler; $DEC_CFG_LISTOFFILES = $DEC_CFG_BASE_DIR/etc/${DEC_CFG_DB_NAME}_test.fileids; $DEC_CFG_TRANSCRIPTFILE = $DEC_CFG_BASE_DIR/etc/${DEC_CFG_DB_NAME}_test.transcription; $DEC_CFG_RESULT_DIR = $DEC_CFG_BASE_DIR/result; # These variables, used by the decoder, have to be user defined, and # may affect the decoder output $DEC_CFG_LANGUAGEMODEL_DIR = $DEC_CFG_BASE_DIR/etc; $DEC_CFG_LANGUAGEMODEL = $DEC_CFG_LANGUAGEMODEL_DIR/ ${CFG_DB_NAME}.lm.DMP; Training After setting all these paths and parameters in the configuration file as described above, the training was proceeded. To start the training process the following command was run. python ../sphinxtrain/scripts/sphinxtrain run Scripts launched jobs on the machine, and it took few minutes to run. Acoustic Model After the training process, the acoustic model was located in the following path in the directory. Only this folder is needed for the speech recognition tasks. model_parameters/svd.cd_semi_200 We need only that folder for the speech recognition tasks we have to perform. 4.1.6Testing Results 150 speech samples were used as testing data. The aligning results could be obtained after the training process. It was located in the following path in the database directory. results/svd.align 4.1.7Parameters to be optimized Word error rate WER was given as a percentage value. It was calculated according to the following equation Accuracy Accuracy was also given as a percentage. That is the opposite value of the WER. It was calculated using the following equation To obtain an optimal recognition system, the WER should be minimized and the accuracy should be maximized. The parameters of the configuration file were changed time to time and obtained an optimal recognition system where the WER was the minimum with a high accuracy rate. 4.2Research phase 2: Build the voice dialing mobile application. In this section, the implementation of voice dialer for android mobile application is described. The application was developed using the programming language JAVA and it was done using the Eclipse IDE. It was tested in both the emulator and the actual device. The application is able to recognize the spoken digits by any speaker and dial the recognized number. To do this process the trained acoustic model, the pronunciation dictionary, the language model and the grammar files were needed. The speech recognition was performed by using these models in the mobile device itself by using the pocketsphinx library. It is a library written in C language to use for embedded speech recognition devices in Android platform. The step by step implementation and integration of the necessary components were discussed in detail in this section. Resource Files When inputting the resource files to the Android application, they were added in to theassets/directory of the project. Then the physical path was given to make them available for pocketsphinx. After adding them, the Assets directory contained the following resource files. Dictionary svd.dic svd.dic.md5 Grammar digits.gram digits.gram.md5 menu.gram menu.gram.md5 Language model svd.lm.DMP svd.lm.DMP.md5 Acoustic Model feat.params feat.params.md5 mdef mdef.md5 means means.md5 mixture_weights mixture_weights.md5 noisedict noisedict.md5 transition_matrices transition_matrices.md5 variances variances.md5 Assets.lst models/dict/svd.dic models/grammar/digits.gram models/grammar/menu.gram models/hmm/en-us-semi/feat.params models/hmm/en-us-semi/mdef models/hmm/en-us-semi/means models/hmm/en-us-semi/mixture_weights models/hmm/en-us-semi/noisedict models/hmm/en-us-semi/sendump models/hmm/en-us-semi/transition_matrices models/hmm/en-us-semi/variances models/lm/svd.lm.DMP Setup the Recognizer First of all the recognizer should be set up by adding the resource files. The model parameters taken after the training process were added as the HMM in the application. The recognition process was depended mainly on this resource files. Since the grammar files and the language model were added as assets, these two can be used for the recognition process of the application as well as the HMM. The utterances can be recognized from either the grammar files or language model. The whole process is coded using the Java programing language. 4.3Architecture of the developed Speech Recognition System
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.