ㅇㅇ: 음성인식 메모(kaldi) 17

「原作者へ」

連絡先を存じ上げませんでしたので、不本意ながら無断で翻訳しました。 
正式に翻訳を許可されたいです。 
gogyzzz@gmail.comでご連絡ください。

아래 포스트의 번역입니다.

http://work-in-progress.hatenablog.com/entry/2018/06/24/111611

Kaldi는, Bash 스크립트로 실행하는 커맨드를 사용하고 있다.

이번엔 스크립트에 대해 확인해보고자 한다.

GitHub에서 다운로드한 디렉토리 구성은 아래와 같다.

egs (알아보고자 하는 것)
src (소스코드)
misc (논문 등? 미확인)
tools (외부 툴、OpenFST、ATLAS 등)
windows (WindowsOS 용)

세부 내용은 kaldi공식사이트 설명을 참고（"Kaldi directories structure" 부분）

「egs」에 각 코퍼스에 대응하는 예제 스크립트가 수록되어 있다.

egs – example scripts allowing you to quickly build ASR systems for over 30 popular speech corporas (documentation is attached for each project),

스스로 음성 데이터를 준비한 경우에는 어떻게 할 것인가.

Kaldi공식 튜토리얼을 보면, 「egs/wsj/s5」 내의 스크립트를 이용하면 된다는 설명이 있다.

Project finalization -> Tools attachment의 내용 중

From kaldi-trunk/egs/wsj/s5 copy two folders (with the whole content) - utils and steps - and put them in your kaldi-trunk/egs/digits directory.
You can also create links to these directories.

「wsj」는 Wall Street Journal news text 코퍼스 인것 같다.

egs/wsj/README.txt 내용 중

About the Wall Street Journal corpus:
This is a corpus of read sentences from the Wall Street Journal, recorded under clean conditions.
The vocabulary is quite large. About 80 hours of training data.
Available from the LDC as either: [ catalog numbers LDC93S6A (WSJ0) and LDC94S13A (WSJ1) ]
or: [ catalog numbers LDC93S6B (WSJ0) and LDC94S13B (WSJ1) ]
....

다른 코퍼스의 디렉토리 (예를 들어 「egs/rm/steps」)을 보아도, 「egs/wsj/steps」에 심볼릭 링크되어 있다.

/opt/kaldi/egs/rm/s5% ls -l steps
lrwxrwxrwx 1 ichou1 ichou1 18  2월  5 19:46 steps -> ../../wsj/s5/steps
/opt/kaldi/egs/rm/s5% file steps 
steps: symbolic link to `../../wsj/s5/steps' 
/opt/kaldi/egs/rm/s5%

코퍼스는 없지만, kaldi를 사용해보고자 하는 경우에 대해 「egs/yesno」가 준비되어 있다.

이것은 음성 데이터（.wav）도 수록되어 있기 때문에, 바로 사용해볼 수 있다.

（"YES"와"NO" 둘 중 하나를 8회, 패턴을 바꿔가며 발화. 학습용으로 31파일, 테스트용으로 29파일）

egs/yesno/README 내용 중

The "yesno" corpus is a very small dataset of recordings of one individual saying yes or no multiple times per recording, in Hebrew.

egs/yesno/s5/waves_yesno/README 내용 중

The archive "waves_yesno.tar.gz" contains 60 .wav files, sampled at 8 kHz. 
All were recorded by the same male speaker, in English (although the individual is not a native speaker).
In each file, the individual says 8 words; 
each word is either "yes" or "no", so each file is a random sequence of 8 yes-es or noes.
There is no separate transcription provided; 
the sequence is encoded in the filename, with 1 for yes and 0 for no, for instance:

실행 방법

```bash cd egs/yesno/s5 ./run.sh

### 내부에서 하고 있는 것들

- Data preparation

--> 「local/prepare_dict.sh」、「local/prepare_dict.sh」、「utils/prepare_lang.sh」、「local/prepare_lm.sh」을 실행

- Feature extraction

--> 「steps/make_mfcc.sh」、「steps/compute_cmvn_stats.sh」、「utils/fix_data_dir.sh」을 실행

（「steps」、「utils」는、 「egs/wsj/s5/steps」、「egs/wsj/s5/utils」에 링크되어 있음）

- Mono training

--> 「steps/train_mono.sh」을 실행

- Graph compilation（그래프 생성）

--> 「utils/mkgraph.sh」를 실행

- Decoding（인식）

--> 「steps/decode.sh」를 실행

실행하는 console 상에는 WER(단어 오류율)이 표시된다.

decode의 결과는, （egs/yesno/s5/exp/mono0a/decode_test_yesno/log/decode.1.log）에서 확인 가능하다.

### 예시）「egs/yesno/s5/waves_yesno/1_0_0_0_0_0_0_0.wav」의 인식결과

plaintext 1000000_0 YES NO NO NO NO NO NO NO

상위 디렉토리가 「/opt/kaldi」로, decode를 직접 실행하는 경우의 커맨드（결과는 표준출력에 텍스트 형식으로 출력）

### decode(lattice없이 수행)

bash /opt/kaldi/src/gmmbin/gmm-decode-faster \ --word-symbol-table=/opt/kaldi/egs/yesno/s5/exp/mono0a/graphtgpr/words.txt \ /opt/kaldi/egs/yesno/s5/exp/mono0a/40.mdl \ /opt/kaldi/egs/yesno/s5/exp/mono0a/graphtgpr/HCLG.fst \ "ark,s,cs:/opt/kaldi/src/featbin/apply-cmvn --utt2spk=ark:/opt/kaldi/egs/yesno/s5/data/testyesno/split1/1/utt2spk scp:/opt/kaldi/egs/yesno/s5/data/testyesno/split1/1/cmvn.scp scp:/opt/kaldi/egs/yesno/s5/data/test_yesno/split1/1/feats.scp ark:- | /opt/kaldi/src/featbin/add-deltas ark:- ark:- |" \ ark,t:-

전달하는 파라미터에 대해서는 이전 포스팅을 참고

### 결과(lattice없이 수행)

plaintext 10000000 3 2 2 2 2 2 2 2 10000000 YES NO NO NO NO NO NO NO LOG (gmm-decode-faster[5.3.106~1389-9e2d8]:main():gmm-decode-faster.cc:196) Log-like per frame for utterance 1000000_0 is -8.37946 over 668 frames.

「3」은 words.txt에서 "YES"、「2」는 "NO"에 대응

### decode(lattice를 사용하여 수행)

bash /opt/kaldi/src/gmmbin/gmm-latgen-faster \ --word-symbol-table=/opt/kaldi/egs/yesno/s5/exp/mono0a/graphtgpr/words.txt \ /opt/kaldi/egs/yesno/s5/exp/mono0a/final.mdl \ /opt/kaldi/egs/yesno/s5/exp/mono0a/graphtgpr/HCLG.fst \ "ark,s,cs:/opt/kaldi/src/featbin/apply-cmvn --utt2spk=ark:/opt/kaldi/egs/yesno/s5/data/testyesno/split1/1/utt2spk scp:/opt/kaldi/egs/yesno/s5/data/testyesno/split1/1/cmvn.scp scp:/opt/kaldi/egs/yesno/s5/data/test_yesno/split1/1/feats.scp ark:- | /opt/kaldi/src/featbin/add-deltas ark:- ark:- |" \ ark,t:-

### 결과(lattice를 사용하여 수행)

plaintext 10000000 YES NO NO NO NO NO NO NO 10000000 0 1 3 9.34174, 10746.4, 4111111618 1 2 2 3.00029, 3604.42, 151515151515 2 3 2 3.75534, 460.406, 2929 3 4 2 6.37105, 626.19, 4 5 2 5.32006, 589.474, 5 6 2 5.67636, 4377.79, 6 7 2 5.32006, 596.049, 7 8 2 4.3186, 6239.1, 8 9 2 5.85963, 5268.64, 29292929 8 9.50533, 28208.8, 292929294111 9 7.30095, 22958.9, 2628304161515_

LOG (gmm-latgen-faster[5.3.106~1389-9e2d8]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance 1000000_0 is -8.37946 over 668 frames.

### 모델(egs/yesno/s5/exp/mono0a/final.mdl)을 텍스트 형식으로 나타낸 것

plaintext 2 3 0 0 0 0.75 1 0.25 1 1 1 0.75 2 0.25 2 2 2 0.75 3 0.25 3 1 0 0 0 0.25 1 0.25 2 0.25 3 0.25 1 1 1 0.25 2 0.25 3 0.25 4 0.25 2 2 1 0.25 2 0.25 3 0.25 4 0.25 3 3 1 0.25 2 0.25 3 0.25 4 0.25 4 4 4 0.75 5 0.25 5 11 1 0 0 1 1 1 1 2 2 1 3 3 1 4 4 2 0 5 2 1 6 2 2 7 3 0 8 3 1 9 3 2 10 [ 0 -0.3016863 -4.60517 -2.116771 -2.040137 -0.05096635 -4.60517 -3.516702 -4.60517 -4.60517 -0.09362812 -2.668062 -4.60517 -4.60517 -4.60517 -0.1123881 -2.449803 -0.04502614 -3.122941 -0.3431785 -1.236192 -0.1315082 -2.09372 -0.07189104 -2.668334 -0.1359556 -2.062634 -0.09793975 -2.371973 -0.04792399 -3.062005 ] 39 11 [ -162.6711 -100.3258 -150.894 -774.145 ] [ 0.02608728 0.03167231 0.03214631 0.03326807 0.01074118 ] [ -3.798081 -5.357131 0.8406813 0.918729 1.014658 "snip" 0.5328674 1.181959 -0.6352269 -0.7017035 -0.06531551 "snip" ] [ 0.2399497 0.4042536 0.2387805 0.09193342 0.04029746 "snip" 0.282881 0.1213772 0.07582887 0.03232023 0.03635461 "snip" ] "snip" ( 10 times repeat )

「YES」(/jes/)를 「j-e+s」 와 같이 자르지 않고, 한 덩어리로 취급하고 있다. (예제이기 때문에 단순하게 한 것)

### phone transcriptions(egs/yesno/s5/data/local/dict/lexicon.txt)

plaintext SIL YES Y NO N ```

ㅇㅇ

2019년 7월 14일 일요일

음성인식 메모(kaldi) 17 - Toolkit script

Project finalization -> Tools attachment의 내용 중

egs/wsj/README.txt 내용 중

egs/yesno/README 내용 중

egs/yesno/s5/waves_yesno/README 내용 중

실행 방법

decision tree description (egs/yesno/s5/exp/mono0a/tree)

댓글 없음:

댓글 쓰기