2017년 11월 17일 금요일

Attention-based model for speech recognition

2017년 11월 3일 금요일

Combining Residual Networks with LSTMs for Lipreading 정리

알아야 할 어휘

Viseme
Any of a group of speech sounds that look the same, for example when lipreading.
'비슴'이라고 발음함.
시각적으로 구분할 수 없는 음소

HMM-based, hand-engineered features, 립리딩[11, 12, 13, 14]
optical flow, SVM 등의 active appearance spatiotemporal descriptor [16]
???[17, 18]
DBN 사용하여 립리딩 21% 성능 향상[22]
Deep autoencoder로 얻은 feature를 DCT feature와 함께 붙여 LSTM 학습[23]
end-to-end LSTM 사용, GRID DB에서 기존 방법들보다 높은 성능[3]
LSTM과 CTC 결합, GRID에서 95.2% 인식률[1]
attention mechanism을 사용한 encoder-decoder 방식으로 GRID DB에서 97% 달성[2]

제한된 어휘의 DB에서 립리딩 전문가(사람)을 능가하는 립리딩 알고리즘[1][2]
립리딩 접근 방법 2가지
word에 대한 모델링[3][4]
- 고립 단어 인식에 적합
viseme에 대한 모델링[1][2]
- LVCSR에 대해 적합
최근에는 word 모델을 가지고 LVCSR에 적용해도 좋다는 연구가 있음[7, 8, 9]

Problem

DB의 word가 고립되어 있지 않고 발화 내에 있음
그러므로 인식기는 문장 내에서 target word와 무시해야 하는 word를 구분하는 방법을 학습해야 함.
학습 시 인식기에 대해 word boundary를 따로 알려주지는 않음

Proposed

단어 수준 립리딩 state-of-the-art보다 6.8% 높은 성능을 얻음
word에 대한 모델이지만 Softmax layer에서 viseme 수준 인식도 가능(How ??)
3 네트워크로 구성
front-end
- frame 시퀀스에 대해 spatiotemporal convolution 수행
ResNet
Bi-LSTM
(마지막 layer) softmax

LRW Database

BBC 영상 기반
화자/포즈 매우 다양
타 DB에 비해 target word의 갯수가 매우 많음(500개)
단수/복수(23쌍), 현재/과거(4쌍) 등 viseme
DB제작은 전부 자동으로 수행
정답 text는 BBC 자막을 OCR로 얻어냄
word가 고립되어 있지 않고 발화 내에 있음
- 그러므로 인식기는 문장 내에서 target word와 무시해야 하는 word를 구분하는 방법을 학습해야 함.
- 학습 시 인식기에 대해 word boundary를 따로 알려주지는 않음
word 당 clip의 수
Training set: 1000
Validation, Test set: 50
고정된 duration: 1.28 sec, 25fps frame rate

Deep learning modeling and preprocessing

Facial landmarks and data augmentation

[27, 28]에서 사용한 방법의 2D 버전을 사용
66 facial landmark
112 x 112 cropped size
아래는 정확히 어떻게 했다는 것인지 모르겠음.
- A common cropping is applied to all frames of a given clip, using the median coordinates of each landmark. The frames are transformed to grayscale and are normalized with respect to the overall mean and variance.
data augmentation은 training 도중에 수행됨(무슨 뜻인지 모르겠음)
주어진 clip의 모든 프레임에 대해 random cropping(+- 5 pixels) 수평 flip

The block diagram of proposed network

Spatiotemporal front-end

Spatiotemporal conv. layer는 RNN 없이도 짧은 duration의 움직임 검출에 대해 좋은 성능을 냄[1]
파라미터의 수: ~16K
Fig. 2 참고

Residual Network

ImageNet에서 제안된 34-레이어 짜리 사용[31]
pretrained model은 ImageNet이나 CIFAR 등의 task에 사용된 것이기 때문에 사용하지 않음.
파라미터의 수: ~21M

Bidirectional LSTM back-end and optimization criterion

파라미터의 수: ~2.4M
viseme에 대한 고려 없이, 최적화 criterion 관점에서 몇 가지 접근법이 있음
1. 마지막 LSTM output에 대해 softmax layer를 붙여서 BPTT로 학습 시키는 것
2. 각 time step에 criterion을 적용하는 것.
- LSTM을 사용한 음성인식과 비슷
- 음소/viseme label 대신 word label을 각 time step(모든 프레임)에 반복 적용함
- 이는 word boundary를 모르기 때문임
- hidden state가 모든 video 시퀀스를 알고 있어야 하기 때문에 bi-LSTM에 적용할 수 있음
2번 방법이 3% 더 좋음
- 전체 loss는 모든 time step에 대한 aggregated loss
- word posterior의 negative log의 합

Implementation details

Titan X
SGD 학습, 0.9 momentum
softmax layer를 제외한 모든 conv. layer와 linear layer에 BN 적용
dropout은 사용하지 않음(ResNet recipe가 아니기 때문)
초기 lr: 0.0005 / 최종 lr: 0.00005 (log scale로 감소)
validation set에 대해 3 epoch 동안 성능이 좋아지지 않으면 학습 중단
15~20 epoch에서 수렴

전체 시스템 Training 방법

1) Bi-LSTM 대신 temporal conv. back-end 사용
2) 1)이 수렴 후 temporal conv. back-end를 없애고 Bi-LSTM을 붙여 학습
front-end와 ResNet 부분의 weight를 고정하여 5 epoch 동안 학습
3) 2)를 end-to-end 학습
temporal conv. back-end와 Bi-LSTM back-end의 성능 비교 했음

Experiments

Baseline results

현재 state-of-the-art는 multi-tower VGG-M[4]
Top-1: best score로 추정한 단어의 정확도
Top-N: N best score 중 정답 단어가 있는 것을 맞춘 것으로 쳤을 때 정확도

Results using proposed network

N1: 2D conv. ---> ResNet ---> temporal conv.
각 단계 사이에 BN--ReLU--MaxPooling(1/2로 사이즈 축소됨)
N2: 3D conv. ---> ResNet ---> temporal conv.
N3: 2D conv. ---> DNN ---> temporal conv.
ResNet의 성능을 알아보기 위한 실험
DNN은 ResNet과 같은 파라미터를 가지도록 세팅함. 결과적으로 파라미터의 수가 같은데도 ResNet이 좋은 성능을 냄.

N4: 3D conv. ---> ResNet ---> Bi-LSTM
N5: 3D conv. ---> ResNet ---> 2 layers Bi-LSTM

N6: 3D conv. ---> ResNet ---> 2 layers Bi-LSTM
N5의 weight들을 starting point로 해서 end-to-end 학습
Fig.2 부분의 concatenate layer가 addition layer인 형태
(그러면 N5는 end-to-end로 학습하지 않고 front-end와 ResNet은 고정하고 Bi-LSTM만 학습했다는 얘기인가?)
- N7: 3D conv. ---> ResNet ---> 2 layers Bi-LSTM(concatenate layer instead of addition layer)
N6과 구조는 거의 같고 학습 방식도 end-to-end임. N6의 addition layer를 concatenate layer로 교체(Fig.2와 같음)

Discussion and error analysis

baseline(VGG.M) vs. N1
- 8.5% 향상
N1 vs. N2(3D conv.)
3D conv.가 짧은 구간의 움직임을 강조해주어(왜?) 5.0% 성능 향상
등등 각 요소를 하나씩 붙일 때마다 좋아짐

음소가 많이 들어 있는 단어는 잘 맞춤
viseme이 많이 들어있으면 잘 틀림
선행 단어, 후행 단어에 대해 조음 결합 현상이 일어나는 단어들에서 주로 틀림
맨 앞과 맨 끝 부분의 viseme을 인식하는 것은 어려움

2017년 10월 22일 일요일

arrayfire batch multiplication implementation

github repo

arrayfire batch matmul

matrix batch multiplication for arrayfire using cublas.

support data type, structure

The function names are af::matmul3CNN, af::matmul3CTN, af::matmul3CNT.

The '3' means "third dimension".

'N' means "normal", 'T' means "hermitian transformed".

You can read cublas gemm for detail.

The matrix batch multiplication performs below.

A(input matrix): (n x m x l) B(input matrix): (m x k x l) C(output matrix): (n x k x l) <= matmul3CXX(A x B)

And af::matmul3CXX needs 3 matrix arguments: A, B, C.

These all of 3 matrics must be prepared(allocated).

| |type |dimension | |--|----------------|-----------| |A |af::array(c32, c64) |(n x m x l)| |B |af::array(c32, c64) |(m x k x l)| |C |af::array(c32, c64) |(n x k x l)|

precaution

It doesn't support real number matrix transpose.

example

int main(void)
{
    af::setBackend(AF_BACKEND_CUDA);

    cublasHandle_t handle;
    cublasCreate(&handle);

    cudaError_t cudaStat;

    cudaStat = cudaSetDevice(0);
    if (cudaStat != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        return 0;
    }

    float Adata[4] = { 1, 2, 3, 4 };
    float Bdata[4] = { 5, 6, 7, 8 };

    auto A = af::array(2, 2, Adata, afHost).as(c32);
    auto B = af::array(2, 2, Bdata, afHost).as(c32);

    af_print(A);
    af_print(B);
    af_print(af::matmul(A, B));

    /*
    expect
        (23.0000,0.0000)          (31.0000,0.0000)
        (34.0000,0.0000)          (46.0000,0.0000)
    */

    // arrayfire matmul

    auto batchA = af::tile(A, 1, 1, 2);
    auto batchB = af::tile(B, 1, 1, 2);
    auto batchC = af::constant(0.0f, 2, 2, 2, c32);

    af_print(af::matmul3CNN(handle, batchA, batchB, batchC)); // it returns batchC.

    /*
    expect
         (23.0000,0.0000)          (31.0000,0.0000)
         (34.0000,0.0000)          (46.0000,0.0000)
         (23.0000,0.0000)          (31.0000,0.0000)
         (34.0000,0.0000)          (46.0000,0.0000)

    */
    return 0;
}

2017년 9월 24일 일요일

공부삼아 Kaldi copy-feats 프로그램을 뜯어보자

아래 검색어로 찾아보았다.

how to use kaldi library
how to use kaldi library on source code
kaldi library 日本語
kaldi 日本語

쓸만한 사이트가 없다.

직접 소스코드 읽어서 공부해야 함.

copy-feats 소스코드를 공부하고, 소스코드를 약간 고쳐서 빌드를 시도해보자.

소스코드는 아래에

https://github.com/kaldi-asr/kaldi/blob/master/src/featbin/copy-feats.cc

프린트해서 종이에 메모하였다.

gcc가 copy-feats를 빌드하는 과정이 어디 있는지 찾아보자.

src/featbin/Makefile

copy-feat을 빌드하는 Makefile이다.

Makefile 내용을 캡쳐해서 여기서 주석 달기

SUBDIRS를 빌드하는 부분이다. .PHONY 타겟에 대해 아래를 참고했다.

http://pinocc.tistory.com/131

전부 빌드되는 거라면 나만의 디렉토리를 따로 만들어서 거기다가 copy-feats.cc를 이동하고, src/hybin 을 만들었다.

make에 수동으로 Makefile 말고 다른 파일을 빌드하도록 전달할 수 있는지 찾아보았다.
(디폴트가 Makefile일 수도 있기 때문)

make 명령 시 target의 이름을 직접 넣어주면 그 타겟의 위치부터 make를 시작함

그러므로 기존 src/Makefile의 내용에 따로 타겟을 만들면 됨.

src/hybin/Makefile을 만들었다.

copy-feats를 빌드하고, 작동을 확인한다.

아래 오타를 제보하자.

http://kaldi-asr.org/doc/group__table__group.html#ga0b3ec62400216b77be81739825c9ce28

아래 인수 중 wxfilename이 rxfilename으로 바뀌어야 함

2017년 9월 22일 금요일

Git 사용법 노트

2017년 9월 11일 월요일

Some features of arrayfire

http://www.evernote.com/l/AdxpZ1EBWE9Ey7Nb2Hy6G6P6hbycLDdQOKY/

JIT compile

lazy evaluation

some tips

2017년 9월 7일 목요일

Complex double binary to matlab converter

Complex double binary to matlab converter
===

[Github repository](https://github.com/gogyzzz/complex_double_binary_to_matlab)

You can easily load MATLAB complex data in C++(and ArrayFire), and C++ complex data on MATLAB.

The structure of binary file is 'interleave Real and Imaginary components' part, demonstrated by [MATLAB R/W COMPLEX NUMBER GUIDE](https://www.mathworks.com/examples/matlab/mw/matlab-ex14013084-write-and-read-complex-numbers)

Binary data is stored as 'complex double' type.

### input data type

Thus you should put data of complex double type as input(or convert to complex double)

But when you use ArrayFire matrix, you can put any type of matrix.

### output data type

- complex double in C++ function
- c64 in Arrayfire

## References
- [MATLAB R/W COMPLEX NUMBER GUIDE](https://www.mathworks.com/examples/matlab/mw/matlab-ex14013084-write-and-read-complex-numbers)
- [Arrayfire - Getting Started](http://arrayfire.org/docs/gettingstarted.htm)

## example in C++

```cpp
    std::vector<std::complex<double>> a = read_complex64_binary("testmat.complex.interleaved.bin", 2, 2);

int nsamples = a.size();

std::cout << "check output\n";
    
    std::copy(a.begin(), a.end(), std::ostream_iterator<std::complex<double>>(std::cout, "\n"));

std::cout << "check output from pointer\n";
    
    std::complex<double>* ptr = new std::complex<double>[nsamples];
    
    memcpy(ptr, a.data(), nsamples * sizeof(std::complex<double>));
    
    std::copy(&ptr[0], &ptr[nsamples], std::ostream_iterator<std::complex<double>>(std::cout, " "));
    
    std::cout << std::endl << std::endl;

// The filename will be testmat.complex.interleaved.bin
    
    write_complex64_binary(ptr, "testmat", 2, 2);

// When load binary file, you should put the full name of file.
    
    a = read_complex64_binary("A.complex.interleaved.bin", 3, 3, 3);
    
    std::copy(a.begin(), a.end(), std::ostream_iterator<std::complex<double>>(std::cout, "\n"));

delete[] ptr;
```

## example in ArrayFire

```cpp
    array afmat = randn(3, 3, 3, f32);

af_print(afmat);

std::string realfilename = write_complex64_binary(afmat, "testmat");
    
    std::cout << "the real file name is " << realfilename << std::endl;

af_print(read_complex64_binary("A.complex.interleaved.bin", dim4(3, 3, 3, 1)));
    
    system("pause");
    
    return 0;
```

WaveManager: C/C++ buffer <-> PCM format wav file converter

WaveManager: C/C++ buffer <-> PCM format wav file converter
===

[Github repository](https://github.com/gogyzzz/cpp_butter_to_wave)

## Support buffer type

short, float, double

## Support data struncture

1 dimension PCM buffer \[samplelength\] $0ch, 1ch, 0ch, 1ch, ...$ (maximum 8channels)

2 dimension PCM buffer \[ichannel\]\[samplelength\]

## Examples

```cpp
    printf("************************ Test Save function ************************\n");
    
    TYPE arr1dStatic[SAMPLESIZE];
    for (int i = 0; i < SAMPLESIZE; i++)
    {
  arr1dStatic[i] = TYPE(sin(i)*0.95);
 }

TYPE arr2dStatic[NCH][SAMPLESIZE];
 for (int j = 0; j < NCH; j++)
  for (int i = 0; i < SAMPLESIZE; i++)
  {
   arr2dStatic[j][i] = TYPE(sin(i)*0.95);
  }

TYPE* arr1dDyna = new TYPE[SAMPLESIZE];
 for (int i = 0; i < SAMPLESIZE; i++)
  arr1dDyna[i] = TYPE(sin(i)*0.95);

TYPE** arr2dDyna = new TYPE*[NCH];
 for (int j = 0; j < NCH; j++)
 {
  arr2dDyna[j] = new TYPE[SAMPLESIZE];
  for (int i = 0; i < SAMPLESIZE; i++)
   arr2dDyna[j][i] = TYPE(sin(i)*0.95);
 }

wwrite("output_1d_static.wav", arr1dStatic, SAMPLESIZE);
 wwrite("output_2d_static.wav", arr2dStatic, SAMPLESIZE, 2);
 wwrite("output_1d_dynamic.wav", arr1dDyna, SAMPLESIZE);
 wwrite("output_2d_dynamic.wav", arr2dDyna, SAMPLESIZE, 2);
    
    
    
    printf("************************  Test Load function  ************************\n");
    
    TYPE input1dStatic[SAMPLESIZE];
 int nblock = wread("output_1d_static.wav", input1dStatic);
 wwrite("output_1d_static_resave.wav", input1dStatic, nblock);

TYPE input2dStatic[2][SAMPLESIZE];
 nblock = wread("output_2d_static.wav", input2dStatic);
 wwrite("output_2d_static_resave.wav", input2dStatic, nblock, NCH);

TYPE* input1dDyna;
 nblock = wread("output_1d_dynamic.wav", input1dDyna);
 wwrite("output_1d_dynamic_resave.wav", input1dDyna, nblock);

TYPE** input2dDyna;
 nblock = wread("output_2d_dynamic.wav", input2dDyna);
 wwrite("output_2d_dynamic_resave.wav", input2dDyna, nblock, NCH);

// ************************ caution ************************
 // You should release memory after using pointer taken from wread.

delete[] input1dDyna;
 for (int i = 0; i < NCH; i++)
  delete[] input2dDyna[i];
 delete[] input2dDyna;
```

## precautions

- There are not enough safe codes.
- Please do not expect "unguided" purpose.

2017년 8월 18일 금요일

Kaldi Toolkit을 사용한 음성인식 시스템의 구조 ppt(shinozaki lab ppt 번역)

이 글은 저의 예전 블로그의 글을 옮긴 것입니다.

칼디 관련 튜토리얼 중 좋은 자료가 없나 찾아보다가, 도쿄공업대학 시노자키 교수님 연구실에 있는 튜토리얼 자료를 발견하였습니다. 분량이 그렇게 많지 않아, pdf를 ppt로 변환하여 번역을 해보았습니다.

원본의 URL입니다.

Kaldiツールキットを用いた音声認識システムの構築(Kaldi 툴킷을 사용한 음성인식 시스템의 구축)

Shinozaki 교수님 연구실의 주소입니다.

http://www.ts.ip.titech.ac.jp/

혹시 오역이나 잘못된 부분이 있으면 연락주시기 바랍니다.

제 이메일 주소입니다.

gogyzzz@gmail.com

HTK installation on ubuntu 16.04

(이 글은 제가 예전에 잠시 옮겼던 [블로그](https://gogyzzz.github.io/2017/03/15/htk-installation-on-ubuntu-16.04.html)의 글을 옮긴 것입니다)

# 사전 준비

> $sudo apt-get upgrade

> $sudo apt-get update

> $sudo apt-get install libc6-dev-i386 libx11-dev gawk

ubuntu 16.04에서 HTK를 설치하기 위해서는 특정 버전의 gcc를 사용해야 한다.

아래를 참고해서 낮은 버전의 gcc를 받도록 하자.

[http://askubuntu.com/questions/778794/how-to-install-gcc-4-1-2-on-ubuntu-16-04](http://askubuntu.com/questions/778794/how-to-install-gcc-4-1-2-on-ubuntu-16-04)

위 링크를 들어가도 되고, 아래 파일을 바로 받아서 더블클릭해서 실행해도 된다.

[https://drive.google.com/file/d/0B7S255p3kFXNRTkzQnRSNXZ6UVU/view](https://drive.google.com/file/d/0B7S255p3kFXNRTkzQnRSNXZ6UVU/view)

[https://drive.google.com/file/d/0B7S255p3kFXNV3J3bnVoWGNWdG8/view](https://drive.google.com/file/d/0B7S255p3kFXNV3J3bnVoWGNWdG8/view)

아래처럼 확인해보니 설치가 되었다.

> haeyong@haeyong:~$ ls /usr/bin/gcc

>gcc gcc-5 gcc-ar-5 gcc-nm-5 gcc-ranlib-5 
gcc34 gcc-ar gcc-nm gcc-ranlib

> haeyong@haeyong:~$ ls /usr/bin/g++

> g++ g++34 g++-5

# HTK 설치

아래에서 HTK를 다운한다.

[http://htk.eng.cam.ac.uk/download.shtml](http://htk.eng.cam.ac.uk/download.shtml)

대부분의 내용은 아래 사이트에서 참고하였다.

[http://www.voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/how-to/download](http://www.voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/how-to/download)

HTK 사이트에서 다운 받은 파일의 압축을 풀고, htk 디렉토리로 들어가 아래의 내용을 실행한다.

> $linux32 bash     <--- 이거는 32bit 환경의 bash를 실행하겠다는 뜻.

> $./configure CC=gcc34

> $make

make를 하면 tab 대신 space가 있다고 오류가 나는데, 
gedit 등의 편집기로 htk/Makefile을 열어서 77번째 줄에 있는 spaces를 tab으로 바꾼다.

> $gedit /home/haeyong/htk/HLMTools/Makefile

그리고 make를 다시 실행하여 빌드를 마저 끝낸다.

> $make

그 다음 아래처럼 해야 된다. 그냥 make install하면 permission문제 때문에 안됨.

> $sudo make install

그리고 HCopy 등을 하면 프로그램을 실행할 수 있는 것을 알 수 있다.

# HDecode 설치

아래에서 HDecode stable version을 다운한다.

[http://htk.eng.cam.ac.uk/prot-docs/hdecode.shtml](http://htk.eng.cam.ac.uk/prot-docs/hdecode.shtml)

다운로드 받은 tar.gz을 압축풀면, htk/HTKLVRec 이라는 디렉토리가 생긴다.

이것을 원래 htk/HTKLVRec에 덮어 씌운다.

그리고 아래를 실행한다. 혹시 몰라 sudo를 붙였다.

> $sudo make hdecode install-hdecode

그리고 나면 HDecode 설치까지 완료된다.

Emotion Recognition using GMM-HMM in Kaldi

This is the post of my [previous blog's post](https://gogyzzz.github.io/2017/03/01/emotion-recognition-using-GMMHMM-in-kaldi.html)

I wanted to implement this paper
[Hybrid Deep Neural Network - Hidden Markov Model (DNN-HMM) Based Speech Emotion Recognition](http://ieeexplore.ieee.org/document/6681449/#full-text-section),

So I try to explain how to prepare data set and implement like that paper.

And The implementation is made of yesno recipe script of kaldi.

First, you should have a little experience about using kaldi in linux environment.

There are several references for understanding linux and kaldi.

[Linux command line basics - Udacity](https://www.udacity.com/course/linux-command-line-basics--ud595)

[Introduction to the use of WFSTs in Speech and Language processing](http://www.lvcsr.com/static/pubs/apsipa_09_tutorial_dixon_furui.pdf)

[Kaldi ASR](http://kaldi-asr.org/)

[Josh meyer's website](http://jrmeyer.github.io/) <- this is the best material of kaldi for beginner, I think.

## Dataset preparation
You can download [Berlin Database of Emotional Speech](http://emodb.bilderbar.info/docu/).

And the DB has 535 wave files.

Each wave file has a label in 7 types of emotions(anger, boredeom, disgust, anxiety, happiness, sadness, neutral)

After downloading, you should split the dataset into training and test set(Maybe the validation set will be needed for training DNN)

In my case, I shuffled dataset and splited it into training(50%), test(40%) and validation(10%) set following the paper.

(The validation set was not used in this case.)

And I prepare the 'wav.scp', 'text', 'spk2utt', 'utt2spk' scripts for each set.

You can make spk2utt and utt2spk more detail using specific speaker.

But I wanted not to care who the speakers are.

These files look like below. These are examples.

- files of training set
![](https://github.com/gogyzzz/gogyzzz.github.io/raw/master/_posts/train_data.PNG)

## Additional preparation

You should also prepare several files of language model. It is really easy.

![](https://github.com/gogyzzz/gogyzzz.github.io/raw/master/_posts/lang_dict.PNG)

for removing the cyclic of WFST, I made the G.fst file manually.

This tip was [Dan Povey's comment](https://sourceforge.net/p/kaldi/discussion/1355348/thread/d927baef/).

In order to prepare G.fst without cyclic, make the txt format file like below.

![](https://github.com/gogyzzz/gogyzzz.github.io/raw/master/_posts/g_fst_txt.PNG)

and compile.

> fstcompile --isymbols=words.txt --osymbols=words.txt G.fst.txt G.fst

After removing the cyclic, G.fst will be like below. I used 'fstdraw' and 'dot' to extract a pdf format file.

> fstdraw 
--portrait=true 
--isymbols=words.txt 
--osymbols=words.txt 
G.fst | dot -Tpdf > G.pdf

![](https://github.com/gogyzzz/gogyzzz.github.io/raw/master/_posts/g_fst_without_cyclic.PNG)

## Implementation

First, set the files and directories paths.
![](https://github.com/gogyzzz/gogyzzz.github.io/raw/master/_posts/prep_script.PNG)

And make the language model. prepare_lang.sh make language model directory using your dictionary directory.

You can see sil_prob. it is probability of silence phone. it should be 0.0 for getting better score(only for this experiment).
This tip was [Dan Povey's](https://groups.google.com/forum/#!topic/kaldi-developers/z4km_Q8kO0U), too.

![](https://github.com/gogyzzz/gogyzzz.github.io/raw/master/_posts/prep_lang.PNG)

Extract MFCC features as having 42 dimensions using defined configure file.
![](https://github.com/gogyzzz/gogyzzz.github.io/raw/master/_posts/mfcc_conf.PNG)

![](https://github.com/gogyzzz/gogyzzz.github.io/raw/master/_posts/prep_data.PNG)

And train GMMHMM model, WFST graph and decode.
![](https://github.com/gogyzzz/gogyzzz.github.io/raw/master/_posts/train_mkgraph_decode.PNG)

If you implemented well, you will get a result.

See exp/your_experiment/decode/wer_*. 
(I think the ratios between language model and acoustic model not important.)

I got a WER, 26.51. It means 73% accuracy, it is lower than paper's.

I hope if somebody noticed better implementation for me.