ㅇㅇ: 2019

2019년 7월 15일 월요일

음성인식 메모(kaldi) 26 - parameter update, NG-SGD Dan's DNN(nnet2)

「原作者へ」

連絡先を存じ上げませんでしたので、不本意ながら無断で翻訳しました。 
正式に翻訳を許可されたいです。 
gogyzzz@gmail.comでご連絡ください。

아래 포스트의 번역입니다.

http://work-in-progress.hatenablog.com/entry/2018/10/06/124604

이전 글의 연속 논문과 함께 보도록 하자.

파라미터 업데이트의 과정에서 하고 있는 것은 「Natural Gradient for Stochastic Gradient Descent (NG-SGD)」이라는 이름이 붙어있다.

용어의 복습

「Stochastic Gradient Descent」는 shuffle한 학습 데이터로 gradient를 계산, 파라미터 업데이트를 반복하는 방법. 하나의 학습 데이터를 사용하는 것이 온라인 학습, 여러개의 학습 데이터를 사용하는 것이 미니배치 학습.

「Natural Gradient method」는 그들 연구에서 쓰이는 용어로 fisher information 행렬의 역행렬（근사）을 learning rate matrix(???)로서 사용하는 방법이다.

previous work has used the term “Natural Gradient” to describe methods like ours which use an approximated inverse-Fisher matrix as the learning rate matrix, so we follow their precedent in calling our method “Natural Gradient”.

핵심이 되는 계산은 N행으로 구성된 행렬 X가 있을 때 i번째의 행 벡터 「xi」에 fisher information (Fi)의 역수를 곱하는 것이다.

여기에서 fisher information「Fi」은 i번째의 행을 뺀 다른 행「xj」으로부터 구할 수 있다.

이 아이디어를 기반으로, 아래 2개의 확장을 추가했다고 한다.

smoothing of Fi with the identity matrix (단위 행렬에 의한 Fi의 equalization)
scaling the output to have the same Frobenius norm as the input（input과 같은 frobenius norm을 가지도록 output을 스케일링）

여기서, N행 x D열의 행렬 「Xt」가 있을 때 「Xt」를 column-wise（열 방향）으로 생각하면 fisher information matrix 「F_i」은 D행 x D열이 된다（X^T X）

fisher information matrix를 R차원의 낮은 랭크로 근사하여,

t : 미니배치 index

F_t : D행 x D열

R_t : R행 x D행

D_t : R행 x R행

I : D행 x D열、identity matrix (단위행렬)

rhot : 0 < rhot

역행렬 （근사한 것）을 구한다.

G_t : D행 x D열

E_t : R행 x R행

beta_t : scalar

column-wise（열방향）이므로, 곱셈은 「Xt」의 우측에서 곱하여 「Xt」을 업데이트한다.

이는 이전 글의 아래 부분에 해당한다.

이어서 스케일링.

이 부분은 이전 글의 「gamma_t」에 해당한다.

좀 더 자세히 보도록 하자.

업데이트 대상인 파라미터의 차원(D)가 376, 미니배치 사이즈(N)이 128, 낮은 랭크로 근사한 차원(R)이 30이라 하자.

행렬 X_t（N행 x D열）은, transpose 행렬을 곱하여 대칭행렬로 만든 후 계산한다.

또한 논문에서는

파라미터의 차원(D) < 미니배치의 차원(N)

을 상정하여 column-wise（열방향）으로 되어 있다.

（학습이 진행됨에 따라, 미니배치 사이즈는 점점 커진다(예를 들어 512))

계산은 D차원(=full-rank)의 대칭행렬 「T_t」（D행 x D열）에 대해서가 아니라,

S_t : D행 x D열

eta : forgetting factor、0 < η < 1

R차원（＝row-rank approximation）의 대칭행렬 「Z_t」（R행 x R열）에 대해 수행한다.

Y_t : R행 x D열

R_t : R행 x D행

위의 「Rt」을 스케일링 한것을 「Wt」로 두어,

W_t : R행 x D행

이것을 weight 행렬로써 적절한 값이 되도록 업데이트하는 것이 목적이 된다.

일단은 「R_t」를 초기화하자.

이것은 R차원의 직교행렬을 가로로 늘어놓은 형태가 된다.

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::InitDefault()

// after the next line, W_t_ will store the orthogonal matrix R_t.
InitOrthonormalSpecial(&W_t_);

R_t (R x D)

「E_t」의 초기값을 구하여, 그것의 root를 곱한다.

BaseFloat E_tii = 1.0 / ( 2.0 + (D + rank_) * alpha_ / D );

// W_t =(def) E_t^{0.5} R_t.
W_t_.Scale(sqrt(E_tii));

다음의 미니배치에 대해 「W_t」를 업데이트한다.

업데이트 식은 아래와 같다.

W_t1 : R행 x D열
E_t1 : R행 x R행、diagonal matrix
R_t1 : R행 x D열
C_t : R행 x R행、diagonal matrix
U_t : R행 x R열、Orthogonal matrix
Y_t : R행 x D열
J_t : R행 x D열
D_t : R행 x R행
eta : forgetting factor、0 < η < 1

「Ut」（직교행렬）와 「Ct」（특이값）은, 「Z_t」를 SVD하여 얻는다.

아래 「Z_t」를 구하는 과정

원래의 input 「Xt」（N행 x D열）에, 「Wt」（R행 x D열）의 전치행렬을 오른쪽에서부터 곱하여, 「H_t」（N행 x R열）을 구한다（「N」은 미니배치사이즈）

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::PreconditionDirectionsInternal()

H_t.AddMatMat(1.0, *X_t, kNoTrans, W_t, kTrans, 0.0);  // H_t = X_t W_t^T

H_t (N x R)

열 수 「D」（376차원）의 input （X_t）가, 열 수「R」（30차원）의 행렬이 된다.

이어서, 「J_t」를 구한다.

원래의 input 「Xt」（N행 x D열）에 「Ht」（N행 x R열）의 전치행렬을 왼쪽에서부터 곱하여 「J_t」（R행 x D열）을 구한다.

（「W_t」와 같은 행 수, 열 수가 된다）

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::PreconditionDirectionsInternal()

J_t.AddMatMat(1.0, H_t, kTrans, *X_t, kNoTrans, 0.0);  // J_t = H_t^T X_t

「J_t」의 식을 변형하면

J_t = H_t^T X_t
    = (X_t W_t^T)^T X_t
    = W_t X_t^T X_t

「X_t」（N행 x D열）의 「uncentered covariance matrix」（D행 x D열）에, Weight Matrix（R행 x D열）를 왼쪽에서 곱한 것이라고 할 수도 있다.

J_t (R x D)

이어서 「K_t」를 구한다.

「J_t」（R행 x D열）에 대하여, 전치행렬을 오른쪽에서부터 곱한 것에 해당한다.

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::PreconditionDirectionsInternal()

K_t.SymAddMat2(1.0, J_t, kNoTrans, 0.0);  // K_t = J_t J_t^T

K_t (R x R、symmetric)

이어서 「L_t」을 구한다.

「H_t」（N행 x R열）에 대해, 전치행렬을 왼쪽에서 곱한 것에 해당한다.

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::PreconditionDirectionsInternal()

L_t.SymAddMat2(1.0, H_t, kTrans, 0.0);  // L_t = H_t^T H_t

L_t (R x R、symmetric)

「Kt」과 「Lt」을 사용하여, 「Z_t」를 구한다.

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::PreconditionDirectionsInternal()

 SpMatrix<double> Z_t_double(R);
 ComputeZt(N, rho_t, d_t, inv_sqrt_e_t, K_t_cpu, L_t_cpu, &Z_t_double);

Z_t (R x R、symmetric)

스케일 변환 후, SVD한다.

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::PreconditionDirectionsInternal()

Matrix<BaseFloat> U_t(R, R);
Vector<BaseFloat> c_t(R);
// do the symmetric eigenvalue decomposition Z_t = U_t C_t U_t^T.
Z_t_scaled.Eig(&c_t, &U_t);
SortSvd(&c_t, &U_t);
c_t.Scale(z_t_scale);

C_t (30차원)

0.957  0.164  0.119  0.090  ...   0.00019  0.00016

U_t (R x R)

「W_t」를 업데이트한다.

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::PreconditionDirectionsInternal()

CuMatrix<BaseFloat> W_t1(R, D);  // W_{t+1}
ComputeWt1(N,
           d_t,
           d_t1,
           rho_t,
           rho_t1,
           U_t,
           sqrt_c_t,
           inv_sqrt_e_t,
           W_t,
           &J_t,
           &W_t1);

「B_t」를 구한다.

B_t (R x D)

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::ComputeWt1()

// B_t = J_t + (1-eta)/(eta/N) (D_t + rho_t I) W_t
J_t->AddDiagVecMat(1.0, w_t_coeff_gpu, W_t, kNoTrans, 1.0);

wtcoeffgpu : R행 x R행, 대각행렬, Wt의 각 행에 대한 계수

식의 변형은 아래와 같다.

B_t = J_t + ( W_t계수 * W_t )
    = ( W_t X_t^T X_t ) + ( W_t계수 * W_t )

「A_t」를 구한다.

A_t (R x R)

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::ComputeWt1()

// A_t = (eta/N) E_{t+1}^{0.5} C_t^{-0.5} U_t^T E_t^{-0.5} B_t
Matrix<BaseFloat> A_t(U_t, kTrans);
for (int32 i = 0; i < R; i++) {
    BaseFloat i_factor = (eta / N) * sqrt_e_t1(i) * inv_sqrt_c_t(i);
    for (int32 j = 0; j < R; j++) {
        BaseFloat j_factor = inv_sqrt_e_t(j);
        A_t(i, j) *= i_factor * j_factor;
    }
}

「W_t1」를 구한다.

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::ComputeWt1()

// W_{t+1} = A_t B_t
CuMatrix<BaseFloat> A_t_gpu(A_t);
W_t1->AddMatMat(1.0, A_t_gpu, kNoTrans, *J_t, kNoTrans, 0.0);

「Wt1」를 사용하여 「Xt」를 업데이트한다.

이것을 반복하여, 업데이트 된 「Xt」는 이전 글에서 보았던 「invaluetemp」、「outderiv_temp」에 해당한다.

[ nnet2/nnet-component.cc ] AffineComponentPreconditionedOnline::Update()

preconditioner_in_.PreconditionDirections(&in_value_temp,
                                          &in_row_products,
                                          &in_scale);

preconditioner_out_.PreconditionDirections(&out_deriv_temp,
                                           &out_row_products,
                                           &out_scale);

파라미터를 수정하는 벡터로 사용하여 모델의 파라미터를 업데이트.

[ nnet2/nnet-component.cc ] AffineComponentPreconditionedOnline::Update()

bias_params_.AddMatVec(local_lrate, out_deriv_temp, kTrans, precon_ones, 1.0);
linear_params_.AddMatMat(local_lrate, out_deriv_temp, kTrans, in_value_precon_part, kNoTrans, 1.0);

음성인식 메모(kaldi) 25 - 내부에서 사용하는 데이터의 형태

「原作者へ」

連絡先を存じ上げませんでしたので、不本意ながら無断で翻訳しました。 
正式に翻訳を許可されたいです。 
gogyzzz@gmail.comでご連絡ください。

아래 포스트의 번역입니다.

http://work-in-progress.hatenablog.com/entry/2018/09/15/121225

프로그램 내에서 사용하는 데이터의 자릿수에 관한 메모.

디폴트로 컴파일하면 「float」형이 된다.

kaldi/configure at master · kaldi-asr/kaldi · GitHub

configure 내용 중

# Default configuration
double_precision=false

(snip)

if $double_precision; then
  echo "DOUBLE_PRECISION = 1" >> kaldi.mk
else
  echo "DOUBLE_PRECISION = 0" >> kaldi.mk
fi

DOUBLE_PRECISION이 「0」이라면, 템플릿 부분은 float형으로 컴파일된다.

matrix/kaldi-matrix.cc

template<typename Real>
void Matrix<Real>::Read(std::istream & is, bool binary, bool add) {

    Real r;
    is >> r;

사용하는 머신에 double형 연산을 지원하는 회로가 있다면 double형으로 하는 쪽이 빨라질지도 모른다.

아래는 데이터의 읽기에 관한 메모.

아래와 같은 지수표기로 쓰여진 텍스트 파일이 있다면,

-6.93889018e-18 -5.55112e-17 ...

아래의 코드를 통해 읽을 수 있다.

float r;
is >> r;

음성인식 메모(kaldi) 24 - parameter update Dan's DNN(nnet2)

「原作者へ」

連絡先を存じ上げませんでしたので、不本意ながら無断で翻訳しました。 
正式に翻訳を許可されたいです。 
gogyzzz@gmail.comでご連絡ください。

아래 포스트의 번역입니다.

http://work-in-progress.hatenablog.com/entry/2018/09/04/165212

「AffineComponentPreconditionedOnline」 컴포넌트의 파라미터 업데이트 과정을 따라가보자.

모델은 「nnet4c」 기반, mixup을 수행하기 전의 상태

이번에 확인하려는 것은 아래 그림의 점선 부분의 파라미터

BackPropagation에서는 아래 2개의 값을 사용하여 업데이트한다.

또한 여기서 minibatch-size는 「128」로 한다.

「invaluetemp」 : Propagation에서 Tanh 컴포넌트를 통과한 이후의 값（128row x 375col）에 대해, 각 행의 마지막에 「1.0」을 추가한 것（128row x 376col）
「outderivtemp」 : Backpropagation에서 Softmax 컴포넌트를 통과한 이후의 값의 복사본（128row x 192col）

AffineComponentPreconditionedOnline 클래스의 멤버인 2개의 OnlinePreconditioner 클래스에 위 2개를 각각 전달한다.

nnet2/nnet-component.cc

void AffineComponentPreconditionedOnline::Update(
 CuMatrixBase<BaseFloat> &in_value,
 const CuMatrixBase<BaseFloat> &out_deriv)
{
 ...
 preconditioner_in_.PreconditionDirections(&in_value_temp,
                                              &in_row_products,
                                              &in_scale);
 preconditioner_out_.PreconditionDirections(&out_deriv_temp,
                                               &out_row_products,
                                               &out_scale);
 ...

여기에서의 처리 내용의 세부사항은 Daniel povey의 논문에 쓰여 있다.

업데이트 전의 invaluetemp（128row x 376col、 논문의 「X_t」에 해당

0.265  0.362  -0.999  -0.999  -0.946  ... 1
...

t : minibatch index

여기에서, Xhatt를 구한다.

R=30（rankIn(낮은 랭크 근사)의 차원）

「W_t」는 R x D의 Weight Matrix（D=376）

N=128（minibatch size）

업데이트 후의 invaluetemp（128row x 376col、 논문의 「Xhatt」에 해당

0.035  0.562  -0.010  0.040  -0.575  ...
...

해당하는 소스코드 (nnet-precondition-online.cc)

// X_hat_t = X_t - H_t W_t
X_t->AddMatMat(-1.0, H_t, kNoTrans, W_t, kNoTrans, 1.0);

「Xhatt」의 각 행의 inner product ( (0.035 * 0.035） + (0.562 * 0.562） + (-0.010 * -0.010) + …)를 구한다.

이것이 「inrowproducts」에 해당한다.

inrowproducts (128dim)

62.02  44.57  66.93  81.97  59.58 ...

게다가, 아래 식에 따라 gamma_t를 구한다.

이것이 「in_scale」에 해당한다.

in_scale

2.04739451

「preconditionerout」에 대해서도 같은 처리를 수행한다.

업데이트 전의 outderivtemp(128row x 192col、논문의 「X_t」에 해당)

-0.000  -6.5e-06, -9.9e-06, -1.2-06, -7.5e-07 ...
...

업데이트 후의 outderivtemp(128row x 192col、 논문의 「Xhatt」에 해당)

0.048  -5.5e-06  -0.000  5.9e-08  -0.000  ...
...

outrowproducts(128dim)

0.136  0.163  0.000  0.072  0.008  ...

out_scale

2.1778152

「invaluetemp」(128row x 376dim)로, 「invaluepreconpart」(128row x 375dim)、「preconones」(128dim)를 구한다.

AffineComponentPreconditionedOnline::Update()의 내부 처리

CuSubMatrix<BaseFloat> in_value_precon_part(in_value_temp,
                                            0, in_value_temp.NumRows(),
                                            0, in_value_temp.NumCols() - 1);

CuVector<BaseFloat> precon_ones(in_value_temp.NumRows());
precon_ones.CopyColFromMat(in_value_temp, in_value_temp.NumCols() - 1);

「inscale」、「outscale」、「inrowproducts」、「outrowproducts」으로 해당 미니배치의 「scale」 및 「minibatch_scale」을 구한다.

이상 언급한 데이터를 가지고 파라미터를 업데이트한다.

AffineComponentPreconditionedOnline::Update()의 내부처리

BaseFloat local_lrate = scale * minibatch_scale * learning_rate_;

bias_params_.AddMatVec(local_lrate, out_deriv_temp, kTrans,
                                    precon_ones, 1.0);
linear_params_.AddMatMat(local_lrate, out_deriv_temp, kTrans,
                                      in_value_precon_part, kNoTrans, 1.0);

2019년 7월 14일 일요일

음성인식 메모(kaldi) 23 - training 시의 모델 업데이트 Dan's DNN(nnet2)

「原作者へ」

連絡先を存じ上げませんでしたので、不本意ながら無断で翻訳しました。 
正式に翻訳を許可されたいです。 
gogyzzz@gmail.comでご連絡ください。

아래 포스트의 번역입니다.

http://work-in-progress.hatenablog.com/entry/2018/08/12/123334

음성인식 메모(kaldi) 20 - Training Dan's DNN(nnet2)의 연장.

mixup 후에 생성된 12개의 모델（14.mdl〜25.mdl）에서 「final.mdl」을 생성하는 과정을 추적해보자.

nnet2에는 결과가 좋은 모델을 하나만 고르지 않고 (여러 모델에?) 스케일을 곱한 다음 더하는 방식을 취한다.

생성은 「nnet2bin/nnet-combine-fast」 커맨드를 사용하여 모델을 combine（합성）한다.

nnet-combine-fast \
    --minibatch-size=129 \ 
    exp/nnet4c/14.mdl \
    exp/nnet4c/15.mdl \
    <snip> \
    exp/nnet4c/25.mdl \
    ark:exp/nnet4c/egs/combine.egs \    #  <valid-examples-in>
    exp/nnet4c/final.mdl

이번 케이스에선 모델의 Component 수가 「9」、Updatable-Component 수가 「3」

info 커맨드（nnet2bin/nnet-am-info exp/nnet4c/14.mdl ）

num-components 9
num-updatable-components 3
input-dim 40
output-dim 192
component 0 : SpliceComponent, input-dim=40, output-dim=360, ... 
component 1 : FixedAffineComponent, input-dim=360, output-dim=360, ...
component 2 : AffineComponentPreconditionedOnline, input-dim=360, output-dim=375, ...
component 3 : TanhComponent, input-dim=375, output-dim=375
component 4 : AffineComponentPreconditionedOnline, input-dim=375, output-dim=375, ...
component 5 : TanhComponent, input-dim=375, output-dim=375
component 6 : AffineComponentPreconditionedOnline, input-dim=375, output-dim=453, ...
component 7 : SoftmaxComponent, input-dim=453, output-dim=453
component 8 : SumGroupComponent, input-dim=453, output-dim=192

index이 2、4、6인 「AffineComponentPreconditionedOnline」 컴포넌트 （Updatable-Component에 해당）가 합성 대상이 된다.

결과를 먼저 쓰자면, 각 모델, 각 컴포넌트에 대해 아래와 같은 스케일이 할당된다.

각 모델, 컴포넌트마다의 스케일

스케일을 구하는 과정을 따라가보자.

업데이트에 있어서의 평가 척도는 Propagate의 값(Log값)이 된다.（값이 작을수록 좋다）

같은 데이터（exp/nnet4c/egs/combine.egs）에 대한 모델의 값은 아래와 같이 되어 있다.

14.mdl : -0.0753596
15.mdl : -0.0679429
16.mdl : -0.0373454
17.mdl : -0.0325585
18.mdl : -0.0286209
19.mdl : -0.0237487
20.mdl : -0.0209156
21.mdl : -0.00755152
22.mdl : -0.0101945
23.mdl : -0.00779286
24.mdl : -0.00615664
25.mdl : -0.00456561

이번 케이스에서는 「25.mdl」의 「-0.00456561」가 가장 좋은 값이므로, 그 값을 기준으로 더욱 좋은 값을 얻을 수 있는 스케일을 구해보자.

스케일의 최적해는 뉴턴 메소드 (L-BFGS)를 사용하여 구한다.

소스 코드의 내용（nnet2/combine-nnet-fast.cc）

OptimizeLbfgs<double> lbfgs(params_,
                            lbfgs_options);

// Loop 10 times
for (int32 i = 0; i < config_.num_lbfgs_iters; i++) {

    // 스케일을 세팅
    params_.CopyFromVec(lbfgs.GetProposedValue());

    // Propagate의 값과 gradient를 구한다
    objf = ComputeObjfAndGradient(&gradient, ®ularizer_objf);

    if (i == 0) {
        initial_objf = objf;
        initial_regularizer_objf = regularizer_objf;
    }
    // 판정과 스케일을 업데이트
    lbfgs.DoStep(objf, gradient);

} // end of for(i)

각 루프 단계에서 Propagate의 값과 판정은 아래와 같았다.

(i=0) -0.00456561                  
(i=1) -0.310829      action = decrease 
(i=2) -0.0201834     action = decrease 
(i=3) -0.00512826    action = decrease 
(i=4) -0.00410415    action = accept   
(i=5) -0.00352887    action = accept
(i=6) -0.00315388    action = accept
(i=7) -0.00228854    action = accept
(i=8) -0.000916785   action = accept
(i=9) -0.000539656   action = accept

루프 카운터가 「9」일 때의 값 「-0.000539656」가 가장 좋았기 때문에, 이 때의 스케일이 최적해가 된다.

params_ = lbfgs.GetValue(&objf);

음성인식 메모(kaldi) 22 - alignment

「原作者へ」

連絡先を存じ上げませんでしたので、不本意ながら無断で翻訳しました。 
正式に翻訳を許可されたいです。 
gogyzzz@gmail.comでご連絡ください。

아래 포스트의 번역입니다.

http://work-in-progress.hatenablog.com/entry/2018/07/29/104925

decode의 과정을 파헤쳐보자.

alignment로 출력되는 수치（input인 MFCC feature 각 프레임에 일대일로 대응한다）는 무엇을 나타내고 있는 것일까?

이번엔 egs/wsj/s5/steps/decode.sh의 내부에서 호출되는 lattice 생성 커맨드의 output을 보도록 하자.

（추가 옵션으로, 「words-wspecifier」와 「alignments-wspecifier」를 지정）

gmm-latgen-faster \
--max-active=7000 \
--beam=13.0 \
--lattice-beam=6.0 \
--acoustic-scale=0.083333 \
--allow-partial=true \
--determinize-lattice=true \
--word-symbol-table=exp/mono/graph/words.txt \
exp/mono/final.mdl \        # model-in
exp/mono/graph/HCLG.fst \   # fst-in
ark:/tmp/utter_053.ark \    # features-rspecifier
ark:/tmp/lat.ark \          # lattice-wspecifier
ark:/tmp/word.ark \         # words-wspecifier
ark:/tmp/ali.ark            # alignments-wspecifier

입력

음성（/tmp/utter_053.ark）

이전 글에서 테스트용으로 사용한 ”禁煙席お願いします(킨넨 세키 오네가이 시 마스: 금연석 부탁합니다)”라는 발화

（프레임 수는 「323」、식별자는 "utteranceid053"、39차원）

아래 커맨드로 생성한다.

apply-cmvn \
--utt2spk=ark:data/test/split1/1/utt2spk \
scp:data/test/split1/1/cmvn.scp \
scp:data/test/split1/1/feats.scp \
ark:- | \
add-deltas \
ark:- \
ark:/tmp/hoge.ark

모델

이전 글에서 생성했던 Monophone 모델(언어 모델은 2-gram으로 생성）

exp/mono/graph/words.txt

<eps> 0
!SIL 1
<UNK> 2
あり(아리) 3 
お(오) 4
お願い(오네가이) 5
し(시) 10
ます(마스) 23
席(세키) 45
禁煙(킨넨) 53
#0 62
<s> 63
</s> 64

exp/mono/graph/phones/align_lexicon.txt

<eps> <eps> sil
禁煙(킨넨) 禁煙(킨넨) k_B i_I N_I e_I N_E
席(세키) 席(세키) s_B e_I k_I i_E
お願い(오네가이) お願い(오네가이) o_B n_I e_I g_I a_I i_E
し(시) し(시) sh_B i_E
ます(마스) ます(마스) m_B a_I s_I u_E

output

words(tmp/word.ark)

utterance_id_053 53 45 5 10 23

symbol로 바꾸면, 「禁煙(킨넨)(53) 席(세키)(45) お願い(오네가이)(5) し(시)(10) ます(마스)(23)」

lattice (/tmp/lat.ark)　※설명용으로 일부 가공

utterance_id_053                                              
0   1   53  9.81125, 7750.02, 2_1_1_1_8_5_5_5_18_17_17_17_17_17_...
1   2   45  2.38995, 4952.66, 197_197_197_197_197_197_197_197_...
2   3   5   4.61863, 8155.52, 527_527_527_527_527_527_527_527_...
3   4   10  2.22007, 3096.31, 527_926_925_925_925_925_925_925_...
4   5   23  5.81478, 6656.75, 523_526_528_638_637_640_639_639_...
5                 0,       0, 917_917_917_917_917_917_917_917_...

alignment 결과（/tmp/ali.ark）

utterance_id_053 2 1 1 1 8 5 5 5 18 17 17 17 17 17 17 17 17 17 17 ...

MFCC feature 프레임과 같은 수의 「323」개

alignment를 음소로 변환한다.

ali-to-phones \
--write-lengths \
exp/mono/final.mdl \
ark:/tmp/ali.ark \
ark:/tmp/ali2phone.ark

alignment 결과（/tmp/ali2phone.ark） ※설명용으로 음소를 symbol로 치환

utterance_id_053 sil 31 ; k_B 5 ; i_I 12 ; N_I 39 ; e_I 6 ; N_E 3 ; s_B 3 ; e_I 22 ; k_I 3 ; i_E 12 ; o_B 28 ; n_I 14 ; e_I 3 ; g_I 13 ; a_I 4 ; i_E 14 ; sh_B 28 ; i_E 5 ; m_B 38 ; a_I 14 ; s_I 21 ; u_E 5

symbol의 뒤는 출현 수. 예를 들어 「sil 31」은 "sil"이 31회 이어졌다는 것을 뜻함

다시 오늘의 주제 alignment 시 출력되고 있는 것은 무엇인지 알아보자.

「ali-to-phones」 커맨드에 건넨 input을 봤다면 모델(*.mdl)의 정보로부터 유추 가능하다.

（alignment로부터 음소로 변환하는 것뿐이라면, FST의 그래프는 사용하지 않는다）

모델 생성 시의 input이 되는 「phones.txt」의 내부는 전부 「171」개가 있다.

exp/mono/phones.txt

<eps> 0
sil 1
sil_B 2
sil_E 3
sil_I 4
sil_S 5
spn 6
spn_B 7
spn_E 8
spn_I 9
spn_S 10
N_B 11
N_E 12
N_I 13
N_S 14
a_B 15
a_E 16
a_I 17
a_S 18
...
z_E 164
z_I 165
z_S 166
#0 167
#1 168
#2 169
#3 170

여기에서 모델의 「TopologyEntry」로 정의되는 것은 「166」개.

<ForPhones> 1 2 3 4 5 6 7 8 9 10 </ForPhones> 
<ForPhones> 11 12 13 14 15 16 17 18  ...  164 165 166 </ForPhones>

phone-id가 1에서 10까지 （silence phone）은 「5」 상태, 11에서 166까지（non silence phone）는 「3」 상태가 된다.

3 state HMM

pdf-class수는 「3」、transition 수는 「6」

5 state HMM

pdf-class수는 「5」、transition 수는 「18」

음소 수 x 상태의 총 수는 「518」(5상태 x 10음소 + 3상태 x 156음소)

이 「518」 개의 하나하나에 pdf를 정의하는 것이 아니라, 닮은 음소 간 상태 pdf를 공유시킨다.

상태 간에 pdf를 공유하여, pdf의 총 수는 「127」이 된다.

<NUMPDFS> 127

상태 transition의 총 수는 「1116」이 된다（18transition x 10음소 + 6transition x 156음소）

(「LogProbs」엔트리와 같은 수)

<LogProbs> 
 [ 0 -0.3281818 -1.378509 -4.027719 -4.60517 ... -1.386294 -0.2876821 -1.386294 ]
</LogProbs>

alignment로 출력되는 것은 상태 transition의 식별자(transition-id)에 해당한다.

예를 들어, 말하기 시작하는 부분의 「sil」에 대해서는 「2 1 1 1 8 5 5 5 18 17 17 17 17 17 17 17 17 17 17 …」로 나열된다.

번호를 부여하는 방법을 생각해보면, self-loop의 transition을 뒤에서부터 추가하는 것 같다.

이는 어느 state를 봤을 때 self-loop 쪽이 transition-id가 크게 되어있기 때문이다.

음성인식 메모(kaldi) 21 - 음소 모델

「原作者へ」

連絡先を存じ上げませんでしたので、不本意ながら無断で翻訳しました。 
正式に翻訳を許可されたいです。 
gogyzzz@gmail.comでご連絡ください。

아래 포스트의 번역입니다.

http://work-in-progress.hatenablog.com/entry/2018/07/22/124409

인식 대상을 고립단어에서 발전시켜, 좀 더 실용적으로 예제를 테스트해보자.

일단 학습 대상의 발화를 음소로 분해한다.

원문

オススメの料理は何ですか (오스스메노료-리와난데스까: 추천 요리는 무엇입니까)

문장을 단어로 분해（나눠쓰기, 「MeCab」을 사용）

オススメ の　料理 は 何 です か (오스스메 노 료-리 와 난 데스 까)

단어를 음소열로 분해（「Julius」 내의 「yomi2voca.pl」를 사용）

オススメ o s u s u m e
の n o
料理 ry o u r i
は w a
何 n a N
です d e s u
か k a

음소에 대해서는, 무음（sp、sil）을 빼면 「40」종류.

음소 표

(인용처)

http://winnie.kuis.kyoto-u.ac.jp/dictation/doc/phone_m.pdf

이 안에, 「dy」（「ぢゃ」、「ぢゅ」、「ぢょ」）에 대해서는 제외하고,

「39」종류의 음소 전부를 사용한 20종류의 문장을 만들어, 각각 3회 발화한 음성 데이터 60개를 준비했다.

발화문장（20종류）

オススメの料理は何ですか (오스스메 노 료-리 와 난 데스 까: 추천 요리는 무엇입니까)

o s u s u m e   n o   ry o u r i   w a   n a N   d e s u   k a

百十番テーブルへどうぞ (햐꾸 쥬 방 테-브루 에 도오조: 10번 테이블에 모시겠습니다)

hy a k u   j u u   b a N   t e: b u r u   e   d o u z o

ラーメンと餃子のセットを１つお願いします (라-멘 토 교-자 노 셋또 오 히토츠 오네가이 시 마스: 라멘과 교자 세트를 하나 주세요)

r a: m e N   t o   gy o u z a   n o   s e q t o   o   h i t o ts u   o n e g a i   sh i   m a s u

メニューお願いします (메뉴- 오네가이 시 마스: 메뉴판 부탁합니다)

m e ny u:   o n e g a i   sh i   m a s u

禁煙席お願いします (킨넨 세키 오네가이 시 마스: 금연석 부탁합니다)

k i N e N   s e k i   o n e g a i   sh i   m a s u

お水４つお願いします (오 미스 욧쯔 오네가이 시 마스: 물 4잔 부탁합니다)

o   m i z u   y o q ts u   o n e g a i   sh i   m a s u

フォーク２つお願いします (호-크 후타츠 오네가이 시 마스: 포크 2개 주세요)

f o: k u   f u t a ts u   o n e g a i   sh i   m a s u

コーヒーは食後にお願いします (코-히- 와 쇼쿠고 니 오네가이 시 마스: 커피는 식후에 부탁합니다)

k o: h i:   w a   sh o k u g o   n i   o n e g a i   sh i   m a s u

ソフトドリンクはありますか (소후토도링쿠 와 아리 마스 까: 청량음료 있나요?)

s o f u t o d o r i N k u   w a   a r i   m a s u   k a

持ち帰りにできますか (모찌카에리 니 데키 마스 까: 포장 되나요?)

m o ch i k a e r i   n i   d e k i   m a s u   k a

別々にできますか (베쯔베쯔 니 데키 마스 까: 따로따로 되나요?)

b e ts u b e ts u   n i   d e k i   m a s u  k a

ごちそうさまでした (고치소-사마 데시 타: 잘 먹었습니다)

g o ch i s o u s a m a   d e sh i   t a

牛肉にしてください (규-니꾸 니 시 테 쿠다사이: 소고기로 해주세요)

gy u u n i k u   n i   sh i   t e   k u d a s a i

キャベツはお代わり自由です (캬베쯔 와 오 카와리 지유- 데스: 양배추는 무한리필 가능합니다)

ky a b e ts u   w a   o   k a w a r i   j i y u u   d e s u

サプライズはできますか (사프라이즈 와 데키 마스 까: 깜짝파티(?)는 가능한가요)

s a p u r a i z u   w a   d e k i   m a s u   k a

シャンパンをください (샴빵 오 쿠다사이: 샴페인 주세요)

sh a N p a N   o   k u d a s a i

かんぴょうをください (캄뾰- 오 쿠다사이: 칸뾰(음식?)를 주세요)

k a N py o u   o  k u d a s a i

食事はビュッフェスタイルです (쇼쿠지 와 븃훼 스타이루 데스: 식사는 뷔페 스타일입니다)

sh o k u j i   w a   by u q f e   s u t a i r u   d e s u

みょうがを添えてください (묘-가 오 소에 떼 쿠다사이: 묘가(채소?)를 곁들여 주세요)

my o u g a   o   s o e   t e   k u d a s a i

コーヒーと紅茶どちらにしますか (코-히- 토 코-챠 도찌라 니 시 마스 까: 커피와 홍차 어느 것으로 하시겠습니까?)

k o: h i:   t o   k o u ch a   d o ch i r a   n i   sh i   m a s u   k a

(참고) data/lang/phones.txt

<eps> 0
sil 1
sil_B 2
sil_E 3
sil_I 4
sil_S 5
spn 6
spn_B 7
spn_E 8
spn_I 9
spn_S 10
N_B 11
N_E 12
N_I 13
N_S 14
a_B 15
a_E 16
a_I 17
a_S 18
<snip>
z_B 163
z_E 164
z_I 165
z_S 166

「禁煙席お願いします」(킨넨 세키 오네가이 시 마스: 금연석 부탁합니다) 라는 음성 데이터 1개 분을 테스트 데이터,

남은 59개의 음성 데이터를 학습용으로 사용해보니, 각 모델의 디코딩 결과는 아래와 같이 되었다.

Monophone

1-gram

utterance_id_053 禁煙 お願い し ます 킨넨 오네가이 시 마스
LOG (gmm-latgen-faster[5.3.106~1389-9e2d8]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance utterance_id_053 is -8.01345 over 323 frames.

2-gram

utterance_id_053 禁煙 席 お願い し ます 킨넨 세키 오네가이 시 마스
LOG (gmm-latgen-faster[5.3.106~1389-9e2d8]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance utterance_id_053 is -7.97456 over 323 frames.

gmm-info 실행 결과

gmm-info exp/mono/final.mdl 
number of phones 166
number of pdfs 127
number of transition-ids 1116
number of transition-states 518
feature dimension 39
number of gaussians 1004

모델의 NUMPDFS(pdf-class 수) : 127 ( 5hmmstate * 2phone + 3hmmstate * 29phone )

Triphone (tri1)

1-gram

utterance_id_053 禁煙 お願い し ます 킨넨 오네가이 시 마스
LOG (gmm-latgen-faster[5.3.106~1389-9e2d8]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance utterance_id_053 is -7.9652 over 323 frames.

2-gram

utterance_id_053 禁煙 席 お願い し ます 킨넨 세키 오네가이 시 마스
LOG (gmm-latgen-faster[5.3.106~1389-9e2d8]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance utterance_id_053 is -7.92976 over 323 frames.

gmm-info 실행결과

gmm-info exp/tri1/final.mdl 
number of phones 166
number of pdfs 152
number of transition-ids 1740
number of transition-states 830
feature dimension 39
number of gaussians 977

Triphone (tri2b、LDA+MLLT)

1-gram

utterance_id_053 禁煙 お願い し ます 킨넨 오네가이 시 마스
LOG (gmm-latgen-faster[5.3.106~1389-9e2d8]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance utterance_id_053 is -4.92207 over 323 frames.

2-gram

utterance_id_053 禁煙 お願い し ます 킨넨 오네가이 시 마스
LOG (gmm-latgen-faster[5.3.106~1389-9e2d8]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance utterance_id_053 is -4.89113 over 323 frames.

gmm-info 실행결과

gmm-info exp/tri2b/final.mdl 
number of phones 166
number of pdfs 168
number of transition-ids 2246
number of transition-states 1083
feature dimension 40
number of gaussians 970

DNN(nnet4c)

1-gram

utterance_id_053 禁煙 お願い し ます 킨넨 오네가이 시 마스
LOG (nnet-latgen-faster[5.3.106~1389-9e2d8]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance utterance_id_053 is -0.543315 over 323 frames.

nnet-am-info 실행결과

nnet-am-info exp/nnet4c/final.mdl 
num-components 9
num-updatable-components 3
left-context 4
right-context 4
input-dim 40
output-dim 192
parameter-dim 446703
component 0 : SpliceComponent, input-dim=40, output-dim=360, context=-4 -3 -2 -1 0 1 2 3 4 
component 1 : FixedAffineComponent, input-dim=360, output-dim=360, <snip>
component 2 : AffineComponentPreconditionedOnline, input-dim=360, output-dim=375, <snip>
component 3 : TanhComponent, input-dim=375, output-dim=375
component 4 : AffineComponentPreconditionedOnline, input-dim=375, output-dim=375, <snip>
component 5 : TanhComponent, input-dim=375, output-dim=375
component 6 : AffineComponentPreconditionedOnline, input-dim=375, output-dim=453, <snip>
component 7 : SoftmaxComponent, input-dim=453, output-dim=453
component 8 : SumGroupComponent, input-dim=453, output-dim=192
prior dimension: 192, prior sum: 1, prior min: 1e-20

(킨넨 세키 오네가이 시 마스: 금연석 부탁합니다)

"禁煙(킨넨)"의 뒤에 붙어있는 "席(세키)"가 탈락되어 있는 케이스가 있지만, 학습용 데이터수나 weight 파라미터에 따라 결과는 바뀔 것이라 예상된다.

또한 이번에 예를 들어 「料理(료-리)」라는 단어에 관해서는 「ry o u r i」(료우리)로 했지만, 「ry o: r i」（료-리）라도 인식가능하도록 하면 좋겠다고 생각한다. (일본어의 경우 '오우'를 '오-'로 발음함)

이즈음에 이르러서는 대화체 발화 분야의 깊이를 느끼게 된다.

음성인식 메모(kaldi) 20 - Training Dan's DNN(nnet2)

「原作者へ」

連絡先を存じ上げませんでしたので、不本意ながら無断で翻訳しました。 
正式に翻訳を許可されたいです。 
gogyzzz@gmail.comでご連絡ください。

아래 포스트를 번역한 것입니다.

http://work-in-progress.hatenablog.com/entry/2018/07/15/121825

nnet2 학습의 흐름을 따라가보자.

이번엔 activation 함수로 「tanh」을 사용한 「nnet4c」을 대상으로 하였다.

${KALDIROOT}/egs/rm/s5/local/nnet2/run4c.sh

# for CPU only (with --use-gpu false).
steps/nnet2/train_tanh_fast.sh \
    --stage -10 \
    --minibatch-size 128 \
    --num-epochs 20 \
    --add-layers-period 1 \
    --num-hidden-layers 2 \
    --mix-up 4000 \
    --initial-learning-rate 0.02 \
    --final-learning-rate 0.004 \
    --hidden-layer-dim 375 \
    data/train \                    # <data>
    data/lang \                     # <lang>
    exp/tri3b_ali \                 # <ali-dir>
    exp/nnet4c_manual               # <exp-dir>

내부 처리의 확인

stage: -4

steps/nnet2/get_lda.sh \
    --transform-dir exp/tri3b_ali \
    --splice-width 4 \
    data/train \
    data/lang \
    exp/tri3b_ali \
    exp/nnet4c

stage: -3

학습 데이터를 "validation"용과 "training"용으로 나누자.

steps/nnet2/get_egs.sh \
    --transform-dir exp/tri3b_ali \
    --splice-width 4 \
    --stage 0 \
    data/train \
    data/lang \
    exp/tri3b_ali \
    exp/nnet4c

stage: -2

초기 모델을 생성

nnet-am-init \
    exp/tri3b_ali/tree \
    data/lang/topo \
    'nnet-init exp/nnet4c/nnet.config -|' \
    exp/nnet4c/0.mdl

==>모델의 Component수는 「6」

Splice / FixedAffine / AffinePre / Tanh / AffinePre / Softmax

"AffinePre"는 "AffineComponentPreconditionedOnline"의 약어

stage: -1

transition probabilities（전이 확률）의 업데이트

nnet-train-transitions \
    exp/nnet4c/0.mdl \
    'ark:gunzip -c exp/tri3b_ali/ali.*.gz|' \
    exp/nnet4c/0.mdl

여기서 loop를 한다.

loop 횟수는 「25」（$numepochs + $numepochs_extra）。

카운터가 1일때 hidden layer를 추가

==> 모델의 Component수가 「8」이 된다（"Tanh"와 "AffinePre"을 추가）

Splice / FixedAffine / AffinePre / Tanh / AffinePre / Tanh / AffinePre / Softmax

카운터가 「13」일 때 mixup

==> 모델의 Component 수가 「9」가 된다（"SumGroup"을 추가）

plaintext Splice / Fixed Affine / AffinePre / Tanh / AffinePre / Tanh / AffinePre / Softmax / SumGroup

loop의 각 단계에서 하고 있는 것을 정리하면 아래와 같다.

![](https://cdn-ak.f.st-hatena.com/images/fotolife/i/ichou1/20180715/20180715130237.png)

mixup 후의 모델을 사용하여 「final.mdl」을 생성한다.

bash nnet-combine-fast \ exp/nnet4c/14.mdl \ exp/nnet4c/15.mdl \ \ exp/nnet4c/25.mdl \ ark:exp/nnet4c/egs/combine.egs \ # exp/nnet4c/final.mdl ```

음성인식 메모(kaldi) 19 - Toolkit script(3)

「原作者へ」

連絡先を存じ上げませんでしたので、不本意ながら無断で翻訳しました。 
正式に翻訳を許可されたいです。 
gogyzzz@gmail.comでご連絡ください。

아래 포스트를 번역한 것입니다.

http://work-in-progress.hatenablog.com/entry/2018/07/08/084903

이전 포스트의 「Kaldi for Dummies tutorial」에서는 Triphone의 초기 학습까지 진행했다.

TRI1 - simple triphone training (first triphone pass).

그 후의 처리를 확인하고자 한다.

「egs/rm/s5/RESULTS」에는 각 실험（experiments）의 WER가 만들어져 있는데, 몇가지를 나열해보면 아래와 같다.

mono

Monophone, MFCC + delta + accel

tri1

MFCC + delta + accel

tri2a

MFCC + delta + accel (on top of better alignments)

tri2b

LDA + MLLT

tri3b

LDA + MLLT + SAT

tri3c

raw-fMLLR ( fMLLR on the raw MFCCs )

sgmm2_4[a-c]

SGMM2 is a new version of the code that has tying of the substates a bit like "state-clustered tied mixture" systems; and which has speaker-dependent mixture weights.

nnet4[a-e]

Deep neural net -- various types of hybrid system.

dnn4b

MFCC, LDA, fMLLR feaures, (Karel - 30.7.2015)

cnn4c

FBANK + pitch features, (Karel - 30.7.2015)

이 중 「nnet4d」（nnet2의 primary recipe）를 타겟으로, triphone 초기 모델(tri1)까지의 흐름을 역으로 돌아가보고자 한다.

（GPU를 사용하지 않는 환경에서 하고 있으므로, GPU를 사용하지 않는 조건에서 확인하였다）

공식사이트 설명에 따라, 「rm/s5/local/run_nnet2.sh」가 기점이 되는 스크립트라는 것을 확인.

The first place to look to get a top level overview of the neural net training is probably the scripts. In the standard example scripts in egs/rm/s5, egs/wsj/s5 and egs/swbd/s5b, the top-level script is run.sh. This script calls (sometimes commented out) a script called local/run_nnet2.sh. This is the top-level example script for Dan's setup.

rm/s5/local/run_nnet2.sh의 내용 중

# **THIS IS THE PRIMARY RECIPE (40-dim + fMLLR + p-norm neural net)**
local/nnet2/run_4d.sh --use-gpu false

egs/rm/s5/local/nnet2/run_4d.sh의 내용 중

steps/nnet2/train_pnorm_fast.sh
    data/train \
    data/lang \
    exp/tri3b_ali \
    exp/nnet4d

학습 output「exp/nnet4d」을 생성하기 위해선 input으로 alignment 데이터 「exp/tri3b_ali」가 필요하다.

egs/rm/s5/run.sh의 내용 중

# Align all data with LDA+MLLT+SAT system (tri3b)
steps/align_fmllr.sh \
    --use-graphs true \
    data/train \
    data/lang \
    exp/tri3b \
    exp/tri3b_ali

alignment의 output「exp/tri3b_ali」을 생성하기 위해서는, input으로 「exp/tri3b」이 필요하다.

egs/rm/s5/run.sh의 내용 중

## Do LDA+MLLT+SAT
steps/train_sat.sh \
 1800 \          # <#leaves>
 9000 \          # <#gauss>
 data/train \    # <data>
 data/lang \     # <lang>
 exp/tri2b_ali \ # <ali-dir>
 exp/tri3b       # <exp-dir>

학습의 output 「exp/tri3b」을 생성하기 위해서는, input으로 alignment 데이터 「exp/tri2b_ali」가 필요하다.

egs/rm/s5/run.sh의 내용 중

# Align all data with LDA+MLLT system (tri2b)
steps/align_si.sh \
 --use-graphs true \
 data/train \
 data/lang \
 exp/tri2b \
 exp/tri2b_ali

alignment 데이터 「exp/tri2b_ali」을 생성하기 위해서는, input으로 「exp/tri2b」가 필요하다.

egs/rm/s5/run.sh의 내용 중

# train and decode tri2b [LDA+MLLT]
steps/train_lda_mllt.sh \
 1800 \         # <#leaves>
 9000 \         # <#gauss>
 data/train \   # <data>
 data/lang \    # <lang>
 exp/tri1_ali \ # <ali-dir>
 exp/tri2b      # <exp-dir>

학습 데이터 「exp/tri2b」을 생성하기 위해서는, input으로 alignment 데이터 「exp/tri1_ali」가 필요하다.

egs/rm/s5/run.sh의 내용 중

# align tri1
steps/align_si.sh \
 --use-graphs true \
 data/train \
 data/lang \
 exp/tri1 \
 exp/tri1_ali

alignment 데이터 「exp/tri1_ali」를 생성하기 위해서는, input으로 학습 데이터 「exp/tri1」가 필요하다.

「exp/tri1」로부터 [exp/nnet4d]까지의 흐름을 정리하면 아래와 같다.

1. Triphone 모델(MFCC + delta + accel)을 사용한 alignment

output은 「exp/tri1_ali」

2. Triphone 모델(LDA + MLLT)의 생성과 학습

output은 「exp/tri2b」

3. Triphone 모델(LDA + MLLT)을 사용한 alignment

output은 「exp/tri2b_ali」

4. Triphone 모델(LDA + MLLT + SAT)의 생성과 학습

output은 「exp/tri3b」

5. Triphone 모델(LDA + MLLT + SAT)을 사용한 alignment

output은 「exp/tri3b_ali」

6. Neural Network 모델의 생성과 학습

output은 「exp/tri4d」

음성인식 메모(kaldi) 18 - Toolkit script

「原作者へ」

連絡先を存じ上げませんでしたので、不本意ながら無断で翻訳しました。 
正式に翻訳を許可されたいです。 
gogyzzz@gmail.comでご連絡ください。

아래 포스트의 번역입니다.

http://work-in-progress.hatenablog.com/entry/2018/07/01/154347

스스로 준비한 음성 데이터를 인식 시키는 순서는 Kaldi for Dummies tutorial에 설명되어 있다.

"for Dummies"라고 할 정도이니, 「yes/no」 샘플（이전 포스트） 다음으로 실습해보기에는 이것이 좋을 것이다.

흐름을 크게 정리하자면 아래와 같다.

Download Kaldi (GitHub에서 clone)
Data preparation ( 음성 데이터와 언어 데이터를 준비 )
Project finalization (Scoring script를 카피 / SRILM 설치 / Config 파일 작성)
Running scripts creation (cmd.sh / path.sh / run.sh 생성)
Getting results (run.sh 실행)

언어 모델에 대해서는, Julius의 경우 연속 단어라면 「N-gram」이나 「DFA」(원저자의 다른 글)、고립 단어라면 "-w" 옵션（원저자의 다른 글）을 준비했지만, Kaldi의 경우는 「N-gram」하나만을 택하고 있다.

N-gram을 만들기 위한 언어 모델 Toolkit 몇가지

언어모델 처리 툴

언어모델 구축 툴 메모 - Negative/Positive Thinking

튜토리얼처럼 SRILM을 사용하자.

최신 버전은 「1.7.2」（업데이트 날짜는 「9 November 2016」）

이번엔 「Running scripts creation」 항목의 「run.sh」의 흐름을 따라가보자.

스크립트 내의 흐름을 크게 정리하면 아래와 같다.

음성 데이터 준비（발화와 화자의 매핑（화자가 한명이라면 경고가 발생한다）, feature extraction）
언어모델 준비（WFST화, Grammar와 Lexicon)
Monophone 모델의 생성과 학습
Monophone 모델을 사용한 decoding
Monophone 모델을 사용한 alignment(Triphone 모델 생성의 input이 된다）
Triphone 모델의 생성과 학습
Triphone 모델을 사용한 decoding

「run.sh」를 실행하기 위해 아래 파일이 준비 되어 있다면 OK.

% tree --charset C    
.
|-- cmd.sh
|-- conf
|   |-- decode.config
|   `-- mfcc.conf
|-- data
|   |-- local
|   |   |-- corpus.txt
|   |   `-- dict
|   |       |-- lexicon.txt
|   |       |-- nonsilence_phones.txt
|   |       |-- optional_silence.txt
|   |       `-- silence_phones.txt
|   |-- test
|   |   |-- text
|   |   |-- utt2spk
|   |   `-- wav.scp
|   `-- train
|       |-- text
|       |-- utt2spk
|       `-- wav.scp
|-- local
|   `-- score.sh
|-- path.sh
|-- run.sh
|-- steps -> ${KALDI_ROOT}/egs/wsj/s5/steps
`-- utils -> ${$KALDI_ROOT}/egs/wsj/s5/utils

「data/train」와 「data/test」가 음성 데이터.

「data/train」가 학습용, 「data/test」가 테스트용으로, 여기선 '모시모시'라는 발화를 3회 녹음하여 파일 3개를 준비했다.

（학습용으로 2파일, 테스트용으로 1파일）

「data/local」이 언어 데이터.

'모시모시'에 덧붙여, 같은 음소로 표현가능한 단어 두개 '모모(복숭아)', '이모(고구마, 감자 같은 뿌리채소)'를 추가하여 총 3개를 준비.

「run.sh」 내부의 처리를 순서대로 살펴보자.

1. 음성데이터 준비

# ===== PREPARING ACOUSTIC DATA =====

# Making spk2utt files
utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt
utils/utt2spk_to_spk2utt.pl data/test/utt2spk > data/test/spk2utt

# ===== FEATURES EXTRACTION =====

# Making feats.scp files
steps/make_mfcc.sh data/train exp/make_mfcc/train $mfccdir
steps/make_mfcc.sh data/test exp/make_mfcc/test $mfccdir

# Making cmvn.scp files
steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train $mfccdir
steps/compute_cmvn_stats.sh data/test exp/make_mfcc/test $mfccdir

2. 언어모델 준비

# ===== PREPARING LANGUAGE DATA =====
utils/prepare_lang.sh \
    data/local/dict \
    "<UNK>" \
    data/local/lang \
    data/lang

# ===== MAKING lm.arpa =====
lm_order=1 # language model order (n-gram quantity)
ngram-count \
    -order $lm_order \
    -write-vocab \
    data/local/tmp/vocab-full.txt \
    -wbdiscount \
    -text data/local/corpus.txt \
    -lm data/local/tmp/lm.arpa

# ===== MAKING G.fst =====

arpa2fst \
    --disambig-symbol=#0 \
    --read-symbol-table=data/lang/words.txt \
    data/local/tmp/lm.arpa \
    data/lang/G.fst

Lexicon(L.fst)

3. Monophone 모델의 생성과 학습

steps/train_mono.sh \
    data/train \           # <data-dir>
    data/lang \            # <lang-dir>
    exp/mono               # <exp-dir>

스크립트 내부에서 호출되는 kaldi 커맨드는 아래와 같다.

스크립트에는 "stage"라는 변수가 있어, 도중에서 재시작 가능하도록 되어 있다.

stage: -3

# Initialize monophone GMM
gmmbin/gmm-init-mono

stage: -2

# Creates training graphs(without transition-probabilities, by default)
bin/compile-train-graphs

stage: -1

균등 alignment를 가지고 통계량(statistics)을 생성

# Write an equally spaced alignment(for getting training started)
bin/align-equal-compiled

# Accumulate stats for GMM training
gmmbin/gmm-acc-stats-ali

stage: 0

# Do Maximum Likelihood re-estimation of GMM-based acoustic model
gmmbin/gmm-est

iteration (학습 횟수는 디폴트로 40)

# Modify GMM-based model to boost
gmmbin/gmm-boost-silence

# Align features given [GMM-based] models
gmmbin/gmm-align-compiled

# Accumulate stats for GMM training
gmmbin/gmm-acc-stats-ali

# (Above-mentioned ( stage 0 ))
gmmbin/gmm-est

"exp/mono/final.mdl"이 output이 된다.

4. Monophone 모델을 사용한 decoding

언어 데이터（”data/lang/L.fst"、"data/lang/G.fst"、기타）를 가지고, HMM state가 input이 되는 단어 그래프 "HCLG.fst"를 생성한다.

utils/mkgraph.sh \
    --mono \
    data/lang \      # <lang-dir>
    exp/mono \       # <model-dir>
    exp/mono/graph   # <graphdir>

"--mono" 옵션은 폐지된 모양이다.

Note: the --mono, --left-biphone and --quinphone options are now deprecated and will be ignored.

"gmmbin/gmm-latgen-faster" 커맨드를 사용하여 decode를 실행.

음성 데이터는 테스트용으로 준비한 것（학습시의 데이터와 다름）

steps/decode.sh \
    --config conf/decode.config \
    exp/mono/graph \             # <graph-dir>
    data/test \                  # <data-dir>
    exp/mono/decode              # <decode-dir>

5. Monophone 모델을 사용한 alignment

steps/align_si.sh \
    data/train \          # <data-dir>
    data/lang \           # <lang-dir>
    exp/mono \            # <src-dir>
    exp/mono_ali          # <align-dir>

alignment 결과（exp/mono_ali/ali.1.gz）

utterance_id_001 2 1 1 1 1 1 8 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 18 17 17 206 208 207 210 242 244 246 245 245 245 245 266 265 265 265 268 267 267 267 267 267 267 267 267 270 269 269 269 269 269 194 193 193 193 196 195 195 195 195 198 197 197 197 218 217 217 217 217 220 219 219 219 219 222 221 221 242 244 246 245 245 245 245 245 245 245 245 245 245 245 245 245 245 245 266 268 270 269 269 269 269 269 269 269 269 269 269 269 188 190 189 189 189 189 189 189 192 191 191 191 191 191 3 1 1 1 1 1 9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 7 5 5 14 15 15 15 15 15 15 15 15 15 15 12 10 10 10 10 10 10 10 10 10 10 10 10 10 18 
utterance_id_002 2 8 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 18 17 17 17 17 17 17 17 17 206 208 207 207 210 209 209 209 209 209 209 209 209 242 241 241 241 241 241 244 243 243 243 243 243 246 245 245 245 245 245 266 265 265 265 265 268 267 267 267 267 267 267 270 269 269 194 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 196 195 198 218 220 222 242 241 241 241 241 241 241 244 246 266 268 270 188 190 192 3 9 10 10 10 10 10 10 6 5 5 9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 6 5 5 5 9 10 10 10 10 10 10 10 10 10 6 5 5 9 10 10 10 10 10 6 5 12 10 10 10 10 18

6. Triphone 모델의 생성과 학습

steps/train_deltas.sh \
    2000 \             # <num-leaves>
    11000 \            # <tot-gauss>
    data/train \       # <data-dir>
    data/lang \        # <lang-dir>
    exp/mono_ali \     # <alignment-dir>
    exp/tri1           # <exp-dir>

스크립트 내부에서 호출되는 kaldi 커맨드는 아래와 같다.

stage: -3

# Accumulate statistics for phonetic-context tree building.
bin/acc-tree-stats

# Sum statistics for phonetic-context tree building.
bin/sum-tree-stats

stage: -2

# Cluster phones (or sets of phones) into sets for various purposes
bin/cluster-phones

# Compile questions
bin/compile-questions

# Train decision tree
bin/build-tree

# Initialize GMM from decision tree and tree stats
gmm-init-model

# Does GMM mixing up (and Gaussian merging)
gmmbin/gmm-mixup

stage: -1

# Convert alignments from one decision-tree/model to another
bin/convert-ali

stage: 0

# Creates training graphs (without transition-probabilities, by default)
bin/compile-train-graphs

iteration (학습 횟수는 디폴트로 35)

# Align features given [GMM-based] models.
gmmbin/gmm-align-compiled

# Accumulate stats for GMM training.
gmmbin/gmm-acc-stats-ali

# Do Maximum Likelihood re-estimation of GMM-based acoustic model
gmmbin/gmm-est

"exp/tri1/final.mdl"이 output이 된다.

7. Triphone 모델을 사용한 decoding

Monophone과 동일

utils/mkgraph.sh \
    data/lang \      # <lang-dir>
    exp/tri1 \       # <model-dir>
    exp/tri1/graph   # <graphdir> 

steps/decode.sh \
    --config conf/decode.config \
    exp/tri1/graph \             # <graph-dir>
    data/test \                  # <data-dir>
    exp/tri1/decode              # <decode-dir>

음성인식 메모(kaldi) 17 - Toolkit script

「原作者へ」

連絡先を存じ上げませんでしたので、不本意ながら無断で翻訳しました。 
正式に翻訳を許可されたいです。 
gogyzzz@gmail.comでご連絡ください。

아래 포스트의 번역입니다.

http://work-in-progress.hatenablog.com/entry/2018/06/24/111611

Kaldi는, Bash 스크립트로 실행하는 커맨드를 사용하고 있다.

이번엔 스크립트에 대해 확인해보고자 한다.

GitHub에서 다운로드한 디렉토리 구성은 아래와 같다.

egs (알아보고자 하는 것)
src (소스코드)
misc (논문 등? 미확인)
tools (외부 툴、OpenFST、ATLAS 등)
windows (WindowsOS 용)

세부 내용은 kaldi공식사이트 설명을 참고（"Kaldi directories structure" 부분）

「egs」에 각 코퍼스에 대응하는 예제 스크립트가 수록되어 있다.

egs – example scripts allowing you to quickly build ASR systems for over 30 popular speech corporas (documentation is attached for each project),

스스로 음성 데이터를 준비한 경우에는 어떻게 할 것인가.

Kaldi공식 튜토리얼을 보면, 「egs/wsj/s5」 내의 스크립트를 이용하면 된다는 설명이 있다.

Project finalization -> Tools attachment의 내용 중

From kaldi-trunk/egs/wsj/s5 copy two folders (with the whole content) - utils and steps - and put them in your kaldi-trunk/egs/digits directory.
You can also create links to these directories.

「wsj」는 Wall Street Journal news text 코퍼스 인것 같다.

egs/wsj/README.txt 내용 중

About the Wall Street Journal corpus:
This is a corpus of read sentences from the Wall Street Journal, recorded under clean conditions.
The vocabulary is quite large. About 80 hours of training data.
Available from the LDC as either: [ catalog numbers LDC93S6A (WSJ0) and LDC94S13A (WSJ1) ]
or: [ catalog numbers LDC93S6B (WSJ0) and LDC94S13B (WSJ1) ]
....

다른 코퍼스의 디렉토리 (예를 들어 「egs/rm/steps」)을 보아도, 「egs/wsj/steps」에 심볼릭 링크되어 있다.

/opt/kaldi/egs/rm/s5% ls -l steps
lrwxrwxrwx 1 ichou1 ichou1 18  2월  5 19:46 steps -> ../../wsj/s5/steps
/opt/kaldi/egs/rm/s5% file steps 
steps: symbolic link to `../../wsj/s5/steps' 
/opt/kaldi/egs/rm/s5%

코퍼스는 없지만, kaldi를 사용해보고자 하는 경우에 대해 「egs/yesno」가 준비되어 있다.

이것은 음성 데이터（.wav）도 수록되어 있기 때문에, 바로 사용해볼 수 있다.

（"YES"와"NO" 둘 중 하나를 8회, 패턴을 바꿔가며 발화. 학습용으로 31파일, 테스트용으로 29파일）

egs/yesno/README 내용 중

The "yesno" corpus is a very small dataset of recordings of one individual saying yes or no multiple times per recording, in Hebrew.

egs/yesno/s5/waves_yesno/README 내용 중

The archive "waves_yesno.tar.gz" contains 60 .wav files, sampled at 8 kHz. 
All were recorded by the same male speaker, in English (although the individual is not a native speaker).
In each file, the individual says 8 words; 
each word is either "yes" or "no", so each file is a random sequence of 8 yes-es or noes.
There is no separate transcription provided; 
the sequence is encoded in the filename, with 1 for yes and 0 for no, for instance:

실행 방법

```bash cd egs/yesno/s5 ./run.sh

### 내부에서 하고 있는 것들

- Data preparation

--> 「local/prepare_dict.sh」、「local/prepare_dict.sh」、「utils/prepare_lang.sh」、「local/prepare_lm.sh」을 실행

- Feature extraction

--> 「steps/make_mfcc.sh」、「steps/compute_cmvn_stats.sh」、「utils/fix_data_dir.sh」을 실행

（「steps」、「utils」는、 「egs/wsj/s5/steps」、「egs/wsj/s5/utils」에 링크되어 있음）

- Mono training

--> 「steps/train_mono.sh」을 실행

- Graph compilation（그래프 생성）

--> 「utils/mkgraph.sh」를 실행

- Decoding（인식）

--> 「steps/decode.sh」를 실행

실행하는 console 상에는 WER(단어 오류율)이 표시된다.

decode의 결과는, （egs/yesno/s5/exp/mono0a/decode_test_yesno/log/decode.1.log）에서 확인 가능하다.

### 예시）「egs/yesno/s5/waves_yesno/1_0_0_0_0_0_0_0.wav」의 인식결과

plaintext 1000000_0 YES NO NO NO NO NO NO NO

상위 디렉토리가 「/opt/kaldi」로, decode를 직접 실행하는 경우의 커맨드（결과는 표준출력에 텍스트 형식으로 출력）

### decode(lattice없이 수행)

bash /opt/kaldi/src/gmmbin/gmm-decode-faster \ --word-symbol-table=/opt/kaldi/egs/yesno/s5/exp/mono0a/graphtgpr/words.txt \ /opt/kaldi/egs/yesno/s5/exp/mono0a/40.mdl \ /opt/kaldi/egs/yesno/s5/exp/mono0a/graphtgpr/HCLG.fst \ "ark,s,cs:/opt/kaldi/src/featbin/apply-cmvn --utt2spk=ark:/opt/kaldi/egs/yesno/s5/data/testyesno/split1/1/utt2spk scp:/opt/kaldi/egs/yesno/s5/data/testyesno/split1/1/cmvn.scp scp:/opt/kaldi/egs/yesno/s5/data/test_yesno/split1/1/feats.scp ark:- | /opt/kaldi/src/featbin/add-deltas ark:- ark:- |" \ ark,t:-

전달하는 파라미터에 대해서는 이전 포스팅을 참고

### 결과(lattice없이 수행)

plaintext 10000000 3 2 2 2 2 2 2 2 10000000 YES NO NO NO NO NO NO NO LOG (gmm-decode-faster[5.3.106~1389-9e2d8]:main():gmm-decode-faster.cc:196) Log-like per frame for utterance 1000000_0 is -8.37946 over 668 frames.

「3」은 words.txt에서 "YES"、「2」는 "NO"에 대응

### decode(lattice를 사용하여 수행)

bash /opt/kaldi/src/gmmbin/gmm-latgen-faster \ --word-symbol-table=/opt/kaldi/egs/yesno/s5/exp/mono0a/graphtgpr/words.txt \ /opt/kaldi/egs/yesno/s5/exp/mono0a/final.mdl \ /opt/kaldi/egs/yesno/s5/exp/mono0a/graphtgpr/HCLG.fst \ "ark,s,cs:/opt/kaldi/src/featbin/apply-cmvn --utt2spk=ark:/opt/kaldi/egs/yesno/s5/data/testyesno/split1/1/utt2spk scp:/opt/kaldi/egs/yesno/s5/data/testyesno/split1/1/cmvn.scp scp:/opt/kaldi/egs/yesno/s5/data/test_yesno/split1/1/feats.scp ark:- | /opt/kaldi/src/featbin/add-deltas ark:- ark:- |" \ ark,t:-

### 결과(lattice를 사용하여 수행)

plaintext 10000000 YES NO NO NO NO NO NO NO 10000000 0 1 3 9.34174, 10746.4, 4111111618 1 2 2 3.00029, 3604.42, 151515151515 2 3 2 3.75534, 460.406, 2929 3 4 2 6.37105, 626.19, 4 5 2 5.32006, 589.474, 5 6 2 5.67636, 4377.79, 6 7 2 5.32006, 596.049, 7 8 2 4.3186, 6239.1, 8 9 2 5.85963, 5268.64, 29292929 8 9.50533, 28208.8, 292929294111 9 7.30095, 22958.9, 2628304161515_

LOG (gmm-latgen-faster[5.3.106~1389-9e2d8]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance 1000000_0 is -8.37946 over 668 frames.

### 모델(egs/yesno/s5/exp/mono0a/final.mdl)을 텍스트 형식으로 나타낸 것

plaintext 2 3 0 0 0 0.75 1 0.25 1 1 1 0.75 2 0.25 2 2 2 0.75 3 0.25 3 1 0 0 0 0.25 1 0.25 2 0.25 3 0.25 1 1 1 0.25 2 0.25 3 0.25 4 0.25 2 2 1 0.25 2 0.25 3 0.25 4 0.25 3 3 1 0.25 2 0.25 3 0.25 4 0.25 4 4 4 0.75 5 0.25 5 11 1 0 0 1 1 1 1 2 2 1 3 3 1 4 4 2 0 5 2 1 6 2 2 7 3 0 8 3 1 9 3 2 10 [ 0 -0.3016863 -4.60517 -2.116771 -2.040137 -0.05096635 -4.60517 -3.516702 -4.60517 -4.60517 -0.09362812 -2.668062 -4.60517 -4.60517 -4.60517 -0.1123881 -2.449803 -0.04502614 -3.122941 -0.3431785 -1.236192 -0.1315082 -2.09372 -0.07189104 -2.668334 -0.1359556 -2.062634 -0.09793975 -2.371973 -0.04792399 -3.062005 ] 39 11 [ -162.6711 -100.3258 -150.894 -774.145 ] [ 0.02608728 0.03167231 0.03214631 0.03326807 0.01074118 ] [ -3.798081 -5.357131 0.8406813 0.918729 1.014658 "snip" 0.5328674 1.181959 -0.6352269 -0.7017035 -0.06531551 "snip" ] [ 0.2399497 0.4042536 0.2387805 0.09193342 0.04029746 "snip" 0.282881 0.1213772 0.07582887 0.03232023 0.03635461 "snip" ] "snip" ( 10 times repeat )

「YES」(/jes/)를 「j-e+s」 와 같이 자르지 않고, 한 덩어리로 취급하고 있다. (예제이기 때문에 단순하게 한 것)

### phone transcriptions(egs/yesno/s5/data/local/dict/lexicon.txt)

plaintext SIL YES Y NO N ```

decision tree description (egs/yesno/s5/exp/mono0a/tree)

음성인식 메모(kaldi) 16 - Backpropagation Dan's DNN(nnet2)

「原作者へ」

連絡先を存じ上げませんでしたので、不本意ながら無断で翻訳しました。 
正式に翻訳を許可されたいです。 
gogyzzz@gmail.comでご連絡ください。

아래 포스트의 번역입니다.

http://work-in-progress.hatenablog.com/entry/2018/06/17/100622

이전편에 이어서

Backpropagation에 의한 파라미터 업데이트를 따라가보자.

Softmax의 출력결과(확률)이 아래와 같이 있다고 치자（정답이 되는 pdf-class만 표시）.

(소수점 여섯째자리부터는 버림)

확률의 역수에 weight을 곱한 것을 구하자(여기서 weight는 전부 1)

(소수점 셋째자리부터 버림)

확률이 낮은 것일수록 값(=오차)는 커진다.

이 값을 가지고 이전 component의 오차를 구해가면서 파라미터를 업데이트해간다.

Propagation에 의한 프레임의 전이가 아래와 같이 있다고 치자.

（splice、FixedAffine（LDA변환）이 끝난 상태부터 시작）

＜ 그림의 설명 ＞

주황색 원은 프레임의 데이터를 나타냄
" A x B"는 데이터의 행 수 A, 열 수 B를 나타냄("mini"는 미니배치 사이즈, 여기선 64)
화살표는 각 component를 나타냄

error backpropagation(1)

일단은 「Softmax」 component부터

nnet2/nnet-component.cc

in_deriv->DiffSoftmaxPerRow(out_value, out_deriv);

＜ 그림의 설명 ＞

녹색의 원이 프레임의 오차 데이터를 나타냄 (차원수는 주황색 원과 같음)
그림 내의 1이 소스의 out_value에 해당
그림 내의 2가 소스의 out_deriv에 해당
그림 내의 3이 소스의 in_deriv에 해당

error backpropagation(2)

이어서, 「AffineComponentPreconditionedOnline」 component.

nnet2/nnet-component.cc

in_deriv->AddMatMat(1.0, out_deriv, kNoTrans, linear_params_, kNoTrans, 0.0);

＜ 그림의 설명 ＞

그림 내의 1이 소스의 out_deriv에 해당
그림 내의 2가 소스의 linear_params_"(모델의 파라미터 부분)에 해당
그림 내의 3이 in_deriv에 해당

또한, input layer와 output 오차를 가지고 모델의 파라미터를 업데이트 함

error backpropagation(3)

이어서, 「Normalize」 component.

nnet2/nnet-component.cc

cu::DiffNormalizePerRow(in_value, out_deriv, BaseFloat(1), false, in_deriv);

＜ 그림의 설명 ＞

그림 내의 1이 소스의 out_value에 해당
그림 내의 2가 소스의 out_deriv에 해당
그림 내의 3이 소스의 in_deriv에 해당

error backpropagation(4)

이어서, 「Pnorm」 component.

nnet2/nnet-component.cc

in_deriv->DiffGroupPnorm(in_value, out_value, out_deriv, p_);

＜ 그림의 설명 ＞

그림 내의 1이 소스의 in_value에 해당
그림 내의 2가 소스의 out_value에 해당
그림 내의 3이 소스의 out_deriv에 해당
그림 내의 4가 소스의 in_deriv에 해당

error backpropagation(5)

이어서, 「AffineComponentPreconditionedOnline」 component.

nnet2/nnet-component.cc

in_deriv->AddMatMat(1.0, out_deriv, kNoTrans, linear_params_, kNoTrans, 0.0);

＜ 그림의 설명 ＞

그림 내의 1이 소스의 out_deriv에 해당
그림 내의 2가 소스의 linear_params_"(모델의 파라미터 부분)에 해당
그림 내의 3이 in_deriv에 해당

마찬가지로, input 값과 output 오차를 가지고 모델의 파라미터를 업데이트 한다.

2019년 7월 15일 월요일

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::InitDefault()

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::PreconditionDirectionsInternal()

H_t (N x R)

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::PreconditionDirectionsInternal()

J_t (R x D)

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::PreconditionDirectionsInternal()

K_t (R x R、symmetric)

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::PreconditionDirectionsInternal()

L_t (R x R、symmetric)

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::PreconditionDirectionsInternal()

Z_t (R x R、symmetric)

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::PreconditionDirectionsInternal()

C_t (30차원)

U_t (R x R)

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::PreconditionDirectionsInternal()

B_t (R x D)

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::ComputeWt1()

A_t (R x R)

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::ComputeWt1()

[ nnet2/nnet-precondition-online.cc ] OnlinePreconditioner::ComputeWt1()

[ nnet2/nnet-component.cc ] AffineComponentPreconditionedOnline::Update()

[ nnet2/nnet-component.cc ] AffineComponentPreconditionedOnline::Update()

configure 내용 중

matrix/kaldi-matrix.cc

nnet2/nnet-component.cc

업데이트 전의 invaluetemp（128row x 376col、 논문의 「X_t」에 해당

업데이트 후의 invaluetemp（128row x 376col、 논문의 「Xhatt」에 해당

해당하는 소스코드 (nnet-precondition-online.cc)

inrowproducts (128dim)

in_scale

업데이트 전의 outderivtemp(128row x 192col、논문의 「X_t」에 해당)

업데이트 후의 outderivtemp(128row x 192col、 논문의 「Xhatt」에 해당)

outrowproducts(128dim)

out_scale

AffineComponentPreconditionedOnline::Update()의 내부 처리

AffineComponentPreconditionedOnline::Update()의 내부처리

2019년 7월 14일 일요일

info 커맨드（nnet2bin/nnet-am-info exp/nnet4c/14.mdl ）

각 모델, 컴포넌트마다의 스케일

소스 코드의 내용（nnet2/combine-nnet-fast.cc）

입력

음성（/tmp/utter_053.ark）

모델

exp/mono/graph/words.txt

exp/mono/graph/phones/align_lexicon.txt

output

words(tmp/word.ark)

lattice (/tmp/lat.ark) ※설명용으로 일부 가공

alignment 결과（/tmp/ali.ark）

alignment 결과（/tmp/ali2phone.ark） ※설명용으로 음소를 symbol로 치환

exp/mono/phones.txt

3 state HMM

5 state HMM

원문

문장을 단어로 분해（나눠쓰기, 「MeCab」을 사용）

단어를 음소열로 분해（「Julius」 내의 「yomi2voca.pl」를 사용）

음소 표

발화문장（20종류）

(참고) data/lang/phones.txt

Monophone

1-gram

2-gram

gmm-info 실행 결과

Triphone (tri1)

1-gram

2-gram

gmm-info 실행결과

Triphone (tri2b、LDA+MLLT)

1-gram

2-gram

gmm-info 실행결과

DNN(nnet4c)

1-gram

nnet-am-info 실행결과

${KALDIROOT}/egs/rm/s5/local/nnet2/run4c.sh

stage: -4

stage: -3

stage: -2

stage: -1

lattice (/tmp/lat.ark)　※설명용으로 일부 가공