ㅇㅇ: 음성인식 메모(kaldi) 13 - feature 변환 Dan's DNN(nnet2)

「原作者へ」

連絡先を存じ上げませんでしたので、不本意ながら無断で翻訳しました。 
正式に翻訳を許可されたいです。 
gogyzzz@gmail.comでご連絡ください。

아래 포스트의 번역입니다.

http://work-in-progress.hatenablog.com/entry/2018/05/14/224658

Daniel povey의 구현 버전(nnet2)의 내부를 보도록 하자.

공식 문서의 설명에 의하면 input이 되는 feature(MFCC)에 대해 다음의 변환을 수행한다고 한다.

Dan씨의 논문의 baseline/Type I features에 해당하는 것 같다.

splice
LDA(linear discriminant Analysis)
MLLT（maximum likelihood linear transform）/ global STC(semi-tied covariance)
fMLLR(feature space maximum likelihood linear regression)

논문에서는 4개 패턴에 대해 word error rate （WER）을 비교하고 있는데, 가장 좋은 성능을 낸 것이 아래의 패턴.

The Type-IV features consist of our baseline 40-dimensional speaker adapted features that have been spliced again,followed by de-correlation and dimensionality reduction using another LDA.

baseline이 되는 feature에 대해 또다시 splice를 적용하여 (2번째의)LDA변환을 적용한다.

2번째의 LDA변환은 Neural Network 모델의 "FixedAffineComponent" 부분에 해당한다.

이 부분은 학습 과정에서 업데이트 되지 않는다. （fixed in advance and not trainable）

모델 (발췌)

<Nnet> <NumComponents> 7 <Components> 
<SpliceComponent> 
    <InputDim> 40 <Context> [ -4 -3 -2 -1 0 1 2 3 4 ] <ConstComponentDim> 0
</SpliceComponent> 
<FixedAffineComponent> 
    <LinearParams>
    [ 0.1481841 0.1649369 (snip)
      (snip)
      -0.0002072983 0.0001211765 (snip) ]
    <BiasParams>
    [ 7.769857 5.612672 (snip) ]
</FixedAffineComponent> 
(snip)
</Components> </Nnet>

가령, baseline이 되는 feature가 「40」차원이라 하면, 2번째의 splice 후는 「360차원」（40 x 9프레임）

2번째의 LDA변환을 차원감소 없이 실행하는 경우, ”LinearParams”파라미터는 「360row x 360col」、"BiasParams"파라미터는 「1row x 360col」이 된다.

LDA변환용도의 데이터는 「steps/nnet2/get_lda.sh」내부에서 호출되는 「src/nnet2bin/nnet-get-feature-transform」로 생성된다.

Get feature-projection transform using stats obtained with acc-lda.
See comments in the code of nnet2/get-feature-transform.h for more information.
Usage:  nnet-get-feature-transform [options] <matrix-out> <lda-acc-1> <lda-acc-2> ...

여기서 LDA는 차원감소（reduce the dimensionality）때문이 아니라, uncorrelatedned(decorrelated the data)때문에 실시한다고 한다.

wrapper 스크립트「steps/nnet2/get_lda.sh」를 보아도, 디폴트로 차원감소는 일어나지 않도록 되어 있다.

lda_dim=  # This defaults to no dimension reduction.

논문에도 나타나 있듯이, DNN 학습의 input으로써는 차원감소보다도 whitening쪽이 유익한 것 같다.

ㅇㅇ

2019년 7월 14일 일요일

음성인식 메모(kaldi) 13 - feature 변환 Dan's DNN(nnet2)

모델 (발췌)

댓글 없음:

댓글 쓰기