ㅇㅇ: 음성인식 메모(kaldi) 15 - activation 함수 Dan's DNN(nnet2)

「原作者へ」

連絡先を存じ上げませんでしたので、不本意ながら無断で翻訳しました。 
正式に翻訳を許可されたいです。 
gogyzzz@gmail.comでご連絡ください。

아래 포스트의 번역입니다.

http://work-in-progress.hatenablog.com/entry/2018/06/10/123041

activation 함수에 관한 메모.

공식사이트의 hidden layer에 관한 설명에는 "tanh"와 "p-norm"이 등장한다.

이 "p-norm"이라는 것은 논문에 의하면 activation 함수 "maxout"으로부터 힌트를 얻은 독자적인 방식인 것 같다.

「egs/rm/s5/local/run_nnet2.sh」의 주석에도 쓰여있듯이, nnet2에는 "p-norm"이 primary recipe로 되어 있다.

「egs/rm/s5/local/runnnet2.sh」(usegpu=false)

# This one is on top of 40-dim + fMLLR features, it's a fairly normal tanh system.
local/nnet2/run_4c.sh --use-gpu false

# **THIS IS THE PRIMARY RECIPE (40-dim + fMLLR + p-norm neural net)**
local/nnet2/run_4d.sh --use-gpu false

「local/nnet2/run4d.sh」은 내부에서 「steps/nnet2/trainpnorm_fast.sh」호출하고 있는데, 이 내부에서 「nnet2bin/nnet-train-simple」 커맨드가 실행된다.

「nnet2bin/nnet-train-simple」 커맨드

Train the neural network parameters with backprop and stochastic gradient descent using minibatches.  
Usage:  nnet-train-simple [options] <model-in> <training-examples-in> <model-out>

input이 되는 feature가 13차원, output이 되는 pdf-class수가 6인 모델의 구성 예시는 아래와 같다.

전체 7층의 구성으로, 아래의 component가 hidden layer에 해당한다.

AffineComponentPreconditionedOnline
PnormComponent
NormalizeComponent

모델

<Nnet>
<NumComponents> 7 
<Components>
<SpliceComponent> 
    <InputDim> 13 <Context> [ -4 -3 -2 -1 0 1 2 3 4 ]<ConstComponentDim> 0 
</SpliceComponent> 
<FixedAffineComponent> 
    <LinearParams>  [ (snip) 117row x 117col ]
    <BiasParams>  [ (snip) 1row x 117col ]
</FixedAffineComponent> 
<AffineComponentPreconditionedOnline> 
    <LearningRate> 0.02 
    <LinearParams>  [ (snip) 1000row x 117col ]
    <BiasParams>  [ (snip) 1row x 117col ]
    <RankIn> 20 <RankOut> 80 <UpdatePeriod> 4 <NumSamplesHistory> 2000 <Alpha> 4 <MaxChangePerSample> 0.075
</AffineComponentPreconditionedOnline> 
<PnormComponent> 
    <InputDim> 1000 <OutputDim> 200 <P> 2 
</PnormComponent> 
<NormalizeComponent>
     <Dim> 200 <ValueSum>  [ ]  <DerivSum>  [ ]  <Count> 0
</NormalizeComponent> 
<AffineComponentPreconditionedOnline>
    <LearningRate> 0.02 
    <LinearParams>  [ (snip) 6row x 200col ]
    <BiasParams>  [ (snip) 1row x 6col ]
    <RankIn> 20 <RankOut> 80 <UpdatePeriod> 4 <NumSamplesHistory> 2000 <Alpha> 4 <MaxChangePerSample> 0.075 
</AffineComponentPreconditionedOnline> 
<SoftmaxComponent> 
    <Dim> 6 <ValueSum>  [ (snip) 1row x 6col ]
    <DerivSum>  [ (snip) 1row x 6col ]
    <Count> 396 
</SoftmaxComponent> 
</Components> 
</Nnet>

「nnet2bin/nnet-train-simple」 커맨드의 내부처리의 흐름을 확인해보자.

minibatch-size는 「64」로 지정했다.

일단은 Propagate부터

「SpliceComponent」로 13차원에서 117차원으로 만듬

이어서 「FixedAffineComponent」로 uncorrelated로 만듬（117차원 그대로）

이어서 「AffineComponentPreconditionedOnline」로 차원 수를 1000으로 만듬

여기까지 얻은 output은 64row x 1000col이 된다.

이어서, 「PnormComponent」. 디폴트는 「p=2」. (2-norm을 의미) （논문에선 이것이 가장 성능이 좋았다고 한다）

input이 「1000」차원, output이 「200」차원이므로, 5차원마다 grouping한다.

grouping 전

grouping 후

이어서, 「NormalizeComponent」로 정규화

이어서 「AffineComponentPreconditionedOnline」로 차원수를 6으로 한다.

（여기까지 얻은 output은 64row x 6col）

마지막으로 「SoftmaxComponent」으로 각 pdf-class마다의 output 확률을 구한다.

첫번째의 프레임 결과

 0.1746  0.0040  0.7098  0.0007  0.0889  0.0217

(소수점 다섯째자리 아래는 버림)

이 데이터에 관해서는 4번째의 데이터(확률이 0.0007)가 정답이 된다.

이 오차를 가지고 Backpropagation을 통해 모델의 파라미터를 업데이트 한다.

ㅇㅇ

2019년 7월 14일 일요일

음성인식 메모(kaldi) 15 - activation 함수 Dan's DNN(nnet2)

「egs/rm/s5/local/runnnet2.sh」(usegpu=false)

「nnet2bin/nnet-train-simple」 커맨드

모델

grouping 전

grouping 후

첫번째의 프레임 결과

댓글 없음:

댓글 쓰기