ㅇㅇ: 180909 CMU 음성인식 ppt 노트

Link: http://www.cs.cmu.edu/afs/cs/user/bhiksha/WWW/courses/11-756.asr/spring2013/

6일 동안 읽은거 주말 일요일에 한번에 정리하자.

솔직히 다시 정리하기 귀찮은데 아예 안하자니 기분좋게 아는 느낌만 들고 아무 기억도 안날 것 같아서 최소한 이것만은 해야 겠다.

Spell Checking, The Trellis as a Product

spell check를 어떻게 할 건지 생각해보자.
입력이 UROP 이고, 가지고 있는 템플릿은 ON, OR, DROP 세 개라고 하자.
이 경우 입력과 세 템플릿에 대해 각각 거리를 구하여 가장 거리가 작은 템플릿을 선택하는 것으로 잘못된 spell을 교정할 수 있다.
거리를 구하는 연산을 Product라고 한다. 벡터로 따지면 inner product 같은 것이다.

Continuous text: Looping around, Lextree, Continuous text with arbitrary spaces, Models with optional spaces

ON, OR, DROP 템플릿 세 개에 대한 state 모델을 만들고 start, end state를 공유하도록 하여 모델 세개를 병렬로 묶자.
ON, OR, DROP로만 이루어진 연속 텍스트를 모델링 하기 위해서는 end-start state 사이에 loop-back edge를 추가하면 된다. start, end state를 따로 두지 말고 하나로 합치자
loop-back edge에 적절한 페널티를 주면 연속 텍스트 모델이 단어 간의 경계를 더 잘 찾거나 혹은 더 안 찾도록 유도할 수 있다.
페널티의 디폴트 값은 insertion(-> 방향) 페널티와 같은 값을 주는데, 이는 경계를 잘 구분하든 아니든 같은 cost가 들도록 한 것이다.
템플릿 단어 간 겹치는 글자를 공유하여 lextree를 구성할 수 있다. (예: horrible과 horse는 hor를 공유한다)
여기서 설명하고 있는 모델의 경우 space가 들어 있지 않은 입력(예: helloworld)만을 고려한 것이다. space가 들어 있는 텍스트의 경우에는 어떻게 할까?
모델에 space (" ")를 넣는다.

Isolated Word vs Continuous Speech, Templates for "Sentences",

연속음성인식을 하기 위한 템플릿은 어떻게 만들까? 고립 단어 템플릿을 가지고 전부 조합해서 준비하면 될까?
특정한 문장만이 필요한 경우는 상관 없지만 고립단어로부터 가능한 문장의 수는 무한이다.

Other Issues with Continuous Speech

사람마다 말하는 속도가 다르다.
문장으로 구성되었을 경우 발음이 변하는 경우가 있다(예: Did you -> Dijjou)
구어체, 즉흥 발화의 경우 아무 의미도 없는 단어나 잘못된 단어, 비문이 나올 수 있다.

Treat it as a series of isolated word recognition problems?, Recording only Word Templates

입력 음성을 통해 "thiscar" 라는 텍스트를 얻어냈다. 이걸 this car로 인식해야 할까 the scar로 인식해야 할까?
th iscar 같은 것은 어떤가?
세 단어로 이루어져 있을 수도 있는데, 몇 개의 단어로 구성되어 있는지는 어떻게 알까?

A Simple Solution, Building Sentence Templates, Handling Silence, Sentence HMM with Optional Silences, Composing HMMs for Word Sequences

word template들을 준비하고, word templates을 가지고 sentence templates를 만들자(composition). 만들 때 온갖 변수를 고려하여 만들자.
red, green, blue라는 word template이 있다면, 이걸 가지고 그냥 이어서(???) redgreenblue라는 문장 템플릿을 만들 수 있다.
이렇게 만들면 사람이 끊어 말할 때 인식이 잘 안되므로 "silence" state를 만들어 단어 사이사이에 '옵션으로' 넣자.
- silence state는 묵음 구간에 대해 학습시킨 것이다.
그냥 red, green 모델을 이어 붙이면 red의 마지막 state(loop를 가지고 있음) 와 green의 시작 state는 어떻게 붙여야 할까? transition 확률을 정할 수가 없다!
- 각 단어 모델에 대해 loop가 없는 start, end state를 추가하여 각자의 시작과 끝을 책임지게끔 하자. start, end state는 non-emitting state이다.

Connecting Words with Final NULL States, Retaining a non-emitting state between words, Viterbi with NULL states

이제 word1과 word2 모델을 붙일 때 word1의 end state만 떼면 word2에 자연스럽게 붙일 수 있다.
하지만 word1의 end state를 남겨둔 채로 붙이는 게 더 유용하다. (왜?)
- word2로 진입하는 확률이 무조건 1이 되고, non-emitting state에서만 word2로 갈 수 있기 때문(? 이라고 한다)
non-emitting state는 viterbi decoding에 영향을 미친다.
state segmentation을 얻는 과정에 영향을 미친다 (어떻게?)
이는 실제 word sequence를 인식할 때 중요히다 (왜?)
- (내 생각) non-emitting state를 남겨두면 유용한 이유는 단지 수식적으로 다루기 편해서 그런 것 같다. 그림 상으로만 보면 솔직히 왜 좋다는지 이해가 잘 안된다. non-emitting state를 없애버리고 viterbi decoding 수식을 다시 쓰면 '이전 단어'에 대한 표시를 해줘야 한다. 즉 이전 단어에 의존적인 식이 되어버리는 것이다. 하지만 non-emitting state를 그대로 사용한다면 viterbi 수식상으로 이전단어가 뭐든 개의치 않고 그냥 non-emitting state에서 오는 확률 term을 추가해주기만 하면 된다. 이러면 queue를 사용한다든지 재사용을 한다든지 여러 이점이 있을 것 같다(그냥 그럴 것 같은 느낌만 있다)

Speech Recognition as Bayesian Classification, Statistical pattern classification, Isolated Word Recognition as Bayesian Classification

듣지 않아도 알 수 있는 것들이 있다. 살면서 SEE와 ZEE 중 당연히 SEE를 더 많이 듣는다.
Basic DTW는 word prior에 대해 고려하지 않았다.

Computing P(X|word), Factoring in a priori probability into Trellis, Time-Synchronous Trellis

Why Scores and not Probabilities

score(log probability)를 쓰면 곱셈을 안해도 된다. 그리고 underflow도 없다.
Deeper reason
score를 쓰면 trellis를 덜 쓸 수 있다(메모리 절약).
어차피 forward probability를 사용할 수 없다.

Statistical classification of word sequences ~ Decoding to classify between word sequences

close file 이라는 문장을 delete file보다 보통 더 많이 사용한다. (무슨 이런 예를 들지…)
이 부분은 trellis 그림이 대부분이라 스킵한다
중간에 max(max(dogstar), max(rockstar)) = max(max(dogstar1, rockstar1), … )가 가능한 이유는 argmax가 commutative라서 가능하다는데 이게 commutative와 뭔 상관인지 모르겠다.
dogstar와 rockstar의 경우 star가 겹치고 synchronous state들의 경우 star의 start state부터는 path가 겹치기 때문에 미리 max로 비교해도 된다(?)

The Real "Classes"

Dog Star, Rock Star class가 따로 있는게 아니라 합쳐진 형태의 Dog,Rock - Star 가 하나 있는 것이라고 생각해야 한다(이제부터는 정말로 그렇게 생각해야 한다)
이런 식의 reduced graph로 나타낼 수 있는 이유는 viterbi 알고리즘에 따라 best path score만 사용하고 있기 때문이고 forward propbability를 사용할 수 없기 때문이다.

Language-HMMs for fixed length word sequences

word graph가 나타낼 수 있는 모든 가능한 word sequences를 "language"라고 한다(의미심장…)
word graph도 HMM이라 볼 수 있다.

Where does the graph come from

인식 어플리케이션에 따라 graph를 정해줘야 한다.

Language HMMs for arbitrary long word sequences

word graph로부터 임의의 길이를 가지는 word sequence도 만들 수 있어야 한다. 이러려면 word-graph 자체에 loop가 필요하다.
자연어를 대상으로 하는 음성인식이 아니라면 constrained vocabulary를 쓰는 것이 현실적인 방법이다.

ㅇㅇ

2018년 9월 9일 일요일

180909 CMU 음성인식 ppt 노트 - 11. Continuous Speech

Spell Checking, The Trellis as a Product

Continuous text: Looping around, Lextree, Continuous text with arbitrary spaces, Models with optional spaces

Isolated Word vs Continuous Speech, Templates for "Sentences",

Other Issues with Continuous Speech

Treat it as a series of isolated word recognition problems?, Recording only Word Templates

A Simple Solution, Building Sentence Templates, Handling Silence, Sentence HMM with Optional Silences, Composing HMMs for Word Sequences

Connecting Words with Final NULL States, Retaining a non-emitting state between words, Viterbi with NULL states

Speech Recognition as Bayesian Classification, Statistical pattern classification, Isolated Word Recognition as Bayesian Classification

Computing P(X|word), Factoring in a priori probability into Trellis, Time-Synchronous Trellis

Why Scores and not Probabilities

Statistical classification of word sequences ~ Decoding to classify between word sequences

The Real "Classes"

Language-HMMs for fixed length word sequences

Where does the graph come from

Language HMMs for arbitrary long word sequences

댓글 없음:

댓글 쓰기