일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
- GAN #StyleCLIP #YAI 11기 #연세대학교 인공지능학회
- YAI 8기
- 연세대학교 인공지능학회
- NLP
- YAI 9기
- RCNN
- nerf
- Faster RCNN
- Perception 강의
- 자연어처리
- VIT
- 강화학습
- YAI
- cl
- rl
- Googlenet
- Fast RCNN
- 컴퓨터비전
- transformer
- CNN
- YAI 11기
- NLP #자연어 처리 #CS224N #연세대학교 인공지능학회
- cv
- YAI 10기
- CS224N
- PytorchZeroToAll
- 3D
- 컴퓨터 비전
- GaN
- CS231n
- Today
- Total
연세대 인공지능학회 YAI
[CS224N] Lecture 9, 10 : Pretraining, NLG (Natural Language Generation) 본문
[CS224N] Lecture 9, 10 : Pretraining, NLG (Natural Language Generation)
_YAI_ 2023. 3. 4. 16:08CS224N, Winter 2023 : Lecture 9, 10
https://web.stanford.edu/class/cs224n/index.html
*YAI 11기 신명진님이 "자연어 강의"팀에서 작성한 글입니다.
Subword Models
Mapping all unknown words to UNK is suboptimal. Want open vocabulary.
Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
One solution: Byte-pair encoding; review from last week material.
Unknown words are instead split into subwords:
> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Pretraining and Word Embeddings
> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Word embeddings do not capture context. We can hope that training data
we have for the downstream task contains sufficient data to teach
contextual aspects, but if that is not the case, it can be problematic.
Word embedding for "movie" is fixed.
> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Pretraining methods hide parts of the input from the model, and train the model to reconstruct those parts.
Pretraining is analogous to Word2Vec training in that Word2Vec learns to
predict context words based on the centre word (and vice versa), but it
usually has a much larger context that is trained jointly.
The model learns to represent entire sentences through pretraining.
Pretraining
Through language modelling: train an NN to perform language modelling on
a large amount of text and use trained the network weights for
downstream tasks.
> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Pretraining and Finetuning Paradigm
> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Step 2 starts training with network weights from Step 1.
Why is this so effective?
> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Pretraining for 3 Types of Architectures
Pretraining for Encoders
> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Encoder models cannot directly be used for language modelling because it
only needs to repeat the input!
Idea: Masking. This is from BERT:
> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
There are additional tasks that make training BERT
harder, which is done purposefully to make BERT a more powerful model.
But we are not going to use BEERT to predict the masked word for our
downstream tasks---this is better left for pretrained decoders. However
we will be using BERT for its strong representation of language, made
possible by its training objectives.
Even though learning objectives by BERT are ill-posed, BERT will learn
the average representation throughout training and will learn to build
strong representations.
Notes about pretraining and finetuning:
> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Pretraining for Encoder-Decoders
> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Teacher forcing used during training
> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Pretraining for Decoders
> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Example (see above): ignoring they they were trained to
model
i.e., we are only using the network for its initialisation of its
parameters!
> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Another common usage: pretrain decoders as language models and use them
as generators.
> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
GPT (2018) was a big success in pretraining a decoder.
Utilising a pretrained decoder (e.g., GPT): instead of changing the
architecture, we format the task to suit the architecture.
> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf
Natural Language Generation (NLG)
Example uses: machine translation, digital assistant, summarisation
systems, creative stories, data (e.g., tabular data) to text, video
description
> Source: Xiang Lisa Li, Natural Language Processing with Deep Learning (CS224N), Lecture 10: Neural Language Generation. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture10-nlg.pdf
→ Argmax
Another decoding algorithm: beam search
Problems with Greedy Decoding
> Source: Xiang Lisa Li, Natural Language Processing with Deep Learning (CS224N), Lecture 10: Neural Language Generation. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture10-nlg.pdf
Humans don't speak with the goal of maximising
likelihood. They speak to make a point.
Incorporate randomness by sampling:
> Source: Xiang Lisa Li, Natural Language Processing with Deep Learning (CS224N), Lecture 10: Neural Language Generation. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture10-nlg.pdf
Top-k sampling: prevent sampling too-unlikely tokens.
To reflect probability of each tokens, can instead use top-p.
> Source: Xiang Lisa Li, Natural Language Processing with Deep Learning (CS224N), Lecture 10: Neural Language Generation. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture10-nlg.pdf
This can be used in conjunction with a decoding
algorithm. Does not affect argmax obviously.
Rebalancing Distributions
(Khandelwal et al. ICRL 2020) > Source: Xiang Lisa Li, Natural Language Processing with Deep Learning (CS224N), Lecture 10: Neural Language Generation. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture10-nlg.pdf
From a different corpus/database, search phrases that are similar to
what the model would output.
Exposure Bias
During text time, NLG model is fed gold text. However during testing it
is fed its own output.
Reinforcement learning as Solution
learn behaviours by rewarding the model when it exhibits them.
If using BLEU as reward, final sequence is scored and every token gets
rewarded instead of at every token step. But BLEU is also an evaluation
metric, and optimising for the metric can improve the metric score but
translation quality may not improve!
Also, BLEU scores are always > 0, meaning the tokens will be rewarded
all the time, so it is important to subtract the expected BLEU score
before rewarding.
Takeaways
> Source: Xiang Lisa Li, Natural Language Processing with Deep Learning (CS224N), Lecture 10: Neural Language Generation. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture10-nlg.pdf
Evaluation
> Source: Xiang Lisa Li, Natural Language Processing with Deep Learning (CS224N), Lecture 10: Neural Language Generation. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture10-nlg.pdf
Discussion
RNN으로 NLG를 할 때 반복되는 현상이 transformer에 비해 덜하다고 했는데,
time bottleneck이랑 관련있다고 하는데 왜 그럴까?
Transformer보다 RNN의 기억력이 떨어져서 같은 말이 반복되어도 어느
순간부터는 크게 영향을 안 주는 것 같다.
추후 더 조사가 필요한 내용이라고 생각.
'강의 & 책 > CS224N' 카테고리의 다른 글
[CS224N] Language Modeling with LSTM and GRU (0) | 2023.03.04 |
---|---|
[CS224N] Language Model, Analysis, Future of NLP (0) | 2023.01.14 |
[CS224n] T5 and Large Language Models (0) | 2022.04.07 |
[CS224n] Subword Modeling & Pretraining (0) | 2022.03.18 |
[CS224n] 어텐션 (Attention) (0) | 2022.03.18 |