연세대 인공지능학회 YAI

[CS224N] Lecture 9, 10 : Pretraining, NLG (Natural Language Generation) 본문

강의 & 책/CS224N

[CS224N] Lecture 9, 10 : Pretraining, NLG (Natural Language Generation)

_YAI_ 2023. 3. 4. 16:08

CS224N, Winter 2023 : Lecture 9, 10

https://web.stanford.edu/class/cs224n/index.html

 

Stanford CS 224N | Natural Language Processing with Deep Learning

Natural language processing (NLP) is a crucial part of artificial intelligence (AI), modeling how people share information. In recent years, deep learning approaches have obtained very high performance on many NLP tasks. In this course, students gain a tho

web.stanford.edu

*YAI 11기 신명진님이 "자연어 강의"팀에서 작성한 글입니다.


Subword Models

Mapping all unknown words to UNK is suboptimal. Want open vocabulary.

Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf

One solution: Byte-pair encoding; review from last week material.

Unknown words are instead split into subwords:

> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf

Pretraining and Word Embeddings

> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf

Word embeddings do not capture context. We can hope that training data
we have for the downstream task contains sufficient data to teach
contextual aspects, but if that is not the case, it can be problematic.

Word embedding for "movie" is fixed.

> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf

Pretraining methods hide parts of the input from the model, and train the model to reconstruct those parts.

Pretraining is analogous to Word2Vec training in that Word2Vec learns to
predict context words based on the centre word (and vice versa), but it
usually has a much larger context that is trained jointly.

The model learns to represent entire sentences through pretraining.

Pretraining

Through language modelling: train an NN to perform language modelling on
a large amount of text and use trained the network weights for
downstream tasks.

> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf

Pretraining and Finetuning Paradigm

> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf

Step 2 starts training with network weights from Step 1.

Why is this so effective?

> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf

Pretraining for 3 Types of Architectures

Pretraining for Encoders

> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf

Encoder models cannot directly be used for language modelling because it
only needs to repeat the input!

Idea: Masking. This is from BERT:

> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf

There are additional tasks that make training BERT
harder, which is done purposefully to make BERT a more powerful model.

But we are not going to use BEERT to predict the masked word for our
downstream tasks---this is better left for pretrained decoders. However
we will be using BERT for its strong representation of language, made
possible by its training objectives.

Even though learning objectives by BERT are ill-posed, BERT will learn
the average representation throughout training and will learn to build
strong representations.

Notes about pretraining and finetuning:

> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf

Pretraining for Encoder-Decoders

> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf

Teacher forcing used during training

> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf

Pretraining for Decoders

> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf

Example (see above): ignoring they they were trained to
model

 

i.e., we are only using the network for its initialisation of its
parameters!

> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf

Another common usage: pretrain decoders as language models and use them
as generators.

> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf

GPT (2018) was a big success in pretraining a decoder.

Utilising a pretrained decoder (e.g., GPT): instead of changing the
architecture, we format the task to suit the architecture.

> Source: John Hewitt, Natural Language Processing with Deep Learning (CS224N), Lecture 9: Pretraining. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture9-pretraining.pdf

Natural Language Generation (NLG)

Example uses: machine translation, digital assistant, summarisation
systems, creative stories, data (e.g., tabular data) to text, video
description

> Source: Xiang Lisa Li, Natural Language Processing with Deep Learning (CS224N), Lecture 10: Neural Language Generation. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture10-nlg.pdf

→ Argmax

Another decoding algorithm: beam search

Problems with Greedy Decoding

> Source: Xiang Lisa Li, Natural Language Processing with Deep Learning (CS224N), Lecture 10: Neural Language Generation. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture10-nlg.pdf

Humans don't speak with the goal of maximising
likelihood. They speak to make a point.

Incorporate randomness by sampling:

> Source: Xiang Lisa Li, Natural Language Processing with Deep Learning (CS224N), Lecture 10: Neural Language Generation. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture10-nlg.pdf

Top-k sampling: prevent sampling too-unlikely tokens.
To reflect probability of each tokens, can instead use top-p.

> Source: Xiang Lisa Li, Natural Language Processing with Deep Learning (CS224N), Lecture 10: Neural Language Generation. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture10-nlg.pdf

This can be used in conjunction with a decoding
algorithm. Does not affect argmax obviously.

Rebalancing Distributions

(Khandelwal et al. ICRL 2020) > Source: Xiang Lisa Li, Natural Language Processing with Deep Learning (CS224N), Lecture 10: Neural Language Generation. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture10-nlg.pdf

From a different corpus/database, search phrases that are similar to
what the model would output.

Exposure Bias

During text time, NLG model is fed gold text. However during testing it
is fed its own output.

Reinforcement learning as Solution

learn behaviours by rewarding the model when it exhibits them.

If using BLEU as reward, final sequence is scored and every token gets
rewarded instead of at every token step. But BLEU is also an evaluation
metric, and optimising for the metric can improve the metric score but
translation quality may not improve!

Also, BLEU scores are always > 0, meaning the tokens will be rewarded
all the time, so it is important to subtract the expected BLEU score
before rewarding.

Takeaways

> Source: Xiang Lisa Li, Natural Language Processing with Deep Learning (CS224N), Lecture 10: Neural Language Generation. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture10-nlg.pdf

Evaluation

> Source: Xiang Lisa Li, Natural Language Processing with Deep Learning (CS224N), Lecture 10: Neural Language Generation. https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture10-nlg.pdf

Discussion

RNN으로 NLG를 할 때 반복되는 현상이 transformer에 비해 덜하다고 했는데,
time bottleneck이랑 관련있다고 하는데 왜 그럴까?

Transformer보다 RNN의 기억력이 떨어져서 같은 말이 반복되어도 어느
순간부터는 크게 영향을 안 주는 것 같다.

추후 더 조사가 필요한 내용이라고 생각.

Comments