일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
- 연세대학교 인공지능학회
- 컴퓨터 비전
- nerf
- transformer
- PytorchZeroToAll
- YAI 10기
- NLP #자연어 처리 #CS224N #연세대학교 인공지능학회
- YAI
- 강화학습
- VIT
- CS224N
- 3D
- CS231n
- RCNN
- GAN #StyleCLIP #YAI 11기 #연세대학교 인공지능학회
- Perception 강의
- NLP
- cv
- rl
- 자연어처리
- YAI 11기
- YAI 8기
- Fast RCNN
- 컴퓨터비전
- GaN
- cl
- Faster RCNN
- YAI 9기
- CNN
- Googlenet
- Today
- Total
연세대 인공지능학회 YAI
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 본문
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
_YAI_ 2022. 2. 8. 22:12분류: NLP
생성일: 2021년 7월 20일 오후 5:41
연도: 2018
저자: Devlin, J., Chang, M. W., Lee, K., & Toutanova, K.
진행 상황: Modifying
키워드: Attention, Bidirectional, Transformer
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Overall
Summary
BERT는 구글이 공개한 NLP 모델로, 언어 표현(language representation)을 사전 학습(pre-training)시키는 방법입니다. 이후에 사전 학습된 모델을 미세 조정(fine-tuning)하여 질문 답변(question answering)이나 감정 분석(sentiment analysis)과 같은 다양한 NLP 테스크들에 적용하게 됩니다. 위 모델은 여러 장점이 있으면서도 SOTA(state-of-the-art)를 달성하게 됩니다. 이를 요약하면 아래와 같습니다.
- 사전 학습 과정은 Attention mechanism을 사용하여 unsupervised learning(비지도 학습)으로 학습하게 되어, 기존의 모델과 달리 deeply bidirectional(깊이 양방향적)이라는 특징과 훈련 데이터 수집이 상대적으로 쉬울 것이라는 장점이 있습니다.
- 사전 학습 과정은 특정 태스크에 맞춰 훈련되는 것이 아니라 위키피디아 등과 같은 대형 데이터를 통해 학습되므로, 일반적인 언어 표현을 익힐 수 있습니다. 이후 미세 조정 과정을 거치며 특정 태스크에 맞게 파라미터가 업데이트되는데, 이때 소요 시간은 사전 학습 과정에 비해 매우 작다는 장점이 있습니다. 또한, 사전 학습 과정에서 잘 훈련된 동일한 모델을 다양한 태스크들에 반복해서 사용할 수 있습니다.
- 미세 조정 과정은 별도의 레이어를 붙여서 실행하는 것이 아니라, 사전 훈련된 모델을 그대로 사용합니다. 따라서 기존의 context vector를 받아서 별도로 훈련 시키거나, 추가적인 레이어를 부착하여 훈련을 시키지 않고 모형 전체가 해당 태스크에 맞게 미세 조정되므로 직관적으로 더 우수한 성능을 낼 수 있을 것이라고 생각할 수 있습니다.
Abstract in Korean
우리는 새로운 언어 모델인 BERT를 소개합니다. BERT는 Bidirectional Encoder Representation from Transformers(트랜스포머의 양방향 인코더 표현)을 의미합니다. 최근의 언어 포현 모델(Peters et al., 2018a; Radford et al., 2018)과 달리, BERT는 모든 레이어에서 좌측 및 우측 양방향으로 조건부를 걸어 라벨되지 않은 텍스트로부터 깊은 양방향 표현들을 학습하도록 설계되었습니다. 결과적으로, 사전 학습된 BERT 모델은 단일의 추가적인 출력 레이어만으로도 미세 조정되어, 태스크별 아키텍쳐상의 수정 없이도 질의 응답과 언어 추론과 같은 다양한 분야의 태스크들을 위한 최첨단 모델들을 만들어낼 수 있습니다.
BERT는 개념적으로 간단하며 경험적으로 강력합니다. 위 모델은 11개의 자연어 처리 태스크들에서 새로운 최첨단 결과를 얻었으며, 이는 GLUE 점수 80.5% (7.7%p 향상), MultiNLI 점수 86.7% (4.6%p 향상), SQuAD v1.1 질의 응답 테스트 F1 스코어 93.2 (1.5 향상), 그리고 SQuAD v2.0 테스트 F1 스코어 83.1 (5.1 향상)을 포함합니다.
Introduction
History of NLP
Natural Language Processing (NLP) is a subset of AI technologies which is used to understand and mimic the communication of the Human by the computer[1]. Machine Translation (MT), Machine Reading Comprehension (MRC) and Question Answering (QA) are major tasks in NLP. Traditionally, statistical methods and heuristic embeddings were used before (computational grammar in 1980s, parsers in 1990s). However, since new machine learning techniques have shown the better performance compared to traditional methods, neural networks such as RNN are the most popular structure in NLP.
Evolution in Machine Translation
The original paper of RNN came out in 1986[4]. The structure of RNN is not only intuitively understandable and relatively simple but also very powerful, and it successfully learns sequentiality in the given data[5]. However, RNN has a critical limitation of "vanishing/exploding gradient problem". To handle that issue, many variants of RNN such as LSTM[3] and GRU have came out. Also, Seq2Seq[2] model and encoder-decoder structure were invented to handle various-length sequential inputs and to compress the sequential data into a shorter fixed-length context vector.
While these methods above focused mainly on the sequentiality of the language representation, the development of Attention mechanism[6] has enabled to gain information from the whole input seqeunce, rather than the previous portion of the sequence. Also, the advent of Transformers[7] has succeeded in increasing performance while lowering computational complexity, eliminating the need to use RNN structures. Transformer and Attention structure has become the basis for what the best language AI models currently use. BERT is also the most commonly used state-of-the-art NLP model.
BERT
BERT stands for Bidirectional Encoder Representation from Transformers. It is a method of pre-training language representations, which means that it is first trained to be a general-purpose "language understanding" model on a large text corpus, and we use that model for various downstream NLP tasks such as question answering.
Seq2Seq Model
Features and Advantages
- DNN, especially fully connected layer, is very powerful, but it has a difficulty on handling data with variable lengths. In short, DNN can only be applied to the problems whose inputs and targets can be expressed with vectors with fixed dimensionality.
- However, a Seq2Seq model uses two LSTM models for an encoder and a decoder, respectively. The encoder LSTM's hidden state vector $\mathbf{v}$ is used as a representation of the original input. The output of encoder LSTM is disregarded.
- The decoder LSTM gets that vector (called context vector) and generates outputs until it predicts
[EOS]
token.
Limitation
- Seq2Seq model compresses the information of an input sentence (sequence) in a single context vector $\mathbf{v}$. Therefore, there is inevitably a loss of information, and the longer the sentence length, the lower the level of translation because the output is generated with limited information.
- RNN structures such as LSTM, GRU require the $t-1$th output and hidden state to generate $t$-th input. Because of this dependency, parallelization is not easy and the training cost is expensive.
- The context vector $\mathbf{v}$ is used only in the initial input of a decoder LSTM, which also limits the flow of information to the end of the model.
Masked Language Modeling (Masked LM)
Since we can intuitively think that bidirectional models are much more powerful than unidirectional language models, the researchers wanted to implement the deeply bidirectional models rather than unidirectional models such as GPT or shallowly bidirectional models such as ELMo. However, in order to train bidirectional models, there was a limitation that existing training methods were inapplicable. This is because the previous layers leak information and allow a token to see itself in later layers. Therefore, the researchers used Masked LM that they hide some words in a sentence randomly and make the language model to predict the word vector.
However, the masked words are not always replaced with the masked token [MASK]
due to the fact that [MASK]
tokens would never be shown at the fine-tuning stage. Therefore, the researchers have chosen 15% of words in a single sentence to hide and
- 80% of the time, tokens are actually replaced with
[MASK]
, - 10% of the time, tokens are replaced with a random token,
- and 10% of the time, tokens remain unchanged.
Transformer
- Left-side: Encoder
- Right-side: Decoder
[More Description Required]
Fine-Tuning
In the fine-tuning phase, the researchers used [SEP]
token to seperate two inputs such as sentence pairs in paraphrasing, hypothesis-premise pairs in entailment, question-passage pairs in question answering, and a degenerate text-null pair in text classification or sequence tagging. Providing input in this way has the advantage of learning much deeper because it does not require changing the model's architecture to fit the problem situation and it goes through a process of fine-tuning the parameters across the entire model. Therefore, many sub-models can be created with a single parent BERT model, and the training cost of the fine tuning is much lower than training the whole language model, which means we can reuse the same model.
Attention
What is Attention
The basic idea of Attention is that when generating an output, the decoder looks over all the entire input every time. At this time, the attention mechanism allows the decoder to pay attention only to the information which is necessary to predict the output (or pay attention proportional to the dependence). That is why it is called "Attention". As mentioned above, translation relies on a fixed-length context vector, which surely lead to information loss. However, attention partially solves this problem.
Scaled Dot-Product Attention
The two most commonly used attention functions are additive attention, and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of $1/\sqrt{d_k}$ . Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.[8]
An attention function takes three inputs $Q$ (Query), $K$ (Key), and $V$ (Value) and returns the attention value, which can be denoted as follows:
$$
\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V = H
$$
Multihead Attention
$$
\begin{align}
\text{MultiHead}(Q, K, V) &= \text{Concat}[H_1, H_2, \dots , H_h]W^O \
\text{where } H_i &= \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
\end{align}
$$
- "Scaled Dot-Product Attention"
- $d_k$: the dimension of keys
- $W^O$: a weight matrix of a fully connected layer to match the dimensionality of the output of the attention layer to that of the input
The concatenated outputs of the attention functions can be thought of the mixture of focused word meaning vectors in a high dimensional space. Also, multihead attention can be thought of multiple people with different perspectives are looking at the same sentence and generating different points of view.
Visualization
How it works
Visualization
Source: https://trungtran.io/2019/03/29/neural-machine-translation-with-attention-mechanism/
Experiments
[More Description Required]
Appendix
References
Bert image from Bert's twitter (https://twitter.com/bertsesame)
[1] "Evolution and Future of Natural Language Procesing", https://www.xenonstack.com/blog/evolution-of-nlp (Retrieved from 20 July 2021)
[2] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014. Review
[3] Sepp Hochreiter, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.
[4] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. "Learning internal representations by error propagation." California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
[5] Keyulu Xu, et al. "How powerful are graph neural networks?." arXiv preprint arXiv:1810.00826 (2018). Review
[6] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate", arXiv preprint arXiv:1409.0473 (2015).
[7] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, "Attention Is All You Need", Advances in neural information processing systems (2017). Review
[8] Ashish Vaswani, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
Code
https://github.com/google-research/bert
More articles to read
https://yashuseth.blog/2019/06/12/bert-explained-faqs-understand-bert-working/
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf
'자연어처리 : NLP > Transformer based' 카테고리의 다른 글
[논문 리뷰] RoBERTa: A Robustly Optimized BERT Pretraining Approach (0) | 2022.07.23 |
---|---|
[논문 리뷰] Transformer (0) | 2022.07.02 |