7.1 GAT

graph deep learning/#7 Graph Attention Networks

7.1 GAT

yuuuun 2020. 11. 23. 00:19

www.youtube.com/watch?v=NSjpECvEf0Y

Attention
- 모델이 집중해서 학습해야 하는 곳까지도 모델이 학습
- Explainability + model performance
Graph Attention Neural Network
- node-edge로 구성된 그래프 데이터의 중요 node에 가중치를 부여하는 어텐션 메커니즘을 사용하여 구조를 학습하는 딥러닝 모델
Attention
- Text data & Recurrent Neural Network
  - RNN은 sequential data의 학습에 적합한 신경망 모델
  - text data는 co-occurence와 sequential pattern을 고려한 분석이 필요로 한다.
- Seq2seq
  - RNN encoder + RNN decoder
  - Machine translation
  - 생기는 문제점
    - Long term dependency
    - Vanishing/ Exploding gradient
- Key, Query, Value
  - Dictionary 자료구조
  - query와 Key가 일치하면 Value값을 return
  - Dictionary자료형의 결과 리턴 과정
    - Similarity(key, value): key와 value의 유사도를 계산
    - SimXValue(sim, value): 유사도와 value를 곱함
    - Result(outputs): 유사도와 value를 곱한 값의 합을 return
    - $result = \sum_i similarity(key, query) * value$
  - Key, Query, Value in Attention
    - Attention: Query와 key의 유사도를 계산한 후 value 의 가중합을 계산하는 과정
    - Attention score: Value 에 곱해지는 가중치
    - Considerations
      - Key, Query, Value = Vectors(Matrix/ Tensor)
      - Similarity Function
    - $A(q, K, V) = \sum_i softmax(f(K, q))V$
    - 예시1) Machine Translation
    - 예시 2) Document 분류
  - Similarity Function(Alighment Model)
  - Attention + (alpha)
    - Feature Representation by RNN-Based Network
      - Attention에 RNN을 많이 붙이는 경우 많이 발생
      - 논문 예시
        
        Bidirectional RNN with Attention
        
        Hierarchical Attention Network
    - Feature Representation by CNN-based Network
      - TextCNN(2014)
      - Character-level CNN (2015)
    - Attention to Self- Attention (문제점)
      - RNN-Based Network
        
        Sequential Data에 용이하기 때문에 Parallel computing에 적합하지 않는다. (순차적인 학습을 해야 하기 때문에 병렬처리 불가)
        
        Calculation time과 complexity가 늘어난다.
        
        Vanishing gradient 과 Long term dependency문제가 발생한다.
      - CNN-Based Network
        
        Long path length를 갖고 있다. hyperparameter로 window사이즈를 가지고 있는데, 맨 처음과 끝 단어가 중요할 경우의 문제가 발생할 수 있다. 즉, 중요한 부분이 convolution 연산시 고려가 되지 않을 수 있다.
  - Self-Attention
    Transsfomer - Attention is all you need
    - Transformer
      - Scale-dot product attention(Self-Attention)
      - Hidden state of word embedding vector는 X로 나타냄
      - Generalized Attention Form: $A(q, K, V) = \sum_i softmax(f(K, q))V$Similarity function = Dot-product
        
        Matmul $f(K, Q) = QK^T (K = XW^K, Q =XW^Q, V = XW^V)$
        
        linear transformation을 다양한 방식으로 표현을 하였다.
        
        dot-product를 사용한 이유: 더 빠르고 계산 효율적이기 때문에(matrix multiplication을 쉽게 계산할 수 있음)
        
        Scaling $\frac{QK^T}{\sqrt{d_k}}$
        
        차원이 너무 커지면 극단 값으로 feature가 계산될 수가 있음
        
        Softmax
        
        $softmax(\frac{QK^T}{\sqrt{d_k}})
        
        Matmul
        
        $softmax(\frac{QK^T}{\sqrt{d_k}})
      - Multi-head Attention
        
        위의 식을 여러개를 합한 형태가 multi-head attention을 나타냄.
        
        ㅊ