Longformer

Posted on Sat, Mar 27, 2021 NLP 논문리뷰

Table of Contents

무엇이 문제인가?

Transformer & Self-Attention

Related work

Long sequence 처리를 위한 Transformer 계열 모델들

Transformer-XL (ACL 2019)

Adaptive Span (ACL 2019)

Compressive

Reformer

Sparse Transformer

Routing

BP-Transformer

Blockwise

어떻게 풀어가는가?

💡

Longformer는 2020.04에 v1 페이퍼가 나왔고, 그 사이 수많은 Long sequence Transformer가 나옴. 이번 리뷰에서는 2020.12에 나온 v2 paper를 기준으로 리뷰함. (따라서 BigBird등 v1 이후 페이퍼들과의 비교도 나옴)

Longformer = Windowed Local Attention + Global Attention

Windowed Local Attention

Sliding Window

# https://github.com/allenai/longformer/blob/master/longformer/sliding_chunks.py#L40
def sliding_chunks_matmul_qk(q: torch.Tensor, k: torch.Tensor, w: int, padding_value: float):
    '''Matrix multiplicatio of query x key tensors using with a sliding window attention pattern.
    This implementation splits the input into overlapping chunks of size 2w (e.g. 512 for pretrained Longformer)
    with an overlap of size w'''
    bsz, seqlen, num_heads, head_dim = q.size()
    assert seqlen % (w * 2) == 0
    assert q.size() == k.size()

    chunks_count = seqlen // w - 1

    # group bsz and num_heads dimensions into one, then chunk seqlen into chunks of size w * 2
    q = q.transpose(1, 2).reshape(bsz * num_heads, seqlen, head_dim)
    k = k.transpose(1, 2).reshape(bsz * num_heads, seqlen, head_dim)

    chunk_q = _chunk(q, w)
    chunk_k = _chunk(k, w)

    # matrix multipication
    # bcxd: bsz*num_heads x chunks x 2w x head_dim
    # bcyd: bsz*num_heads x chunks x 2w x head_dim
    # bcxy: bsz*num_heads x chunks x 2w x 2w
    chunk_attn = torch.einsum('bcxd,bcyd->bcxy', (chunk_q, chunk_k))  # multiply

    # convert diagonals into columns
    diagonal_chunk_attn = _skew(chunk_attn, direction=(0, 0, 0, 1), padding_value=padding_value)

    # allocate space for the overall attention matrix where the chunks are compined. The last dimension
    # has (w * 2 + 1) columns. The first (w) columns are the w lower triangles (attention from a word to
    # w previous words). The following column is attention score from each word to itself, then
    # followed by w columns for the upper triangle.

    diagonal_attn = diagonal_chunk_attn.new_empty((bsz * num_heads, chunks_count + 1, w, w * 2 + 1))

    # copy parts from diagonal_chunk_attn into the compined matrix of attentions
    # - copying the main diagonal and the upper triangle
    diagonal_attn[:, :-1, :, w:] = diagonal_chunk_attn[:, :, :w, :w + 1]
    diagonal_attn[:, -1, :, w:] = diagonal_chunk_attn[:, -1, w:, :w + 1]
    # - copying the lower triangle
    diagonal_attn[:, 1:, :, :w] = diagonal_chunk_attn[:, :, - (w + 1):-1, w + 1:]
    diagonal_attn[:, 0, 1:w, 1:w] = diagonal_chunk_attn[:, 0, :w - 1, 1 - w:]

    # separate bsz and num_heads dimensions again
    diagonal_attn = diagonal_attn.view(bsz, num_heads, seqlen, 2 * w + 1).transpose(2, 1)

    mask_invalid_locations(diagonal_attn, w, 1, False)
    return diagonal_attn
# https://github.com/allenai/longformer/blob/master/longformer/sliding_chunks.py#L150
def sliding_chunks_no_overlap_matmul_qk(q: torch.Tensor, k: torch.Tensor, w: int, padding_value: float):
    bsz, seqlen, num_heads, head_dim = q.size()
    assert seqlen % w == 0
    assert q.size() == k.size()
    # chunk seqlen into non-overlapping chunks of size w
    chunk_q = q.view(bsz, seqlen // w, w, num_heads, head_dim)
    chunk_k = k.view(bsz, seqlen // w, w, num_heads, head_dim)
    chunk_k_expanded = torch.stack((
        F.pad(chunk_k[:, :-1], (0, 0, 0, 0, 0, 0, 1, 0), value=0.0),
        chunk_k,
        F.pad(chunk_k[:, 1:], (0, 0, 0, 0, 0, 0, 0, 1), value=0.0),
    ), dim=-1)
    diagonal_attn = torch.einsum('bcxhd,bcyhde->bcxhey', (chunk_q, chunk_k_expanded))  # multiply
    return diagonal_attn.reshape(bsz, seqlen, num_heads, 3 * w)

Dilated Window

Dilated CNN

Global Attention

Linear Projection for Global Attention

class LongformerSelfAttention(nn.Module):
    def __init__(self, config, layer_id):
        [...생략...]
        self.query = nn.Linear(config.hidden_size, self.embed_dim)
        self.key = nn.Linear(config.hidden_size, self.embed_dim)
        self.value = nn.Linear(config.hidden_size, self.embed_dim)

        self.query_global = nn.Linear(config.hidden_size, self.embed_dim)
        self.key_global = nn.Linear(config.hidden_size, self.embed_dim)
        self.value_global = nn.Linear(config.hidden_size, self.embed_dim)
				[...생략...]

Longformer 구현체

💡

현재 PyTorch에 맞게 구현된 CUDA 커널에서 Dilated window 지원 X (Finetune에는 필요 없음)

AutoRegressive LM에서의 Longformer

AR LM에서의 Attention Pattern

AR LM에서의 학습

AR LM에서의 Longformer 성능

Ablation Study

Pretraining & Finetuning(MLM)에서의 Longformer

allenai/longformer

Longformer: The Long-Document Transformer. Contribute to allenai/longformer development by creating an account on GitHub.

def create_long_model(save_model_to, attention_window, max_pos):
    model = RobertaForMaskedLM.from_pretrained('roberta-base')
    tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', model_max_length=max_pos)
    config = model.config

MLM방식에서의 Attention Pattern

MLM방식에서의 Position Embedding

# extend position embeddings
tokenizer.model_max_length = max_pos
tokenizer.init_kwargs['model_max_length'] = max_pos
current_max_pos, embed_size = model.roberta.embeddings.position_embeddings.weight.shape
max_pos += 2  # NOTE: RoBERTa has positions 0,1 reserved, so embedding size is max position + 2
config.max_position_embeddings = max_pos
assert max_pos > current_max_pos
# allocate a larger position embedding matrix
new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
# copy position embeddings over and over to initialize the new position embeddings
k = 2
step = current_max_pos - 2
while k < max_pos - 1:
    new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]
    k += step
model.roberta.embeddings.position_embeddings.weight.data = new_pos_embed
model.roberta.embeddings.position_ids.data = torch.tensor([i for i in range(max_pos)]).reshape(1, max_pos)
# replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
config.attention_window = [attention_window] * config.num_hidden_layers
for i, layer in enumerate(model.roberta.encoder.layer):
    longformer_self_attn = LongformerSelfAttention(config, layer_id=i)
    longformer_self_attn.query = layer.attention.self.query
    longformer_self_attn.key = layer.attention.self.key
    longformer_self_attn.value = layer.attention.self.value

    longformer_self_attn.query_global = copy.deepcopy(layer.attention.self.query)
    longformer_self_attn.key_global = copy.deepcopy(layer.attention.self.key)
    longformer_self_attn.value_global = copy.deepcopy(layer.attention.self.value)

    layer.attention.self = longformer_self_attn

MLM방식에서의 학습

MLM방식에서의 BPC 성능

MLM방식에서의 Downstream task 성능

QA

→ Non-GNN 모델 중에서 최고!

Coreference Resolution

Document Classification

Longformer-Encoder-Decoder, a.k.a. "LED"

💡

2020년 4월 버전1에는 없는, 2020년 12월 버전2에 추가됨!

💡

현재 Longformer 공식 Github을 보면 T5 모델에 Longformer를 적용하는 실험을 하는 것으로 보임

Longformer on Transformers 🤗

(다행이도) Longformer는 Huggingface Transformer 라이브러리에 구현체가 있다! (LED도 있다)

Longformer 학습

Longformer Tokenizer

import torch
from longformer.longformer import Longformer, LongformerConfig
from longformer.sliding_chunks import pad_to_window_size
from transformers import RobertaTokenizer

config = LongformerConfig.from_pretrained('longformer-base-4096/') 
# choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
# 'n2': for regular n2 attantion
# 'tvm': a custom CUDA kernel implementation of our sliding window attention
# 'sliding_chunks': a PyTorch implementation of our sliding window attention
config.attention_mode = 'sliding_chunks'

model = Longformer.from_pretrained('longformer-base-4096/', config=config)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
tokenizer.model_max_length = model.config.max_position_embeddings

Longformer for Seq Classfication 실습 with IMDB

Google Colaboratory