Updated:

torchtext, spaCy를 활용하여 Vocab을 만드는 실습해볼 것이다.

  1. spaCy의 Tokenizer를 활용해서 vocab을 직접 구현해본다.

  2. torchtext의 메소드를 활용해서 vocab을 만들어본다.

1. WikiText-2 데이터 불러오기

torchtext의 데이터셋인 WikiText-2를 사용하기 위해 데이터를 불러온다.

1.1. torchdata 설치

torchtext에서 데이터셋을 불러오려면 먼저 torchdata를 설치해야한다. PyTorch의 version에 주의하여 적합한 version을 설치하면 된다.

torchdata를 설치할 때 아래와 같은 ERROR가 발생할 수 있다.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.

해결 방법

  • folium==0.2.1을 먼저 설치한 다음 torchdata를 재설치한다.
    • torchdata가 잘못된 방법으로 했을 때 설치된 경우 torchdata를 삭제한 후 folium부터 다시 설치한다.
  • colab을 사용하고 있는 경우 런타임을 재실행한 다음 순서대로 설치하면된다.
!pip install folium==0.2.1
!pip install torchdata==0.4.0
!pip show torchdata

1.2. WikiText-2 데이터셋 불러오기

  • torchtext.datasets.WikiText2(root: str = ‘.data’, split: Union[Tuple[str], str] = (‘train’, ‘valid’, ‘test’))
    • split을 활용하여 필요한 데이터셋만 가져올 수 있다.
      • 여기서는 train 데이터만 활용한다.
      • split=’train’으로 train 데이터만 불러올 수 있다.
from torchtext.datasets import WikiText2
train = WikiText2(split='train')

데이터를 보면 <unk>를 확인할 수 있는데 unknown token을 가리킨다.

for i, text in enumerate(train):
    if i == 5: break
    print(text)
 = Valkyria Chronicles III = 

 

 Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . <unk> the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " . 

 The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n . 

2. spaCy Tokenizer를 이용하여 vocab 직접 구현하기

2.1. spaCy 데이터 설치

!python -m spacy download en_core_web_sm

2.2. Tokenizer 불러오기

import spacy
from spacy.symbols import ORTH

<unk>라는 speical token이 있기 때문에 spacy tokenizer가 <unk>를 하나의 token으로 인식할 수 있도록 special case를 추가해주어야한다.

  • Adding special case tokenization rules, Pattern format
  • ORTH를 이용하여 special_case라는 list에 special token을 추가한다.
  • tokenizer에 add_special_case를 이용하여 ‘<unk>‘를 만날 때 <unk>로 토큰화를 진행할 수 있도록 한다.
spacy_en = spacy.load('en_core_web_sm')
special_case = [{ORTH:'<unk>'}]
spacy_en.tokenizer.add_special_case('<unk>', special_case) 
# TEST
text = 'I use <unk> things.'

for token in spacy_en.tokenizer(text):
    print(token)
I
use
<unk>
things
.

2.3. Vocab 클래스 구현하기

Vocab 클래스는 다음과 같은 역할을 한다.

  • token2id를 이용하여 token이 어떤 id와 매핑되는지 저장한다.
  • id2token을 이용하여 id가 어떤 token과 매핑되는지 저장한다.
  • encode를 이용하여 문장을 토큰화하고 id값으로 바꾼다.
  • decode를 이용하여 id들이 있을 때 적합한 토큰들로 바꾼이후 원래 문장으로 복원한다.
  • special token과 매핑되는 id를 클래스 변수로 저장해서 사용한다.
    • special token으로 unknown token을 사용한다.
    • task에 따라 <sos>, <eos> 같은 special token을 추가로 사용할 수 있다.

※ spacy tokenizer의 반환값인 token을 사용하는 것보다 token.text가 더 정확한 결과를 반환한다.

from collections import Counter
from tqdm.notebook import tqdm
class Vocab:
    UNK_TOKEN = '<unk>'
    UNK_TOKEN_ID = 0

    def __init__(self, data, tokenizer, min_freq):
        self.data = [text for text in data]
        self.en = tokenizer
        self.id2token = list()
        self.token2id = dict()

        self.build_vocab(min_freq)

    def build_vocab(self, min_freq):

        counter = Counter()
        for tokens in tqdm(map(self.en.tokenizer, self.data), total=len(self.data), desc='Building Vocab'):
            counter.update(map(lambda x: x.text, tokens))

        self.id2token = [Vocab.UNK_TOKEN] + [ token for token, freq in counter.items() if freq >= min_freq and token != Vocab.UNK_TOKEN]
        self.token2id = { token:i for i, token in enumerate(self.id2token)}
    
    def encode(self, text):
        encoded = [self.token2id.get(token.text, UNK_TOKEN_ID) for token in self.en.tokenizer(text)]
        return encoded

    def decode(self, sequence):
        decoded = " ".join([self.id2token[token_id] for token_id in sequence])
        return decoded
corpus = Vocab(train, spacy_en, 3)
Building Vocab:   0%|          | 0/36718 [00:00<?, ?it/s]
len(corpus.token2id), len(corpus.id2token)
(33242, 33242)
corpus.token2id['<unk>'], corpus.id2token[0]
(0, '<unk>')
train_text = [text for text in train]
encoded = corpus.encode(train_text[4])

encoded
[2,
 86,
 35,
 87,
 88,
 46,
 89,
 15,
 90,
 91,
 29,
 92,
 93,
 18,
 19,
 94,
 95,
 96,
 4,
 5,
 97,
 17,
 98,
 49,
 99,
 19,
 100,
 101,
 18,
 19,
 51,
 15,
 49,
 102,
 103,
 104,
 105,
 15,
 106,
 25,
 107,
 19,
 35,
 108,
 0,
 42,
 51,
 109,
 17,
 110,
 111,
 0,
 112,
 39,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 15,
 121,
 122,
 4,
 5,
 97,
 123,
 124,
 125,
 17,
 126,
 92,
 127,
 18,
 128,
 129,
 19,
 130,
 17,
 86,
 35,
 131,
 132,
 133,
 134,
 135,
 37,
 136,
 137,
 138,
 17,
 7]
print(f"decode  : {corpus.decode(encoded)}")
print(f"original: {train_text[4]}")
decode  :   The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May ' n . 

original:  The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n . 

3. torchtext를 이용해서 vocab 만들기

torchtext의 get_tokenizerbuild_vocab_from_iterator를 사용하여 비교적 쉽게 vocab을 구성할 수 있다.

  • build_vocab_from_iterator()
    • iterator를 이용하여 vocab을 만든다.
    • parameter
      • iterator: vocab을 만들때 사용되는 iterator
      • min_freq: vocab에 포함되기 위한 최소 빈도수
      • specials: special token의 list
    • torchtext.vocab.Vocab 클래스를 반환한다.
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
torch_tokenizer = get_tokenizer('basic_english')
torch_tokenizer('I use <unk> thing.')
['i', 'use', '<unk>', 'thing', '.']
torch_vocab = build_vocab_from_iterator(map(torch_tokenizer, train), min_freq=3, specials=['<unk>'])

build_vocab_from_iterator()은 torchtext.vocab.Vocab 클래스의 object를 반환한다. 반환된 Vocab object를 이용하여 아래와 같은 일들을 할 수 있다.

  • get_stoi(): token2id를 반환한다.
  • get_itos(): id2token을 반환한다.
  • __getitem__(): token에 매핑되는 id값을 반환한다.
  • lookup_token(): id에 매핑되는 token을 반환한다.
  • forward(): encode 한다.(문장을 토큰화하고 id값으로 바꾼다.)
    • nn.Module의 forward()처럼 작동한다.
  • lookup_indices(): encode 한다.(문장을 토큰화하고 id값으로 바꾸어 바꾼다.)
  • lookup_tokens(): decode 한다.(id들을 적합한 토큰들로 바꾼다.)
# get_stoi(), get_itos()
p_token2id = torch_vocab.get_stoi()
p_id2token = torch_vocab.get_itos()

print(len(p_token2id.keys()), len(p_id2token))
print(p_token2id['<unk>'], p_id2token[0])
28782 28782
0 <unk>
# __getitem__, lookup_token()
torch_vocab['<unk>'], torch_vocab.lookup_token(0)
(0, '<unk>')
# 토큰화 테스트 문장
train_text = [text for text in train]
# forward(), lookup_indices()
encoded1 = torch_vocab(torch_tokenizer(train_text[4]))
encoded2 = torch_vocab.lookup_indices(torch_tokenizer(train_text[4]))

print(f"encoded1: {encoded1}")
print(f"encoded2: {encoded2}")
encoded1: [1, 67, 135, 369, 6, 297, 2, 3245, 65, 8, 184, 1742, 4, 1, 138, 1177, 13, 3849, 3869, 304, 3, 66, 24, 3277, 1, 1176, 579, 4, 1, 93, 2, 24, 44, 4380, 1842, 18273, 2, 89, 14, 407, 1, 67, 61, 0, 17, 93, 19588, 3, 278, 3749, 0, 25905, 5, 3024, 25883, 19949, 99, 435, 25, 479, 11649, 2, 163, 18, 3849, 3869, 304, 537, 17954, 27012, 3, 8, 184, 157, 4, 1145, 3886, 1, 1623, 3, 1, 67, 11, 15, 658, 1071, 10, 3610, 19, 75, 11, 1586, 3]
encoded2: [1, 67, 135, 369, 6, 297, 2, 3245, 65, 8, 184, 1742, 4, 1, 138, 1177, 13, 3849, 3869, 304, 3, 66, 24, 3277, 1, 1176, 579, 4, 1, 93, 2, 24, 44, 4380, 1842, 18273, 2, 89, 14, 407, 1, 67, 61, 0, 17, 93, 19588, 3, 278, 3749, 0, 25905, 5, 3024, 25883, 19949, 99, 435, 25, 479, 11649, 2, 163, 18, 3849, 3869, 304, 537, 17954, 27012, 3, 8, 184, 157, 4, 1145, 3886, 1, 1623, 3, 1, 67, 11, 15, 658, 1071, 10, 3610, 19, 75, 11, 1586, 3]
# lookup_tokens()
decoded = torch_vocab.lookup_tokens(encoded1)
decoded_sentence = " ".join(decoded)
print(f"decoded: {decoded_sentence}")
print(f"original: {train_text[4]}")
decoded: the game began development in 2010 , carrying over a large portion of the work done on valkyria chronicles ii . while it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . character designer <unk> honjou and composer hitoshi sakimoto both returned from previous entries , along with valkyria chronicles ii director takeshi ozawa . a large team of writers handled the script . the game ' s opening theme was sung by may ' n .
original:  The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n . 

Leave a comment