[NLP] torchtext, spaCy를 이용하여 Vocab 만들기
Updated:
torchtext, spaCy를 활용하여 Vocab을 만드는 실습해볼 것이다.
-
spaCy의 Tokenizer를 활용해서 vocab을 직접 구현해본다.
-
torchtext의 메소드를 활용해서 vocab을 만들어본다.
1. WikiText-2 데이터 불러오기
torchtext의 데이터셋인 WikiText-2를 사용하기 위해 데이터를 불러온다.
1.1. torchdata 설치
torchtext에서 데이터셋을 불러오려면 먼저 torchdata를 설치해야한다. PyTorch의 version에 주의하여 적합한 version을 설치하면 된다.
torchdata를 설치할 때 아래와 같은 ERROR가 발생할 수 있다.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
해결 방법
- folium==0.2.1을 먼저 설치한 다음 torchdata를 재설치한다.
- torchdata가 잘못된 방법으로 했을 때 설치된 경우 torchdata를 삭제한 후 folium부터 다시 설치한다.
- colab을 사용하고 있는 경우 런타임을 재실행한 다음 순서대로 설치하면된다.
!pip install folium==0.2.1
!pip install torchdata==0.4.0
!pip show torchdata
1.2. WikiText-2 데이터셋 불러오기
- torchtext.datasets.WikiText2(root: str = ‘.data’, split: Union[Tuple[str], str] = (‘train’, ‘valid’, ‘test’))
- split을 활용하여 필요한 데이터셋만 가져올 수 있다.
- 여기서는 train 데이터만 활용한다.
- split=’train’으로 train 데이터만 불러올 수 있다.
- split을 활용하여 필요한 데이터셋만 가져올 수 있다.
from torchtext.datasets import WikiText2
train = WikiText2(split='train')
데이터를 보면 <unk>를 확인할 수 있는데 unknown token을 가리킨다.
for i, text in enumerate(train):
if i == 5: break
print(text)
= Valkyria Chronicles III =
Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . <unk> the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " .
The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n .
2. spaCy Tokenizer를 이용하여 vocab 직접 구현하기
2.1. spaCy 데이터 설치
!python -m spacy download en_core_web_sm
2.2. Tokenizer 불러오기
import spacy
from spacy.symbols import ORTH
<unk>라는 speical token이 있기 때문에 spacy tokenizer가 <unk>를 하나의 token으로 인식할 수 있도록 special case를 추가해주어야한다.
- Adding special case tokenization rules, Pattern format
- ORTH를 이용하여 special_case라는 list에 special token을 추가한다.
- tokenizer에 add_special_case를 이용하여 ‘<unk>‘를 만날 때 <unk>로 토큰화를 진행할 수 있도록 한다.
spacy_en = spacy.load('en_core_web_sm')
special_case = [{ORTH:'<unk>'}]
spacy_en.tokenizer.add_special_case('<unk>', special_case)
# TEST
text = 'I use <unk> things.'
for token in spacy_en.tokenizer(text):
print(token)
I
use
<unk>
things
.
2.3. Vocab 클래스 구현하기
Vocab 클래스는 다음과 같은 역할을 한다.
- token2id를 이용하여 token이 어떤 id와 매핑되는지 저장한다.
- id2token을 이용하여 id가 어떤 token과 매핑되는지 저장한다.
- encode를 이용하여 문장을 토큰화하고 id값으로 바꾼다.
- decode를 이용하여 id들이 있을 때 적합한 토큰들로 바꾼이후 원래 문장으로 복원한다.
- special token과 매핑되는 id를 클래스 변수로 저장해서 사용한다.
- special token으로 unknown token을 사용한다.
- task에 따라 <sos>, <eos> 같은 special token을 추가로 사용할 수 있다.
※ spacy tokenizer의 반환값인 token을 사용하는 것보다 token.text가 더 정확한 결과를 반환한다.
from collections import Counter
from tqdm.notebook import tqdm
class Vocab:
UNK_TOKEN = '<unk>'
UNK_TOKEN_ID = 0
def __init__(self, data, tokenizer, min_freq):
self.data = [text for text in data]
self.en = tokenizer
self.id2token = list()
self.token2id = dict()
self.build_vocab(min_freq)
def build_vocab(self, min_freq):
counter = Counter()
for tokens in tqdm(map(self.en.tokenizer, self.data), total=len(self.data), desc='Building Vocab'):
counter.update(map(lambda x: x.text, tokens))
self.id2token = [Vocab.UNK_TOKEN] + [ token for token, freq in counter.items() if freq >= min_freq and token != Vocab.UNK_TOKEN]
self.token2id = { token:i for i, token in enumerate(self.id2token)}
def encode(self, text):
encoded = [self.token2id.get(token.text, UNK_TOKEN_ID) for token in self.en.tokenizer(text)]
return encoded
def decode(self, sequence):
decoded = " ".join([self.id2token[token_id] for token_id in sequence])
return decoded
corpus = Vocab(train, spacy_en, 3)
Building Vocab: 0%| | 0/36718 [00:00<?, ?it/s]
len(corpus.token2id), len(corpus.id2token)
(33242, 33242)
corpus.token2id['<unk>'], corpus.id2token[0]
(0, '<unk>')
train_text = [text for text in train]
encoded = corpus.encode(train_text[4])
encoded
[2,
86,
35,
87,
88,
46,
89,
15,
90,
91,
29,
92,
93,
18,
19,
94,
95,
96,
4,
5,
97,
17,
98,
49,
99,
19,
100,
101,
18,
19,
51,
15,
49,
102,
103,
104,
105,
15,
106,
25,
107,
19,
35,
108,
0,
42,
51,
109,
17,
110,
111,
0,
112,
39,
113,
114,
115,
116,
117,
118,
119,
120,
15,
121,
122,
4,
5,
97,
123,
124,
125,
17,
126,
92,
127,
18,
128,
129,
19,
130,
17,
86,
35,
131,
132,
133,
134,
135,
37,
136,
137,
138,
17,
7]
print(f"decode : {corpus.decode(encoded)}")
print(f"original: {train_text[4]}")
decode : The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May ' n .
original: The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n .
3. torchtext를 이용해서 vocab 만들기
torchtext의 get_tokenizer와 build_vocab_from_iterator를 사용하여 비교적 쉽게 vocab을 구성할 수 있다.
- build_vocab_from_iterator()
- iterator를 이용하여 vocab을 만든다.
- parameter
- iterator: vocab을 만들때 사용되는 iterator
- min_freq: vocab에 포함되기 위한 최소 빈도수
- specials: special token의 list
- torchtext.vocab.Vocab 클래스를 반환한다.
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
torch_tokenizer = get_tokenizer('basic_english')
torch_tokenizer('I use <unk> thing.')
['i', 'use', '<unk>', 'thing', '.']
torch_vocab = build_vocab_from_iterator(map(torch_tokenizer, train), min_freq=3, specials=['<unk>'])
build_vocab_from_iterator()은 torchtext.vocab.Vocab 클래스의 object를 반환한다. 반환된 Vocab object를 이용하여 아래와 같은 일들을 할 수 있다.
- get_stoi(): token2id를 반환한다.
- get_itos(): id2token을 반환한다.
- __getitem__(): token에 매핑되는 id값을 반환한다.
- lookup_token(): id에 매핑되는 token을 반환한다.
- forward(): encode 한다.(문장을 토큰화하고 id값으로 바꾼다.)
- nn.Module의 forward()처럼 작동한다.
- lookup_indices(): encode 한다.(문장을 토큰화하고 id값으로 바꾸어 바꾼다.)
- lookup_tokens(): decode 한다.(id들을 적합한 토큰들로 바꾼다.)
# get_stoi(), get_itos()
p_token2id = torch_vocab.get_stoi()
p_id2token = torch_vocab.get_itos()
print(len(p_token2id.keys()), len(p_id2token))
print(p_token2id['<unk>'], p_id2token[0])
28782 28782
0 <unk>
# __getitem__, lookup_token()
torch_vocab['<unk>'], torch_vocab.lookup_token(0)
(0, '<unk>')
# 토큰화 테스트 문장
train_text = [text for text in train]
# forward(), lookup_indices()
encoded1 = torch_vocab(torch_tokenizer(train_text[4]))
encoded2 = torch_vocab.lookup_indices(torch_tokenizer(train_text[4]))
print(f"encoded1: {encoded1}")
print(f"encoded2: {encoded2}")
encoded1: [1, 67, 135, 369, 6, 297, 2, 3245, 65, 8, 184, 1742, 4, 1, 138, 1177, 13, 3849, 3869, 304, 3, 66, 24, 3277, 1, 1176, 579, 4, 1, 93, 2, 24, 44, 4380, 1842, 18273, 2, 89, 14, 407, 1, 67, 61, 0, 17, 93, 19588, 3, 278, 3749, 0, 25905, 5, 3024, 25883, 19949, 99, 435, 25, 479, 11649, 2, 163, 18, 3849, 3869, 304, 537, 17954, 27012, 3, 8, 184, 157, 4, 1145, 3886, 1, 1623, 3, 1, 67, 11, 15, 658, 1071, 10, 3610, 19, 75, 11, 1586, 3]
encoded2: [1, 67, 135, 369, 6, 297, 2, 3245, 65, 8, 184, 1742, 4, 1, 138, 1177, 13, 3849, 3869, 304, 3, 66, 24, 3277, 1, 1176, 579, 4, 1, 93, 2, 24, 44, 4380, 1842, 18273, 2, 89, 14, 407, 1, 67, 61, 0, 17, 93, 19588, 3, 278, 3749, 0, 25905, 5, 3024, 25883, 19949, 99, 435, 25, 479, 11649, 2, 163, 18, 3849, 3869, 304, 537, 17954, 27012, 3, 8, 184, 157, 4, 1145, 3886, 1, 1623, 3, 1, 67, 11, 15, 658, 1071, 10, 3610, 19, 75, 11, 1586, 3]
# lookup_tokens()
decoded = torch_vocab.lookup_tokens(encoded1)
decoded_sentence = " ".join(decoded)
print(f"decoded: {decoded_sentence}")
print(f"original: {train_text[4]}")
decoded: the game began development in 2010 , carrying over a large portion of the work done on valkyria chronicles ii . while it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . character designer <unk> honjou and composer hitoshi sakimoto both returned from previous entries , along with valkyria chronicles ii director takeshi ozawa . a large team of writers handled the script . the game ' s opening theme was sung by may ' n .
original: The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n .
Leave a comment