Updated:

โ€ป ์ด ๊ธ€์˜ ์›๋ฌธ์€ ์ด ๊ณณ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. โ€ป ๋ชจ๋“  ๊ธ€์˜ ๋‚ด์šฉ์„ ํฌํ•จํ•˜์ง€ ์•Š์œผ๋ฉฐ ์ƒˆ๋กญ๊ฒŒ ๊ตฌ์„ฑํ•œ ๋‚ด์šฉ๋„ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

pipeline() ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•ด ๋ชจ๋ธ์˜ ๊ตฌ์กฐ๋ฅผ ์ •ํ™•ํžˆ ๋ชจ๋ฅด๋”๋ผ๋„ pretrained ๋ชจ๋ธ์„ ์ด์šฉํ•ด ์ž์—ฐ์–ด์ฒ˜๋ฆฌ task๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Model Hub์—์„œ pretrained model๋“ค์„ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ๋ชจ๋ธ๋ณ„๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” task๊ฐ€ ๋ชจ๋‘ ๋‹ค๋ฅด๋ฏ€๋กœ task์— ์ ํ•ฉํ•œ ๋ชจ๋ธ์„ ์ฐพ์•„์•ผํ•ฉ๋‹ˆ๋‹ค.

pipeline()์„ ์ด์šฉํ•ด fill-mask task ์ˆ˜ํ–‰ํ•˜๊ธฐ

task์— ์ ํ•ฉํ•œ model์„ ์ฐพ์•˜๋‹ค๋ฉด AutoModel, AutoTokenizer ํด๋ž˜์Šค๋ฅผ ์ด์šฉํ•˜์—ฌ model๊ณผ model์— ์‚ฌ์šฉ๋˜๋Š” tokenizer๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋‹ค์šด๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • AutoClass์— ๊ด€ํ•ด์„œ๋Š” ๋‹ค์Œ ๊ธ€์—์„œ ๋‹ค๋ฃน๋‹ˆ๋‹ค.
  • ์ด๋ฒˆ์—๋Š” fill-mask๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์— AutoModelForMaskedLM ํด๋ž˜์Šค๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค. (AutoModel์„ ์ด์šฉํ•  ๊ฒฝ์šฐ ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.)
!pip install transformers

ํ•œ๊ตญ์–ด fill-mask task๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ์œ„ํ•ด BERT pretrained ๋ชจ๋ธ ์ค‘์—์„œ bert-base-multilingual-cased๋ฅผ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค.

  • ๋‹ค์–‘ํ•œ ์–ธ์–ด๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” multilingual model์ž…๋‹ˆ๋‹ค.

from_pretrained()์— model ์ด๋ฆ„์„ ๋„ฃ์œผ๋ฉด ์†์‰ฝ๊ฒŒ pretrained model, tokenizer๋ฅผ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์ผ๋ฐ˜์ ์œผ๋กœ model์— ์‚ฌ์šฉ๋˜๋Š” configuration, tokenizer๊ฐ€ ๋ชจ๋‘ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์šฉํ•˜๋ ค๋Š” model์— ์ ํ•ฉํ•œ configuration, tokenizer๋ฅผ ๋ถˆ๋Ÿฌ์™€์•ผํ•ฉ๋‹ˆ๋‹ค.
from transformers import AutoModelForMaskedLM, AutoTokenizer

MODEL_NAME = 'bert-base-multilingual-cased'
model = AutoModelForMaskedLM.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

๋จผ์ € tokenizer๊ฐ€ ์ •์ƒ์ ์œผ๋กœ ๋™์ž‘ํ•˜๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

  • ์›๋ฌธ: ์ด์ˆœ์‹ ์€ ์กฐ์„  ์ค‘๊ธฐ์˜ ๋ฌด์‹ ์ด๋‹ค.
  • mask: ์ด์ˆœ์‹ ์€ [MASK] ์ค‘๊ธฐ์˜ ๋ฌด์‹ ์ด๋‹ค.

fill-mask task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ ค๋ฉด text๋‚ด์— [MASK] special token์ด ํฌํ•จ๋˜์–ด ์žˆ์–ด์•ผํ•ฉ๋‹ˆ๋‹ค.

text = "์ด์ˆœ์‹ ์€ [MASK] ์ค‘๊ธฐ์˜ ๋ฌด์‹ ์ด๋‹ค."

tokenizer.tokenize(text)
['์ด', '##์ˆœ', '##์‹ ', '##์€', '[MASK]', '์ค‘', '##๊ธฐ์˜', '๋ฌด', '##์‹ ', '##์ด๋‹ค', '.']

BERT๋Š” WordPiece ๋ฐฉ์‹์˜ tokenization์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ##์ด๋ผ๋Š” ํŠน๋ณ„ํ•œ prefix๊ฐ€ ๋ถ™์–ด์žˆ๋Š” token๋“ค์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ##์€ ํ•ด๋‹น token์ด ์›๋ž˜๋Š” ์•ž token๊ณผ ๋ถ™์–ด์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. e.g.) ์ด์ˆœ์‹  โ†’ ์ด, ##์ˆœ, ##์‹ 

pipeline()์„ ์ด์šฉํ•ด ํ•œ๊ตญ์–ด fill-mask task๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ์œ„ํ•œ ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

from transformers import pipeline

kor_mask_fill = pipeline(task='fill-mask', model=model, tokenizer=tokenizer)

kor_mask_fill ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ fill-mask task๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

text = "์ด์ˆœ์‹ ์€ [MASK] ์ค‘๊ธฐ์˜ ๋ฌด์‹ ์ด๋‹ค."

kor_mask_fill("์ด์ˆœ์‹ ์€ [MASK] ์ค‘๊ธฐ์˜ ๋ฌด์‹ ์ด๋‹ค.")
[{'score': 0.874712347984314,
  'token': 59906,
  'token_str': '์กฐ์„ ',
  'sequence': '์ด์ˆœ์‹ ์€ ์กฐ์„  ์ค‘๊ธฐ์˜ ๋ฌด์‹ ์ด๋‹ค.'},
 {'score': 0.0643644854426384,
  'token': 9751,
  'token_str': '์ฒญ',
  'sequence': '์ด์ˆœ์‹ ์€ ์ฒญ ์ค‘๊ธฐ์˜ ๋ฌด์‹ ์ด๋‹ค.'},
 {'score': 0.010954903438687325,
  'token': 9665,
  'token_str': '์ „',
  'sequence': '์ด์ˆœ์‹ ์€ ์ „ ์ค‘๊ธฐ์˜ ๋ฌด์‹ ์ด๋‹ค.'},
 {'score': 0.004647187888622284,
  'token': 22200,
  'token_str': '##์ข…',
  'sequence': '์ด์ˆœ์‹ ์€์ข… ์ค‘๊ธฐ์˜ ๋ฌด์‹ ์ด๋‹ค.'},
 {'score': 0.0036106701008975506,
  'token': 12310,
  'token_str': '##๊ธฐ',
  'sequence': '์ด์ˆœ์‹ ์€๊ธฐ ์ค‘๊ธฐ์˜ ๋ฌด์‹ ์ด๋‹ค.'}]

[MASK] ์ž๋ฆฌ์— ๋“ค์–ด๊ฐˆ token๋“ค์„ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

  • score: ์ ์ˆ˜
  • token: token id
  • token_str: token text

Leave a comment