Evaluating Chunking Strategies for Retrieval | Chroma Research

๐Ÿ“˜ ๊ฐœ์š”: AI ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๊ฒ€์ƒ‰ ํ‰๊ฐ€์˜ ํ•„์š”์„ฑ

AI ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜, ํŠนํžˆ ๊ฒ€์ƒ‰ ์ฆ๊ฐ• ์ƒ์„ฑ(RAG)์—์„œ ๋ฌธ์„œ ์ฒญํ‚น์€ ํ•ต์‹ฌ ์ „์ฒ˜๋ฆฌ ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๊ธฐ์กด ์ •๋ณด ๊ฒ€์ƒ‰(IR) ๋ฒค์น˜๋งˆํฌ๋Š” ๋ฌธ์„œ ์ „์ฒด์˜ ์ˆœ์œ„๋‚˜ ๊ด€๋ จ์„ฑ์— ์ดˆ์ ์„ ๋งž์ถฐ, LLM์ด ์‹ค์ œ๋กœ ์ค‘์š”ํ•˜๊ฒŒ ์—ฌ๊ธฐ๋Š” ํ† ํฐ ์ˆ˜์ค€์˜ ํšจ์œจ์„ฑ์„ ์ธก์ •ํ•˜์ง€ ๋ชปํ•˜๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. LLM์€ ์ปจํ…์ŠคํŠธ ์ฐฝ ๋‚ด์—์„œ ๊ด€๋ จ ์ •๋ณด์˜ ์œ„์น˜์— ๋ฏผ๊ฐํ•˜์ง€ ์•Š์œผ๋ฉฐ, ๊ด€๋ จ ์ •๋ณด๊ฐ€ ์—ฌ๋Ÿฌ ๋ฌธ์„œ์— ํฉ์–ด์ ธ ์žˆ์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ด ๋ณด๊ณ ์„œ๋Š” ํ† ํฐ ์ˆ˜์ค€์—์„œ ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

B) ๐Ÿงช ์ƒˆ๋กœ์šด ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ ๋ฐ ์ง€ํ‘œ

์ด ์—ฐ๊ตฌ๋Š” AI ์‘์šฉ ํ™˜๊ฒฝ์— ํŠนํ™”๋œ ์ƒˆ๋กœ์šด ๊ฒ€์ƒ‰ ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

  • ์ƒ์„ฑํ˜• ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹: LLM์„ ํ™œ์šฉํ•˜์—ฌ ์ฃผ์–ด์ง„ ๋ฌธ์„œ ์ฝ”ํผ์Šค์—์„œ ์‚ฌ์‹ค ๊ธฐ๋ฐ˜์˜ ์ฟผ๋ฆฌ์™€ ๊ด€๋ จ ์›๋ฌธ ๋ฐœ์ทŒ๋ฌธ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ํŠน์ • ๋„๋ฉ”์ธ์— ๋งž๋Š” ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋ฉฐ, ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ๋Š” ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ํฌํ•จ๋˜์ง€ ์•Š์•„ ํŽธํ–ฅ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง: ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ๋‘ ๋‹จ๊ณ„์˜ ํ•„ํ„ฐ๋ง์„ ๊ฑฐ์นฉ๋‹ˆ๋‹ค. ๋จผ์ €, ์ฟผ๋ฆฌ๋“ค์„ ์ž„๋ฒ ๋”ฉํ•˜์—ฌ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๊ฐ€ ๋†’์€ ์ค‘๋ณต ์ฟผ๋ฆฌ๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ์œผ๋กœ, ์ฟผ๋ฆฌ์™€ ๊ด€๋ จ ๋ฐœ์ทŒ๋ฌธ ๊ฐ„์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๊ด€๋ จ์„ฑ์ด ๋‚ฎ์€ ์Œ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.
  • ํ† ํฐ ๋‹จ์œ„ ํ‰๊ฐ€ ์ง€ํ‘œ:
    • IoU (Intersection over Union): ๊ฒ€์ƒ‰๋œ ํ† ํฐ๊ณผ ์‹ค์ œ ๊ด€๋ จ ํ† ํฐ์˜ ์ค‘์ฒฉ ์ •๋„๋ฅผ ์ธก์ •ํ•˜๋Š” ์ง€ํ‘œ๋กœ, ์ž์นด๋“œ ์œ ์‚ฌ๋„(Jaccard similarity)์— ๊ธฐ๋ฐ˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊ฒ€์ƒ‰์˜ ํšจ์œจ์„ฑ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
    • ์ •๋ฐ€๋„(Precision): ๊ฒ€์ƒ‰๋œ ํ† ํฐ ์ค‘ ๊ด€๋ จ ํ† ํฐ์˜ ๋น„์œจ์ž…๋‹ˆ๋‹ค.
    • ์žฌํ˜„์œจ(Recall): ์ „์ฒด ๊ด€๋ จ ํ† ํฐ ์ค‘ ๊ฒ€์ƒ‰๋œ ํ† ํฐ์˜ ๋น„์œจ์ž…๋‹ˆ๋‹ค.
    • Precisionฮฉ: ๋ชจ๋“  ๊ด€๋ จ ์ฒญํฌ๊ฐ€ ๊ฒ€์ƒ‰๋˜์—ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ–ˆ์„ ๋•Œ์˜ ์ด๋ก ์ ์ธ ์ตœ๋Œ€ ์ •๋ฐ€๋„๋กœ, ์™„๋ฒฝํ•œ ์žฌํ˜„์œจ ํ•˜์—์„œ์˜ ํ† ํฐ ํšจ์œจ์„ฑ ์ƒํ•œ์„ ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

C) ๐Ÿ“š ํ‰๊ฐ€์— ์‚ฌ์šฉ๋œ ์ฝ”ํผ์Šค

ํ‰๊ฐ€๋Š” ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ๊ณผ ๋ฐ์ดํ„ฐ ์ •์ œ ์ˆ˜์ค€์„ ๊ฐ€์ง„ 5๊ฐœ์˜ ์ฝ”ํผ์Šค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ง„ํ–‰๋˜์—ˆ์œผ๋ฉฐ, ์ด 328,208๊ฐœ์˜ ํ† ํฐ๊ณผ 472๊ฐœ์˜ ์ฟผ๋ฆฌ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

์ฝ”ํผ์Šคํ† ํฐ ์ˆ˜ํŠน์ง•
State of the Union 202410,444๊ตฌ์กฐ๊ฐ€ ์ž˜ ์งœ์ธ ๋ช…ํ™•ํ•œ ์—ฐ์„ค๋ฌธ
Wikitext26,649์œ„ํ‚คํ”ผ๋””์•„์˜ ๊ฒ€์ฆ๋œ ์šฐ์ˆ˜ ๋ฌธ์„œ
Chatlogs7,727ChatGPT ๋Œ€ํ™” ์›๋ฌธ (JSON ํ˜•์‹ ํฌํ•จ)
Finance166,177๊ธˆ์œต ๋ณด๊ณ ์„œ ๊ธฐ๋ฐ˜์˜ ์งˆ์˜์‘๋‹ต ๋ฐ์ดํ„ฐ
Pubmed117,211์ƒ๋ช… ๊ณผํ•™ ๋ฐ ๋ฐ”์ด์˜ค๋ฉ”๋””์ปฌ ๋ถ„์•ผ์˜ ๋…ผ๋ฌธ

D) ๐Ÿงฉ ํ‰๊ฐ€๋œ ์ฒญํ‚น ์•Œ๊ณ ๋ฆฌ์ฆ˜

์ด ๋ณด๊ณ ์„œ๋Š” ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ์ฒญํ‚น ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ํ•จ๊ป˜ ์ด ์—ฐ๊ตฌ์—์„œ ์ƒˆ๋กญ๊ฒŒ ์ œ์•ˆํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. โ˜…๋Š” ์ด ์—ฐ๊ตฌ์—์„œ ๊ฐœ๋ฐœํ•œ ์ƒˆ๋กœ์šด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜๋ฐฉ์‹ํŠน์ง•
RecursiveCharacterTextSplitter๋ฌธ์ž ๊ธฐ๋ฐ˜ ๋ถ„ํ• ๊ตฌ๋ถ„์ž(\n\n, ., ๋“ฑ)๋ฅผ ์‚ฌ์šฉํ•ด ์žฌ๊ท€์ ์œผ๋กœ ๋ถ„ํ• ํ•˜๋Š” ํœด๋ฆฌ์Šคํ‹ฑ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
TokenTextSplitter๊ณ ์ • ๊ธธ์ด ๋ถ„ํ• ์˜๋ฏธ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ  ์ง€์ •๋œ ํ† ํฐ ์ˆ˜๋กœ ํ…์ŠคํŠธ๋ฅผ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
KamradtSemanticChunker์˜๋ฏธ ๊ธฐ๋ฐ˜ ๋ถ„ํ• ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ์˜ ์ฝ”์‚ฌ์ธ ๊ฑฐ๋ฆฌ ๋ถˆ์—ฐ์†์„ฑ์„ ๊ฐ์ง€ํ•˜์—ฌ ์ฒญํฌ ๊ฒฝ๊ณ„๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
โ˜… KamradtModifiedChunkerKamradt ๋ฐฉ์‹ ๊ฐœ์„ ์‚ฌ์šฉ์ž๊ฐ€ ์ตœ๋Œ€ ์ฒญํฌ ๊ธธ์ด๋ฅผ ์ง์ ‘ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ด์ง„ ํƒ์ƒ‰์œผ๋กœ ์ž„๊ณ„๊ฐ’์„ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
โ˜… ClusterSemanticChunker์˜๋ฏธ ์œ ์‚ฌ์„ฑ ์ตœ์ ํ™”๋™์  ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์‚ฌ์šฉํ•ด ์ฒญํฌ ๋‚ด ์กฐ๊ฐ๋“ค์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ํ•ฉ์„ ์ „์—ญ์ ์œผ๋กœ ์ตœ๋Œ€ํ™”ํ•ฉ๋‹ˆ๋‹ค.
โ˜… LLMSemanticChunkerLLM ์ง์ ‘ ํ”„๋กฌํ”„ํŒ…LLM์—๊ฒŒ ์ง์ ‘ ํ…์ŠคํŠธ๋ฅผ ๋ถ„ํ• ํ•˜๊ณ  ๋ถ„ํ• ํ•  ์œ„์น˜์˜ ์ธ๋ฑ์Šค๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋„๋ก ์š”์ฒญํ•ฉ๋‹ˆ๋‹ค.

E) ๐Ÿ“Š ์ฃผ์š” ์‹คํ—˜ ๊ฒฐ๊ณผ ์š”์•ฝ

์‹คํ—˜์€ OpenAI text-embedding-3-large ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ 5๊ฐœ์˜ ์ฒญํฌ๋ฅผ ๊ฒ€์ƒ‰ํ–ˆ์„ ๋•Œ์˜ ํ‰๊ท  ์ ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

  • RecursiveCharacterTextSplitter (์ฒญํฌ ํฌ๊ธฐ 200, ์˜ค๋ฒ„๋žฉ ์—†์Œ)๋Š” ์žฌํ˜„์œจ์€ ์ตœ๊ณ  ์ˆ˜์ค€์ด ์•„๋‹ˆ์—ˆ์ง€๋งŒ, ์ •๋ฐ€๋„(7.0%)์™€ IoU(6.9%)์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ ์ „๋ฐ˜์ ์œผ๋กœ ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.
  • OpenAI ๊ธฐ๋ณธ ์„ค์ • (TokenTextSplitter, ์ฒญํฌ 800, ์˜ค๋ฒ„๋žฉ 400)์€ ์žฌํ˜„์œจ์€ ํ‰๊ท  ์ดํ•˜์˜€๊ณ , ๋‹ค๋ฅธ ๋ชจ๋“  ์ง€ํ‘œ์—์„œ๋Š” ์ตœํ•˜์œ„ ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
  • โ˜… **ClusterSemanticChunker**๋Š” ์ฒญํฌ ํฌ๊ธฐ 400์—์„œ ๋‘ ๋ฒˆ์งธ๋กœ ๋†’์€ ์žฌํ˜„์œจ(91.3%)์„ ๊ธฐ๋กํ–ˆ๊ณ , ์ฒญํฌ ํฌ๊ธฐ 200์—์„œ๋Š” ๊ฐ€์žฅ ๋†’์€ ์ •๋ฐ€๋„(8.0%)์™€ IoU(8.0%)๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
  • โ˜… **LLMSemanticChunker**๋Š” ๊ฐ€์žฅ ๋†’์€ ์žฌํ˜„์œจ(91.9%)์„ ๊ธฐ๋กํ–ˆ์ง€๋งŒ, ๋‹ค๋ฅธ ์ง€ํ‘œ๋Š” ํ‰๊ท  ์ˆ˜์ค€์ด์—ˆ์Šต๋‹ˆ๋‹ค.
  • KamradtSemanticChunker์˜ ๊ธฐ๋ณธ ์„ค์ •์€ ํ‰๊ท  ์ดํ•˜์˜ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋‚˜, ์ˆ˜์ •๋œ โ˜… **KamradtModifiedChunker**๋Š” ์žฌํ˜„์œจ๊ณผ ๋‹ค๋ฅธ ์ง€ํ‘œ ๋ชจ๋‘์—์„œ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

F) โš ๏ธ ํ•œ๊ณ„ ๋ฐ ํ–ฅํ›„ ๊ณผ์ œ

  • ๋ฐ์ดํ„ฐ์…‹ ํ•œ๊ณ„: LLM์ด ์ƒ์„ฑํ•˜๋Š” ์งˆ๋ฌธ ์Šคํƒ€์ผ์ด ๋‹จ์กฐ๋กœ์šธ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹์˜ ๊ทœ๋ชจ์™€ ๋„๋ฉ”์ธ ๋‹ค์–‘์„ฑ์ด ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค.
  • ์‹œ๊ฐ„ ๋ฐ ๋น„์šฉ ๋ฏธ๋ฐ˜์˜: LLMSemanticChunker์™€ ๊ฐ™์ด ์‹คํ–‰ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ฒฝ์šฐ, ์‹ค์ œ ์ ์šฉ ์‹œ ์‹œ๊ฐ„ ๋น„์šฉ์„ ๊ณ ๋ คํ•ด์•ผ ํ•˜์ง€๋งŒ ์ด ์—ฐ๊ตฌ์—์„œ๋Š” ๋‹ค๋ฃจ์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
  • ๊ด€๋ จ ๋ฐœ์ทŒ๋ฌธ ๋ˆ„๋ฝ ๊ฐ€๋Šฅ์„ฑ: LLM์ด ์ƒ์„ฑํ•˜์ง€ ์•Š์€ ๋‹ค๋ฅธ ๊ด€๋ จ ๋ฐœ์ทŒ๋ฌธ์ด ์ฝ”ํผ์Šค์— ์กด์žฌํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€๋Šฅ์„ฑ์„ ๊ณ ๋ คํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

G) ๐Ÿง  ๊ฒฐ๋ก 

์ด ๋ณด๊ณ ์„œ๋Š” AI ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ํ™˜๊ฒฝ์— ๋งž๋Š” ์ƒˆ๋กœ์šด ๊ฒ€์ƒ‰ ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ์ œ์•ˆํ•˜๊ณ , ์ด๋ฅผ ํ†ตํ•ด ์ฒญํ‚น ์ „๋žต์ด ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์ •๋Ÿ‰์ ์œผ๋กœ ๋ถ„์„ํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ์ƒˆ๋กญ๊ฒŒ ์ œ์•ˆ๋œ ClusterSemanticChunker์™€ LLMSemanticChunker๋Š” ์˜๋ฏธ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์„ ํ†ตํ•ด ๋†’์€ ์žฌํ˜„์œจ์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ์ „์ฒด ์ฝ”๋“œ๋Š” GitHub์— ๊ณต๊ฐœ๋˜์–ด ์žˆ์–ด ์žฌํ˜„์„ฑ๊ณผ ์ถ”๊ฐ€ ์—ฐ๊ตฌ๋ฅผ ์œ„ํ•œ ํ™•์žฅ์„ฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.


H) QnA

H.1) Precision, Recall, IoU ์ธก์ • ๋ฐฉ์‹ (์˜ˆ์‹œ ํฌํ•จ)

์ด ๋ณด๊ณ ์„œ์—์„œ ์‚ฌ์šฉ๋œ Precision, Recall, IoU ๋“ฑ์˜ ์ง€ํ‘œ๋Š” ํ† ํฐ ์ˆ˜์ค€์—์„œ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋ฌธ์„œ ์ „์ฒด๋‚˜ ๋ฌธ๋‹จ์ด ์•„๋‹Œ, ๊ฐœ๋ณ„ ํ† ํฐ ํ•˜๋‚˜ํ•˜๋‚˜์˜ ์ผ์น˜ ์—ฌ๋ถ€๋ฅผ ๋”ฐ์ง‘๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์•„์ด๋””์–ด: LLM์ด ์ƒ์„ฑํ•œ โ€œ์ •๋‹ตโ€ (๊ด€๋ จ ๋ฐœ์ทŒ๋ฌธ์— ํฌํ•จ๋œ ๊ด€๋ จ ํ† ํฐ๋“ค)๊ณผ, ์ฒญํ‚น ํ›„ ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ์ด ๋ฐ˜ํ™˜ํ•œ โ€œ๊ฒฐ๊ณผโ€ (๊ฒ€์ƒ‰๋œ ์ฒญํฌ์—์„œ ๋‚˜์˜จ ํ† ํฐ๋“ค)๋ฅผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.

๊ฐ€์ •:

  • ๊ด€๋ จ ํ† ํฐ ์ง‘ํ•ฉ (te): ์ฟผ๋ฆฌ์— ๋Œ€ํ•ด โ€œ์ •๋‹ตโ€์ด๋ผ๊ณ  ํŒ๋‹จ๋˜๋Š” ํ† ํฐ๋“ค์˜ ์ง‘ํ•ฉ์ž…๋‹ˆ๋‹ค. (์˜ˆ: โ€œAtlanta์˜ ํ•จ์„ ์— ์‚ฌ์šฉ๋œ ๊ฐ‘์˜ท์˜ ์ฃผ์š” ํŠน์ง•์€ ๋ฌด์—‡์ด์—ˆ์Šต๋‹ˆ๊นŒ?โ€ ๋ผ๋Š” ์ฟผ๋ฆฌ์— ๋Œ€ํ•œ ์ •๋‹ต ๋ฐœ์ทŒ๋ฌธ์—์„œ โ€˜Atlantaโ€™, โ€˜๊ฐ‘์˜ทโ€™, โ€˜์ฃผ์š” ํŠน์ง•โ€™, โ€˜3์ธ์น˜โ€™, โ€˜๋‘๊บผ์› ๋‹คโ€™ ๋“ฑ์˜ ์˜๋ฏธ๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” ํ† ํฐ๋“ค)
  • ๊ฒ€์ƒ‰๋œ ํ† ํฐ ์ง‘ํ•ฉ (tr): ์ฒญํ‚น๋œ ๋ฌธ์„œ๋ฅผ ๊ฒ€์ƒ‰ํ–ˆ์„ ๋•Œ, ์‹œ์Šคํ…œ์ด ๋ฐ˜ํ™˜ํ•œ ์ฒญํฌ๋“ค์—์„œ ๋‚˜์˜จ ํ† ํฐ๋“ค์˜ ์ง‘ํ•ฉ์ž…๋‹ˆ๋‹ค. (์˜ˆ: ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ์ด ๋ฐ˜ํ™˜ํ•œ ์ฒญํฌ์— โ€˜Atlantaโ€™s hull received two inches of armorโ€™ ์™€ ๊ฐ™์€ ๋‚ด์šฉ์ด ์žˆ๋‹ค๋ฉด, ์ด ๋ฌธ์žฅ์˜ ํ† ํฐ๋“ค)

์ธก์ • ๋ฐฉ๋ฒ• (์˜ˆ์‹œ):

๋ฌธ์„œ ์ฝ”ํผ์Šค: โ€œThe battleship Atlantaโ€™s hull armor was 3 inches thick. It was backed by oak. The upper portion of Atlantaโ€™s hull received two inches of armor.โ€ ์ฟผ๋ฆฌ: โ€œWhat were the main characteristics of the armor used on the ship Atlanta?โ€

  1. LLM์„ ํ†ตํ•œ โ€œ์ •๋‹ตโ€ ์ƒ์„ฑ:

    • LLM์ด ์œ„ ๋ฌธ์„œ์™€ ์ฟผ๋ฆฌ๋ฅผ ๋ณด๊ณ , โ€œAtlantaโ€, โ€œํ•จ์„ โ€, โ€œ๊ฐ‘์˜ทโ€, โ€œ์ฃผ์š” ํŠน์ง•โ€, โ€œ3์ธ์น˜โ€, โ€œ๋‘๊บผ์› ๋‹คโ€, โ€œ์ƒ๋‹จ ๋ถ€๋ถ„โ€ ๋“ฑ์˜ ํ† ํฐ์„ ํฌํ•จํ•˜๋Š” ๊ด€๋ จ ๋ฐœ์ทŒ๋ฌธ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • ์ด ๋ฐœ์ทŒ๋ฌธ์—์„œ **๊ด€๋ จ ํ† ํฐ ์ง‘ํ•ฉ (te)**์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. (๊ฐ„๋‹จํ™”๋ฅผ ์œ„ํ•ด ์—ฌ๊ธฐ์„œ๋Š” โ€œAtlantaโ€, โ€œํ•จ์„ โ€, โ€œ๊ฐ‘์˜ทโ€, โ€œ์ฃผ์š”โ€, โ€œํŠน์ง•โ€, โ€œ3์ธ์น˜โ€, โ€œ๋‘๊บผ์› ๋‹คโ€, โ€œ์ƒ๋‹จโ€, โ€œ๋ถ€๋ถ„โ€ ์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ๋Š” ๋” ๋งŽ์€ ํ† ํฐ์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.)
    • te = {"Atlanta", "ํ•จ์„ ", "๊ฐ‘์˜ท", "์ฃผ์š”", "ํŠน์ง•", "3์ธ์น˜", "๋‘๊บผ์› ๋‹ค", "์ƒ๋‹จ", "๋ถ€๋ถ„"} (์ด 9๊ฐœ ํ† ํฐ)
  2. ์ฒญํ‚น ํ›„ ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ์˜ โ€œ๊ฒฐ๊ณผโ€ ์ƒ์„ฑ:

    • ๋งŒ์•ฝ ์ฒญํ‚น ๋ฐฉ์‹์ด ๋ฌธ์žฅ์„ ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„๊ณ , ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ์ด ์ฒซ ๋ฒˆ์งธ ๋ฌธ์žฅ๊ณผ ์„ธ ๋ฒˆ์งธ ๋ฌธ์žฅ์„ ๋ฐ˜ํ™˜ํ–ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด ๋ด…์‹œ๋‹ค.
    • ๊ฒ€์ƒ‰๋œ ์ฒญํฌ: โ€œThe battleship Atlantaโ€™s hull armor was 3 inches thick.โ€ (A) + โ€œThe upper portion of Atlantaโ€™s hull received two inches of armor.โ€ (B)
    • ์ด ์ฒญํฌ๋“ค์—์„œ ๊ฒ€์ƒ‰๋œ ํ† ํฐ ์ง‘ํ•ฉ (tr) ์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. (๊ฐ„๋‹จํ™”๋ฅผ ์œ„ํ•ด โ€œAtlantaโ€, โ€œํ•จ์„ โ€, โ€œ๊ฐ‘์˜ทโ€, โ€œ3์ธ์น˜โ€, โ€œ๋‘๊บผ์› ๋‹คโ€, โ€œ์ƒ๋‹จโ€, โ€œ๋ถ€๋ถ„โ€, โ€œ2์ธ์น˜โ€ ๋ผ๊ณ  ๊ฐ€์ •ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.)
    • tr = {"Atlanta", "ํ•จ์„ ", "๊ฐ‘์˜ท", "3์ธ์น˜", "๋‘๊บผ์› ๋‹ค", "์ƒ๋‹จ", "๋ถ€๋ถ„", "2์ธ์น˜"} (์ด 8๊ฐœ ํ† ํฐ)
  3. ์ง€ํ‘œ ๊ณ„์‚ฐ:

    • ๊ต์ง‘ํ•ฉ (te โˆฉ tr): ๋‘ ์ง‘ํ•ฉ์— ๋ชจ๋‘ ์žˆ๋Š” ํ† ํฐ๋“ค
      • te โˆฉ tr = {"Atlanta", "ํ•จ์„ ", "๊ฐ‘์˜ท", "3์ธ์น˜", "๋‘๊บผ์› ๋‹ค", "์ƒ๋‹จ", "๋ถ€๋ถ„"} (์ด 7๊ฐœ ํ† ํฐ)
    • ํ•ฉ์ง‘ํ•ฉ (te โˆช tr): ๋‘ ์ง‘ํ•ฉ์— ์žˆ๋Š” ๋ชจ๋“  ํ† ํฐ๋“ค
      • te โˆช tr = {"Atlanta", "ํ•จ์„ ", "๊ฐ‘์˜ท", "์ฃผ์š”", "ํŠน์ง•", "3์ธ์น˜", "๋‘๊บผ์› ๋‹ค", "์ƒ๋‹จ", "๋ถ€๋ถ„", "2์ธ์น˜"} (์ด 10๊ฐœ ํ† ํฐ)
    • Precision (์ •๋ฐ€๋„): |te โˆฉ tr| / |tr|
      • 7 / 8 = 0.875
      • ํ•ด์„: ๊ฒ€์ƒ‰๋œ 8๊ฐœ์˜ ํ† ํฐ ์ค‘ 7๊ฐœ๊ฐ€ ์‹ค์ œ ๊ด€๋ จ ํ† ํฐ์ด์—ˆ์Šต๋‹ˆ๋‹ค. (87.5%์˜ ์ •ํ™•๋„)
    • Recall (์žฌํ˜„์œจ): |te โˆฉ tr| / |te|
      • 7 / 9 = 0.778
      • ํ•ด์„: ์‹ค์ œ ๊ด€๋ จ ํ† ํฐ 9๊ฐœ ์ค‘ 7๊ฐœ๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ์ฐพ์•„๋ƒˆ์Šต๋‹ˆ๋‹ค. (77.8%์˜ ์žฌํ˜„์œจ)
    • IoU (Intersection over Union): |te โˆฉ tr| / |te โˆช tr|
      • 7 / 10 = 0.7
      • ํ•ด์„: ์ „์ฒด ๊ด€๋ จ ํ† ํฐ๊ณผ ๊ฒ€์ƒ‰๋œ ํ† ํฐ์˜ ํ•ฉ์ง‘ํ•ฉ ์ค‘์—์„œ ๊ฒน์น˜๋Š” ๋ถ€๋ถ„์ด 70%์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์˜ ์ „๋ฐ˜์ ์ธ ์ผ์น˜๋„๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. (๋ถ„๋ชจ๋ฅผ |te| + |tr| - |te โˆฉ tr| ๋กœ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋ฉฐ, ์ด๋Š” |te โˆช tr|๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.)
    • F1 Score: 2 * (Precision * Recall) / (Precision + Recall)
      • 2 * (0.875 * 0.778) / (0.875 + 0.778) โ‰ˆ 0.824
      • ํ•ด์„: Precision๊ณผ Recall์˜ ์กฐํ™” ํ‰๊ท ์œผ๋กœ, ๋‘ ์ง€ํ‘œ๋ฅผ ์ข…ํ•ฉ์ ์œผ๋กœ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

์ด์ฒ˜๋Ÿผ ๊ฐ ํ† ํฐ์˜ ์กด์žฌ ์—ฌ๋ถ€๋ฅผ ์ •ํ™•ํžˆ ํŒŒ์•…ํ•˜์—ฌ ์ง€ํ‘œ๋ฅผ ๊ณ„์‚ฐํ•จ์œผ๋กœ์จ, LLM์ด ์ •๋ณด๋ฅผ ์–ผ๋งˆ๋‚˜ ์ •ํ™•ํ•˜๊ณ  ํšจ์œจ์ ์œผ๋กœ ์ถ”์ถœํ–ˆ๋Š”์ง€๋ฅผ ํ† ํฐ ์ˆ˜์ค€์—์„œ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.


H.2) KamradtModifiedChunker vs. ClusterSemanticChunker: ๋ฌด์—‡์ด ๋‹ค๋ฅธ๊ฐ€?

์ด ๋‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ชจ๋‘ ์˜๋ฏธ ๊ธฐ๋ฐ˜ ์ฒญํ‚น์„ ์‹œ๋„ํ•˜์ง€๋งŒ, ์ ‘๊ทผ ๋ฐฉ์‹๊ณผ ์ตœ์ ํ™” ๋ชฉํ‘œ์—์„œ ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

  • KamradtSemanticChunker (๊ธฐ๋ณธ):
    • ๋ฐฉ์‹: ๋จผ์ € ๋ฌธ์žฅ์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ๋ถ„ํ• ํ•œ ํ›„, ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์œˆ๋„์šฐ๋ฅผ ์ด๋™ํ•˜๋ฉด์„œ ๊ฐ ์œˆ๋„์šฐ(์ฒญํฌ ์กฐ๊ฐ)์˜ ์ž„๋ฒ ๋”ฉ๊ณผ ๋‹ค์Œ ์œˆ๋„์šฐ ์ž„๋ฒ ๋”ฉ ๊ฐ„์˜ ์ฝ”์‚ฌ์ธ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฑฐ๋ฆฌ๊ฐ€ ํŠน์ • ์ž„๊ณ„๊ฐ’(๋†’์€ ๊ฑฐ๋ฆฌ)์„ ๋„˜์–ด์„œ๋Š” ์ง€์ ์„ ์ฒญํฌ ๊ฒฝ๊ณ„๋กœ ํŒ๋‹จํ•ฉ๋‹ˆ๋‹ค.
    • ํŠน์ง•: ํƒ์š•์ (Greedy) ์ ‘๊ทผ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์ฆ‰, ๊ฐ ๋‹จ๊ณ„์—์„œ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ฒฝ๊ณ„๋ฅผ ์ฐพ์œผ๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊ตฌํ˜„์ด ๋น„๊ต์  ๊ฐ„๋‹จํ•˜์ง€๋งŒ, ๊ตญ์†Œ์ ์ธ ์ตœ์ ํ•ด์— ๋น ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ํ•œ๊ณ„: ๊ธฐ๋ณธ ์„ค์ •์—์„œ ์ž„๊ณ„๊ฐ’์ด ๋„ˆ๋ฌด ๋†’๊ฑฐ๋‚˜ ๋‚ฎ์œผ๋ฉด ์ฒญํฌ๊ฐ€ ๋„ˆ๋ฌด ํฌ๊ฑฐ๋‚˜ ์ž‘์•„์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • โ˜… KamradtModifiedChunker (์ˆ˜์ • ๋ฒ„์ „):
    • ๊ฐœ์„ ์ : KamradtSemanticChunker์˜ ํƒ์š•์  ๋ฐฉ์‹์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋˜, ์‚ฌ์šฉ์ž๊ฐ€ ์›ํ•˜๋Š” ์ตœ๋Œ€ ์ฒญํฌ ๊ธธ์ด๋ฅผ ์ง์ ‘ ์ง€์ •ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ์ž‘๋™ ๋ฐฉ์‹: KamradtSemanticChunker์˜ ์ž„๊ณ„๊ฐ’ ํƒ์ง€ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ํƒ์ง€๋œ ๊ฒฝ๊ณ„๊ฐ€ ์ตœ๋Œ€ ๊ธธ์ด๋ฅผ ๋„˜์œผ๋ฉด ์ด์ง„ ํƒ์ƒ‰๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž„๊ณ„๊ฐ’์„ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ๋ชฉํ‘œ๋Š” ์˜๋ฏธ์  ๊ฒฝ๊ณ„๋ฅผ ์ตœ๋Œ€ํ•œ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ์ง€์ •๋œ ์ตœ๋Œ€ ๊ธธ์ด๋ฅผ ๋„˜์ง€ ์•Š๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
    • ํ•ต์‹ฌ: ์‚ฌ์šฉ์ž ์ œ์–ด๋ ฅ ๊ฐ•ํ™”. ์˜๋ฏธ ๊ธฐ๋ฐ˜ ๋ถ„ํ• ์˜ ์žฅ์ ์„ ์‚ด๋ฆฌ๋ฉด์„œ๋„, ๊ธธ์ด ์ œ์•ฝ์„ ํ†ตํ•ด RAG ์‹œ์Šคํ…œ์˜ ์ž…๋ ฅ ์ œํ•œ์— ๋งž์ถ”๊ธฐ ์šฉ์ดํ•ฉ๋‹ˆ๋‹ค.
  • โ˜… ClusterSemanticChunker (์ƒˆ๋กœ์šด ์ œ์•ˆ):
    • ๋ชฉํ‘œ: Kamradt ๋ฐฉ์‹์˜ ํƒ์š•์  ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ณ , ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ์ „์—ญ์ ์œผ๋กœ ์ตœ์ ํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
    • ๋ฐฉ์‹:
      1. ์›๋ณธ ํ…์ŠคํŠธ๋ฅผ ์ž‘์€ ์กฐ๊ฐ(์˜ˆ: 50 ํ† ํฐ ๋‹จ์œ„)์œผ๋กœ ๋จผ์ € ๋ถ„ํ• ํ•˜๊ณ  ๊ฐ ์กฐ๊ฐ์˜ ์ž„๋ฒ ๋”ฉ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
      2. ์ด ์กฐ๊ฐ๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ ๋™์  ํ”„๋กœ๊ทธ๋ž˜๋ฐ(Dynamic Programming) ๊ธฐ๋ฒ•์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
      3. ๋™์  ํ”„๋กœ๊ทธ๋ž˜๋ฐ์€ ํ…์ŠคํŠธ๋ฅผ ์ผ๋ จ์˜ ์ฒญํฌ๋กœ ๋ถ„ํ• ํ•  ๋•Œ, ๋ชจ๋“  ์ฒญํฌ ๋‚ด์˜ ์กฐ๊ฐ๋“ค ๊ฐ„ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ํ•ฉ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ตœ์ ์˜ ๋ถ„ํ•  ์ง€์ ์„ ์ฐพ์Šต๋‹ˆ๋‹ค.
    • ํ•ต์‹ฌ: ์ „์—ญ ์ตœ์ ํ™”. ์ „์ฒด ๋ฌธ์„œ๋ฅผ ๋ณด๊ณ  ๊ฐ€์žฅ ์˜๋ฏธ์ ์œผ๋กœ ๋ญ‰์ณ์žˆ๋Š” ์ฒญํฌ๋“ค์„ ๋งŒ๋“ค์–ด๋ƒ…๋‹ˆ๋‹ค. ์ด๋Š” โ€œ๊ฐ€์žฅ ์ž์—ฐ์Šค๋Ÿฌ์šด ์˜๋ฏธ ๋‹จ์œ„โ€๋ฅผ ์ฐพ์œผ๋ ค๋Š” ์‹œ๋„์ž…๋‹ˆ๋‹ค.
    • ๋‹จ์ : ์ „์ฒด ์ฝ”ํผ์Šค์˜ ํ†ต๊ณ„๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ตœ์ ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฏ€๋กœ, ๋ฐ์ดํ„ฐ๊ฐ€ ์ถ”๊ฐ€๋˜๋ฉด ์ฒญํฌ๋ฅผ ๋‹ค์‹œ ๊ณ„์‚ฐํ•ด์•ผ ํ•˜๋Š” ๋ถ€๋‹ด์ด ์žˆ์Šต๋‹ˆ๋‹ค. KamradtModifiedChunker๋ณด๋‹ค ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋†’์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐ„๋‹จํžˆ ๋งํ•ด:

  • Kamradt Modified: โ€œ์˜๋ฏธ๊ฐ€ ๋Š์–ด์ง€๋Š” ๊ณณ์„ ์ฐพ๊ณ , ๋„ˆ๋ฌด ๊ธธ์–ด์ง€์ง€ ์•Š๊ฒŒ ์กฐ์ ˆํ•˜์ž.โ€ (์‚ฌ์šฉ์ž ์ •์˜ ๊ธธ์ด ์ œ์•ฝ + ์˜๋ฏธ ๊ธฐ๋ฐ˜)
  • Cluster Semantic: โ€œ๋ฌธ์„œ ์ „์ฒด์—์„œ ๊ฐ€์žฅ ์˜๋ฏธ์ ์œผ๋กœ ๋˜˜๋˜˜ ๋ญ‰์ณ ์žˆ๋Š” ๋ฉ์–ด๋ฆฌ๋“ค์„ ๋งŒ๋“ค์ž.โ€ (์ „์—ญ ์ตœ์ ํ™” + ์˜๋ฏธ ๊ธฐ๋ฐ˜)

H.3) ์ผ๋ฐ˜์ ์œผ๋กœ ์ฒญํฌ๋ฅผ ์–ด๋–ป๊ฒŒ ํ•˜๋ฉด ์ข‹์„๊นŒ? (์—ฐ๊ตฌ ๊ฒฐ๊ณผ ๊ธฐ๋ฐ˜)

์ด ์—ฐ๊ตฌ์˜ ๊ฒฐ๊ณผ๋ฅผ ์ข…ํ•ฉํ•ด ๋ณผ ๋•Œ, โ€œ์ •๋‹ตโ€์€ ์—†์ง€๋งŒ ๋ช‡ ๊ฐ€์ง€ ๊ฒฝํ–ฅ๊ณผ ์ถ”์ฒœ ์‚ฌํ•ญ์„ ๋„์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ก ์˜ ํ•œ๊ณ„ ์ธ์ง€:

    • TokenTextSplitter (๊ณ ์ • ๊ธธ์ด, OpenAI ๊ธฐ๋ณธ๊ฐ’)๋Š” ์˜์™ธ๋กœ ์„ฑ๋Šฅ์ด ๋‚ฎ์•˜์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์˜๋ฏธ๋‚˜ ๊ตฌ์กฐ๋ฅผ ์ „ํ˜€ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.
    • RecursiveCharacterTextSplitter์˜ ๊ธฐ๋ณธ ๊ตฌ๋ถ„์ž(\n\n, \n, ๋“ฑ)๋งŒ์œผ๋กœ๋Š” ์งง๊ณ  ๊ด€๋ จ ์—†๋Š” ์ฒญํฌ๊ฐ€ ์ƒ์„ฑ๋˜์–ด ์„ฑ๋Šฅ์ด ์ €ํ•˜๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด . ? ! ๋“ฑ์„ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.
  2. ์˜๋ฏธ ๊ธฐ๋ฐ˜ ์ฒญํ‚น์˜ ์ค‘์š”์„ฑ:

    • Kamradt, ClusterSemantic, LLMChunker์™€ ๊ฐ™์ด ์˜๋ฏธ๋ฅผ ๊ณ ๋ คํ•˜๋Š” ์ „๋žต๋“ค์ด ์ „๋ฐ˜์ ์œผ๋กœ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค. LLM์€ ๋ฌธ๋งฅ์„ ์ดํ•ดํ•˜๋Š” ๋ฐ ๋›ฐ์–ด๋‚˜๋ฏ€๋กœ, ์˜๋ฏธ ๋‹จ์œ„๋กœ ์ž˜ ๋‚˜๋‰œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•  ๋•Œ ๋” ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค.
  3. ์‹คํ—˜ ๊ฒฐ๊ณผ ๊ธฐ๋ฐ˜ ์ถ”์ฒœ:

    • RecursiveCharacterTextSplitter (๊ฐœ์„ ๋œ ๊ตฌ๋ถ„์ž ์‚ฌ์šฉ): ์ ์ ˆํ•œ ์ฒญํฌ ํฌ๊ธฐ(์—ฐ๊ตฌ์—์„œ๋Š” 200 ํ† ํฐ)์™€ ์˜ค๋ฒ„๋žฉ ์—†์Œ ์„ค์ •์—์„œ ์•ˆ์ •์ ์ด๊ณ  ์ค€์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ๊ตฌํ˜„์ด ๊ฐ„๋‹จํ•˜๊ณ  ๋น ๋ฅด๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
    • ClusterSemanticChunker: ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ์ „์—ญ์ ์œผ๋กœ ์ตœ์ ํ™”ํ•˜์—ฌ ๋งค์šฐ ๋†’์€ Precisionฮฉ์™€ IoU๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์˜๋ฏธ์  ์‘์ง‘์„ฑ์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ ์ข‹์€ ์„ ํƒ์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒญํฌ ํฌ๊ธฐ 200์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
    • LLMSemanticChunker: ๊ฐ€์žฅ ๋†’์€ ์žฌํ˜„์œจ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ด€๋ จ ์ •๋ณด๊ฐ€ ํฉ์–ด์ ธ ์žˆ๋”๋ผ๋„ ์ตœ๋Œ€ํ•œ ๋งŽ์ด ์ฐพ์•„๋‚ด๋ ค๋Š” ๊ฒฝ์šฐ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค. LLM์—๊ฒŒ ์ง์ ‘ ๋งก๊ธฐ๋Š” ๋ฐฉ์‹์€ ๊ฐœ๋…์ ์œผ๋กœ ํฅ๋ฏธ๋กญ์ง€๋งŒ, ๋น„์šฉ์ด๋‚˜ ์†๋„ ์ด์Šˆ๋Š” ๊ณ ๋ คํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  4. ๊ณ ๋ ค ์‚ฌํ•ญ:

    • ์ฒญํฌ ํฌ๊ธฐ: ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด ๊ด€๋ จ ์ •๋ณด๊ฐ€ ์ชผ๊ฐœ์ ธ์„œ ๊ฒ€์ƒ‰ ํšจ์œจ์ด ๋–จ์–ด์ง€๊ณ , ๋„ˆ๋ฌด ํฌ๋ฉด ํ•˜๋‚˜์˜ ์ฒญํฌ ์•ˆ์— ๋„ˆ๋ฌด ๋งŽ์€ ๋ถˆํ•„์š”ํ•œ ์ •๋ณด๊ฐ€ ํฌํ•จ๋˜์–ด ๋น„์šฉ์ด๋‚˜ ์ •ํ™•๋„์— ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฐ๊ตฌ์—์„œ๋Š” 200 ~ 400 ํ† ํฐ ๋ฒ”์œ„๊ฐ€ ์—ฌ๋Ÿฌ ์ „๋žต์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
    • ์˜ค๋ฒ„๋žฉ(Overlap): ์ฒญํฌ ๊ฐ„์— ์ผ๋ถ€ ํ…์ŠคํŠธ๋ฅผ ๊ฒน์น˜๊ฒŒ ํ•˜๋ฉด, ์˜๋ฏธ๊ฐ€ ์ฒญํฌ ๊ฒฝ๊ณ„์— ๊ฑธ์ณ ์žˆ์„ ๋•Œ ์ •๋ณด๋ฅผ ๋†“์น˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์—ฐ๊ตฌ์—์„œ๋Š” ์˜ค๋ฒ„๋žฉ ์—†์ด๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•˜์ง€๋งŒ, ์‹ค์ œ ์ ์šฉ ์‹œ์—๋Š” ์ƒํ™ฉ์— ๋”ฐ๋ผ ์˜ค๋ฒ„๋žฉ์„ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ๋„๋ฉ”์ธ ํŠน์„ฑ: ๊ธˆ์œต, ์˜๋ฃŒ ๋“ฑ ํŠน์ • ๋„๋ฉ”์ธ์—์„œ๋Š” ๊ณ ์œ ์˜ ์šฉ์–ด๋‚˜ ๊ตฌ์กฐ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ํ•ด๋‹น ๋„๋ฉ”์ธ์— ํŠนํ™”๋œ ์ฒญํ‚น ์ „๋žต์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก ์ ์œผ๋กœ, โ€œ๊ฐ€์žฅ ์ข‹์€โ€ ์ฒญํ‚น ์ „๋žต์€ ์—†์Šต๋‹ˆ๋‹ค. ์—ฐ๊ตฌ ๊ฒฐ๊ณผ๋กœ๋Š” ClusterSemanticChunker๋‚˜ LLMSemanticChunker๊ฐ€ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์ง€๋งŒ, RecursiveCharacterTextSplitter ๋˜ํ•œ ๊ฐœ์„ ๋œ ์„ค์ • ํ•˜์—์„œ ํ•ฉ๋ฆฌ์ ์ธ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์‹ค์ œ ์ ์šฉ ์‹œ์—๋Š” ๋ฐ์ดํ„ฐ ํŠน์„ฑ, RAG ์‹œ์Šคํ…œ์˜ ์š”๊ตฌ์‚ฌํ•ญ(์ •๋ฐ€๋„ vs. ์žฌํ˜„์œจ), ๊ทธ๋ฆฌ๊ณ  ๊ณ„์‚ฐ ๋น„์šฉ ๋“ฑ์„ ์ข…ํ•ฉ์ ์œผ๋กœ ๊ณ ๋ คํ•˜์—ฌ ์„ ํƒํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.


H.4) ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ ์‹œ Corpus ์ „์ฒด๋ฅผ ๋ด์•ผ ํ•˜๋Š”๊ฐ€?

๋„ค, ์ผ๋ฐ˜์ ์œผ๋กœ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹์„ ์ƒ์„ฑํ•˜๋ ค๋ฉด ํ•ด๋‹น ํ‰๊ฐ€์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” Corpus ์ „์ฒด๋ฅผ ์ดํ•ดํ•˜๊ณ  ํ™œ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๊ทธ ์ด์œ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  1. LLM์˜ ์ •๋ณด ์ƒ์„ฑ ๋ฒ”์œ„: LLM์ด ์ฟผ๋ฆฌ์™€ ๊ด€๋ จ ๋ฐœ์ทŒ๋ฌธ์„ ์ƒ์„ฑํ•  ๋•Œ, ์ฃผ์–ด์ง„ Corpus์˜ ๋ชจ๋“  ์ •๋ณด๋ฅผ ์ฐธ์กฐํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ Corpus์˜ ์ผ๋ถ€๋งŒ LLM์—๊ฒŒ ์ œ๊ณตํ•œ๋‹ค๋ฉด, LLM์€ ํ•ด๋‹น ๋ถ€๋ถ„์—๋งŒ ๊ตญํ•œ๋œ ์ฟผ๋ฆฌ๋‚˜ ๋ฐœ์ทŒ๋ฌธ๋งŒ ์ƒ์„ฑํ•˜๊ฒŒ ๋˜์–ด ํ‰๊ฐ€์˜ ๋Œ€ํ‘œ์„ฑ์„ ์žƒ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  2. ๋ฐœ์ทŒ๋ฌธ์˜ ์ •ํ™•์„ฑ ๋ณด์žฅ: LLM์€ ์ œ๊ณต๋œ Corpus ๋‚ด์˜ ํ…์ŠคํŠธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ •ํ™•ํžˆ ์ผ์น˜ํ•˜๋Š” ๋ฐœ์ทŒ๋ฌธ์„ ์ƒ์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. Corpus ์ „์ฒด๋ฅผ ๋ด์•ผ LLM์ด โ€œ์‚ฌ์‹ค ๊ธฐ๋ฐ˜โ€์ด๊ณ  โ€œ์›๋ณธ๊ณผ ์ผ์น˜ํ•˜๋Š”โ€ ๋ฐœ์ทŒ๋ฌธ์„ ์ œ๋Œ€๋กœ ์ฐพ์•„๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  3. ๋‹ค์–‘ํ•œ ์ฟผ๋ฆฌ ์ƒ์„ฑ: Corpus ์ „์ฒด๋ฅผ ํ›‘์–ด๋ณด๋ฉด์„œ LLM์€ ๋” ๋‹ค์–‘ํ•˜๊ณ  ๋ณต์žกํ•œ ์ฟผ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์—ฌ๋Ÿฌ ์„น์…˜์— ๊ฑธ์ณ ์žˆ๋Š” ์ •๋ณด์— ๋Œ€ํ•œ ์ฟผ๋ฆฌ, ํŠน์ • ์‚ฌ๊ฑด์˜ ์›์ธ๊ณผ ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ์ฟผ๋ฆฌ ๋“ฑ์ด ๊ฐ€๋Šฅํ•ด์ง‘๋‹ˆ๋‹ค.
  4. ํ•„ํ„ฐ๋ง์˜ ์ •ํ™•์„ฑ: ์ƒ์„ฑ๋œ ์ฟผ๋ฆฌ์™€ ๋ฐœ์ทŒ๋ฌธ์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ํ•„ํ„ฐ๋งํ•˜๋Š” ๊ณผ์ •์—์„œ๋„ Corpus ์ „์ฒด์˜ ๋งฅ๋ฝ์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ, โ€œ๋ชจ๋“  Corpus๋ฅผ ๋‹ค ๋ณธ๋‹คโ€๋Š” ๊ฒƒ์ด ๋ฐ˜๋“œ์‹œ โ€œ๋™์‹œ์— ๋ฉ”๋ชจ๋ฆฌ์— ๋กœ๋“œํ•œ๋‹คโ€๋Š” ๋œป์€ ์•„๋‹™๋‹ˆ๋‹ค.

  • Chunking & Sampling: ์‹ค์ œ๋กœ LLM์—๊ฒŒ Corpus ์ „์ฒด๋ฅผ ํ•œ ๋ฒˆ์— ์ œ๊ณตํ•˜๊ธฐ๋Š” ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ Corpus๋ฅผ ์ž‘์€ ์ฒญํฌ๋กœ ๋ถ„ํ• ํ•˜๊ฑฐ๋‚˜, LLM์—๊ฒŒ ์ฟผ๋ฆฌ์™€ ๋ฐœ์ทŒ๋ฌธ ์ƒ์„ฑ์„ ์ง€์‹œํ•  ๋•Œ Corpus ๋‚ด์˜ ์—ฌ๋Ÿฌ ๋ถ€๋ถ„์„ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Prompt Engineering: LLM์—๊ฒŒ ์ œ๊ณตํ•˜๋Š” ํ”„๋กฌํ”„ํŠธ์— โ€œ๋‹ค์Œ ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌโ€ฆโ€ ์™€ ๊ฐ™์ด ๋ช…ํ™•ํ•˜๊ฒŒ ์ง€์‹œํ•˜๊ณ , ํ•„์š”ํ•œ ๊ฒฝ์šฐ ๋ฌธ์„œ์˜ ํŠน์ • ๋ถ€๋ถ„์„ ๋ฐœ์ทŒํ•˜์—ฌ ํ•จ๊ป˜ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ Corpus ์ „์ฒด์˜ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

์ด ์—ฐ๊ตฌ์—์„œ๋„ โ€œLLM์—๊ฒŒ ๋ฌธ์„œ ์ฝ”ํผ์Šค๋ฅผ ์ œ๊ณตํ•˜๊ณ โ€ฆโ€ ๋ผ๊ณ  ๋ช…์‹œ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์ด ์ฝ”ํผ์Šค๊ฐ€ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ์˜ ๊ทผ๊ฑฐ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ์˜ ์‹œ์ž‘์ ์œผ๋กœ์„œ Corpus ์ „์ฒด์˜ ์ •๋ณด๊ฐ€ LLM์—๊ฒŒ ๊ฐ€์šฉํ•ด์•ผ ํ•œ๋‹ค๋Š” ์˜๋ฏธ๋กœ ์ดํ•ดํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.