AI/Classification

[๋…ผ๋ฌธ๋ฆฌ๋ทฐ] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

ํ•ด๋“œ์œ„๊ทธ 2025. 2. 12. 14:58
๋ฐ˜์‘ํ˜•

Intro

* ViT ๊ธฐ๋ฐ˜์œผ๋กœ ๋งŒ๋“ค์–ด์ง„ ๋ฐฑ๋ณธ์ด๋‹ค.

* Swin Transformer = Vit + 1.๊ณ„์ธต์ ๊ตฌ์กฐ + 2. shift window ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

* window ์•ˆ์—์„œ๋งŒ attention์—ฐ์‚ฐ์„ ํ•œ ํ›„, ๊ฐ๊ฐ window๋ผ๋ฆฌ attention ์—ฐ์‚ฐ์„ ํ•˜๋Š” ํ˜•ํƒœ์ด๋‹ค.

 

Method

์ „๋ฐ˜์ ์ธ ๊ตฌ์กฐ๋Š” ์œ„์™€ ๊ฐ™๋‹ค.

Network

  • input: H x W x 3์˜ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€๊ฐ€ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐ€๊ณ , ๊ฒน์น˜์ง€ ์•Š๊ฒŒ ๊ฐ๊ฐ์˜ ์ด๋ฏธ์ง€ ํŒจ์น˜๋ฅผ ๋‚˜๋ˆ”
  • stage 1: Transformer ํ•™์Šต์„ ์œ„ํ•ด ์‚ฌ์šฉ์ž๊ฐ€ ์ •์˜ํ•œ C์ฐจ์›์œผ๋กœ ๋งคํ•‘ํ•ด์คŒ(Linear Embedding), ์—ฌ๊ธฐ์„œ 2๊ฐœ๋กœ ๊ตฌ์„ฑ ๋œ swin transformer block์œผ๋กœ ์ž…๋ ฅ๋˜์–ด ๋™์ผํ•œ ์ฐจ์›์œผ๋กœ ์ถœ๋ ฅ๋จ (H/4*W/4*C)
  • stage 2: Patch Merging์œผ๋กœ (H/4*W/4*C) ์˜ ํ•ด์ƒ๋„๊ฐ€ (H/8*W/8*2C)๋กœ ์ค„์–ด๋“ฆ, 2๊ฐœ๋กœ ๊ตฌ์„ฑ ๋œ swin transformer block์œผ๋กœ ์ž…๋ ฅ๋˜์–ด ๋™์ผํ•œ ์ฐจ์›์œผ๋กœ ์ถœ๋ ฅ๋จ.
  • stage 3,4๋Š” ์ฐจ์›๊ณผ block ๊ฐœ์ˆ˜๋งŒ ๋‹ค๋ฅด๊ณ  ๋‚˜๋จธ์ง€๋Š” ๋™์ผํ•จ.

Patch Merging : ํ•ด์ƒ๋„๋ฅผ ์ค„์ด๋Š” ๊ณผ์ •

  • stage 1์˜ ์ถœ๋ ฅ์ธ (H/4*W/4*C)์˜ ์ฐจ์›์„ 2*2 ๊ทธ๋ฃน์œผ๋กœ ๋‚˜๋ˆ”.
  • ๋‚˜๋ˆ ์ง„ ๊ทธ๋ฃน์€ (H/8*W/8*C)์˜ ์ฐจ์›์„ ๊ฐ€์ง€๊ณ , 4๊ฐœ์˜ ๊ทธ๋ฃน๋“ค์„ ์ฑ„๋„์„ ๊ธฐ์ค€์œผ๋กœ concatํ•จ.]
  • (H/8*W/8*4C) ๋ณ‘ํ•ฉ๋œ ์ฐจ์› ์ถ•์†Œ๋ฅผ ์œ„ํ•ด ์ ˆ๋ฐ˜์ธ 2C๋กœ ์ถ•์†Œํ•จ.
  • ๋ชจ๋“  ์Šคํ…Œ์ด์ง€์—์„œ ์œ„ ๊ณผ์ •์„ ๋™์ผํ•˜๊ฒŒ ์ž‘์šฉํ•จ.

=> ๊ณ„์ธต์  ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด  representations์„ ๋” ์ž˜ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ณ  ์—ฐ์‚ฐ์†๋„์—๋„ ์ด์ ์ด ์žˆ์Œ.

 

Swin Transformer Block

Window

Swin์€ window๋กœ ์ชผ๊ฐœ๋Š” ๋ฐฉ์‹์œผ๋กœ ViT๋ณด๋‹ค ์—ฐ์‚ฐ์— ์ด์ ์ด ์žˆ๋‹ค.

์—ฐ์‚ฐ์— ์‹œ๊ฐ„์ด ์–ผ๋งˆ๋‚˜ ๊ฑธ๋ฆฌ๋Š” ์ง€ ์ธก์ •ํ•œ ๊ฒฐ๊ณผ์ด๋‹ค.

Swin์˜ ๊ฒฝ์šฐ ์œˆ๋„์šฐ์˜ ํฌ๊ธฐ๋Š” ๊ณ ์ •๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ ์ƒ์ˆ˜ ์ทจ๊ธ‰์ด ๊ฐ€๋Šฅํ•˜๊ณ , HW์˜ ํฌ๊ธฐ์—์„œ๋งŒ ์„ ํ˜•์ ์œผ๋กœ ๊ณ„์‚ฐ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•œ๋‹ค.

 

Shift Window

  • window๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ ๋…๋ฆฝ์ ์œผ๋กœ self-attention์„ ์‹œํ–‰
  • cyclic-shifting ๋ฐฉ์‹์„ ํ†ตํ•ด ์ถ”๊ฐ€์ ์ธ ๊ณ„์‚ฐ์„ ๊ฑฐ์˜ ์š”๊ตฌํ•˜์ง€ x
  • ํŒŒํ‹ฐ์…˜ ์ขŒ์ƒ๋‹จ์—์„œ ์šฐํ•˜๋‹จ์œผ๋กœ ์ง„ํ–‰

Relative Position Bias

  • ํ˜„์žฌ ์œ„์น˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ƒ๋Œ€์  ๊ฑฐ๋ฆฌ๋ฅด ๊ณ„์‚ฐํ•ด์„œ

 

๋ฐ˜์‘ํ˜•