ํ‹ฐ์Šคํ† ๋ฆฌ ๋ทฐ

Martin Arjovsky : https://arxiv.org/abs/1701.07875v3

 

Wasserstein GAN

We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debuggi

arxiv.org

 

1. Introduction

Unsupervised Learning์€ ๋ฐ์ดํ„ฐ (x)์˜ ํ™•๋ฅ  ๋ถ„ํฌ ( P(x))๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ํ•™์Šตํ•œ๋‹ค๋Š” ๊ฒƒ์€ ํ™•๋ฅ  ๋ฐ€๋„๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. ( P(x))๋ฅผ parameter (θ)์— ๋Œ€ํ•ด ์•„๋ž˜์™€ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

์‹ค์ œ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์™€ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ด์ง€๋Š” ๋ฐ€๋„ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ตœ์†Œํ™” ํ•ด์„œ ์‹ค์ œ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์— ๊ฐ€๊น๊ฒŒ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ((KL(P_θ | P_r))์„ minimizeํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•œ๋‹ค.)

๊ทธ๋Ÿฌ๊ธฐ ์œ„ํ•ด์„œ model density ( P_θ)

์ด์— ๋Œ€ํ•œ ์ „ํ˜•์ ์ธ ํ•ด๊ฒฐ์ฑ…์€ noise ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ํ•˜์ง€๋งŒ ๊ธฐ์กด์˜ ์—ฐ๊ตฌ๋“ค์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด ์ด๋ฏธ์ง€ GAN์—์„œ noise๋Š” ์ƒ˜ํ”Œ์˜ ํ€„๋ฆฌํ‹ฐ๋ฅผ ๋‚ฎ์ถ”๊ณ , blurryํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค. maximum likelihood ์ ‘๊ทผ๋ฒ•์—๋Š” noise๊ฐ€ ํ•„์ˆ˜์ ์ด์ง€๋งŒ, ์ด๊ฒƒ์ด ์ •๋‹ต์€ ์•„๋‹ ๊ฒƒ์ด๋‹ค.

 

์ตœ๊ทผ์— ๋‚˜์˜จ ๋ฐฉ๋ฒ•์€ ํ™•๋ฅ  ๋ฐ€๋„๋ฅผ ์ง์ ‘ ๊ณ„์‚ฐํ•˜์—ฌ ๋ถ„ํฌ (P_θ)๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ํ•˜๋‚˜์˜ fixed๋œ ๋ถ„ํฌ๋ฅผ ์ •ํ•ด๋‘๊ณ  parameterized function์„ ํ†ต๊ณผ์‹œ์ผœ (P_θ)์˜ sample๋“ค์„ ์ง์ ‘ ๋ฝ‘์•„๋‚ด๊ฒŒ ํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์€ low dimensional manifold์— ๋งž๋Š” ๋ถ„ํฌ๋ฅผ ๋ฝ‘์„ ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋ฉฐ, ๋น ๋ฅด๊ฒŒ sample์„ ๋ฝ‘์„ ์ˆ˜ ์žˆ๋Š” ๋ฐ์„œ ์ด์ ์„ ๊ฐ–๋Š”๋‹ค. VAE, GAN๊ณผ ๊ฐ™์ด ์œ ๋ช…ํ•œ ์ƒ์„ฑ ๋ชจ๋ธ๋“ค์ด ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค.

 

๊ทธ ์ค‘ GAN์€ objective ์„ ํƒ์ด ๋Šฅ๋™์ ์ด๋‚˜, ์ด์ œ๊นŒ์ง€์˜ GAN์˜ ๊ฒฝ์šฐ ํ•™์Šต์ด ๋ถˆ์•ˆ์ •ํ•˜๊ณ  mode dropping, G/D unbalance์™€ ๊ฐ™์€ ์—ฌ๋Ÿฌ ๋ฌธ์ œ๋ฅผ ๊ฒช๊ณ  ์žˆ๋‹ค : ์‹ค์ œ GAN์—์„œ ์‚ฌ์šฉ๋˜๋Š” objective๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•™์Šต์„ ์ง„ํ–‰ํ•  ๊ฒฝ์šฐ, ๋„ˆ๋ฌด ๋นจ๋ฆฌ ํฌํ™”๋œ((saturated)) D๋กœ ์ธํ•ด G๊ฐ€ ํŠน์ • mode๋งŒ์„ ์ด์šฉํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ์ดˆ๋ž˜ํ•  ์ˆ˜๋„ ์žˆ๋‹ค. ์ฆ‰, G๊ฐ€ ๋น„์Šทํ•œ ์ •๋ณด๋งŒ์„ ์ด์šฉํ•˜์—ฌ ์ƒ˜ํ”Œ์„ ๋งŒ๋“ค์–ด๋‚ด์–ด D๋ฅผ ์†์ด๊ณ , ์ด ๊ณผ์ •์—์„œ data distribution์ด globalํ•œ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ์˜์—ญ์„ ํฌํ•จํ•˜๊ณ ์ž ํ•˜๋Š” ๋ชฉ์ ๊ณผ ๋‹ฌ๋ฆฌ ์ผ๋ถ€๋งŒ์„ ํ‘œํ˜„ํ•˜๊ฒŒ ๋œ๋‹ค.

 

๊ทธ๋ž˜์„œ ์ €์ž๋“ค์€ ์ด๋Ÿฌํ•œ GAN์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์ €์ž๋“ค์€ (P_θ)์™€ (P_real)์˜ distance/divergence๋ฅผ ๊ตฌํ•˜๋Š” measure์— ์ ‘๊ทผํ•˜๊ฒŒ ๋œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฑฐ๋ฆฌ๋‚˜ ๋ถ„์‚ฐ์„ ์ •์˜ํ•˜๋Š” ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ๋ชจ๋ธ ๋ถ„ํฌ์™€ ์‹ค์ œ ๋ถ„ํฌ๊ฐ€ ์–ผ๋งˆ๋‚˜ ๊ฐ€๊นŒ์šด์ง€๋ฅผ ์ธก์ •ํ•˜๋Š” ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์— ๊ด€์‹ฌ์„ ๊ฐ€์ง„๋‹ค.

  • training ์‹œ discriminator์™€ generator๊ฐ„์˜ ๊ท ํ˜•์„ ์ฃผ์˜๊นŠ๊ฒŒ ์‚ดํ”ผ๊ณ  ์žˆ์ง€ ์•Š์•„๋„ ๋œ๋‹ค.
  • GAN์—์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฐœ์ƒ๋˜๋Š” ๋ฌธ์ œ์ธ mode dropping์„ ํ•ด๊ฒฐ ๊ฐ€๋Šฅํ•˜๋‹ค.

 

2. Different Distances

Wasserstein GAN์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•ด ์†Œ๊ฐœํ•˜๊ธฐ ์ „์—, ์—ฌ๊ธฐ์„œ ์‚ฌ์šฉํ•œ ํ™•๋ฅ  ๊ฑฐ๋ฆฌ ์ฒ™๋„์˜ ๋‹น์œ„์„ฑ์— ๋Œ€ํ•ด ์ด์•ผ๊ธฐ ํ•˜๋Š” ๋ถ€๋ถ„์ด๋‹ค.

4๊ฐ€์ง€์˜ distance๋ฅผ ๋น„๊ตํ•˜๊ณ , ์ƒˆ distance๋ฅผ ์ œ์•ˆํ•œ๋‹ค.

[1] ๋„ค ๊ฐ€์ง€ ๊ฑฐ๋ฆฌ ์ข…๋ฅ˜ ์ •์˜

1. Total Variation(TV)

๋‘ ํ™•๋ฅ  ๋ถ„ํฌ์˜ ์ธก์ •๊ฐ’์ด ๋ฒŒ์–ด์งˆ ์ˆ˜ ์žˆ๋Š” ๊ฐ€์žฅ ํฐ ๊ฐ’์„ ๋œปํ•œ๋‹ค. ์ฆ‰, ์•„๋ž˜ ๊ทธ๋ฆผ์—์„œ ๋นจ๊ฐ„์ƒ‰ A์˜ ์˜์—ญ ์•ˆ์— ์žˆ๋Š” A๋“ค์„ ๋Œ€์ž…ํ•˜์˜€์„ ๋•Œ, Pr$(A)$์™€ Pg$(A)$์˜ ๊ฐ’์˜ ์ฐจ ์ค‘ ๊ฐ€์žฅ ํฐ ๊ฒƒ์„ ๋œปํ•œ๋‹ค.

์ถœ์ฒ˜ : Wasserstein GAN ์ˆ˜ํ•™ ์ดํ•ดํ•˜๊ธฐ 1 (ํ•˜๋‹จ ๋งํฌ ์ฐธ์กฐ)

 

2. Kullback-Leibler(KL) divergence

๋‘ ๋ชจ๋ธ์ด ์–ผ๋งˆ๋‚˜ ๋น„์Šทํ•˜๊ฒŒ ์ƒ๊ฒผ๋Š”์ง€๋ฅผ ์•Œ๊ธฐ ์œ„ํ•œ ์ฒ™๋„,

์ฆ‰, ์›๋ณธ ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์ •๋ณด๋Ÿ‰์„ ์ž˜ ๋ณด์กดํ•  ์ˆ˜๋ก ์›๋ณธ ๋ฐ์ดํ„ฐ์™€ ๋น„์Šทํ•œ ๋ชจ๋ธ์ด๋ผ๋Š” ๊ฒƒ์ด๋‹ค.

KL-Divergence๋Š” ๊ทผ์‚ฌ์‹œ ๋ฐœ์ƒํ•˜๋Š” ์ •๋ณด ์†์‹ค๋Ÿ‰์˜ ๊ธฐ๋Œ“๊ฐ’์ด๋‹ค. ์ด ๊ฐ’์ด ์ž‘์„ ์ˆ˜๋ก ๋” ๊ฐ€๊น๊ฒŒ ๊ทผ์‚ฌํ•œ ๊ฒƒ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

3. Jensen-Shannon(JS) divergence

KL-Divergence๋Š” Symmetricํ•˜์ง€ ์•Š๋‹ค. ์ด๋ฅผ ๋Œ€์นญ์ ์œผ๋กœ ๊ฐœ๋Ÿ‰ํ•œ ๊ฒƒ์ด Jensen-Shannon Divergence์ด๋‹ค.

๋‘ ํ™•๋ฅ  ๋ถ„ํฌ ์‚ฌ์ด์˜ distance๋กœ์„œ์˜ ์—ญํ• ์„ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

 

4. Earth-Mover(EM) distance

์ €์ž๋“ค์ด ์ƒˆ๋กœ ์ œ์•ˆํ•œ ๊ฒƒ์œผ๋กœ, Pr, Pg์˜ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  joint distribution set์— ๋Œ€ํ•ด ๊ฑฐ๋ฆฌ์˜ ๊ธฐ๋Œ“๊ฐ’์˜ ํ•˜ํ•œ๊ฐ’์„ ์ฐพ๋Š”๋‹ค.

๋‘ ํ™•๋ฅ  ๋ถ„ํฌ์˜ ๊ฒฐํ•ฉํ™•๋ฅ ๋ถ„ํฌ Π$(Pr, Pg)$์ค‘์—์„œ d$(X, Y)$ $(x์™€ y์˜ ๊ฑฐ๋ฆฌ)$์˜ ๊ธฐ๋Œ“๊ฐ’์„ ๊ฐ€์žฅ ์ž‘๊ฒŒ ์ถ”์ •ํ•œ ๊ฐ’์ด๋‹ค.

์ฆ‰, ์•„๋ž˜ ๊ทธ๋ฆผ์—์„œ ํŒŒ๋ž€์ƒ‰ ์›์ด X์˜ ๋ถ„ํฌ, ๋นจ๊ฐ„์ƒ‰ ์›์ด Y์˜ ๋ถ„ํฌ, ๐›˜๊ฐ€ ๊ฒฐํ•ฉ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ,

์ดˆ๋ก์ƒ‰ ์„ ์˜ ๊ธธ์ด๊ฐ€ ||x-y||๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ์ฆ‰, ์ดˆ๋ก์ƒ‰ ์„  ๊ธธ์ด๋“ค์˜ ๊ธฐ๋Œ“๊ฐ’์„ ๊ฐ€์žฅ ์ž‘๊ฒŒ ์ถ”์ •ํ•œ ๊ฐ’์ด๋‹ค.

์ถœ์ฒ˜ : Wasserstein GAN ์ˆ˜ํ•™ ์ดํ•ดํ•˜๊ธฐ 1 (ํ•˜๋‹จ ๋งํฌ ์ฐธ์กฐ)

 

[2] EM distance์˜ ํƒ€๋‹น์„ฑ

๋…ผ๋ฌธ์—์„œ๋Š” Example 1์„ ํ†ตํ•ด EM distance์˜ ํƒ€๋‹น์„ฑ์„ ์ด์•ผ๊ธฐํ•˜๊ณ  ์žˆ๋‹ค.

์ž„์˜์˜ distribution P0์™€ Pθ๋ฅผ ์ •์˜ํ•˜๊ณ , ์ด๋“ค ๊ฐ„์˜ ํ™•๋ฅ  ๊ฑฐ๋ฆฌ๋ฅผ ๊ตฌํ•ด ๋ณธ ๊ฒฐ๊ณผ์ด๋‹ค.

EM distance๋Š” ๋‘ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์— ๋Œ€ํ•œ ๋ชจ๋“  joint distribution์— ๋Œ€ํ•˜์—ฌ, ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋กœ ์˜ฎ๊ธฐ๋Š” ๋ฐ ํ•„์š”ํ•œ mass๋ฅผ ๊ตฌํ•œ๋‹ค. EM distance๋Š” ์ด๋Ÿฌํ•œ ๊ณ„ํš$(transport plan)$์— ํ•„์š”ํ•œ ์ตœ์†Œ$(optimal)$ ๊ฒฝ๋น„$(cost)$๊ฐ€ ๋œ๋‹ค.

 

๋‹ค๋ฅธ 3๊ฐ€์ง€์™€ EM distance์˜ ์ฐจ์ด๋ฅผ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•ด, convergence: ์ˆ˜๋ ด์„ ์˜ˆ์‹œ๋กœ ๋“ ๋‹ค.

์šฐ์„  uniform distribtuion์„ ๊ฐ–๋Š” Z์™€, ์ด๋ฅผ ํ†ตํ•ด ๊ตฌ์„ฑ๋œ (P_0) = $(0,Z)$๋ฅผ ํ†ตํ•ด convergence๋ฅผ ํ™•์ธํ•ด๋ณด์ž.

EM distance$(๋งจ ์œ„  W)$๋ฅผ ์ œ์™ธํ•œ ๋‹ค๋ฅธ ๋ชจ๋“  function๋“ค์€ 0์—์„œ continuousํ•˜์ง€ ์•Š๋‹ค. ์ฆ‰, ์„ธํƒ€ ๊ฐ’์ด 0์œผ๋กœ ๊ฐˆ ๋•Œ diverge: ๋ฐœ์‚ฐํ•˜๊ฒŒ ๋œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด low dimensional manifold์—์„œ EM distance๋งŒ์ด gradient descent๋ฅผ ํ†ตํ•ด ํ•™์Šต์ด ๊ฐ€๋Šฅํ•œ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๋‚˜๋จธ์ง€๋Š” ๋ชจ๋‘ divergen ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.

 

Figure 1์„ ์‚ดํŽด๋ณด๋ฉด, JS divergence์˜ ๊ฒฝ์šฐ ๋Œ€๋ถ€๋ถ„์˜ ์ผ€์ด์Šค์—์„œ ๊ฐ’์ด ๊ฐ™์œผ๋ฏ€๋กœ, ์ ์ ˆํ•œ gradient ๊ฐ’์„ ์–ป๊ธฐ๊ฐ€ ์–ด๋ ต๋‹ค. ๋ฐ˜๋ฉด์—, EM distance์˜ ๊ฒฝ์šฐ ๋Œ€๋ถ€๋ถ„์˜ ์ผ€์ด์Šค์—์„œ ์œ ์˜๋ฏธํ•œ gradient ๊ฐ’์„ ์–ป์„ ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

 

[3] EM distance๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•œ ์ œ์•ฝ์กฐ๊ฑด

EM distance๋ฅผ loss function์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ฏธ๋ถ„์ด ๊ฐ€๋Šฅํ•ด์•ผ ํ•œ๋‹ค.

Pr์€ ํ•™์Šตํ•˜๊ณ ์ž ํ•˜๋Š” ๋ชฉํ‘œ distribution์ด๋ฉฐ, (P_θ)๋Š” ํ•™์Šต์‹œํ‚ค๊ณ  ์žˆ๋Š” ํ˜„์žฌ์˜ distribution์œผ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

(z)๋Š” latent variable์˜ space์ด๋ฉฐ, ํ•จ์ˆ˜ (g)๋Š” latent variable (z)๋ฅผ (x)๋กœ mappingํ•˜๋Š” ํ•จ์ˆ˜์ด๋‹ค. ์ด ๋•Œ gθ$(z)$์˜ distribution์ด Pθ๊ฐ€ ๋œ๋‹ค. 

 

์ด ๋•Œ,

1. g๊ฐ€ θ์— ๋Œ€ํ•ด ์—ฐ์†ํ•œ๋‹ค๋ฉด, Pr์™€ Pθ์˜ EM distance ๋˜ํ•œ ์—ฐ์†ํ•œ๋‹ค.

2. g๊ฐ€ Lipschitz์กฐ๊ฑด์„ ๋งŒ์กฑํ•œ๋‹ค๋ฉด, Pr์™€ Pθ์˜ EM distance ๋˜ํ•œ ์—ฐ์†ํ•œ๋‹ค.

 

Lipschitz :
Lipschitz function์€ ์–ผ๋งˆ๋‚˜ ๋นจ๋ฆฌ ๋ณ€ํ™”ํ• ์ง€๊ฐ€ ์ œํ•œ๋œ ํ•จ์ˆ˜๋ฅผ ์˜๋ฏธํ•œ๋‹ค. $(2์ฐจ์›์—์„œ ์ง์„ ์˜ ๊ธฐ์šธ๊ธฐ == ๋ณ€ํ™”์œจ)$
์ฆ‰, ์–ด๋–ค ํ•จ์ˆ˜๊ฐ€ Lipschitz constant๋ผ๋Š” ๊ฐ’๋ณด๋‹ค๋Š” ํ•ญ์ƒ ๋ณ€ํ™”์œจ์ด ์ž‘์•„์•ผ ํ•œ๋‹ค๋Š” ๋œป.
๋ณ€ํ™”์œจ์„ ์–ด๋–ค ํŠน์ • ๊ฐ’๋ณด๋‹ค ํ•ญ์ƒ ์ž‘๊ฒŒ ํ•œ๋‹ค๋Š” ๊ฒƒ์€ ์œ ์˜๋ฏธํ•œ gradient ๊ฐ’์„ ์–ป๊ธฐ ์œ„ํ•œ.. ๊ฒƒ 

 

WGAN์€ ํ›ˆ๋ จ ๊ณผ์ •์—์„œ, K-Lipschitzํ•œ ์„ฑ์งˆ์„ ๋งŒ์กฑ์‹œํ‚ค๊ธฐ ์œ„ํ•ด weight์„ clipping ์‹œ์ผœ์ค€๋‹ค.

 

3. Wasserstein GAN

Kantorovich-Rubinstein duality๋ฅผ ์ด์šฉํ•˜์—ฌ ์‹์„ ๋ฐ”๊พผ๊ฒƒ

์›๋ž˜ EM ์‹์—์„œ inf๋ถ€๋ถ„์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— ์œ„์˜ ์‹์œผ๋กœ ๋ฐ”๊พผ๋‹ค.

 

ํŒŒ๋ผ๋ฏธํ„ฐ ์ถ”๊ฐ€, (P_0)๋ฅผ Gθ์— ๋Œ€ํ•œ ์‹์œผ๋กœ ๋ฐ”๊พผ๋‹ค.

์•ž์˜ (P_r)์ด ์žˆ๋Š” ํ–‰์€ ํ•™์Šต๋œ discriminator:์‚ฌ์‹ค์€ critic ์ด๊ธฐ๋•Œ๋ฌธ์—, Pr์˜ ์—ญํ• ์„ ํ•ด์ค€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์œ„์™€ ๊ฐ™์ด gradient update๋ฅผ ํ•  ๋•Œ์—๋Š” θ์— ๋Œ€ํ•ด ๋ฏธ๋ถ„ํ•˜๋ฉด ์•ž์˜ ํ•ญ์ด ์‚ฌ๋ผ์ง€๊ฒŒ ๋œ๋‹ค.

 

WGAN์˜ ์ตœ์ข…์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.

- (critic) ์„ ํ•™์Šต

1. (n_critic)๋ฒˆ ๋งŒํผ ์ง„ํ–‰

2. Pr๊ณผ p$(z)$ (Pθ์—ญํ• )๋ฅผ ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋งŒํผ ์ƒ˜ํ”Œ๋ง

3. critic์˜ loss function์„ ์ด์šฉํ•˜์—ฌ parameter w: ํ•จ์ˆ˜ f๋ฅผ update์‹œํ‚ด

5. RMSProp ์„ ์“ด๋‹ค!

6. update ํ›„ clip$(w, -c, c)$๋ผ๋Š” ๋ถ€๋ถ„์ด ์žˆ๋Š”๋ฐ, Lipschitz์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋„๋ก parameter w๊ฐ€ [-c, c]๊ณต๊ฐ„์— ์•ˆ์ชฝ์— ์กด์žฌํ•˜๋„๋ก ๊ฐ•์ œํ•˜๋Š” ๊ฒƒ, ์ด๋ฅผ Weight clipping์ด๋ผ๊ณ  ํ•จ.

 

clipping์€ ์ข‹์€ ๋ฐฉ๋ฒ•์ด ์•„๋‹ˆ๋ผ๊ณ  ๋…ผ๋ฌธ์—์„œ๋„ ์ฃผ์žฅํ•œ๋‹ค. ์ด๋Š” WGAN์˜ ํ•œ๊ณ„์ ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์‹คํ—˜ ๊ฒฐ๊ณผ clipping parameter c ๊ฐ€ ํฌ๋ฉด limit$(c๋‚˜ -c)$๊นŒ์ง€ ๋„๋‹ฌํ•˜๋Š” ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์—, ํ•™์Šตํ•˜๋Š” ๋ฐ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฐ๋‹ค. ๋ฐ˜๋ฉด c๊ฐ€ ์ž‘์œผ๋ฉด, gradient vanish ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜์˜€๋‹ค๊ณ  ํ•œ๋‹ค.

 

[Discriminator vs Critic]

discriminator์˜ ๊ฒฝ์šฐ ์ผ๋ฐ˜์ ์ธ ๋ถ„๋ฅ˜ neural net๊ณผ ๊ฐ™์ด ์ด๋ฏธ์ง€๊ฐ€ ์ง„์งœ์ธ์ง€, ๊ฐ€์งœ์ธ์ง€ sigmoid ํ™•๋ฅ ๊ฐ’์œผ๋กœ ํŒ๋ณ„ํ•ด๋‚ธ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ critic์˜ ๊ฒฝ์šฐ Wasserstein GAN ์‹ ์ž์ฒด๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—, scalar ๊ฐ’์ด output์ด๋‹ค. ์ด๋Š” ์ด๋ฏธ์ง€๊ฐ€ ์ง„์งœ์ธ์ง€ ์•„๋‹Œ์ง€์— ๋Œ€ํ•œ ์ ์ˆ˜๋ฅผ ์˜๋ฏธํ•˜๋Š” ๊ฒƒ์œผ๋กœ, sigmoid์™€ ๋‹ฌ๋ฆฌ saturation: ํฌํ™”ํ˜„์ƒ์ด ์—†๊ณ  ์ข‹์€ gradient๋ฅผ ๋งŒ๋“ค์–ด ๋‚ธ๋‹ค.

 

๋”ฐ๋ผ์„œ ์ง„์งœ optimal:์ตœ์ ์˜ ์ง€์ ๊นŒ์ง€ ์‰ฝ๊ฒŒ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๊ณ , ์•ž์„œ ์–ธ๊ธ‰ํ–ˆ๋˜

  • discriminator์™€ generator๊ฐ„์˜ ๊ท ํ˜• ๋งž์ถ”๊ธฐ
  • mode dropping $(mode collapse)$ ๋ฌธ์ œ

๋‘๊ฐ€์ง€๊ฐ€ ํ•ด๊ฒฐ๋œ๋‹ค.

 

[RMSProp ์‚ฌ์šฉ ์ด์œ ]

์‹คํ—˜ ๊ฒฐ๊ณผ critic์„ ํ•™์Šต ํ•  ๋•Œ Adam๊ณผ ๊ฐ™์€ mometum ๋ฒ ์ด์Šค optimizer๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ํ•™์Šต์ด unstable ํ•˜๋‹ค.

์ด์œ ๋Š” loss๊ฐ’์ด ํŠ€๊ณ  ์ƒ˜ํ”Œ์ด ์ข‹์ง€ ์•Š์€ ๊ฒฝ์šฐ$(์ผ๋ฐ˜์ ์œผ๋กœ ํ•™์Šต ์ดˆ๋ฐ˜)$ Adam์ด ๊ฐ€๊ณ ์ž ํ•˜๋Š” ๋ฐฉํ–ฅ, ์ฆ‰ ์ด์ „์— ๊ธฐ์–ตํ–ˆ๋˜ ๋ฐฉํ–ฅ$(Adam step)$๊ณผ gradient์˜ ๋ฐฉํ–ฅ ๊ฐ„์˜ cosine๊ฐ’์ด ์Œ์ˆ˜๊ฐ€ ๋œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ nonstationary ๋ฌธ์ œ $(๊ทนํ•œ๊ฐ’์ด ์กด์žฌํ•˜์ง€ ์•Š์Œ)$์— ๋Œ€ํ•ด์„œ๋Š” momentum๊ณ„์—ด๋ณด๋‹ค RMSProp์ด ์„ฑ๋Šฅ์ด ๋” ์ข‹๋‹ค๊ณ  ํ•œ๋‹ค. $(์—ฌ๊ธฐ์„œ ์ •์˜ํ•œ ๋ฌธ์ œ๋„ nonstationary problem)$

 

4. Empirical Results

์‹คํ—˜ ๊ฒฐ๊ณผ์™€ ์„ฑ๋Šฅ์— ๊ด€ํ•œ ๋ถ€๋ถ„์ด๋‹ค.

๋งจ ์œ„์˜ ๊ทธ๋ž˜ํ”„๋“ค์€ discriminator๋Œ€์‹ ์— critic์„ ์ ์šฉํ•œ ๊ฒƒ์ด๊ณ ,

์™ผ์ชฝ์€ generator๋กœ Multi Layer Perceptron, ์˜ค๋ฅธ์ชฝ์€ DCGAN์„ ์ด์šฉํ•œ ๊ฒฐ๊ณผ์ด๋‹ค.

sigmoid๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•„ wasserstein๊ฑฐ๋ฆฌ๊ฐ€ ์ ์ฐจ์ ์œผ๋กœ ์ค„์–ด๋“ค๊ณ , sample์˜ ๊ฒฐ๊ณผ๋„ ํ›จ์”ฌ ์ข‹์•„์ง„ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

์•„๋ž˜ ๊ทธ๋ฆผ์€ discriminator์™€ generator๋ชจ๋‘ MLP๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒฐ๊ณผ์ด๋‹ค. Sample ๊ทธ๋ฆผ์€ ๋ฌด์—‡์ธ์ง€ ์•Œ์•„๋ณด๊ธฐ ์–ด๋ ต๊ณ , ๊ฐ sample์— ๋Œ€ํ•ด wasserstein distance๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋ณด์•˜์„ ๋•Œ ์ƒ์ˆ˜๊ฐ’์œผ๋กœ ๋ณ€ํ™”ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

์œ„์™€ ๊ฐ™์€ ๋ชจ๋ธ ๊ตฌ์กฐ$(critic + MLP, critic + DCGAN, MLP + MLP)$๋ฅผ ์‚ฌ์šฉํ•˜์˜€์ง€๋งŒ generator iteration๋งˆ๋‹ค JS ๊ฑฐ๋ฆฌ๋ฅผ ์ธก์ •ํ•˜์—ฌ ๊ทธ๋ž˜ํ”„๋ฅผ ์ธก์ •ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

sample quality๊ฐ€ ์ข‹์•„์ ธ๋„ JS distance๋Š” ์ฆ๊ฐ€ํ•˜๊ฑฐ๋‚˜ ์ƒ์ˆ˜ ๊ฐ’์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

EM์„ ์ž˜ ์‚ฌ์šฉํ–ˆ๋‹ค๋Š” ๊ฒƒ์„ ๋งํ•˜๋Š” ๊ฒƒ์ด๋‹ค! WGAN์€ GAN์—ญ์‚ฌ์ƒ ์ฒ˜์Œ์œผ๋กœ ์ˆ˜๋ ด$(convergence)$ํ•œ ๋ชจ์Šต์„ ๋ณด์—ฌ ์ค€ ๊ฒฝ์šฐ๋ผ๊ณ  ํ•œ๋‹ค.

 

 

figure5๋Š” ์ผ๋ฐ˜์ ์ธ GAN๊ณผ WGAN์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•œ ๊ฒฐ๊ณผ์ด๋‹ค.$(DCGAN generator ์ด์šฉ)$ ๋‘˜ ๋‹ค ์ข‹์€ ์งˆ์˜ sample์„ ์ƒ์„ฑํ•œ๋‹ค.

 

๊ทธ๋Ÿฌ๋‚˜ figure6์—์„œ๋Š”, batch normalization์„ ์—†์• ๊ณ  generator์˜ DCGAN๋ถ€๋ถ„์—์„œ filter์ˆ˜๋ฅผ ๊ณ ์ •ํ•จ์œผ๋กœ์จ

์ „์ฒด์ ์œผ๋กœ parameter์ˆ˜๋ฅผ ์ค„์ธ ๊ฒฐ๊ณผ์ด๋‹ค. WGAN์—์„œ๋Š” ์—ฌ์ „ํžˆ ์ž˜ ์ž‘๋™ํ•˜์ง€๋งŒ, ์ผ๋ฐ˜์ ์ธ GAN์€ ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

=> ๊ฒฐ๋ก ์ ์œผ๋กœ, discriminator์™€ critic๊ฐ„์˜ balance๋ฅผ ๋” ์ด์ƒ ์‹ ๊ฒฝ์“ฐ์ง€ ์•Š์•„๋„ ๋˜๋ฉฐ,

์‹คํ—˜์—์„œ WGAN์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ผ์„ ๋•Œ๋Š” mode collapseํ˜„์ƒ์ด ๋ฐœ์ƒํ•˜์ง€ ์•Š์•˜๋‹ค!!!๋ผ๊ณ  ํ•œ๋‹ค.

 

figure7์€, generator๋ฅผ MLP + ReLU๋กœ ๋ณ€ํ˜•ํ•˜์—ฌ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ์ด๋‹ค. ์™ผ์ชฝ์€ WGAN, ์˜ค๋ฅธ์ชฝ์€ ์ผ๋ฐ˜์ ์ธ GAN์˜ ๊ฒฐ๊ณผ๋กœ, DCGAN์„ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ๋ณด๋‹ค ํ€„๋ฆฌํ‹ฐ๋Š” ๋–จ์–ด์กŒ์ง€๋งŒ mode collapse ํ˜„์ƒ์— ๋Œ€ํ•ด์„œ ๋น„๊ตํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์˜ค๋ฅธ์ชฝ์˜ ๊ฒฝ์šฐ ๋น„์Šทํ•œ ๊ทธ๋ฆผ์ด ๋งŽ๊ณ  ํŠน์ • ๊ทธ๋ฆผ์— ๋Œ€ํ•ด์„œ๋Š” ์ƒ์„ฑํ•˜์ง€ ๋ชปํ•˜์˜€์ง€๋งŒ, ์™ผ์ชฝ์€ ๋‹ค์–‘ํ•œ ๊ทธ๋ฆผ์„ ๋น„์Šทํ•˜๊ฒŒ ์ƒ์„ฑํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค!

 

5. Related Work

* ์ƒ๋žต

 

6. Conclusion

์šฐ๋ฆฌ๋Š” ์ „ํ†ต์ ์ธ GAN training์„ ๋Œ€์ฒดํ•˜๋Š” WGAN์ด๋ผ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋„์ž…ํ–ˆ๋‹ค.

discriminator์‹์„ EM distance๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€ํ˜•ํ•จ์œผ๋กœ์จ, gradient๊ฐ€ ์ž˜ ํ˜๋Ÿฌ๊ฐ€๋„๋ก ํ•˜์—ฌ GAN์ด ์‹ค์ œ optimal ์ง€์ ๊นŒ์ง€ ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™”๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ GANํ•™์Šต์„ ์•ˆ์ •ํ™”์‹œํ‚ค๊ณ  mode collapse ๋ฌธ์ œ๊นŒ์ง€ ํ•ด๊ฒฐํ•˜์˜€๋‹ค.

 

 

์ฐธ์กฐ:

https://www.slideshare.net/ssuser7e10e4/wasserstein-gan-i

https://ahjeong.tistory.com/7

 

[๋…ผ๋ฌธ ์ฝ๊ธฐ] Wasserstein GAN

๋…ผ๋ฌธ ๋งํฌ : https://arxiv.org/pdf/1701.07875.pdf ๋ถˆ๋Ÿฌ์˜ค๋Š” ์ค‘์ž…๋‹ˆ๋‹ค... ์•„๋ž˜ ๋ธ”๋กœ๊ทธ๊ฐ€ ์ •๋ง ์•Œ๊ธฐ์‰ฝ๊ฒŒ ์„ค๋ช…์ด ์ž˜ ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค!! ๋งŽ์ด ์ฐธ๊ณ ํ•˜์˜€๊ณ  ๋‹ค๋ฅธ ๋ถ„๋“ค๋„ ์ฐธ๊ณ ํ•˜์‹œ๋ฉด ์ข‹์„๊ฑฐ ๊ฐ™์Šต๋‹ˆ๋‹คใ…Žใ…Ž https://medium

ahjeong.tistory.com

https://seokdonge.tistory.com/29

 

Wasserstein GAN

https://arxiv.org/pdf/1701.07875.pdf ์˜ค๋Š˜์€ vanila GAN์˜ ๋ฌธ์ œ์ ๋“ค์„ ๊ณ ์นœ, Wasserstein GAN์— ๋Œ€ํ•ด ๋ถ„์„ํ•ด๋ณผ๊นŒ ํ•œ๋‹ค. ์–ด๋ ค์›Œ์„œ ๋”ฐ๋กœ ์ •๋ฆฌํ•œ ์šฉ์–ด์˜ ๊ฒฝ์šฐ [A*]๋กœ ์ ์–ด ๊ฐ€์žฅ ์•„๋ž˜์— ์ •๋ฆฌํ•ด ๋‘์—ˆ๋‹ค. Introduction Unsuperv

seokdonge.tistory.com

 

 

๋ฐ˜์‘ํ˜•

'AI > ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ Paper Review' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation  (0) 2024.05.21
Pix2Pix : CVPR 2017  (0) 2024.02.07
DCGAN : ICLR 2016  (0) 2024.01.28
Generative Adversarial Nets : arXive 2014  (0) 2024.01.15