Image to Image Translation
Generative Adversarial Networks really took the field by storm sometime last year when the iconic Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks was published. Since then, every couple of weeks you’re sure to come across some application using the core DCGAN to create something astonishing. A couple of days back, researchers over at BAIR came up with one such result.
Their paper, titled Image-to-Image Translation with Conditional Adversarial Networks, applies the GAN model to translate between different representations of images. What I personally found interesting was that in their paper, they use a single network to do general-purpose translation. Just a couple years ago, even translating between 2 classes of representations required a crazy amount of model tuning, but GANs have truly risen up to the challenge.
This paper really feels like it was written by an engineer at heart. Not only is there a Github repo with some really clean code, but the paper itself talks about many of the optimisations in the way an engineer, rather than a researcher, would. Let’s get to the interesting bits:
Firstly, they create a U-Net architecture shape, with the basic understanding that the input and output share a lot of low level information.
To give the generator a means to circumvent the bottleneck for information like this, we add skip connections. Specifically, we add skip connections between each layer i and layer n − i, where n is the total number of layers. Each skip connection simply concatenates all channels at layer i with those at layer n − i.
Next observation: Using just L2/L1 loss creates blurry images because they only capture low frequency correctness, and provide no incentive for high frequency correctness. So, they go on and create a method to allow the discriminator to understand high level correctness.
In order to model high-frequencies, it is sufficient to restrict our attention to the structure in local image patches. Therefore, we design a discriminator architecture – which we term a PatchGAN – that only penalizes structure at the scale of patches. This discriminator tries to classify if each N × N patch in an image is real or fake. We run this discriminator convolutationally across the image, averaging all responses to provide the ultimate output of D.
They also use L1 loss in combination to ensure that the output isn’t dominated by the GAN, which tends to make images more colorful (The Deep Dream filter).
It’s interesting to note that despite training on limited data, the accuracy they got was pretty sweet. Converting maps to aerial photos was trained using just 1096 images, but was able to fool mechanical turksa almost one in every five times, which is impressive, to say the least. I can already imagine use cases for such methods in the design industry, be it for designing clothes, or shoes, or even toys!
Go read the paper, and check out the code as well (I’ll probably do another post on the code, if I can get through it soon enough)
Image-to-Image Translation with Conditional Adversarial Networks by Phillip Isola, Jun-Yan Zhu, Tinghui Zhou and Alexei A. Efros