The image transformation network is shown below. For a given style image, the network is trained using the MS-COCO dataset to minimize perceptual loss while being regularized by total variation. Perceptual loss is defined by the combination of feature reconstruction loss as well as the style reconstruction loss from pretrained layers of VGG16. The feature reconstruction loss is the mean squared error between feature representations, while the style reconstruction loss is the squared Frobenius norm of the difference between the Gram matrices of the feature maps.
--gpu: id of the GPU you want to use (if not specified, will train on CPU)
--visualize: visualize the style transfer of a predefined image every 1000 iterations during the training process in a folder called "visualize"
So to train on a GPU with mosaic.jpg as my style image, MS-COCO downloaded into a folder named coco, and wanting to visualize a sample image throughout training, I would use the following command:
Model trained on mosaic.jpg applied to a few images:
And here is a GIF showing how the output changes during the training process. Notably, the network generates qualitatively appealing output within a 1000 iterations.
Udine
Model trained on udine.jpg applied to a few images: