I’m pretty late to the neural style transfer bandwagon but lately, I have been getting more and more interested in the applications of Deep Learning to Arts. So I finally decided to get my hands dirty and actually play around with Neural Style Transfer. Though the idea is not new and I’d read this paper some time in 2016, I finally got around to implementing it on my own.
The first paper published on Neural Style Transfer (NST) used an optimization technique wherein the authors start off with a random noise image and make it more and more desirable with every training iteration. In this technique we do not use a neural network in true sense i.e. we aren’t training a network to do anything. We are simply taking advantage of back propagation to minimize the two defined loss. The tensor in which we backpropagate into is the stylized image we wish to achieve which I’ll call the pastiche from here on out. We have as inputs the artwork whose style we wish to transfer, known as the style image, and the picture to which we wish to transfer the style onto, known as the content image.
The pastiche is initialized to be random noise. This is passed through a pretrained neural network such as VGG-16 on ImageNet. We use the outputs of various intermediate layers to compute two types of losses: style loss and content loss — that is, how close is the pastiche to the style image in terms of style, and how close is the pastiche to the content image in content. Those losses are minimized by changing our pastiche image via backpropagation. After a few iterations, our pastiche is image is now similar to the content image in terms of content and looks similar to the style image in terms of style. The main reason for using intermediate layers of the network is that we want to use different semantic representations of the input image and use those representations to compare our pastiche to the content and style images.
Let’s define the loss function for style transfer. The content loss is simply defined as the Euclidean distance between the intermediate representations of the content image and the pastiche. The equation for the content loss is —
Basically, we make a list of intermediate layers at which we need to compute the content loss. So, the outer summation denotes the sum across all these layers. We pass the content image and the pastiche image through the network until a particular layer in the list, take the output of that layer, square the difference between each corresponding value, and thens sum all of them up. One thing to note here is that the authors introduce a hyper parameter which is multiplied to this difference so that the effect of this loss function can be altered at a high level.
The style loss is very similar, except instead of comparing the raw inputs of the style and the pastiche images, we compare the Gram matrices at various layers. In Linear Algebra a Gram matrix is nothing but the multiplication of the matrix to its transpose. It can be denoted as follows:
The Gram matrix contains the information about non localized information about the image, such as texture, shapes, and weights — in essence, style.
Now that we have defined the Gram matrix as containing information about the style of the image, we can find the Euclidean distance between the style image and the content image across the layers of interest. The style loss is thus defined as follows:
Similar to our content loss equation, we multiply this equation with another hyperparamter beta, known as the style weight.
Summing the content and style loss defines the total loss for each iteration of training. To summarize content loss determines how close the content of the pastiche is to the content image, and style los determines how close the style of pastiche is to the style image. We then back propagate this loss through the network to reduce this loss by getting a gradient on the pastiche image and iteratively changing it to make it look more and more like a stylized content image. This is all described in more rigorous detail in the original paper on the topic by Gatys et al.
Now that we know how style transfer works, let’s build it.
def compute_content_cost(a_C, a_G): """ Computes the content cost Arguments: a_C -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image C a_G -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image G Returns: J_content -- scalar that you compute using equation 1 above. """ # Retrieve dimensions from a_G m, n_H, n_W, n_C = a_G.get_shape().as_list() # Reshape a_C and a_G a_C_unrolled = tf.transpose(tf.reshape(a_C, [-1])) a_G_unrolled = tf.transpose(tf.reshape(a_G, [-1])) # compute the cost with tensorflow J_content = tf.reduce_sum((a_C_unrolled - a_G_unrolled)**2) / (4 * n_H * n_W * n_C) return J_content
Now we will define the Gram Matrix.
def gram_matrix(A): """ Argument: A -- matrix of shape (n_C, n_H*n_W) Returns: GA -- Gram matrix of A, of shape (n_C, n_C) """ GA = tf.matmul(A, tf.transpose(A)) # '*' is elementwise mul in numpy return GA
After generating the Style matrix(Gram matrix), our goal is to minimize the distance between the Gram matrix of the “style” image S and that of the “generated” image G.
def compute_layer_style_cost(a_S, a_G): """ Arguments: a_S -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image S a_G -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image G Returns: J_style_layer -- tensor representing a scalar value, style cost defined above by equation (2) """ # Retrieve dimensions from a_G m, n_H, n_W, n_C = a_G.get_shape().as_list() # Reshape the images to have them of shape (n_H*n_W, n_C) a_S = tf.reshape(a_S, [n_H*n_W, n_C]) a_G = tf.reshape(a_G, [n_H*n_W, n_C]) # Computing gram_matrices for both images S and G GS = gram_matrix(tf.transpose(a_S)) #notice that the input of gram_matrix is A: matrix of shape (n_C, n_H*n_W) GG = gram_matrix(tf.transpose(a_G)) # Computing the loss J_style_layer = tf.reduce_sum((GS - GG)**2) / (4 * n_C**2 * (n_W * n_H)**2) return J_style_layer
We get better results if we “merge” styles from different layers as opposed to just one layer. So we will go ahead and define that now.
def compute_style_cost(model, STYLE_LAYERS): """ Computes the overall style cost from several chosen layers Arguments: model -- our tensorflow model STYLE_LAYERS -- A python list containing: - the names of the layers we would like to extract style from - a coefficient for each of them Returns: J_style -- tensor representing a scalar value, style cost defined above by equation (2) """ # initialize the overall style cost J_style = 0 for layer_name, coeff in STYLE_LAYERS: # Select the output tensor of the currently selected layer out = model[layer_name] # Set a_S to be the hidden layer activation from the layer we have selected, by running the session on out a_S = sess.run(out) # Set a_G to be the hidden layer activation from same layer. Here, a_G references model[layer_name] # and isn't evaluated yet. Later in the code, we'll assign the image G as the model input, so that # when we run the session, this will be the activations drawn from the appropriate layer, with G as input. a_G = out # Compute style_cost for the current layer J_style_layer = compute_layer_style_cost(a_S, a_G) # Add coeff * J_style_layer of this layer to overall style cost J_style += coeff * J_style_layer return J_style
Now we will define the total cost to optimize.
def total_cost(J_content, J_style, alpha = 10, beta = 40): """ Computes the total cost function Arguments: J_content -- content cost coded above J_style -- style cost coded above alpha -- hyperparameter weighting the importance of the content cost beta -- hyperparameter weighting the importance of the style cost Returns: J -- total cost as defined by the formula above. """ J = alpha * J_content + beta * J_style return J
Solving the optimization problem
Here’s what the final script will have to do:
- Create an Interactive Session
- Load the content image
- Load the style image
- Randomly initialize the image to be generated
- Load the VGG16 model
- Build the TensorFlow graph:
- Run the content image through the VGG16 model and compute the content cost
- Run the style image through the VGG16 model and compute the style cost
- Compute the total cost
- Define the optimizer and the learning rate
- Initialize the TensorFlow graph and run it for a large number of iterations, updating the generated image at every step.
# Reset the graph tf.reset_default_graph() # Start interactive session sess = tf.InteractiveSession() content_image = scipy.misc.imread("images/louvre_small.jpg") content_image = reshape_and_normalize_image(content_image) style_image = scipy.misc.imread("images/monet.jpg") style_image = reshape_and_normalize_image(style_image) generated_image = generate_noise_image(content_image) imshow(generated_image) model = load_vgg_model("pretrained-model/imagenet-vgg-verydeep-19.mat") # Assign the content image to be the input of the VGG model. sess.run(model['input'].assign(content_image)) # Select the output tensor of layer conv4_2 out = model['conv4_2'] a_C = sess.run(out) a_G = out # Compute the content cost J_content = compute_content_cost(a_C, a_G) # Assign the input of the model to be the "style" image sess.run(model['input'].assign(style_image)) # Compute the style cost J_style = compute_style_cost(model, STYLE_LAYERS) J = total_cost(J_content, J_style, alpha = 10, beta = 40) optimizer = tf.train.AdamOptimizer(2.0) train_step = optimizer.minimize(J) def model_nn(sess, input_image, num_iterations = 200): # Initialize global variables (you need to run the session on the initializer) sess.run(tf.global_variables_initializer()) # Run the noisy input image (initial generated image) through the model. Use assign(). sess.run(model['input'].assign(input_image)) for i in range(num_iterations): # Run the session on the train_step to minimize the total cost _ = sess.run(train_step) # Compute the generated image by running the session on the current model['input'] generated_image = sess.run(model['input']) # Print every 20 iteration. if i%20 == 0: Jt, Jc, Js = sess.run([J, J_content, J_style]) print("Iteration " + str(i) + " :") print("total cost = " + str(Jt)) print("content cost = " + str(Jc)) print("style cost = " + str(Js)) # save current generated image in the "/output" directory save_image("output/" + str(i) + ".png", generated_image) # save last generated image save_image('output/generated_image.jpg', generated_image) return generated_image model_nn(sess, generated_image)
And that’s it! We’re done. We used an optimization method to generate a stylized version of the content image. You can now render any image into the style of any painting — albeit slowly, as the optimization process is iterative. Here’s an example that I made from our code — Illini Quad in spring stylized as winter.
Hope you enjoyed this article and found this to be as fascinating and fun as I do. Happy learning!
1. Here is the original paper on neural style transfer, which proposed the optimization process.