A shot at Neural Style Transfer

Leave a comment

I’m pretty late to the neural style transfer bandwagon but lately, I have been getting more and more interested in the applications of Deep Learning to Arts. So I finally decided to get my hands dirty and actually play around with Neural Style Transfer. Though the idea is not new and I’d read this paper some time in 2016, I finally got around to implementing it on my own.

The first paper published on Neural Style Transfer (NST) used an optimization technique wherein the authors start off with a random noise image and make it more and more desirable with every training iteration. In this technique we do not use a neural network in true sense i.e. we aren’t training a network to do anything. We are simply taking advantage of back propagation to minimize the two defined loss. The tensor in which we backpropagate into is the stylized image we wish to achieve which I’ll call the pastiche from here on out. We have as inputs the artwork whose style we wish to transfer, known as the style image, and the picture to which we wish to transfer the style onto, known as the content image.

The pastiche is initialized to be random noise. This is passed through a pretrained neural network such as VGG-16 on ImageNet. We use the outputs of various intermediate layers to compute two types of losses: style loss and content loss — that is, how close is the pastiche to the style image in terms of style, and how close is the pastiche to the content image in content. Those losses are minimized by changing our pastiche image via backpropagation. After a few iterations, our pastiche is image is now similar to the content image in terms of content and looks similar to the style image in terms of style. The main reason for using intermediate layers of the network is that we want to use different semantic representations of the input image and use those representations to compare our pastiche to the content and style images.


Let’s define the loss function for style transfer. The content loss is simply defined as the Euclidean distance between the intermediate representations of the content image and the pastiche. The equation for the content loss is —

Content Loss

Basically, we make a list of intermediate layers at which we need to compute the content loss. So, the outer summation denotes the sum across all these layers. We pass the content image and the pastiche image through the network until a particular layer in the list, take the output of that layer, square the difference between each corresponding value, and thens sum all of them up. One thing to note here is that the authors introduce a hyper parameter which is multiplied to this difference so that the effect of this loss function can be altered at a high level.

The style loss is very similar, except instead of comparing the raw inputs of the style and the pastiche images, we compare the Gram matrices at various layers. In Linear Algebra a Gram matrix is nothing but the multiplication of the matrix to its transpose. It can be denoted as follows:

Gram Matrix Equation

The Gram matrix contains the information about non localized information about the image, such as texture, shapes, and weights — in essence, style.

Now that we have defined the Gram matrix as containing information about the style of the image, we can find the Euclidean distance between the style image and the content image across the layers of interest. The style loss is thus defined as follows:

Style Loss Equation

Similar to our content loss equation, we multiply this equation with another hyperparamter beta, known as the style weight.

Summing the content and style loss defines the total loss for each iteration of training. To summarize content loss determines how close the content of the pastiche is to the content image, and style los determines how close the style of pastiche is to the style image. We then back propagate this loss through the network to reduce this loss by getting a gradient on the pastiche image and iteratively changing it to make it look more and more like a stylized content image. This is all described in more rigorous detail in the original paper on the topic by Gatys et al.


Now that we know how style transfer works, let’s build it.

Content Cost

def compute_content_cost(a_C, a_G):
    Computes the content cost

    a_C -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image C
    a_G -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image G

    J_content -- scalar that you compute using equation 1 above.

    # Retrieve dimensions from a_G
    m, n_H, n_W, n_C = a_G.get_shape().as_list()

    # Reshape a_C and a_G
    a_C_unrolled = tf.transpose(tf.reshape(a_C, [-1]))
    a_G_unrolled = tf.transpose(tf.reshape(a_G, [-1]))

    # compute the cost with tensorflow
    J_content = tf.reduce_sum((a_C_unrolled - a_G_unrolled)**2) / (4 * n_H * n_W * n_C)

    return J_content

Now we will define the Gram Matrix.

def gram_matrix(A):
    A -- matrix of shape (n_C, n_H*n_W)

    GA -- Gram matrix of A, of shape (n_C, n_C)
    GA = tf.matmul(A, tf.transpose(A)) # '*' is elementwise mul in numpy

    return GA

After generating the Style matrix(Gram matrix), our goal is to minimize the distance between the Gram matrix of the “style” image S and that of the “generated” image G.

def compute_layer_style_cost(a_S, a_G):
    a_S -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image S
    a_G -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image G

    J_style_layer -- tensor representing a scalar value, style cost defined above by equation (2)
    # Retrieve dimensions from a_G
    m, n_H, n_W, n_C = a_G.get_shape().as_list()

    # Reshape the images to have them of shape (n_H*n_W, n_C)
    a_S = tf.reshape(a_S, [n_H*n_W, n_C])
    a_G = tf.reshape(a_G, [n_H*n_W, n_C])

    # Computing gram_matrices for both images S and G
    GS = gram_matrix(tf.transpose(a_S)) #notice that the input of gram_matrix is A: matrix of shape (n_C, n_H*n_W)
    GG = gram_matrix(tf.transpose(a_G))

    # Computing the loss
    J_style_layer = tf.reduce_sum((GS - GG)**2) / (4 * n_C**2 * (n_W * n_H)**2)

    return J_style_layer

We get better results if we “merge” styles from different layers as opposed to just one layer. So we will go ahead and define that now.

def compute_style_cost(model, STYLE_LAYERS):
    Computes the overall style cost from several chosen layers

    model -- our tensorflow model
    STYLE_LAYERS -- A python list containing:
                        - the names of the layers we would like to extract style from
                        - a coefficient for each of them

    J_style -- tensor representing a scalar value, style cost defined above by equation (2)

    # initialize the overall style cost
    J_style = 0

    for layer_name, coeff in STYLE_LAYERS:

        # Select the output tensor of the currently selected layer
        out = model[layer_name]

        # Set a_S to be the hidden layer activation from the layer we have selected, by running the session on out
        a_S = sess.run(out)

        # Set a_G to be the hidden layer activation from same layer. Here, a_G references model[layer_name]
        # and isn't evaluated yet. Later in the code, we'll assign the image G as the model input, so that
        # when we run the session, this will be the activations drawn from the appropriate layer, with G as input.
        a_G = out

        # Compute style_cost for the current layer
        J_style_layer = compute_layer_style_cost(a_S, a_G)

        # Add coeff * J_style_layer of this layer to overall style cost
        J_style += coeff * J_style_layer

    return J_style

Now we will define the total cost to optimize.

def total_cost(J_content, J_style, alpha = 10, beta = 40):
    Computes the total cost function

    J_content -- content cost coded above
    J_style -- style cost coded above
    alpha -- hyperparameter weighting the importance of the content cost
    beta -- hyperparameter weighting the importance of the style cost

    J -- total cost as defined by the formula above.

    J = alpha * J_content + beta * J_style

    return J

Solving the optimization problem

Here’s what the final script will have to do:

  1. Create an Interactive Session
  2. Load the content image
  3. Load the style image
  4. Randomly initialize the image to be generated
  5. Load the VGG16 model
  6. Build the TensorFlow graph:
    • Run the content image through the VGG16 model and compute the content cost
    • Run the style image through the VGG16 model and compute the style cost
    • Compute the total cost
    • Define the optimizer and the learning rate
  7. Initialize the TensorFlow graph and run it for a large number of iterations, updating the generated image at every step.
# Reset the graph

# Start interactive session
sess = tf.InteractiveSession()

content_image = scipy.misc.imread("images/louvre_small.jpg")
content_image = reshape_and_normalize_image(content_image)

style_image = scipy.misc.imread("images/monet.jpg")
style_image = reshape_and_normalize_image(style_image)

generated_image = generate_noise_image(content_image)

model = load_vgg_model("pretrained-model/imagenet-vgg-verydeep-19.mat")

# Assign the content image to be the input of the VGG model.

# Select the output tensor of layer conv4_2
out = model['conv4_2']

a_C = sess.run(out)

a_G = out

# Compute the content cost
J_content = compute_content_cost(a_C, a_G)

# Assign the input of the model to be the "style" image

# Compute the style cost
J_style = compute_style_cost(model, STYLE_LAYERS)

J = total_cost(J_content, J_style,  alpha = 10, beta = 40)

optimizer = tf.train.AdamOptimizer(2.0)

train_step = optimizer.minimize(J)

def model_nn(sess, input_image, num_iterations = 200):

    # Initialize global variables (you need to run the session on the initializer)

    # Run the noisy input image (initial generated image) through the model. Use assign().

    for i in range(num_iterations):

        # Run the session on the train_step to minimize the total cost
        _ = sess.run(train_step)

        # Compute the generated image by running the session on the current model['input']
        generated_image = sess.run(model['input'])

        # Print every 20 iteration.
        if i%20 == 0:
            Jt, Jc, Js = sess.run([J, J_content, J_style])
            print("Iteration " + str(i) + " :")
            print("total cost = " + str(Jt))
            print("content cost = " + str(Jc))
            print("style cost = " + str(Js))

            # save current generated image in the "/output" directory
            save_image("output/" + str(i) + ".png", generated_image)

    # save last generated image
    save_image('output/generated_image.jpg', generated_image)

    return generated_image

model_nn(sess, generated_image)

And that’s it! We’re done. We used an optimization method to generate a stylized version of the content image.  You can now render any image into the style of any painting — albeit slowly, as the optimization process is iterative. Here’s an example that I made from our code — Illini Quad in spring stylized as winter.

Content image

Style Image


Generated Image

Hope you enjoyed this article and found this to be as fascinating and fun as I do. Happy learning!


1. Here is the original paper on neural style transfer, which proposed the optimization process. 

2. Coursera’s Course on Convolutional Neural Networks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s