What Anime CLIP Sees
CLIP is a neural network that is able to score the similarity between a text phrase and an image. If the image is generated by another neural network (or any differentiable program), we can differentiate the scoring function with respect to the parameters of that network. For a particular set of parameters we not only get a particular score, but a direction. This direction shows how to change those parameters such that the generated image is more like the target phrase.
The results here show that CLIP is able to express semantic concepts in this domain. "This is isolation" yields a man alone, his face covered. With "This is the world going insane." the faces are laughing mad. "This is a hero." captures dynamic figures, flashy yellows and reds.
Click an image to get a larger gallery view.