NeuralNet Storyteller: Letting AI write stories based on Photographs

I have been into photography for quite sometime now, it was inevitable that my little brain would want to try something out with the photographs. So here is another little fun project I did to explore the combination of art and Machine Learning. I had developed a habit of publishing my photographs only if I could relate it to a story that strongly resonated with the photograph. This introspection led me into investigating if I could let AI write stories by looking at the photographs. And that is when I bumped on to the ULTIMATE NeuralNet Storyteller developed by Ryan Kiros et al from University of Toronto.

What does it do?

Given a photograph, it writes a story in the artistic style of a romantic novel without human intervention. Think of it as writing a passage for a romantic novel given a visually descriptive image.Before heading on to the technical summary, let’s look at the results of NeuralNet Storyteller on some of the photographs:

OUTPUT: My descent to the moon , blinking in the evening sky . It was as if I had spent the last thousand years trying to find a way out of it , but I did n’t know what else to do . Instead , I felt a sinking feeling in the pit of my stomach . The reality of the situation was so much more than that Elizabeth had caught up with him , and she simply held his arm out for her . The sun ‘s rays rose like stars on the horizon , enveloping us . I wanted to stay alive .

 

OUTPUT: Images wing of the plane was spectacular , and I wondered what it was like to fly out of the sky . For the first time in millennia , I could n't find the truth in the news . In fact , it had been more than ever since my plane landed on Earth . It was as if he breathed in and out of the air , looking up at the sky above us . The plane had begun to take its toll , but that 's the only way possible . I wanted to rescue her and hold my breath .

OUTPUT:
Images wing of the plane was spectacular , and I wondered what it was like to fly out of the sky . For the first time in millennia , I could n’t find the truth in the news . In fact , it had been more than ever since my plane landed on Earth . It was as if he breathed in and out of the air , looking up at the sky above us . The plane had begun to take its toll , but that ‘s the only way possible . I wanted to rescue her and hold my breath .

 

OUTPUT: I bird barely touched the water as it was in front of me . By the time I reached the source of the fog , I could hear the rise and fall of his chest , leaving her gasping for air . I wanted to talk to her , but I had no intention of letting her go . In fact , it was the most beautiful thing I 'd ever seen . The sun rose above the horizon as a bird swam in and out , leaving me gasping for air . It seems like the right person to ask who he is , that I had fallen asleep in the water .

OUTPUT:
I bird barely touched the water as it was in front of me . By the time I reached the source of the fog , I could hear the rise and fall of his chest , leaving her gasping for air . I wanted to talk to her , but I had no intention of letting her go . In fact , it was the most beautiful thing I ‘d ever seen . The sun rose above the horizon as a bird swam in and out , leaving me gasping for air . It seems like the right person to ask who he is , that I had fallen asleep in the water .

 

 

And How?

The NeuralNet Storyteller takes an image, recognizes the objects in the image based on which it produces a caption and then transforms the caption into a short romantic story using what is called Style Shifting.

The only part that is trained in a supervised manner was for generating captions using Microsoft COCO data. RNN is trained on Romance novels to first build a decoder to convert passages from the novel to skip-thought vector representation. These skip thought vectors are then conditioned to generate the passages that were used to generate the skip-thought vectors.To obtain the artistic style of Romance Novel, this dataset (romance novels from BookCorpus) has been used. In order to embed new images and retrieve captions, a visual-semantic embedding is trained between COCO images and captions. The captions and images are mapped into a common vector space.
The Skip-Thought Vectors are obtained from an unsupervised approach to train a generic, distributed sentence encoder.Sentences that share semantic and syntactic properties are mapped to similar vector representations. These Vectors make it possible to construct a Style-Shifting function in a simple way that bridges the gap between retrieved image captions and passages in novels.

  1. RNN is trained on Romance novels for encoding and decoding passages from romance novels to skip-thought vectors
  2. Simultaneously, a visual semantic embedding is trained to obtain captions for given photographs
  3. Input: A photograph is given as input
  4. Obtain Caption for Photograph: The Visual Semantic embedding predicts the most suitable caption for the image.
  5. Style Shifting: The caption is then translated into a romantic story type of passage by keeping the ‘thought’ of the caption and replacing the caption (descriptive) style with romantic story style.
  6. Output: TA DA! A Romantic Story Syle Caption for the Photograph.

I hope you are as impressed with the results as I am. If you have ideas for fun AI projects or if you are looking to collaborate, give me a shout-out here or simply comment below. Also, checkout my photography work on facebook and instagram.

Reference:

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. “Skip-Thought Vectors.” arXiv preprint arXiv:1506.06726 (2015).
Code: https://github.com/ryankiros/neural-storyteller

 

Machine Desi Hip Hop: A Fun Experiment with RNN

There has been a lot of exciting developments on the use of Recurrent Neural Networks lately. After Andrej Karpathy’s post on the Unreasonable effectiveness of RNNs and a lot of cool experiments followed. Some of which are: TEDRnn, Find Your Dream Job , RNN Bible, DrumpfRNN. All of them make use of LSTM, a special kind of RNN that enables to connect previous information to the present task even in situations where gap between relevant information and place of prediction is large. You can find more about LSTMs in these amazing blogs written by Christopher Olah and Nikhil Buduma.

I was aware that LSTM is pretty popular for generating text that sounded grammatically accurate albeit less meaningful. This made me wonder, would this be any good for transliteration? I had to scratch the itch. Having been heavily influenced by Bollywood, a monkey in my head wanted to try testing it with Desi Hip-Hop party songs. Probably because it thinks most of the lyrics are meaningless anyways. Will non-native speakers figure out if it is gibberish right away? May be not.

Machine Desi Hip Hop Poster

Transliteration is conversion of a text from one script to another. For instance, the English transliteration of Hindi term ‘नमस्ते’ is ‘NAMASTE’, while its translation is ‘HELLO’.

I managed to scrape just above 100 song lyrics transliterated in English by various bollywood hip-hop artists. The data consists of only about 157000 characters (certainly not a huge dataset). In case you are not very familiar with this genre of songs, most of them although written in Hindi are swayed by Punjabi with many English terms creeping in every now and then.

The model generates lyrics by predicting one character after the other and this is where LSTM’s long term context memory comes into play. This LSTM model was built using Keras with theano in the backend. I rented an Amazon’s AWS GPU based g2x instance (using grid K520) and ran about 120 iterations on the data that ran approximately for about 6 hours. I did try running the algorithm on my local machine with 4 cores and compared the time taken for each epoch. It was approximately 10 times slower than GPU.

MDHH_poster2

I set the seed text (required to start text generation) to something like “Ish your boy Ierr”, hoping that it will pickup some rapper’s pattern and generate something interesting. The model started learning and for each iteration it generated lyrics, character by character. I ignored the first few iterations since it was undecipherable. I waited and checked the outputs after 30 iterations, then 60 and then 100, only to find garbage text that sounded close to no language. I began to wonder if lstm is really not that good for transliteration or was it me doing something funny.

Sure enough, it was the latter case. It was realized that the seed given to the model contained only English words and that the already sparse training data mainly consisted of Hindi/Punjabi words and very few English terms. I decided to give it one more try with randomly chosen seed (of 20 characters) from the training data. As expected, the first few iterations had results like:

Iteration1:

Seed: “r rani, teri jawani”
r rani, teri jawani
a
e r  tuit
uioner b6 hltuh
a¸aned ieyeuaoutuvd t. dua.s m6 pa irvendnaij
ur
n epneaalwa. btt raamqe m kdhinaa

d  ynea  ahhd a teaa
iapa ch
ao nhaoeyant  o i  imo
ih  h lihl mioy
¢   i
t  ?t
urkho?,s  ny a a .h s u bra_y
n ga c l ecmo
aa np  ghdoeuaana oimarhaq, aat
ma
egooanoyhnaaa rutan.hto  srd
ie e a
unojo  efmadkevenh a
wpoadao mwruuh
animy soht godedeiaasu2ouaaaakzm  tubehmee oagk d

I checked up the results after the 10th iteration, still nothing very distinctly seemed like hindi. It was the 19th iteration when I could see a ray of hope. The text did sound very much like a Hindi/Punjabi sentence and as expected it made absolutely no sense.

Iteration 19:

Seed: “ata been mujhko toh”

ata been
mujhko toh na saanu jaane lakh te hai
te hain bab saate hai
aas konge gara pe gava paana hai

aaja je baban meri hoos
mera baal laghi hai
pehi paar te peeth pata loogl na chaladi
ka sunda mere saan da main te saani naar gaana main taa karang

mere maan rati, mera naal te ni labni
mere laal na waalh kao
tere yaar main taan pakh li lai laaye
ni chaalo te lo khar de sapte ho saaye
na lo tu main to

 

As the iterations progressed it sounded like what was intended but as it was noticed earlier, for iterations with seed terms that extensively contained English terms, the words in the lyrics generated were difficult to read. The outputs of all the iterations could be found here. Having seen the results and considering the fact that this was run on a scanty-dataset, it is reasonable to say that LSTM could very well be utilized for transliteration problems.

MDHH_poster3

The generated set of lyrics were taken one step further.Few rhyming couplets(aabb and abab form) were randomly chosen and manually ordered. I now had a Machine generated Desi-Hip Hop song lyrics ready. To do justice to the lyrics, I collaborated with some talented people and came up with a Desi-Hip Hop music video. Don’t understand the lyrics? Don’t panic, Nobody does…!

Machine Desi Hip Hop Lyrics:

main hoon meri naat hai
mere saa mujhe di lagaar
hai manna se magaa jaaya
jam san meri gala maar

disco vich ghaa pe gaya
disco vich ghaa pe gaya
disco vich ghaa pe gaya
disco vich ghaa pe gaya

bad te mere gee main sebaati
hai apna dil tohaaroor beti
par jaane weh laalu kich jar
oh seakh i’ta saapa raho kar
mera baa mera naal mera jaan nahi naar
nachre challe nikh tere nari tere jaar
sboni main tune mainu na baan hai
meri gal meri nahi rani da jaan hai

naal ni bhaam bara dhoon
hai aan hooj shaub de khoon
hi sanna tunne saar laar hai
mundeya de saad bab yehar hai

choda co paari hai, palle khila
apre kara hoon kishi jide gila
khwaban vich khona ni, kinni raata soya ni
tere ghar ke hoya ni

kubi keri annihon phori na choon
niscon te boon la di kon ku joon
rako kar de vod mera naah, nahi,
dal ee lyee kichune saan hai nahi

khir tere choori nachre phoom lega saye
chaki samriyaale kehon lond, nakhaye
mujhke kudiya makhle ni kare ishare
nakhre dikhaave ji main ke main laare

disco vich ghaa pe gaya
disco vich ghaa pe gaya
disco vich ghaa pe gaya
disco vich ghaa pe gaya

The model was over-fit on the line ‘disco vich ghaa pe gaya’ from the training dataset, which is a line from an existing song ‘Take your sandals off’, which is also the reason why it kept popping up repeatedly in the generated output.