Dear Fellow Scholars, this is Two Minute Papers
with Károly Zsolnai-Fehér. Ever thought about the fact that we have a
stupendously large amount of unlabeled videos on the internet? And, of course, with the ascendancy of machine
learning algorithms that can learn by themselves, it would be a huge missed opportunity to not
make any use of all this free data. This is a crazy piece of work where the idea
is to unleash a neural network on a large number of publicly uploaded videos on the
internet, and see how well it does if we ask it to generate new videos from scratch. Here, unlabeled means there is no information
as to what we see in these videos, they are just provided as-is. Machine learning methods that work on this
kind of data, we like to call unsupervised learning techniques. This work is based on a generative adversarial
network. Wait, what does this mean exactly? This means that we have two neural networks
that race each other, where one tries to generate more and more real-looking animations, and
passes it over to the other that learns to tell real footage from fake ones. The first we call the generator network, and
the second is the discriminator network. They try to outperform each other, and this
rivalry goes on for quite a while and improves the quality of output for both neural networks,
hence the name, generative adversarial networks. At first, we have covered this concept that
was used to generate images from written text descriptions. The shortcoming of this approach was the slow
training time that led to extremely tiny, low resolution output images. This was remedied by a followup work which
proposed a two-stage version of this architecture. We have covered this in an earlier Two Minute
Papers episode, as always, the link is available in the video description. It would not be an understatement to say that
I nearly fell off the chair when seeing these incredible results. So, where do we go from here? What shall be the next step? Well, of course, video! However, the implementation of such a technique
is far from trivial. In this piece of work, the generator network
learns not on the original representation of the videos, but on the foreground and background
video streams separately, and it also has to learn what combination of these yields
realistic footage. This two-stream architecture is particularly
useful in modeling real-world videos where the background is mostly stationary and there
is an animated movement in the foreground. A train passing the station or people playing
golf on the field are excellent examples of this kind of separation. We definitely need a high quality discriminator
network as well, as in the final synthesized footage, not only the foreground and background
must go well together, but the synthesized animations also have to be believable for
human beings. This human being is, in our case, is represented
by the discriminator network. Needless to say, this problem is extremely
difficult and the quality of the discriminator network makes or breaks this magic trick. And of course, the all important question
immediately arises: if there are multiple algorithms performing this action, how do
we decide which one is the best? Generally, we get a few people, and show them
a piece of synthesized footage with this algorithm and previous works, and have them decide which
they deem more realistic. This is still the first step – I expect these
techniques to improve so rapidly that we’ll soon find ourselves testing against real-world
footage. And who knows, sometimes perhaps failing to
recognize which is which. The results in the paper show that this new
technique beats the previous techniques by a significant margin, and that users have
a strong preference towards the two-stream architecture. The previous technique they compare against
is an autoencoder, which we have discussed in a previous Two Minute Papers episode, check
it out, it is available in the video description! The disadvantages of this approach are quite
easy to identify this time around: we have a very limited resolution for these output
video streams, that is, 64×64 pixels for 32 frames, which, even at modest framerate, is
just slightly over one second of footage. The synthesized results vary greatly in quality,
but it is remarkable to see that the machine can have a rough understanding of the concept
of a large variety of movement and animation types. It is really incredible to see that the neural
network learns about the representations of these objects and how they move, even when
it wasn’t explicitly instructed to do so. We can also visualize what the neural network
has learned. This is done by finding different image inputs
that make a particular neuron extremely excited. Here, we see a collection of inputs including
these activations for images of people and trains. The authors’ website is definitely worthy
of checking out as some of the submenus are quite ample in results. Some amazing, some, well, a bit horrifying,
but what is sure is that all of them are quite interesting. And before we go, a huge shoutout to László
Csöndes, who helped us quite a bit in sorting out a number of technical issues with the
series. Thanks for watching and for your generous
support, and I’ll see you next time!