Deep Learning in multimedia

The advances in Artificial Intelligence (AI) and more specifically in Deep Learning, have hardly left any technical domain unaffected. The idea of Artificial Intelligence is not new, what happened that suddenly Artificial Intelligence has gained so much attention? What is the impact on the multimedia landscape and what are the emerging challenges for broadcasters?

Shortly before the beginning of the past decade, Fei-Fei-Li, professor at Stanford, launched the publicly available ImageNet database of more than 14.000.000 labeled images in more than 20.000 categories. While, at that time, most of the AI research focused on improving or creating new models, professor Li focused instead on data and managed to expand available datasets used for training AI algorithms.  This was a seminal contribution, mainly because labeling huge amounts of internet images appropriately is an essential requirement for training neural networks successfully. At the same time, the significant increase of contemporary hardware capabilities (such as Graphics Processing Units and Tensor Processing Units) enabled the training of very complex and resource demanding models, namely the Deep Learning models that excelled in efficiency over other traditional computer vision approaches.

Deep learning is a subfield of machine learning, a technological field, which belongs to the more general field of Artificial Intelligence and includes technologies that enable computer systems to improve with experience and data. Deep learning methods are based on neural network architectures that contain multiple hidden layers. The fact that the number of neurons must be large is essential and the success of neural networks nowadays is also due to the dramatic increase in the size of the network that we can use today.  The above-mentioned introduction of ImageNet has been a milestone in the Deep Learning revolution. In the following years, it resulted in the conception of dramatically improved network architectures and has expanded in many application fields. This, on one hand, has given rise to new challenges and, on the other hand, promises significant improvements of journalistic workflows.

In this post, we overview three emerging topics related to deep learning and that are shaping the media landscape: Deepfakes, Natural Language Processing and Open Source Intelligence. Furthermore, we identify emerging challenges for broadcasters and discuss our current focus in IRT.


The first thing that probably comes to mind when thinking of Deepfakes is videos, such as the ones that first appeared a couple of years ago depicting celebrities, for example Barack Obama mouthing words that another person recorded. Deepfakes are a synthetic type of media, manipulated by Deep Learning algorithms, where the person saying or doing something is realistically replaced by another person. Neural network architectures such as Generative Adversarial Networks (GANs) are commonly used for the creation of Deepfakes. In order to make the DeepFake sound or look like a target person, one needs to train such a GAN with the target person’s speaking voice, video or even just photos. And public personalities, who appear on numerous videos online, are an easy “target” to begin with.

Figure 1: GANs can generate synthetically faces using Source A, and Source B. The faces in the middle do not exist.

The technology to produce Deepfakes is rapidly evolving and the game between Deepfake creators and the experts who try to catch them resembles one of cat-and-mouse game. As Deepfake creators can access publicly available content and create any kind of new content, they tend to be one step ahead of the experts that try to detect them. As soon as a novel feature in the detection of Deepfakes is identified, for example that the region around mouth is blurred, the Deepfake creators can quickly come up with a new algorithm to remedy that problem. In order to accelerate advancements in Deepfake detection technology, Facebook has collaborated with several academia institutions to construct and publicly release the Deepfake Detection Challenge dataset.

Of course, Deepfakes constitute only one facet of the general misinformation spread. Others are for instance fake news, that are human-generated texts or fake text, that is text generated (synthesized) purely by AI. One year ago, OpenAI, published a powerful model named GPT-2 (Generative Pre-Trained model-2) which can generate text articles, namely fake text that are sufficiently convincing to be human ones. This release has given rise to concerns about potential misuse of the model and resulted in them releasing, at first instance, only a cut-down version of this model. These technological advances alongside the rising public concern about misinformation seem to make an impact on the media industry in the coming years.

However, it has to be noted that the technology behind Deepfakes can also be used for legitimate (commercial) purposes, such as dubbing foreign-language films. For example, Deep Video Portraits, a video editing technique that uses machine learning techniques to transfer head pose, facial expression and eye movement of the dubbing actor on to the target actor to accurately sync the lips and facial movements to the dubbing audio, could save time and reduce costs for the film industry.

Figure 2: Deepfakes generated from a single image can be used for educational or other legitimate purposes.


Natural Language Processing 

Media is all about communication, and language is a vital part of communication. Natural Language Processing (NLP) is the technological field that deals with machine processing (text reading, text generation etc.) and understanding of human (natural) language. Initially, NLP was handled by rule-based systems which used writing rules for grammar, sentence structure etc. Even though rule-based NLP has been an active research field for decades, it has also gained remarkable attention lately thanks to the advances in Deep Learning.

In recent years, researchers have been showing that techniques with confirmed value in computer vision, may be useful also in the field of NLP. For example, Transfer Learning, i.e. pre-training a neural network model on a known task and then performing fine-tuning using the trained neural network as the basis of a new purpose-specific model, has introduced major improvements in the performance of NLP. The recent introduction of language models such as GPT-2 and BERT, triggered even more interest and enabled noteworthy advancements in the performance of NLP related tasks. One of the most popular and widely used application of NLP is language translation (e.g. English, French). Online translation services improved dramatically by switching from old phrase-based statistical machine learning translation algorithms to deep learning-based ones. In parallel, the introduction of the Tensor Processing Unit (TPU) helped towards the superior, almost human-quality, translation system dream becoming true.

In addition to translation tasks, NLP tasks include language identification, text summarization, natural language generation (convert information into readable language), natural language understanding (machine reading comprehension) and speech recognition among many others. Each NLP task may find applications in multiple fields in the media sector. Automatic subtitle creation (speech-to-text) used for accessibility purposes as well as archiving, automatically producing content, which is not based on a specific template, hate speech detection in user generated content, virtual assistants for customer service are just a few examples. Special language cases, such as spoken local dialects, are also important cases that can be supported by automatic speech recognition tools.

As expected, the rapid advances in NLP currently concern mainly the English language. The German speaking journalistic community faces currently the challenge to transpose and implement these advances to German. One of the major challenges towards this is creating appropriate datasets for training of the learning algorithms, especially in special cases such as the case of local dialects.

Open source Intelligence

User Generated Content (UGC) is continuously produced and it is accompanied by the challenge of moderating or exploiting it. On one side, there is growing awareness amongst the public, business and policy makers to the potential damage of harmful online material. Online platforms are already taking measures to protect public from harmful content such as nudity, fake news, hate speech using algorithms that automatically detect them in newly appeared UGC.  On the other side, the digital data wealth available is an extraordinary opportunity to gain insights and extract valuable information about behavior, trends and correlations that are “hidden” in large data volumes, from heterogenous data sources. This is what the teams of data journalism are pursuing: by combining the journalistic know-how on reporting, storytelling and finding stories, with the capabilities of automated approaches, driven by statistics and machine learning techniques to process large data volumes in a more effective way. The in-depth examination of different kinds and sources of data, such as the routes of ships and planes, publicly available user data and public authorities’ documents can lead to more concrete results and observations regarding specific topics of interest, or even reveal hidden issues that were not deemed important from the beginning.

Furthermore, investigative journalists, who rely on Open Source INTelligence (OSINT) also can benefit from the capabilities that the above-mentioned technologies offer.  Open Source Intelligence is the collection and analysis of information that is gathered from publicly available, open sources. Investigative journalism in the digital era has been pioneered by groups, such as Bellincat who use among others open source and social media investigation for their work. According to the Harvard’s Nieman Foundation of Journalism:An OSINT investigation is not one single method to get at truth, but rather a combination of creative and critical thinking to navigate digital sources on the web”. Their prediction, that OSINT will increasingly attract attention amongst journalistic communities, is also in line with BBC’s recent decision to make training journalists “in the art of open source media” a top priority. Therefore, an increasing number of journalists seem to need to become familiar with this method.

OSINT communities are researching and verifying available online content with the goal of trustworthy reporting, which is becoming even more critical today. To achieve that, research workflows that are automated or improved in accuracy are needed. Open source intelligence used for journalism builds on a wide range of digital data such as satellite imagery, social media, databases of wind, weather or any other form of data in order to better understand what happened at a specific place and point in time. Sophisticated machine learning algorithms can support the extraction of the desired information from these data sources in order to improve the collection of intelligence both qualitatively and quantitatively.

Emerging challenges for Broadcasters  

Deep learning, offers new opportunities in automatic recognition of features in data, allowing the analysis of complex data inputs such as human speech, images, videos and text. Some technologies are more mature than others, such as recommendation systems, speech-to-text, or text-to-speech conversion and others are relatively new, such as misinformation detection, manipulation detection, automated summary generation or automated article generation.

In the case of mature technologies, there is also often the need for tailored solutions. For example, the transcription of spoken dialects that require the creation of annotated datasets before training the neural network. Or the creation of user interfaces that are essential to support the exploitation of AI capabilities in the journalistic workflows, as for example using the in-browser face recognition extension that you can check in this demo. In this way, automation can serve in freeing up journalists’ time so that they can focus more on creative work.

The IRT is focusing on identifying and understanding how AI-driven automation can improve the efficiency of algorithms (i.e. quality of result, execution time or both) as well as opening innovative ways for tackling new challenges. Thematically, our current focus lays on all three above presented topics with the goal to support the journalistic workflows both in the newsroom and in production. Within the field of NLP, we are currently exploring ways towards automating subtitle creation processes, as well as automatic summarization of news articles. Regarding OSINT, we are currently focusing on building tools that employ AI/NLP technologies to support journalistic work towards misinformation detection and automation in general.  And last but not least, we are investigating state-of-the-art Deepfake detection technologies, with the upper scope to be used for media literacy/education purposes.

Back to blog