A system for Deepfake Detection: DFirt

Synthetic media manipulated by deep learning algorithms have become a matter of public concern due to the fact that they are increasingly realistic and believable and can be used to intentionally mislead public. The term DeepFake refers to synthetic images and videos manipulated by deep learning algorithms that achieve realistic results by swapping a person’s face or parts of it with the one from another person.  In order to remedy the malicious usage of Deepfakes, technology countermeasures have to be devised together with policy, legislation and media literacy (more details are discussed here). From the technology side, the countermeasures that can be used to mitigate the impact of Deepfakes fall into three categories: media authentication, media provenance and Deepfake detection.

In this project, we compiled a comprehensive dataset that consists of Deepfakes that contain several manipulation types in order to train a neural network in a supervised fashion towards Deepfake detection and we built a webservice to demonstrate this Deepfake detection system. 

Types of manipulations

Deepfakes’ manipulations are usually categorized in four main types based on the category of facial manipulations: Entire Face SynthesisAttribute ManipulationIdentity Swap and Expression Swap. Entire Face Synthesis, as the name suggests synthesizes an entire fictional face using powerful Generative Adversarial Networks (GANs). In contrast, Attribute Manipulation modifies an existing face, and this modification can be achieved as well through GANs. Facial modifications examples include aging or de-aging, changing of hair color or skin color, changing the gender, adding a beard etc. Identity Swap, the most common type of manipulation, replaces the face of person A with the face of person B. This manipulation is often carried out using a so-called autoencoder, as in the example of the tool faceswap. Expression Swap modifies facial expressions, usually in the mouth area, by replacing the motion of a certain region on the face of person A with the motion of the corresponding area of person B. Popular approaches for Expression Swap are Face2Face or NeuralTexturesExamples of the different types of Deepfakes’ manipulation can be seen in Figure 1.

Types of deepfake manipulations

Figure 1. Real and and Fake examples for each of the four manipulation types. Reprinted from “DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection”, by R. Tolosana, R. Vera-Rodriguez, J. Fierrez, A. Morales, & J. Ortega-Garcia (2020).

System Description

After compiling the dataset, the first step has been to apply several preprocessing steps and prepare the data for training. For the detection of manipulated images and videos of faces we used family of models known as EfficientNets that perform very well compared to other state-of-the art Deepfake detection approachesMoreover, we combined Efficient-Net with an attention module, that suggests to enhance the performance of neural networks for classification tasks. 


Publicly available Deepfake datasets do not include or provide all manipulation types. Hence, for the purposes of this project, we have compiled the DFirt dataset that includes all the above-mentioned Deepfake manipulation types, from several known datasets in order to train our model. The Attribute Manipulation and Entire Face Synthesis manipulation classes derive from the DFFD dataset of the Michigan State University. The class Expression Swap is collected from FaceForensics++ of the Techische UniversitäMünchen and for the Identity Swap class we used a combination of Google’s DeepFakeDetection dataset, FaceForensics++ dataset, and Facebook’s DFDC dataset.  Moreover, the class Real has been included, which consists of faces that have not been forged. This class includes the unforged faces of DFFDFaceForensics++Facebook’s DFDC and CelebA datasets. 

Preprocessing of Training Data

The first preprocessing step of the training data has been to extract two frames per second of the input video and detect and crop existing faces. Next, random samples of the obtained set were compressed to 20% of their original quality, in order to increase robustness against compression artefacts, so that our final model generalizes better to different video or image qualities.  Following, two augmentation steps were appliedΤhe first step aims at a more balanced classes’ distribution and applies randomly one of the following operations: 1) horizontal flip, 2) rotation by 30 degrees and 3) rotation by -30 degreesto classes that have less samples than others. The second step applies a random number of random augmentation types that include: 1) color jittering, 2) gray scaling, 3) affine transformation, 4) perspective transformation, 5) random erasing and 6) gaussian blurring, to all samples. The goal of the second augmentation step is to blend fake and real faces in order to obtain better generalizationFigure 2 illustrates several examples from the DFirt dataset before and after preprocessing.

Figure 2. Examples of the DFirt dataset before and after preprocessing.


In our system we employ Google’s EfficientNeta powerful convolutional neural network (CNN) architecture that has been used in top-performing models in the Deepfake Detection Challenge as announced last June by Facebook. EfficientNet uses a mobile inverted bottleneck convolution (MBConv), and it is optimized by a neural architecture search using the AutoML MNAS framework. Figure 3 shows the architecture of EfficientNet-B0. 

EfficientNet-B0 with Attention Module

Figure 3. Architecture of EfficientNet-B0. Dashed lines show the outputs taken from intermediate layers for the attention module. Adapted from “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”, by M. Tan, and Q. V. Le. (2020)

Attention Module

Previous studies observthat specific face regions have higher impact on DeepFake detection than others. For this reason, we incorporated an attention module, which main idea is to force EfficientNet-B0 to pay attention to the regions of a face that maximize the prediction. Therefore, we have implemented an attention module that takes the output of the 8th, 11th, 12th, 15th and 16th intermediate layer (see Figure 3.) and each of these outputs are added pixelwise with the last MBConv layer of EfficientNet-B0. Afterwards, the obtained outputs are concatenated and fed through a fully connected layer. The second and third rows in Figure 4. show the attention maps for EfficientNet-B0 without and with attention module respectively. It is observed that when using the attention module (third row), the attention (highlighted pixels) is more clustered and fall within the manipulated face region, compared to the case that no attention module is used (second row).

Figure 4. First row: original images. Second row: attention map of EfficientNet-B0 without attention module highlighted in green. Third row: attention map of EfficientNet-B0 with attention module highlighted in green.


We implemented EfficientNet-B0, -B4 and -B7, and -B0 with attention module. All the models were trained for 12 epochs with an input size of 256×256, RMSprop optimizerand 1-e4 learning rate. For EfficientNet-B0 without attention module, we used a batch-size of 128. The batch-size was decreased to 64 for -B0 with attention module, to 32 for EfficientNet-B4, and to 16 for -B7 for computation complexity purposes.

System Performance on the FaceForensics++ Benchmark

Our system performance has been evaluated on the FaceForensics++ automated benchmark, a publicly available automated benchmark for facial manipulations in a realistic scenario, i.e., with random compression and random dimensionsFaceFornesics++ provides a test set of 1000 frames of forged and original (pristine) facesrandomly extracted from 1000 videos. The forged faces have been manipulated by four manipulation systems, namely: a) Face2Face, b) NeuralTextures, c) Deepfakes and d) FaceSwap. a) and b) manipulations are categorized as Expression Swaps and c) and d) manipulations are categorized as Identity Swaps. The outcome of our classification prediction (labels) on this test set are submitted to the FaceForensics++ server and are evaluated compared to the hidden (not available publicly) groundtruth data, by calculating the binary classification accuracy.  

In Figure 5. we observe that our system outperforms XceptionNet, which is the model that was used by the creators of the FaceForensics++ dataset, iall casesi.e. models: EfficientNet-B0, -B4 and -B7, and -B0 with attention module. Considering the current state of participating performances, the performance of our system (DFirt) using EfficientNet-B0 without attention is comparable to state-of-the-art approaches. Moreover, it is observed that in our experiments, the incorporation of the attention module did not improve the performance of the system as expected. This is an indication that the consideration and proper incorporation of the attention module needs further investigation.

FaceForensics++ Benchmark

Figure 5. Binary classification accuracy of different models on the test set of FaceForensics++ dataset. XceptionNet is the model used by Nießner et al.


Within this project we also created a webservice to demonstrate the Deepfake detection. The webservice accepts a file (image or video) or URL (pointing to a YouTube video or to an image) as input and calculates a prediction of the probability that it is Deepfake or realTechnical details are included in the technical description page, which explains how the webservice processes the input and classifies a DeepfakeFor this webservice, we deployed the model EfficientNet-B0 to classify the Deepfakes due to its good performance over all classes.


Our system serves as a proof of concept for a tool that can support human decision whether visual content is Deepfake or real and could be a basis for further research regarding the model implementation and data preprocessing, that is an important step for such tasks. Furthermore, specifically for the use case of journalistic purposes,  the definition of specific requirements and needs shall be investigated, as well as to prioritize aspects such as explainability and reliability in order to provide journalists with valuable and useful tools in their fight against misinformation and fake news. 

Deepfake detection is a very challenging task, especially in the wild, with real-world dataThis is one of the key learnings of the Facebook Deepfake Detection Challenge (DFDC), based on their published results. Moreover, the Deepfake creators, constantly improve their algorithms as soon as they are introduced to new detection methods, so that any detector will have a short life if the creator has access to it. This resembles a “cat-and-mouse” chasing which will require fast reactions and development of robust detection methods. For these reasons, other technical solutions such as media authentication (verify content authenticity, e.g. using watermarking) and media provenance (identify content’s origin, e.g. using reverse image search) have to be considered along with the establishment of legislation and policies. Last but not least, Deepfake media literacy, namely educating people that create and consume media, can enable us question and challenge content and context, is a fundamental step towards mitigating the consequences of misinformation caused by DeepFakes.

Back to blog