future internet
Article
Reducing Videoconferencing Fatigue through Facial Emotion
Recognition
Jannik Rößler 1 , Jiachen Sun 2 and Peter Gloor 3,*
1 Cologne Institute for Information Systems, University of Cologne, 50923 Cologne, Germany;
roessler@wim.uni-koeln.de
2 School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou 510006, China;
sunjch6@mail2.sysu.edu.cn
3 MIT Center for Collective Intelligence, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
* Correspondence: pgloor@mit.edu
Abstract: In the last 14 months, COVID-19 made face-to-face meetings impossible and this has led to
rapid growth in videoconferencing. As highly social creatures, humans strive for direct interpersonal
interaction, which means that in most of these video meetings the webcam is switched on and people
are “looking each other in the eyes”. However, it is far from clear what the psychological conse-
quences of this shift to virtual face-to-face communication are and if there are methods to alleviate
“videoconferencing fatigue”. We have studied the influence of emotions of meeting participants
on the perceived outcome of video meetings. Our experimental setting consisted of 35 participants
collaborating in eight teams over Zoom in a one semester course on Collaborative Innovation Net-
works in bi-weekly video meetings, where each team presented its progress. Emotion was tracked
through Zoom face video snapshots using facial emotion recognition that recognized six emotions
(happy, sad, fear, anger, neutral, and surprise). Our dependent variable was a score given after each
 presentation by all participants except the presenter. We found that the happier the speaker is, the
 happier and less neutral the audience is. More importantly, we found that the presentations that
Citation: Rößler, J.; Sun, J.; Gloor, P. triggered wide swings in “fear” and “joy” among the participants are correlated with a higher rating.
Reducing Videoconferencing Fatigue Our findings provide valuable input for online video presenters on how to conduct better and less
through Facial Emotion Recognition. tiring meetings; this will lead to a decrease in “videoconferencing fatigue”.
Future Internet 2021, 13, 126.
https://doi.org/10.3390/fi13050126 Keywords: facial emotion recognition; social network analysis; video meetings
Academic Editor: Mehrdad Jalali
Received: 17 April 2021 1. Introduction
Accepted: 10 May 2021
Published: 12 May 2021 Famous speeches such as Martin Luther King’s “I have a dream” in 1963 or Barack
Obama’s election victory speech in 2008 are known for their rhetorical manifestations, the
Publisher’s Note: MDPI stays neutral selection of inspiring words and the use of vivid and metaphorical language. However,
with regard to jurisdictional claims in another important driver for a great speech or presentation is the method by which pre-
published maps and institutional affil- senters express their information by using emotions [1]. Presentations which make use
iations. of emotional experiences and trigger an emotional response from the listeners are more
likely to gain the attention of the audience than compared to emotionless presentations [2].
For example, consumers primarily use emotions rather than information when evaluating
brands [3]. Moreover, politicians often express anger and sadness to express empathy and
worriedness about the topic at hand [1]. Furthermore, brain researchers have showed that
Copyright: © 2021 by the authors.
humans have to be emotional in order to be memorable [4].
Licensee MDPI, Basel, Switzerland.
This article is an open access article Emotions in presentations can be expressed with multiple methods: the choice of
distributed under the terms and vocabulary, e.g., using emotional words such as “sad”, “cry”, and “loss” when expressing
conditions of the Creative Commons sadness; the method the presenter communicates, e.g., aggressively vs. sensitively; or
Attribution (CC BY) license (https:// the gestures and facial expressions the presenter makes with his/her body and face, e.g.,
creativecommons.org/licenses/by/ looking happy, sad, angry, disgusted, or surprised [5]. All of the above conveys the
4.0/). presenter’s feelings. In particular, facial emotion recognition is a broad and well-studied
Future Internet 2021, 13, 126. https://doi.org/10.3390/fi13050126 https://www.mdpi.com/journal/futureinternet
Future Internet 2021, 13, 126 2 of 15
research stream which has recently received a lot of attention due to the rising development
of deep learning based methods. Thanks to deep neural networks with convolutional
neural networks (CNN), recurrent neural networks (RNN), and long short-term memory
(LSTM) models in particular, researchers can extract emotions from facial expressions
with a high degree of accuracy. For example, a large number of researchers demonstrated
that their deep learning based methods perform very well on various facial emotion
recognition data sets (see [6] for an overview). In addition, Choi and Song [7] showed that
their framework—a combination of CNN and LSTM—is not only robust and effective on
artificial facial expression data sets (actors playing various emotions), but is also robust
and effective on data sets collected in the wild.
Leveraging facial emotion recognition, some researchers analyzed the relationship be-
tween facial expressions and the learning process, especially in e-learning environments, to
monitor and measure student engagement [8], to create personalized educational e-learning
platforms [9], or to provide personalized feedback to improve the learning experience [10].
Our study joins this research stream by analyzing the relationship between speaker and
listener’s facial expressions and the quality of the presentations. Such an evaluation could
help researchers and practitioners better understand the inherent relationship between
emotions and presentations. Moreover, the insights could be used to create better pre-
sentations, which renders the presentation content more memorable and fosters better
knowledge transfer. Hence, our research aims to answer the question of which sequence
of emotions among the audience, expressed through their facial expressions, lead to good
presentations and which do not. In particular, we hypothesize that certain combinations
of emotions conveyed by the presenter or shown by the audience are indicators of higher
perceived presentation quality.
The lack of such an analysis is likely due to the difficulty of analyzing facial expres-
sions not only of the presenter but also of the audience. Furthermore, presentations need
to be quantified objectively. The current Coronavirus pandemic presents a unique op-
portunity for such an analysis, as pupils, students, companies, and other organizations
are shifting towards online platforms such as Zoom, Microsoft Teams, and Skype, which
allow presentations to be held in front of a camera. This makes it easy to determine the
presenter’s and listener’s emotions by using Deep Learning. Furthermore, such platforms
often enable built-in functions which can be used to assess the quality of the presentation
through the audience, for example, by using real-time questionnaires.
For this reason, we recorded and evaluated presentations from a virtual seminar called
Collaborative Innovation Networks (COINs) over the course of twelve weeks with a total
of seven two hour meetings using Zoom. There were eight student teams that consisted of
three to five students. Each team presented its project progress in each of the seven two
hour sessions spread out over the twelve weeks. In each team presentation, at least one
or more team members presented their progress in ten minutes. After each presentation
was held, the supervisor and the audience immediately evaluated it via a questionnaire.
The facial recordings of the presentations were used to determine emotions of presenter,
supervisor, and audience. We extracted the emotions using a deep neural network, a CNN
based on the VGG16 architecture [11], which was trained prior to the recordings using
self-labelled data and various facial emotion recognition data sets such as the Extended
Cohn-Kanade (CK+) [12], the Japanese Female Facial Expressions (JAFFE) [13], and the
BU-3DFE [14] data set.
Using the emotions from the supervisor and the audience as well as the presentations
scores, we found that:
• The happier the speaker is, the happier and less neutral the audience is;
• The more neutral the speaker is, the less surprised the audience is;
• Triggering diverse emotions such as happiness, neutrality, and fear leads to a higher
presentation score;
• Triggering too much neutrality among the participants leads to a lower presentation
score.
Future Internet 2021, 13, 126 3 of 15
The remainder of this paper is organized as follows. In Section 2, we review relevant
literature with respect to facial emotion recognition and the relationship between emotions
and the quality of presentations. In Section 3, we introduce the experimental setup and
the research method. Results and discussions are presented in Sections 4–6, respectively.
Finally, Section 7 summarizes the paper.
2. Related Work
We review the relevant literature with respect to, firstly, the application of deep
learning in the context of facial emotion recognition, and secondly, the relationship between
emotions and the quality of presentations.
2.1. Facial Emotion Recognition
Facial emotion recognition (FER) is a stream of work in which researchers in the area of
computer vision, affective computing, human–computer interaction, and human behavior
deal with the prediction of emotions using facial expressions in images or videos [15].
FER literature can be divided into two groups according to whether the features are
handcrafted or automatically generated through the output of a deep neural network [6].
Furthermore, FER research can be distinguished according to whether the underlying
emotional model is based on discrete emotional states [16] or on continuous dimensions,
such as valence and arousal [17]. In the former, researchers share a consistent notion of
emotions as discrete states, although different determinations exist with respect to what
these basic emotions are. For example, Ekman and Oster [16] identified the five basic
emotions as happiness, anger, disgust, sadness, and fear/surprise. Panksepp [18] defines
play, panic, fear, rage, seeking, lust, and care as basic emotions. In the continuous dimension
model, emotions are described by two or three dimensions containing valence or pleasant
as one dimension and arousal or activation as the other dimension [19]. Considering
that we leverage discrete emotional states in our work as well as the fact that relatively
few studies develop and evaluate algorithms for the continuous dimension model [19],
especially for the handcrafted feature generation process, we will only focus on the discrete
emotional model type that distinguishes between handcrafted and deep learning based
approaches.
FER approaches that use handcrafted features are usually deployed in three steps: face
and facial component detection, feature extraction, and expression classification [6]. In the
first step, faces and facial components (e.g., eyes and mouth) are detected in an input image.
In the second step, various spatial and temporal features such as Histogram of Gradients
(HoG), Local Binary Pattern (LBP), or Gabor Filters are extract from the facial components.
Finally, machine learning algorithms such as Support Vector Machine (SVM) or Random
Forests use the extracted features to recognize emotional states. Researchers have shown
that manually extracting features can lead to accurate results [20–23] and practitioners use
such approaches because, compared to deep learning based methods, manually extracting
features requires much less computational resources [6].
However, due to the rise in the size and variety of data sets and thanks to recent
developments in deep learning, deep neural networks have been the most appropriate
technique in all computer vision tasks including FER [15]. Many researchers showed
the superiority of deep learning algorithms, with CNNS, RNNS, and LSTM models in
particular, over handcrafted approaches [24–33]. The main advantage of neural networks
is their ability to enable “end-to-end” learning, whereby features are learned automatically
from the input images [6]. One of the most deployed neural networks in the context of facial
emotion recognition is the CNN [6], which is a special kind of neural network for processing
images by using a convolutional operation [34]. The advantage of such a neural network is
that it can take into account spatial information that is location-based information. However,
CNNs “cannot reflect temporal variations in the facial components” [6] (p. 8) and thus,
more recently, RNNs or LSTM models in particular, have been combined with CNNs to
capture not only the spatial information but also the temporal features [7,26,30–33].
Future Internet 2021, 13, 126 4 of 15
In our study, we use a deep learning based method, a CNN, to recognize a subset of
the Ekman model’s emotions which are the following: anger, fear, happiness, sadness, and
surprise. Furthermore, we augmented the emotions with the neutral expression as a state
of control for the recognition results on emotions, similar to Franzoni et al. [35]. Moreoever,
as suggested by Kim et al. [36] and Kuo et al. [37], we trained our CNN on a merged
data set that consisted of various original FER data sets such as CK+ [12], JAFFE [13], and
BU-3DFE [14] to alleviate the problem of overfitting and to further improve its robustness.
Finally, we leveraged a well-known pre-trained model architecture, VGG16 [11], to further
improve the effectiveness of our CNN [38].
2.2. Presentations and Emotions
Deep neural networks have proved successful in a plethora of emotion recognition
challenges such as facial emotion recognition [39], speech emotion recognition [40], or
multimodal emotion recognition [41]. However, the use of such methods to predict emo-
tions and the subsequent investigation into which extent emotions influence the quality of
a presentation is limited, although some studies analyze the relationship between facial
expressions and the learning process; this is especially seen in e-learning environments.
Zeng et al. [5] developed a prototype system which uses multimodal features in-
cluding emotion information from facial expressions, text, and audio to explore emotion
coherence in presentations. By analyzing 30 TED talk videos and examining two semi-
structured expert interviews, the authors demonstrated that the proposed system can be
used to, firstly, teach speakers to express emotions more effectively improving presen-
tations and, secondly, teach presenters to include joke-telling to promote personalized
learning. Although the authors also stress the importance of emotions in presentations,
our study differs significantly from the work by Zeng et al. [5] in that we incorporate the
emotions from the speaker and the audience as opposed to only analyzing the emotions
from the speaker and, rather than considering expert knowledge, we use a score given
after each presentation by all participants except the presenter to analyze the influence
of emotions on the perceived outcome of video meetings. Finally, while Zeng et al. [5]
developed a system which describes emotion coherence on different channels throughout
a presentation, we investigate which emotional patterns lead to great presentations.
In their work, Chen et al. [42] use multimodal features including speech content,
speech delivery (fluency, pronunciation, and prosody), and nonverbal behaviors (head,
body, and hand motions) to automatically assess the quality of public speeches. The authors
collected 56 presentations from 17 speakers whereby each speaker had to perform four
different tasks. The presentations were scored by human raters on ten dimensions, such as
vocal expression and paralinguistic cues, to engage the audience. The authors showed that
multimodal features can be leveraged by machine learning algorithms, which is a random
forest and support vector machine, to assess the performance of public speeches. Although
the study took into account verbal and nonverbal behaviors, it neither considered emotions
from speech nor from nonverbal behaviors such as facial expressions. Furthermore, the
authors did not use sophisticated deep learning algorithms (e.g., convolutional neural
networks) to predict emotions.
In another stream of research, researchers leveraged facial emotion recognition to
improve educational e-learning platforms and thus the learning experience by providing
personalized programs [9], personalized feedback [10], and by measuring student engage-
ment [8]. For example, Carolis et al. [10] developed a tool for emotion recognition from
facial expressions to analyze difficulties and problems of ten students during the learning
process in a first year psychology course. The system was used to detect various emotions,
such as enthusiasm, interest, concentration, and frustration during two situations: while
presenting prerecorded video lectures to the students and while the students participated
in an online chat with a teacher. The authors found that emotions can be an “indicator of
the quality of the student’s learning process” [10] (p. 102). For example, they argued that
energy is a key factor to avoid boredom and frustration.
Future Internet 2021, 13, 126 5 of 15
The study by Carolis et al. [8] is most similar to our work. The authors automatically
measured the engagement of 19 students by analyzing facial expressions, head movements,
and gaze behavior from 33 videos which contained more than five and a half hours of
recordings. The collected data were related with a subjective evaluation of the engagement
coming from a questionnaire with four dimensions: challenge, skill, engagement, and
perceived learning. The authors found that the less stressed and more relaxed students
are, the more engaged they appeared to be. Furthermore, they demonstrated that the more
excitement and engagement the students felt during a presentation (TED videos), the more
engagement was perceived.
Although our work also takes place in a learning environment, a virtual seminar
held at multiple universities simultaneously, it differs from the studies which analyze
the relationship between emotions and the learning process in that we investigate the
relationship between the perceived quality of a presentation and the emotions among the
audiences and the presenter. In other words, we focus on finding an emotional pattern that
leads to a great presentation.
3. Data and Methods
3.1. Experimental Setup
We collected data from a virtual seminar called Collaborative Innovation Networks
(COINs 2020) [43], which involved three instructors and 35 students from MIT, the Uni-
versity of Cologne and the University of Bamberg. Students formed virtual teams with
three to five participants from different locations, resulting in a total of eight teams, each of
which investigated a given complex business topic independently. Afterwards, the seminar
was organized as a virtual meeting by using Zoom (https://zoom.us/, accessed on 10
May 2021) which was held every two weeks from 6 April 2020 to 14 July 2020. In each
meeting, each team gave a PowerPoint-supported presentation of 10–15 min in rotating
order, reporting the current progress of the team project to the audiences, i.e., the other
teams and instructors.
During the virtual meeting, both the speakers and audiences were asked to keep the
built-in camera on their local devices active to ensure that the faces would clearly appear
on the video. We then recorded each meeting using Zoom’s built-in recording function. In
order to capture all the participant’s faces, we recorded the video in Zoom’s Gallery View,
which can display up to 49 thumbnails in a grid pattern on a single screen. The recorded
videos are the main analytical material used in this work.
After each presentation, the audience was asked to rate the overall performance by
using a pre-designed anonymized poll published in a Google Form. Specifically, individuals
were asked to answer “How many points will you give this presentation” on a numeric
rating scale between 1 and 5, with 5 being the maximum number of points for the given
presentation. Moreover, we provided an additional option, “I am a speaker”, in the poll to
prevent subjective scoring by the speaker. After collecting the poll data, we denoted the
collective score y for a presentation as the mean value from all audiences. We deem this as
a reasonable method to align the poll score with the presentation and most importantly,
to reduce the bias of an individual’s subjective evaluation. These ground-truth collective
scores are used as the dependent variable for further analysis.
3.2. Data Pre-Processing
We applied different pre-processing steps to clean up the recorded data and to arrange
them in a suitable form for subsequent analysis. Each recording was first divided into eight
sequences and each of them contained the presentation of one group. We then converted
each video presentation into a sequence of images. More specifically, we extracted one im-
age (frame) per second from the video using Moviepy (https://zulko.github.io/moviepy/,
accessed on 10 May 2021). As each image contained a grid of up to 49 faces, we lo-
calized and extracted individual faces from each image using face-recognition (https:
//github.com/ageitgey/face_recognition, accessed on 10 May 2021), which is a python li-
Future Internet 2021, 13, x FOR PEER REVIEW 6 of 16 
 
into eight sequences and each of them contained the presentation of one group. We then 
converted each video presentation into a sequence of images. More specifically, we ex-
tracted one image (frame) per second from the video using Moviepy 
Future Internet 2021, 13, 126 (https://zulko.github.io/moviepy/, accessed on 10 May 2021). As each image contained6 ao f 15
grid of up to 49 faces, we localized and extracted individual faces from each image using 
face-recognition (https://github.com/ageitgey/face_recognition, accessed on 10 May 2021), 
which is a python library for detecting faces in a given image. Moreover, we utilized the 
bsramryef loibr rdaertye tcoti dnigvfidacee tshien inadgiivvidenuaiml facgees. iMntor tehorveer g, rwoeupust,i lnizaemdetlhye: asuadmi enlcibe,r paryesteondteirv, ide
thaendin sduipveidrvuiasol rf.a Tcehse ilnibtoratrhyr aeuetgormoautpicsa,lnlya dmeetelyct:sa suimdiielanrc fea,cperse gsievnetne ra,na nexdasmupler. vHiseonrc.eT, he
liwbrea rmyaanuutoalmlya tsieclaelclytede steacmtspsleim imilaargfeasc efosrg isvuepneravniseoxra manpdl ep. rHeseennctee,rw, reesmpaenctuivaelllyy,s ealnedct ed
sasmtorpelde itmheamge ssefpoarrsautepleyr vuissionrg afnadcep-rreecsoegnnteitri,orne.s pNeoctteiv tehlayt, ainn dsosmtoer epdrethseenmtasteiopnasr,a nteeliythuesri ng
fatchee- prerceosegnntietiro nno.rN thoet esuthpaertvinisosro mcoeulpdr ebsee indteantitoifniesd,  ndeuieth teor ptohoer plirgehsteinntge cronnodrittihoenss uopr edruvei sor
cotou ltdheb peeirdseonnt oifif eidntedrueest tnoopt oboerinlgig rhetcionrgdecdo nadt iatilol. nWs eo rthdeune dtiosctahredepde rimsoangeosf winittehr eloswt n ot
bqeiunaglitrye caonrdd tehdisa rtesaulll.teWd eint ha etnotadli socfa 4r1d perdesiemnatagteiosnws iwthithlo twheq nuuamlibtyera onfd intdhiivsirdeusaull ftaecdesin a
tortaanlgoinf g4 1frpormes 3e6n0ta0 ttioo n1s5,w86it3h fothr eenacuhm pbreerseonftiantidoinv.i dFiunaalllfya,c ewser aconngvinergtefdro emac3h6 0im0 atoge1 5to,8 63
fogrreeyascchalper easnedn traetsihoanp.eFdi nitasl lsyi,zwe etoc o4n8 v×e r4t8e dpiexaeclsh. iDmuarginegt omgordeeyls ctraalienainngd, rweseh uaspeedd ditastas ize
toau4g8m×en4t8atpioixne, ltsh. aDt uisr,i nragnmdoomdleyl ftlriapipniinngg ,thwe eimusaegdesd haotarizaoungtmalleyn (tsaeteio Fni,gtuhraet 1i sf,orra annd oilm- ly
flliupsptrinatgiotnh)e. images horizontally (see Figure 1 for an illustration).
Divide into eight videos For each presentation Extract faces Classify
each containing a different extract an image per from each faces
presentation second image
Presenter
Supervisor
Zoom recording One recording for Sequence of images for Each image containingeach presentation each presentation up to 49 faces Audience  
FigFuirgeur1e.  1O. vOevrverivewiewo fotfh tehed adtaatap prer-ep-prorocceesssiinngg sstteps. Eacchh rreeccoorrddiningg isi sddivivididede dinitnot oeiegihgth vtidveidoes oasnadn edaceha cchonctoanintas ian s a
diffdeirfefenrtepntr epsreensetanttiaotnio.nW. We teh tehnene xetxrtarcatcetdedo onneei imaaggee ppeerr sseeccoonndd ffoorr eeaacchh pprreesseenntatatitoinon vivdiedoe.o F.inFainllayl,l yw, ew eexterxatcrtaecdt eadll afallcefas ces
fromfroams ian sginlegilme iamgaegaen adndcl aclsassisfiiefidedth tehmemi nintotop prreesseenntteerr,,s suuppeerrvviissoorr,, aanndd aauuddiieennccee.. 
3.3. Facial Emotion Recognition 
3.3. Facial Emotion Recognition
In order to identify participant’s emotions, we used convolutional neural networks 
(CNINnso) rtdoe ermtopliodye nfatcifiyal peamrotitciiopna rnetc’osgenmitoiotino (nFsE, Rw).e Iuns peadrtciocunlvaor,l uwteio fnoacul nseedu roanl sniext dwifo-rks
(CfeNreNnts )emtooteimonps,l oinyspfairceiadl beym Eoktmioann raencdo gOnsitteiro n[16(F],E nRa)m. eIlny:p aanrgtiecru, lfaear,r,w heapfpoicnuesses,d noeun- six
dtirfafel,r esandtneemsso, taionnds s,uirnpsrpisiree. dThbey cElaksmsifaienr awnads Otrasitneerd[ 1o6n] ,an vaamrieeltyy: oaf nagveari,lafbelaer ,FEhRap dpaitnae ss,
nseeuttsr awl,hsearde neeascsh, afancde swuarps rliasbee.lTlehde mclaanssuiafilelyr wwaitsht ara cinorerdesopnoandvianrgie etymooftiaovna. iSlapbelceifFicEaRllyd, ata
sewtse wcohlleerceteeda cthhef afocellowwaisnlga bdealtlae dsemts:a nually with a corresponding emotion. Specifically, we
co• llect5e4d05t hime faoglelos wfroinmg AdaffteactsNetest: [19], which we labelled manually; 
•• 53410,505im1 iamgaegsefsr ofmromA fFfeEcRtNPleuts [(1h9t]t,pws:/h/gicihthwube.cloabme/lmleidcrmosaonftu/FaEllRy;Plus, accessed on 10 
• 3M1,0a5y1 2i0m21a)g; es from FERPlus (https://github.com/microsoft/FERPlus, accessed on 10
• M25a0y i2m0a2g1e);s from the Extended Cohn-Kanade Data set (CK+) [12]; 
•• 215804i mimaaggeess ffrrom tthe EJaxptaenedseed FCemohalne- KFacniald EexDpraetsassioent s( C(JKA+F)F[E1)2 d];atabase [13]; 
•• 158242i mimaaggeess ffrrom tBhUe-J3aDpFaEn e[1se4]F; emale Facial Expressions (JAFFE) database [13];
•• 532425i5m iamgaegsefsr ofmromBU F-F3DQHFE ([h1t4tp];s://github.com/NVlabs/ffhq-dataset, accessed on 10 
• 3M45a5yi m20a2g1)e,s wfrhoicmh FwFeQ laHbe(lhlettdp ms:/an/ugaitlhlyu. b.com/NVlabs/ffhq-dataset, accessed on 10
May 2021), which we labelled manually.
The images from AffectNet and FFQH were labelled by three researchers. All re-
searchers received the same images and had to choose one of the following classes for each
 image: anger, fear, surprise, sadness, neutral, happiness, or unknown. The only images
that were considered are those where two of the three researchers agreed on the same
emotional state and ignoring images where the majority vote was on the unknown class.
Finally, we combined all images, which resulted in a large and heterogeneous FER data
set containing 40,867 images (with 8.58% of anger, 3.26% of fear, 13.81% of surprise, 11.45%
of sadness, 33.13% of neutral, and 29.77% happiness). Table 1 presents the distribution
of emotional states along the six data sets. Note that we used 80% of the images for
Future Internet 2021, 13, 126 7 of 15
training, 10% for testing, and 10% for validation. The validation set was used to estimate
the prediction error for model selection. That is, the training of the model was terminated
prematurely (before the maximum epoch) once the performance on the validation set
became worse.
Table 1. Distribution of different emotions along the various FER data sets.
Anger Fear Surprise Sadness Neutral Happiness Total
AffectNet 473 512 1379 569 1873 599 5405
FERPlus 2606 648 3950 3770 11,011 9066 31,051
CK+ 45 25 83 28 0 69 250
JAFFE 30 32 30 31 30 31 184
BU-3DFE 92 92 89 88 84 77 522
FFQH 260 22 114 193 540 2326 3455
Total 3506 1331 5645 4679 13,538 12,168 40,867
Regarding the deep models, we considered several widely-used convolutional neural
network (CNN) architectures including VGG [11] and Xception [44]. We created these
models from scratch by utilizing Keras. We adopted the cross-entropy criterion as the loss
function which is minimized using the Adam optimizer with a learning rate of 0.025. We
trained each model up to 100 epochs. After training, we evaluated each model on the
same testing set of the heterogeneous FER data set and chose the model with the highest
performance in terms of accuracy. Subsequently, the selected model was used to predict
the emotions for all faces that we had previously recorded in the 41 presentations.
3.4. Feature Engineering
In total, we calculated 18 audience and 6 speaker features for each presentation using
the predicted emotions from the recordings. The audience features were created as follows.
Firstly, for each emotion, we determined the ratio between the frequency of occurrences
of a given emotion by the audience and all the emotions the audience expressed during
the presentation. We denote this feature as ratio_audience (E), where E refers to a specific
emotion, such as anger, fear, surprise, sadness, neutral, or happiness. For each emotion
expressed by the audience, we then calculated the number of times it occurred at least
once per second in a given presentation. Subsequently, we put this number in relation
to the total recorded time of the same presentation. We denote this audience feature as
density (E), with E referring to an emotion. Finally, for each presentation, we normalized
the frequency of an emotion expressed by the audience between 0 and 1 and calculated its
standard deviation over the entire presentation. This feature is referred to as deviation (E),
with E representing the given emotion.
Since there is only one speaker at a given second in a presentation, we could not
compute the same features for the speaker as we could for the audience. More specifically,
we only calculated the ratio between the frequency of a specific emotion the speaker
expressed and the number of all recorded emotional states the speaker exploited during
the presentation. We denote this feature as ratio_speaker (E), where E refers to a given
emotion.
The above described features were used to calculate correlations between audience,
speakers, and presentation scores. We also used some of the features for an ordinary least
squared regression.
4. Results
We found that the VGG16 model performed best with a test accuracy of 84.0%, closely
followed by the Xception model with 83.7%, the VGG19 model with 83.6%, and the VGG13
model with 82.6% (see the confusion matrices on the test set for each model in Figures 2–5).
Figure 2 illustrates the confusion matrix on the test set for the VGG16 model including
Future Internet 2021, 13, x FOR PEER REVIEW 8 of 16 
 
The above described features were used to calculate correlations between audience, 
speakers, and presentation scores. We also used some of the features for an ordinary least 
squared regression. 
4. Results 
Future Internet 2021, 13, 126 We found that the VGG16 model performed best with a test accuracy of 84.0%8, of 15
closely followed by the Xception model with 83.7%, the VGG19 model with 83.6%, and 
the VGG13 model with 82.6% (see the confusion matrices on the test set for each model in 
Figures 2–5). Figure 2 illustrates the confusion matrix on the test set for the VGG16 model 
itnhceluind-icnlga stshep irne-ccilsaisosn pfroerciesaiocnh feomr oeaticohn e:m84o%tion:e u84tr%a ln, e9u3t%ralh, a9p3p%i nheaspsp, i8n3e%ss,s 8u3r%pr sisuer,- 70%
psaridsne,e 7s0s,%8 3sa%dnanesgse, r8,3a%n dan6g4e%r, faenadr. 64% fear. 
Future Internet 2021, 13, x FOR PEER REVIEW  9 of 16  
FFiigguurree 22.. VVGGGGss1166 coconnfufusisoino nmmataritxr ioxno tnheth teestte ssetts. et.
 
 
Fiigurree 33. .VVGGGGs1s91 9cocnofnufsuisoino nmmatraitxr ioxno tnheth testte ssetts. et.
 
Figure 4. VGGs13 confusion matrix on the test set. 
 
Future Internet 2021, 13, x FOR PEER REVIEW 9 of 16 
 
Future Internet 2021, 13, 126 9 of 15
 
Figure 3. VGGs19 confusion matrix on the test set. 
Future Internet 2021, 13, x FOR PEER REVIEW  10 of 16  
FiFgiugruer 4e. 4V.GVGGsG13s 1co3ncfounsfiounsi monatmrixa torinx tohne tehset tsest.t set.
 
 
FiFgiugruer e5.5 X. cXecpetpiotnio cnoncofunsfiuosni omnamtriaxt roinx tohne ttheest tseestt. set.
AAfteftre pr rperdeidctiicntgin tgheth eemeomtioontiso nofs aolfl afallcefas creescorredceodrd deudridnugr tihneg ZthoeomZ opormesepnrteasteionntsa tions
uusisning gthteh eVGVG1G61 m6 omdoedl,e wl,ew cealccaullcauteladt ethde t1h8e a1u8daieundciee nfecaetuferaestu arneds tahned 6t shpee6aksepre faekaeturrfeesa tures
ddesecsrcirbiebde dinin SeScetcitoino n3.34..4 W. We efifirsrts tcocorrrerlealtaetded ththe eememotoitoinosn sofo fththe eauauddieinencec ewwitiht hththe eememo-otions
tions of the speakers for each presentation using the ratio_audience(E) and ra-
tio_speaker(E) features. The coefficients are captured in Table 2. We found that, firstly, the 
happier the speaker was, the happier and less neutral the audience was and, secondly, the 
more neutral the speaker was, the less surprised the audience was. 
Table 2. Correlation coefficients between the emotions of the audience and the emotions of the speaker along all presen-
tations (N = 41) (*** Correlation significant at 0.01 level; ** Correlation significant at 0.05 level). 
  
ratio_audience(happy) 0.5107 *** −0.0754 0.0544 −0.0532 −0.2369 −0.2082 
ratio_audience(neutral) −0.4347 *** 0.0909 −0.0432 −0.0283 0.1850 0.1340 
ratio_audience(fear) 0.2240 0.0983 −0.1953 0.0210 −0.3097 0.1471 
ratio_audience(sad) −0.1185 0.0592 0.0165 −0.0780 −0.0153 0.2119 
ratio_audience(surprise) 0.2059 −0.3147 ** 0.1163 0.1995 0.1952 0.1602 
ratio_audience(angry) −0.0123 0.0013 −0.0569 0.1650 0.0089 −0.1450 
 
Ratio_Speaker 
(Happy) 
Ratio_Speaker 
(Neutral) 
Ratio_Speaker (Fear) 
Ratio_Speaker (Sad) 
Ratio_Speaker 
(Surprise) 
Ratio_Speaker 
(Angry) 
Future Internet 2021, 13, 126 10 of 15
of the speakers for each presentation using the ratio_audience(E) and ratio_speaker(E)
features. The coefficients are captured in Table 2. We found that, firstly, the happier the
speaker was, the happier and less neutral the audience was and, secondly, the more neutral
the speaker was, the less surprised the audience was.
Table 2. Correlation coefficients between the emotions of the audience and the emotions of the speaker along all presentations
(N = 41) (*** Correlation significant at 0.01 level; ** Correlation significant at 0.05 level).
Ratio_Speaker Ratio_Speaker Ratio_Speaker Ratio_Speaker Ratio_Speaker Ratio_Speaker
(Happy) (Neutral) (Fear) (Sad) (Surprise) (Angry)
ratio_audience(happy) 0.5107 *** −0.0754 0.0544 −0.0532 −0.2369 −0.2082
ratio_audience(neutral) −0.4347 *** 0.0909 −0.0432 −0.0283 0.1850 0.1340
ratio_audience(fear) 0.2240 0.0983 −0.1953 0.0210 −0.3097 0.1471
ratio_audience(sad) −0.1185 0.0592 0.0165 −0.0780 −0.0153 0.2119
ratio_audience(surprise) 0.2059 −0.3147 ** 0.1163 0.1995 0.1952 0.1602
ratio_audience(angry) −0.0123 0.0013 −0.0569 0.1650 0.0089 −0.1450
Next, we computed the correlations between all features (18 audience and 8 speaker
features) and the presentation scores. The significant features, their coefficients, and
p-values are provided in Table 3.
Table 3. Correlation coefficients between the emotions of the audience and speakers and the presen-
tation scores along all presentations (N = 41).
Presentation Score p-Value
deviation (happy) 0.73 7 × 10−8
deviation (neutral) 0.50 8 × 10−4
deviation (fear) 0.34 0.025
ratio_audience (happy) 0.55 2 × 10−4
ratio_audience (neutral) −0.44 4 × 10−3
ratio_audience (fear) 0.38 0.010
density (happy) 0.44 4 × 10−3
ratio_speaker (happy) 0.35 0.026
We found that the deviation in the audience’s happiness, neutrality, and fear; the ratio
in the audience’s happiness and fear; the density in the audience’s happiness as well as the
ratio in speaker’s happiness are positively related with the presentation score. Contrarily,
the ratio in the audience’s neutrality is negatively correlated with the presentation score.
All of these correlations are significant with a p-value smaller than 0.05. The correlation
between the deviation in audience’s happiness and the presentation score is the most
significant with a coefficient of 0.73 and a p-value of 7 × 10−8. Recall that the deviation
in audience’s happiness measures the variation in happiness throughout the presentation.
The latter correlation is also illustrated in Figure 6, where we plotted the development
in happiness for some selected presentations. We can see that the more variation in the
audience’s happiness we have, the better the score (see the blue dot in the upper right
corner of Figure 6a) and vice versa (see the blue dot in lower left corner of Figure 6a).
FFuuttuurreeI Inntteerrnneett2 2002211, ,1 133, ,1 x2 6FOR PEER REVIEW 1112o off1 156 
 
 
FFiigguurree6 6.. ((aa)) IIlllluussttrraattiioonn ooff tthhee ccoorrrreellaattiioonn bbeettwweeeenn pprreesseennttaattiioonn ssccoorree aanndd tthhee aauuddiieennccee’’ss ddeevviiaattiioonn iinn hhaappppiinneessss.. WWeec caann 
sseeee tthhaatt pprreesseennttaattiioonnssw whhiicchhc caauusseeddw wiiddeefl fuluccttuuaattioionni ninh haappppinineessssa ammoonnggt thhees sppeeccttaatotorrssa acchhieievveeddh higighheerrs sccoorreess( b(a,,cc)) tthhaann 
pprreesseennttaattioionnssw whhicichhd dididn noottc caauusseefl fulucctutuaatitoionni ninh haappppinineessssa ammoonnggt hthees pspeecctatatotorsrs( d(d,e,e).). 
FFiinnaallllyy,,w weeu usseedd aann oorrddiinnaarryy lleeaasstt ssqquuaarree ((OOLLSS)) rreeggrreessssiioonn ttoo pprreeddiicctt tthhee pprreesseennttaattiioonn 
ssccoorree uussiinngg tthhee aauuddiieennccee’’ss ddeevviiaattiioonn inin hhaappppinineesss sanandd thteh eauaduideinecnec’es ’ds edveivatiaiotino nin ifneafer aars 
ainsdinepdeenpdenendte nvtarviaarbilaebs.l eBso. tBho ftehatfuearetus rweserwee crheocsheons eanftearf tseervseervael rtaelsttse.s Ats .sAumsummarmy aorfy tohfe 
tOheLSO rLeSgrreesgsrieosns imonodmeol dise pl risovpirdoevdid iend TianbTlea b4.l eW4e.  Wfoeunfodu tnhdatt hthaet Rth2e oRf 2thoef OthLeSO mLoSdmelo wdeals 
w0.a5s5 0w.5h5icwh hiincdhicinatdeisc attheast tthhaet rthegerreesgsiroenss imonodmelo ddeelscdriebsecdri b5e5d%5 5o%f thoef  tvhaerivaatiroiant iionn tihne 
tphreespernetsaetniotant isocnorsecso uressinugs ionnglyo tnwlyo tfweaotuferaetsu, rneasm, nealym tehley dtheveidateivoina tiino nthien atuhdeiaenucdei’esn hcaep’s-
hpainpepsisn easnsda fnedarf.e ar.
Table 4. Summary statistics of the OLS regression with the audience’s deviation in happiness and the
Table 4. Summary statistics of the OLS regression with the audience’s deviation in happiness and 
athued iaeundciee’nscdee’sv idaetivoinatiinonf eianr faesari nads einpdenepdeenndt evnatr ivaabrlieasbalensd atnhde tphree spernetsaetniotantisocno rsecoarset hase tdheep deen-dent
vpaerniadbelnet( vNar=ia4b1l)e. (N = 41). 
VVaarriaiabblelses CCooefefifcfiiecnietnt StSantadnarddaErdrr oErrror T-TS-tSattiasttiisctsics pp--VVaalluuee 
IInntteercrecpept t −0−.00.70676 0.007.1071 −1−.10.60363 0..229955 
deviat −6deviatiioonn( h(hapappyp)y) 7.786.8666 1.4111.411 5.557.5474 2.28.8× ×1100−6 
ddeevviiaattioionn( f(efaera)r) 121.202.0626 6.367.2372 1.818.8888 00..006677 
55.. DDiissccuussssiioonn 
IInn tthhiiss wwoorrkk,, wwee hhaavvee eexxppeerriimmeennttaallllyy vveerriififieedd oouurr hhyyppootthheessiiss tthhaatt cceerrttaaiinn ccoommbbiinnaa--
ttiioonnss ooff eemmoottiioonnss ccoonnvveeyyeedd bbyy tthhee pprreesseenntteerr,, oorr sshhoowwnn bbyy tthhee aauuddiieennccee,, aarree iinnddiiccaattoorrss ooff 
hhiigghheerr ppeerrcceeiivveedd pprreesseennttaattiioonn qquuaalliittyy.. BBaasseedd oonn rreecceenntt aaddvvaanncceess iinn FFEERR iinn ddeeeepp lleeaarrnniinngg 
bbaasseedd mmeetthhooddss,, ppaarrttiiccuullaarrllyy CCNNNN,, wwee ccoommppaarreedd tthhee eemmoottiioonnss ooff ppaarrttiicciippaannttssi inn aa sseemmiinnaarr 
ttaauugghhtt oovveerr aa sseemmeesstteerr tthhrroouugghh tthheeiirr ffaacciiaall eexxpprreessssiioonn ccaappttuurreedd iinn vviiddeeooccoonnffeerreenncciinngg.. AAss 
tthhee rreeggrreessssiioonn ccooeefffificciieennttss iinn TTaabbllee4 4 sshhooww,,t thheel laarrger the deviations of happiness and fearamong the audience, the higher the is presentationgsecro rtehde. dIenvoiathtieornws oorfd hsa,pifptihneesssp aecntda tfoerasr 
 
Future Internet 2021, 13, 126 12 of 15
experience a broad range of emotions with wide fluctuations in happiness and fear, they
scored the presentation the highest. The same insight can also be gained from Table 3,
which correlates the presentation score with the different emotions. We find that neutral
faces, probably indicating the boredom of the audience, are negatively correlated with
the score. The highest positive correlation with the presentation score is, again, found
with the deviation of happy, neutral, and fear, which confirms the regression result. Sur-
prisingly, the appearance of fear on the faces of the audience correlates positively with
the perceived quality of the presentation. Simply providing constant happiness is also
positively associated with the average score of all presentations and this is shown by the
positive correlation of ratio_audience (happy) with the score. Presenters achieve the most
engaged audience and obtain the highest score when they smile—showing a happy face,
illustrated by the positive correlation between the happiness in the face of the speaker and
the score of its presentation. Happy presenters will also reduce neutral faces among the
audience, as illustrated in Table 2. Having fewer neutral faces is associated with a higher
presentation score.
6. Limitations
While this project provides interesting insights, much further work is needed. In
particular, we are not making a claim with respect to causality. Rather, we take triggering
a wider range of emotional expressions of the audience as an indicator of a successful
presentation. In order to solidly claim a causal link between triggering a broad range of
emotions to produce a great presentation, we would need to conduct controlled exper-
iments, for instance comparing presentations triggering fear by showing horror movie
snippets with presentations including humor parts. Nevertheless, there are results from
other areas, such as marketing and advertisement, where it has been found that emotional
advertisements are more successful [45], suggesting that there might indeed be a causal
link between experienced emotionality of the audience and presentation quality.
An additional limitation is the low number of measurements which number to 41.
In an ideal world, the experiments should be repeated with a larger N to gain more
statistical significance. Furthermore, our training data set for the CNN was unbalanced
as there were ten times more happy and neutral faces in the dataset than fearful faces.
However, as shown in the confusion matrices in the results section, this did not negatively
impact our accuracy. Additionally, it is well known that facial recognition systems, which
are not restricted to emotion recognition, are biased towards Caucasian males [46] and
discriminate against females and non-Caucasians. Finally, it is also worth mentioning
that video presentations are restricted in communicating emotions since there are only
the contents of the PowerPoint slides and the voice and face of the presenter in the little
“talking head” to get emotionality across to the audience. In real-classroom scenarios, the
body language of the presenter and other interaction channels and environmental cues,
such as smell, convey a much richer emotional experience for the audience; this introduces
other factors that influence the perceived quality of a great presentation.
7. Conclusions
This work contributes to alleviating the restrictions imposed by COVID-19, by trying
to develop recommendations to tackle “videoconferencing fatigue” resulting from long
video meetings because of home office work. While there is no substitute for face-to-face
interaction, we have tried to identify predictors of more highly rated and, thus, less stressful
video meetings. We find that trying to provide “unlimited bliss” by keeping the audience
constantly happy is not the best method for high-quality video presentations. Rather, a
good presenter needs to challenge the audience by puzzling it and providing unexpected
and even temporarily painful information, which then will be resolved over the course of
the presentation. On the other hand, the presenter should constantly provide a positive
attitude to convey enthusiasm and positive energy to the audience. In this manner, even
Future Internet 2021, 13, 126 13 of 15
lengthy video meetings will lead to positive experiences for the audience and, thus, results
in a similar experience for the presenter.
Author Contributions: P.G. conceived of the project; J.S. and J.R. designed the experiments and
analyzed the results; P.G., J.S. and J.R. wrote the manuscript. All authors have read and agreed to the
published version of the manuscript.
Funding: This research has received no external funding.
Institutional Review Board Statement: The study was conducted according to the guidelines of
the Declaration of Helsinki and approved by the Institutional Review Board of MIT (protocol code
170181783) on 27 March 2019.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the
study.
Data Availability Statement: The data presented in this study are available upon request from the
corresponding author. The data are not publicly available due to privacy reasons.
Acknowledgments: We thank Yucong Lin for his assistance in collecting the data. We also thank the
students in the course for participating in our experiment.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. D’Errico, F.; Poggi, I. Tracking a Leader’s Humility and Its Emotions from Body, Face and Voice. Web Intell. 2019, 17, 63–74.
[CrossRef]
2. Gallo, C. Talk Like TED: The 9 Public-Speaking Secrets of the World’s Top Minds; Macmillan: London, UK, 2014; ISBN 978-1-4472-6113-1.
3. Damasio, A.R. Descartes’ Error: Emotion, Reason and the Human Brain; rev. ed. with a new preface; Vintage: London, UK, 2006;
ISBN 978-0-09-950164-0.
4. Tyng, C.M.; Amin, H.U.; Saad, M.N.M.; Malik, A.S. The Influences of Emotion on Learning and Memory. Front. Psychol. 2017, 8,
1454. [CrossRef] [PubMed]
5. Zeng, H.; Wang, X.; Wu, A.; Wang, Y.; Li, Q.; Endert, A.; Qu, H. EmoCo: Visual Analysis of Emotion Coherence in Presentation
Videos. IEEE Trans. Visual. Comput. Graph. 2019, 26, 927–937. [CrossRef] [PubMed]
6. Ko, B.C. A Brief Review of Facial Emotion Recognition Based on Visual Information. Sensors 2018, 18, 401. [CrossRef] [PubMed]
7. Choi, D.Y.; Song, B.C. Facial Micro-Expression Recognition Using Two-Dimensional Landmark Feature Maps. IEEE Access 2020,
8, 121549–121563. [CrossRef]
8. De Carolis, B.; D’Errico, F.; Macchiarulo, N.; Palestra, G. “Engaged Faces”: Measuring and Monitoring Student Engagement
from Face and Gaze Behavior. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence-Companion
Volume, New York, NY, USA, 14 October 2019; pp. 80–85.
9. De Carolis, B.; D’Errico, F.; Macchiarulo, N.; Paciello, M.; Palestra, G. Recognizing Cognitive Emotions in E-Learning Environment.
In Proceedings of the Bridges and Mediation in Higher Distance Education; Agrati, L.S., Burgos, D., Ducange, P., Limone, P.,
Perla, L., Picerno, P., Raviolo, P., Stracke, C.M., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 17–27.
10. De Carolis, B.; D’Errico, F.; Paciello, M.; Palestra, G. Cognitive Emotions Recognition in E-Learning: Exploring the Role of
Age Differences and Personality Traits. In Proceedings of the Methodologies and Intelligent Systems for Technology Enhanced
Learning, 9th International Conference; Gennari, R., Vittorini, P., De la Prieta, F., Di Mascio, T., Temperini, M., Azambuja Silveira,
R., Ovalle Carranza, D.A., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 97–104.
11. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. Available online: https:
//arxiv.org/pdf/1409.1556.pdf (accessed on 10 May 2021).
12. Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The Extended Cohn-Kanade Dataset (CK+): A Complete
Dataset for Action Unit and Emotion-Specified Expression. In Proceedings of the 2010 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 94–101.
13. Lyons, M.; Kamachi, M.; Gyoba, J. The Japanese Female Facial Expression (JAFFE) Dataset 1998; 1998. Available online:
https://zenodo.org/record/3451524#.YJtUMqgzbIU (accessed on 10 May 2021).
14. Yin, L.; Wei, X.; Sun, Y.; Wang, J.; Rosato, M.J. A 3D Facial Expression Database for Facial Behavior Research. In Proceedings of
the 7th International Conference on Automatic Face and Gesture Recognition (FGR06), Southampton, UK, 10–12 April 2006; pp.
211–216.
15. Jain, D.K.; Shamsolmoali, P.; Sehdev, P. Extended Deep Neural Network for Facial Emotion Recognition. Pattern Recognit. Lett.
2019, 120, 69–74. [CrossRef]
16. Ekman, P.; Oster, H. Facial Expressions of Emotion. Annu. Rev. Psychol. 1979, 30, 527–554. [CrossRef]
17. Rubin, D.C.; Talarico, J.M. A Comparison of Dimensional Models of Emotion: Evidence from Emotions, Prototypical Events,
Autobiographical Memories, and Words. Memory 2009, 17, 802–808. [CrossRef] [PubMed]
Future Internet 2021, 13, 126 14 of 15
18. Panksepp, J. Affective Neuroscience: The Foundations of Human and Animal Emotions; Oxford University Press: Oxford, UK, 2004;
ISBN 978-0-19-802567-2.
19. Mollahosseini, A.; Hasani, B.; Mahoor, M.H. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the
Wild. IEEE Trans. Affect. Comput. 2019, 10, 18–31. [CrossRef]
20. Suk, M.; Prabhakaran, B. Real-Time Mobile Facial Expression Recognition System-A Case Study. Available online: http:
//citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1011.4398&rep=rep1&type=pdf (accessed on 10 May 2021).
21. Ghimire, D.; Lee, J. Geometric Feature-Based Facial Expression Recognition in Image Sequences Using Multi-Class AdaBoost and
Support Vector Machines. Sensors 2013, 13, 7714–7734. [CrossRef] [PubMed]
22. Happy, S.L.; George, A.; Routray, A. A Real Time Facial Expression Classification System Using Local Binary Patterns. In
Proceedings of the 2012 4th International Conference on Intelligent Human Computer Interaction (IHCI), Kharagpur, India, 27–29
December 2012; pp. 1–5.
23. Szwoch, M.; Pieniążek, P. Facial Emotion Recognition Using Depth Data. In Proceedings of the 2015 8th International Conference
on Human System Interaction (HSI), Warsaw, Poland, 25–27 June 2015; pp. 271–277.
24. Jung, H.; Lee, S.; Yim, J.; Park, S.; Kim, J. Joint Fine-Tuning in Deep Neural Networks for Facial Expression Recognition. 2015,
pp. 2983–2991. Available online: https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Jung_Joint_Fine-
Tuning_in_ICCV_2015_paper.pdf (accessed on 10 May 2021).
25. Breuer, R.; Kimmel, R. A Deep Learning Perspective on the Origin of Facial Expressions. Available online: https://arxiv.org/pdf/
1705.01842.pdf (accessed on 10 May 2021).
26. Hasani, B.; Mahoor, M.H. Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Networks. Available
online: https://arxiv.org/pdf/1705.07871.pdf (accessed on 10 May 2021).
27. Kim, D.H.; Baddar, W.J.; Jang, J.; Ro, Y.M. Multi-Objective Based Spatio-Temporal Feature Representation Learning Robust to
Expression Intensity Variations for Facial Expression Recognition. IEEE Trans. Affect. Comput. 2019, 10, 223–236. [CrossRef]
28. Ng, H.-W.; Nguyen, V.D.; Vonikakis, V.; Winkler, S. Deep Learning for Emotion Recognition on Small Datasets Using Transfer
Learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, New York, NY, USA, 9
November 2015; pp. 443–449.
29. Gervasi, O.; Franzoni, V.; Riganelli, M.; Tasso, S. Automating Facial Emotion Recognition. Web Intell. 2019, 17, 17–27. [CrossRef]
30. Chu, W.-S.; De la Torre, F.; Cohn, J.F. Learning Spatial and Temporal Cues for Multi-Label Facial Action Unit Detection. In
Proceedings of the 2017 12th IEEE International Conference on Automatic Face Gesture Recognition (FG 2017), Washington, DC,
USA, 30 May–3 June 2017; pp. 25–32.
31. Graves, A.; Mayer, C.; Wimmer, M.; Radig, B. Facial Expression Recognition with Recurrent Neural Networks. Available online:
https://www.cs.toronto.edu/~{}graves/cotesys_2008.pdf (accessed on 10 May 2021).
32. Jain, D.K.; Zhang, Z.; Huang, K. Multi Angle Optimal Pattern-Based Deep Learning for Automatic Facial Expression Recognition.
Pattern Recognit. Lett. 2020, 139, 157–165. [CrossRef]
33. Ebrahimi Kahou, S.; Michalski, V.; Konda, K.; Memisevic, R.; Pal, C. Recurrent Neural Networks for Emotion Recognition in
Video. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, New York, NY, USA, 9 November
2015; pp. 467–474.
34. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; Adaptive Computation and Machine Learning; The MIT Press: Cambridge,
MA, USA, 2016; ISBN 978-0-262-03561-3.
35. Franzoni, V.; Biondi, G.; Perri, D.; Gervasi, O. Enhancing Mouth-Based Emotion Recognition Using Transfer Learning. Sensors
2020, 20, 5222. [CrossRef] [PubMed]
36. Kim, J.H.; Poulose, A.; Han, D.S. The Extensive Usage of the Facial Image Threshing Machine for Facial Emotion Recognition
Performance. Sensors 2021, 21, 2026. [CrossRef] [PubMed]
37. Kuo, C.; Lai, S.; Sarkis, M. A Compact Deep Learning Model for Robust Facial Expression Recognition. In Proceedings of the 2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June
2018; pp. 2202–22028.
38. Kaya, H.; Gürpınar, F.; Salah, A.A. Video-Based Emotion Recognition in the Wild Using Deep Transfer Learning and Score Fusion.
Image Vis. Comput. 2017, 65, 66–75. [CrossRef]
39. Li, S.; Deng, W. Deep Facial Expression Recognition: A Survey. IEEE Trans. Affect. Comput. 2020. [CrossRef]
40. Zhao, H.; Ye, N.; Wang, R. A Survey on Automatic Emotion Recognition Using Audio Big Data and Deep Learning Architectures.
In Proceedings of the 2018 IEEE 4th International Conference on Big Data Security on Cloud (BigDataSecurity), Omaha, NE, USA,
3–5 May 2018; pp. 139–142.
41. Zhang, J.; Yin, Z.; Chen, P.; Nichele, S. Emotion Recognition Using Multi-Modal Data and Machine Learning Techniques: A
Tutorial and Review. Inf. Fusion 2020, 59, 103–126. [CrossRef]
42. Chen, L.; Feng, G.; Joe, J.; Leong, C.W.; Kitchen, C.; Lee, C.M. Towards Automated Assessment of Public Speaking Skills Using
Multimodal Cues. In Proceedings of the Proceedings of the 16th International Conference on Multimodal Interaction, Istanbul,
Turkey, 12 November 2014; pp. 200–203.
43. Gloor, P.A.; Paasivaara, M.; Miller, C.Z.; Lassenius, C. Lessons from the Collaborative Innovation Networks Seminar. IJODE 2016,
4, 3. [CrossRef]
Future Internet 2021, 13, 126 15 of 15
44. Chollet, F. Xception: Deep Learning With Depthwise Separable Convolutions. Available online: https://arxiv.org/pdf/1610.023
57.pdf (accessed on 10 May 2021).
45. Hamelin, N.; El Moujahid, O.; Thaichon, P. Emotion and advertising effectiveness: A novel facial expression analysis approach. J.
Retail. Consum. Serv. 2017, 36, 103–111. [CrossRef]
46. Franzoni, V.; Vallverdù, J.; Milani, A. Errors, biases and overconfidence in artificial emotional modeling. Available online: https://
www.researchgate.net/publication/336626687_Errors_Biases_and_Overconfidence_in_Artificial_Emotional_Modeling (accessed
on 10 May 2021).