paper_working_design.tex 30.1 KB
Newer Older
uoega's avatar
uoega committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
\documentclass[10pt,twocolumn,letterpaper]{article}

\usepackage{cvpr}
\usepackage{times}
\usepackage{epsfig}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}

% Include other packages here, before hyperref.

% If you comment hyperref and then uncomment it, you should delete
% egpaper.aux before re-running latex.  (Or just hit 'q' on the first latex
% run, let it finish, and you should be clear).
\usepackage[breaklinks=true,bookmarks=false]{hyperref}

\cvprfinalcopy % *** Uncomment this line for the final submission

\def\cvprPaperID{****} % *** Enter the CVPR Paper ID here
\def\httilde{\mbox{\tt\raisebox{-.5ex}{\symbol{126}}}}

% Pages are numbered in submission mode, and unnumbered in camera-ready
%\ifcvprfinal\pagestyle{empty}\fi
uoega's avatar
uoega committed
24
\setcounter{page}{1}
uoega's avatar
uoega committed
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
\begin{document}

%%%%%%%%% TITLE
\title{Data Augmentation of Semantic Embeddings for Skeleton based Zero-Shot Gesture Recognition}

\author{David Heiming\\
Karlsruhe Institute of Technology\\
{\tt\small uween@student.kit.edu}
% For a paper whose authors are all at the same institution,
% omit the following lines up until the closing ``}''.
% Additional authors and addresses can be added with ``\and'',
% just like the second author.
% To save space, use either the email address or home page, not both
\and
Hannes Uhl\\
Karlsruhe Institute of Technology\\
{\tt\small ujjmv@student.kit.edu}
\and
Jonas Linkerhägner\\
Karlsruhe Institute of Technology\\
{\tt\small uoega@student.kit.edu}
}

\maketitle
%\thispagestyle{empty}

%%%%%%%%% ABSTRACT
\begin{abstract}
Tediloma's avatar
Tediloma committed
53
\noindent Interaction with computer systems is one of the most important topics of the digital age. Interfacing with a system through body movements rather than tactile controls can provide significant advantages. To make that possible, the system needs to reliably detect the performed gestures. Systems using conventional deep learning methods are therefore trained on all possible gestures beforehand. Zero-Shot learning models, on the other hand, aim to also recognize gestures not seen during training when given their labels. The model thus needs to extract information about an unseen gesture's visual appearance from its label. Using typical text embedding modules like BERT, that information will be focused on the semantics of the label rather than its visual characteristics. In this work, we present several forms of data augmentation that can be applied to the semantic embeddings of the class labels in order to increase their visual information content. This approach achieves a significant performance increase for a Zero-Shot gesture recognition model.
Tediloma's avatar
Tediloma committed
54

Tediloma's avatar
Tediloma committed
55

uoega's avatar
uoega committed
56
57
58
59
60
\end{abstract}

%%%%%%%%% BODY TEXT
\section{Introduction}

Tediloma's avatar
Tediloma committed
61
\noindent Gesture recognition of videos is a rapidly growing field of research and is becoming an important component of input-device-less control of consumer products such as drones or televisions. While various past works have focused on the classification of gestures known in advance \cite{marinov2021pose2drone, kopuk2019realtime}, this work deals with gesture recognition using Zero-Shot learning. This approach makes it possible to use unseen gestures (meaning gestures that the model has not seen during training). The user of the product is thus offered the opportunity to expand the command set for controlling the device.
Tediloma's avatar
Tediloma committed
62

Tediloma's avatar
Tediloma committed
63
In order to be able to classify samples of an unseen class, a network needs to form an expectation of what the gesture looks like based on its label. This is usually done through the use of text embeddings \cite{estevam2020zeroshot}: Trained on unannotated text data, language embedding models extract meaning from words or sentences by converting them into a semantic embedding vector. After creating a semantic embedding for each class label, it is possible to compare the embeddings of an unseen class with those of the seen classes to find similarities between them. Based on those similarities, the network can construct an expectation of what a sample of that unseen class might look like.
uoega's avatar
uoega committed
64

Tediloma's avatar
Tediloma committed
65
It is quite common to apply data augmentation techniques such as cropping, scaling or flipping to the video input of a network in order to increase the amount of available training samples \cite{perez2017effectiveness}. However, in Zero-Shot learning there are two different, equally important kinds of training information for each class: visual and semantic. The common data augmentation strategies only make it possible to multiply the amount of visual training data. But the semantic information remains minimal, usually restricted to the simple label of the class. We aim to provide the network more relevant semantic information about the different classes by applying several forms of data augmentation to the semantic embeddings of the class labels.
uoega's avatar
uoega committed
66
67


uoega's avatar
uoega committed
68
\section{Method}
Tediloma's avatar
Tediloma committed
69
\label{method}
Tediloma's avatar
Tediloma committed
70
\noindent First we need to build a network capable of Zero-Shot learning for gesture recognition. Then we define different forms of data augmentation for the semantic embeddings of the class labels and specify the experimental setting.
uoega's avatar
uoega committed
71
%compare the classification accuracy. 
Tediloma's avatar
Tediloma committed
72
%We use a slightly modified version of the zero-shot classification network from Jasani et al. \cite{jasani2019skeleton}. Their architecture features a multilayer perceptron (MLP) to map the semantic embeddings of the class labels into the visual features space and another MLP that learns a deep similarity metric between those semantic features and the visual features of a given input sample. 
uoega's avatar
uoega committed
73
74

\begin{figure}[t]
uoega's avatar
uoega committed
75
76
	\begin{center}
		%\fbox{\rule{0pt}{2in} \rule{0.9\linewidth}{0pt}}
Tediloma's avatar
Tediloma committed
77
		\includegraphics[width=1\linewidth]{Architektur6.png}
uoega's avatar
uoega committed
78
	\end{center}
Tediloma's avatar
Tediloma committed
79
	\caption{Overview of the network modules.}
80
	\label{architecture}
uoega's avatar
uoega committed
81
82
\end{figure}

uoega's avatar
uoega committed
83
\subsection{Architecture}
Tediloma's avatar
Tediloma committed
84
\noindent The architecture chosen for our experiments largely corresponds to the model presented in \cite{jasani2019skeleton}. We rebuild its modular architecture using the information published in the paper. Certain modules are replaced or slightly modified to fit our specific task. Here, only a brief overview of the functionality is given, explaining how the model tries to solve the Zero-Shot task, and which changes are made. For detailed information on the network modules, which are illustrated in Figure \ref{architecture}, refer to \cite{jasani2019skeleton, yan2018spatial, reimers2019sentencebert, sung2018learning}.
uoega's avatar
uoega committed
85

Tediloma's avatar
Tediloma committed
86
%The architecture consists of three parts described in the following sections.
Tediloma's avatar
Tediloma committed
87

Tediloma's avatar
Tediloma committed
88
%\subsubsection{Visual path}
uoega's avatar
uoega committed
89
% add connection to paragraph above
Tediloma's avatar
Tediloma committed
90
As visual input we use a temporal series of skeletons, instead of RGB videos to remove unnecessary details such as the background or a person’s clothing. Each skeleton is a graph whose nodes represent the person’s joints. A full input sample consists of a series of one skeleton graph per frame. Such skeleton data can be obtained from RGB video using a framework like \emph{Openpose} \cite{cao2019openpose}. To perform a visual feature extraction of these input samples, a Graph Convolutional Network (GCN) \cite{yan2018spatial} is used. It consists of 9 spatial temporal graph convolution layers with residual connections. The resulting ouput is a 256-dimensional vector containing the visual features.
Tediloma's avatar
Tediloma committed
91

uoega's avatar
uoega committed
92
93
%The output of the GCN utputs visual features in form of a 256-dimensional vector.
%The task of the visual path is the feature extraction of a video sample (in form of a temporal series of 3D-Skeletons). The Graph Convolutional Net (GCN) from \cite{yan2018spatial} is used as a feature extractor. Output: 1x256 RGB videos contain a lot of information, which is not necessary to recognize the performed gesture, such as the background or a person’s clothing. To reduce the amount of unnecessary detail we use a temporal series of skeletons as input data. Since gestures are fully defined by the motion of a person’s limbs, it is possible for an appropriate network to recognize them based on an input of this form \cite{duan2021revisiting}.
Tediloma's avatar
Tediloma committed
94
%\subsubsection{Semantic Path}
Tediloma's avatar
Tediloma committed
95
96
Parallel to this visual path, a semantic feature extraction of the vocabulary, i.e. all possible class labels, is performed in two steps. First a \emph{Sentence BERT} (SBERT) module \cite{reimers2019sentencebert} transforms the class labels into semantic embeddings. This is different from the original architecture in \cite{jasani2019skeleton}, where an older \emph{Sent2Vec} module \cite{Pagliardini_2018} is used. The SBERT module takes a sentence as input, analyzes it and yields two kinds of outputs: a cls-token vector, which is a representation of the entire sentence, and a series of embedding vectors, that each represent one word of the input sentence with its context. A 768-dimensional mean token vector can be created out of this secondary output by applying an attention mask to the series of tokens to combine them into a single one. We use the mean-token output of the SBERT module over the cls-token as our semantic embedding because it resulted in a better performance. 
In the second step, the attribute network (AN) transforms the semantic embeddings into semantic features by mapping them into the 256-dimensional visual feature space. Compared to its original form in \cite{sung2018learning}, we apply dropout with a factor of 0.5 to the first layer of this multi layer perceptron (MLP). 
Tediloma's avatar
Tediloma committed
97

Tediloma's avatar
Tediloma committed
98

uoega's avatar
uoega committed
99
100
101
%The AN introduced by Sung et al. \cite{sung2018learning} is a multilayer perceptron (MLP), where we additionally apply dropout to the first layer with a factor of 0.5.
%The AN is introduced in \cite{sung2018learning} where it contributes a significant part to the solution of the ZSL task along with the Relation Net (RN), which is explained in more detail in the following section.
%Sentence BERT (SBERT) \cite{reimers2019sentencebert} is a text embedding module that takes a sentence as input, analyzes it and gives two kinds of outputs: a cls-token vector, that is a representation of the entire sentence, and a series of embedding vectors, that each represent one word of the input sentence with its context. A mean token vector can be created out of this secondary output by applying an attention mask to the series of tokens to combine them into a single one. This way two separate semantic embeddings can be generated for each input sentence: a cls-token and a mean-token. The semantic path, which lies (/is executed in) parallel to the visual path, consists of two modules. The first is a
Tediloma's avatar
Tediloma committed
102
103
%The semantic path consists of two modules. The first is a SBERT module, that has the task of transforming the vocabulary, i.e. all possible class labels, into semantic embeddings. This is different from the original architecture in \cite{jasani2019skeleton}, where a Sent2Vec module \cite{Pagliardini_2018} is used. We use the mean-token output of the SBERT module over the cls-token as our semantic embedding because it resulted in a better performance. The attribute network (AN) then transforms the semantic embeddings into semantic features by mapping them into the visual feature space. The AN is introduced in \cite{sung2018learning} where it contributes a significant part to the solution of the ZSL task along with the Relation Net (RN), which is explained in more detail in the following section. We apply dropout to the first layer of the AN with a factor of 0.5.
%\subsubsection{Similarity-Learning-Part}
Tediloma's avatar
Tediloma committed
104
105

Finally, the visual and semantic feature outputs are combined by forming relation pairs. Each pair is a concatenation of the visual features of our input sample with the semantic features of one class. These relation pairs are then fed into the relation network (RN) introduced in \cite{sung2018learning}. The RN applies a similarity metric in order to assess the resemblance of the semantic and visual features within each relation pair. This way, it computes a similarity score for each pair, which symbolizes the input sample's similarity to the corresponding class. Then, the similarity scores are compared to a one-hot vector representing the ground truth class using mean squared error (MSE) loss. In contrast to previous works, this architecture does not use a fixed similarity metric. Instead, the RN is a MLP that learns a deep similarity metric during training, which was introduced and shown to improve performance in \cite{sung2018learning}. We add an additional linear layer to the RN and apply dropout to the first and second layer with a factor of 0.5.
Tediloma's avatar
Tediloma committed
106

uoega's avatar
uoega committed
107
%Here we first form the relation pairs by pairwise concatenating the visual features of our sample with the semantic features of each class. These relation pairs are then fed into the relation network (RN) introduced in \cite{sung2018learning}, which is another MLP. We add an additional linear layer and apply dropout to the first and second layer with a factor of 0.5. The RN applies a similarity metric in order to assess the similarity of the semantic and visual features within each relation pair. In contrast to previous work, we do not use a fixed similarity metric. Instead, the RN learns a deep similarity metric during training, which was introduced and shown to improve performance in \cite{sung2018learning}.  It computes the deviation between the scores and a one-hot vector representing the ground truth class.
Tediloma's avatar
Tediloma committed
108

Tediloma's avatar
Tediloma committed
109

uoega's avatar
uoega committed
110
111


Tediloma's avatar
Tediloma committed
112
\subsection{Data augmentation}
uoega's avatar
uoega committed
113

Tediloma's avatar
Tediloma committed
114
\noindent In this section, we present three data augmentation methods for the semantic embeddings of the default class labels provided by the dataset. We apply these methods directly to the labels, because they are more tangible than the abstract embeddings. This still results in an augmentation of the semantic embeddings, since they are created from the labels. The goal of these methods is to increase the visual information content of the semantic embeddings of our gesture classes in order to improve the classification performance. To demonstrate them, we apply each augmentation to the class with the default label "squat down" as an example.
uoega's avatar
uoega committed
115
116

\subsubsection{Descriptive labels}
Tediloma's avatar
Tediloma committed
117
118

In a first step, we provide more visual information by substituting the class labels, which in their original form mostly consist of one or two words, with a complete sentence. We use sentences that give a more precise description of the movements required to perform a particular gesture. This way, the default label "squat down" is manually augmented to create the new descriptive label: "A human crouches down by bending their knees". During training and testing, every default label is replaced by its manually written descriptive counterpart. 
uoega's avatar
uoega committed
119

uoega's avatar
uoega committed
120
121
122

\subsubsection{Multiple labels per class}

Tediloma's avatar
Tediloma committed
123
We now increase the information content of the semantic embeddings even further by labeling each gesture with several different descriptions. Thus, we manually create additional descriptions that use different wording for each gesture. An example of using three descriptive labels per class is shown in Table \ref{tab:multi_label}. Since the network computes a similarity score for each possible label, there are now three times as many similarity scores due to the expanded vocabulary. In each iteration of the training process, the ground truth of a sample is randomly selected from one of the three possible labels. During inference, all three possibilities are considered correct if the network predicts one of them for the corresponding sample.
uoega's avatar
uoega committed
124

uoega's avatar
uoega committed
125
126
127
128
\begin{table}
	\begin{center}
		\begin{tabular}{|lc|}
			\hline	
uoega's avatar
uoega committed
129
130
131
132
133
			\textbf{1:} & A human crouches down by bending their knees. \\
			\hline
			\textbf{2:}  & A person is bending their legs to squat down. \\
			\hline
			\textbf{3:}  & Someone crouches down from a standing position.\\	
uoega's avatar
uoega committed
134
135
136
			\hline
		\end{tabular}
	\end{center}
Tediloma's avatar
Tediloma committed
137
	\caption{Three descriptive labels for the class "squat down".}
uoega's avatar
uoega committed
138
139
	\label{tab:multi_label}
\end{table}
uoega's avatar
uoega committed
140

uoega's avatar
uoega committed
141
\subsubsection{Automatic augmentation}
Tediloma's avatar
Tediloma committed
142
143
\label{autoaug}
To reduce the manual annotation effort, we now generate additional labels automatically for the multiple labels approach. For this purpose, we use an augmenter from \emph{nlpaug} \cite{ma2019nlpaug} with the \emph{RoBERTa} language model \cite{liu2019roberta} to insert words into a manually created descriptive label. We do not use word substitutions, since it is often impossible for the automatic text augmentation to find multiple suitable synonyms for specific words. Word deletions are also suboptimal, because removing key words leads to a sentence that does not describe the given action appropriately. An example label set is shown in table \ref{tab:auto_aug}. One can see, that the reduced manual annotation effort sometimes comes at the cost of generating grammatically incorrect sentences.
Tediloma's avatar
Tediloma committed
144

uoega's avatar
uoega committed
145
146

\begin{table}
uoega's avatar
uoega committed
147
	\begin{center}
uoega's avatar
uoega committed
148
149
150
151
152
153
154
155
		\begin{tabular}{|lc|}
			\hline	
			\textbf{Description}: & A human crouches down by  \\
			 & bending their knees. \\
			 \hline
			\textbf{Augmentation 1:} & A \textit{small} human crouches \textit{duck}\\
			 & down by bending their knees.\\
			 \hline
Tediloma's avatar
Tediloma committed
156
			\textbf{Augmentation 2:} & A human crouches \textit{fall} down \\
uoega's avatar
uoega committed
157
158
159
			 & \textit{somewhat} by bending their knees.\\		
			\hline
		\end{tabular}
uoega's avatar
uoega committed
160
	\end{center}
Tediloma's avatar
Tediloma committed
161
	\caption{Descriptive label and two automatic augmentations for "squat down".}
uoega's avatar
uoega committed
162
163
164
165
	\label{tab:auto_aug}
\end{table}  
  

uoega's avatar
uoega committed
166

uoega's avatar
uoega committed
167
\subsection{Experiments}
Tediloma's avatar
Tediloma committed
168
169


Tediloma's avatar
Tediloma committed
170
\noindent In this work, we use the \emph{NTU RGB+D 120} dataset \cite{Liu_2020}, which contains 3D skeleton data for 114,480 samples of 120 different human action classes. To evaluate our model we pick a subset of 40 gestures classes to execute four performance tests: one with the default labels as a baseline, and one per augmentation method. A performance test consists of eight training runs on 35/5 (seen/unseen) splits, which are randomized in such a way that every single class is unseen in exactly one training run.
uoega's avatar
uoega committed
171

Tediloma's avatar
Tediloma committed
172
During a training run, only the weights of the AN and RN modules are adjusted. The visual feature extractor is trained beforehand on the 80 unused classes of the \emph{NTU} dataset to ensure that the unseen gestures have not appeared in the training process at some early stage. The SBERT module has already been trained on large text corpora by \emph{Sentence-Transformers} \cite{reimers2019sentencebert}. %is used as provided by \cite
uoega's avatar
uoega committed
173

Tediloma's avatar
Tediloma committed
174
We test the performance in two scenarios for each augmentation method: In the ZSL scenario, the model only predicts on the unseen classes, while it predicts on all classes (seen and unseen) in the GZSL scenario. In the latter we measure the accuracy for seen and unseen samples, as well as the harmonic mean, following recent works \cite{gupta2021syntactically}. In each scenario the results are averaged over the eight individual training runs of a performance test. For default and descriptive labels, we train our network with a batch size of 32, as it was done in the original paper \cite{sung2018learning}. When using multiple labels, we increase the batch size to 128 and add batch normalization at the input of the RN. 
uoega's avatar
uoega committed
175

uoega's avatar
uoega committed
176

uoega's avatar
uoega committed
177

uoega's avatar
uoega committed
178
\section{Results}
uoega's avatar
uoega committed
179

uoega's avatar
uoega committed
180

Tediloma's avatar
Tediloma committed
181
\noindent All our results are generated following the procedure described in the experiments section. For the multiple labels approach three manually created labels per class are used. The automatic augmentation approach utilizes five labels: one manually created label and four augmented versions. In table \ref{tab:ZSL_GZSL} one can see the ZSL, seen and unseen accuracies, as well as the harmonic mean. Table \ref{tab:top1_top5} displays a more detailed view of the achieved unseen accuracies. It shows the top-1 and top-5 accuracies for our approaches with their standard deviations (std) over the eight splits.
uoega's avatar
uoega committed
182

Tediloma's avatar
Tediloma committed
183
Improvements on the ZSL accuracy, the unseen accuracy and the harmonic mean are achieved using the descriptive labels. The accuracies increase even further with the multiple labels approach. Using automatic augmentation performs worse compared to multiple manually created labels, but it still constitutes a relative 23\% increase over using only one descriptive label. 
uoega's avatar
uoega committed
184

Tediloma's avatar
Tediloma committed
185
The seen accuracy stays within the same range, only experiencing a marginal increase for the two cases that use multiple labels. This behaviour along with a decrease in unseen accuracy is observed whenever batch normalization is applied to any of our approaches. Therefore it is only applied in the cases where multiple labels are used because they require batch normalization in order for the training to converge. 
uoega's avatar
uoega committed
186

Tediloma's avatar
Tediloma committed
187
Table \ref{tab:top1_top5} shows that the top-5 accuracies behave similarly to their top-1 counterparts, with the exception of a less severe performance decrease when using automatic augmentations. The standard deviations of the top-1 accuracies are in the same range for all approaches based on the descriptive labels. The standard deviation belonging to the top-5 accuracies decreases for the multiple label aproaches, which indicates a higher prediction consistency.
uoega's avatar
uoega committed
188
189

\begin{table}[t]
uoega's avatar
uoega committed
190
	\begin{center}
uoega's avatar
uoega committed
191
192
		\begin{tabular}{|l|c|c|c|c|}
			\hline
uoega's avatar
uoega committed
193
			Augmentation & ZSL & Seen & Unseen & h\\
uoega's avatar
uoega committed
194
			\hline\hline
Tediloma's avatar
Tediloma committed
195
196
197
198
			Baseline & 0.4739 & 0.8116 & 0.1067 & 0.1877\\
			Descriptive & 0.5186 & 0.8104 & 0.1503 & 0.2495\\
			Multiple & \textbf{0.6558} & 0.8283 & \textbf{0.2182} & \textbf{0.3417}\\
			Automatic & 0.5865 & \textbf{0.8290} & 0.1856 & 0.3003\\			
uoega's avatar
uoega committed
199
200
201
202
			\hline
		\end{tabular}
	\end{center}
	\caption{ZSL and GZSL results for different approaches.}
203
	\label{tab:ZSL_GZSL}
uoega's avatar
uoega committed
204
205
\end{table}

uoega's avatar
uoega committed
206
\begin{table}[t]
uoega's avatar
uoega committed
207
208
	\begin{center}
		\begin{tabular}{|l|c|c|}
uoega's avatar
uoega committed
209
			\hline
uoega's avatar
uoega committed
210
			Augmentation & top-1${\pm}$ std & top-5 ${\pm}$ std \\
uoega's avatar
uoega committed
211
			\hline\hline
Tediloma's avatar
Tediloma committed
212
213
214
215
			Baseline & ${0.1067\pm 0.0246}$ & ${0.5428\pm 0.0840}$ \\
			Descriptive & ${0.1503\pm 0.0553}$ & ${0.6460\pm 0.1250}$ \\
			Multiple & ${\textbf{0.2182}\pm 0.0580}$ & ${\textbf{0.8580}\pm 0.0657}$ \\
			Automatic & ${0.1856\pm 0.0499}$ & ${0.8272\pm 0.0476}$ \\			
uoega's avatar
uoega committed
216
217
218
			\hline
		\end{tabular}
	\end{center}
uoega's avatar
uoega committed
219
	\caption{Unseen top-1 and top-5 accuracies (GZSL).}
220
	\label{tab:top1_top5}
uoega's avatar
uoega committed
221
\end{table}
uoega's avatar
uoega committed
222

uoega's avatar
uoega committed
223
\subsection{Discussion}
uoega's avatar
uoega committed
224

225
\subsubsection{From default to descriptive labels}
Tediloma's avatar
Tediloma committed
226
227

The improvement from the use of descriptive labels shows that incorporating more visual information into the semantic embeddings helps the network to find a general relation between the semantic and the visual space. Plainly speaking the network can find more similarities between the class labels. This is important since the expected visual features of an unseen class are determined based on the similarities between its label and the seen labels. One might expect these similarities to also be present in the embeddings of the default labels because SBERT should be able to generate representative embeddings that share characteristics with similar classes. While such similarities are present in the SBERT embeddings, they are not focused on the visual appearance of the gestures. For example, the embeddings of the class labels "sit down" and "drink water" might be somewhat similar, because those words appear together frequently in the large text corpora that SBERT was trained on. Visually however, those classes look vastly different from each other. The embeddings falsely suggest, that a similarity between the classes is there, which is less likely to happen if the embeddings are created from visual descriptions of the actions. 
uoega's avatar
uoega committed
228
229

%this to already be possible on the default labels due to the use of text embeddings, but the issue there lies with the way that the embedding modules are trained. The embeddings of the class labels 'sit down' and 'drink water' might be somewhat similar, because those words appear together frequently in the large training text corpora, but visually those classes look vastly different from each other. The embeddings falsely suggest, that a similarity between the classes is there, which is less likely to happen if the embeddings are created from visual descriptions of the actions. Usually this should already be enabled by using text embedding techniques that were trained on large text corpora to find semantic relationships. But the problem with that is that the texts it was trained on contains the words used to describe motions in many different contexts and usually not visually describing it. The main reason for this is, that most humans do not need e.g. an explanation on what “stand up” looks like. But for our task the visual relationships are needed which could explain why using descriptive labels leads to improvements. 
230
231

\subsubsection{Using multiple labels}
uoega's avatar
uoega committed
232
233
234
%vielleicht kurz auf die Bedeutung der Batchsize eingehen 32->128 weil mehr Klassen since the unseen labels can only be categorized based on the relations between and the seen labels. to also be present in the default label by using visual descriptions as class labels


Tediloma's avatar
Tediloma committed
235
When using multiple labels, the idea is somewhat different. The main motivation is that using larger amounts of data is generally a good idea. Here, the descriptions and therefore the embeddings of each sample are chosen randomly among the three possibilities during training. This forces the network to assign a high similarity to all three labels corresponding to a sample, which leads to a more general mapping between the semantic and the visual feature space. The model has to adapt to the greater diversity of the used semantic embeddings. This improved generalization on seen training data then helps the network understand and therefore classify the unseen samples better. 
Tediloma's avatar
Tediloma committed
236
237
238
239

For the methods using multiple labels per class, the batch size during training is increased from 32 to 128. Since the network needs to learn a mapping for a greater amount of classes, increasing the batch size is necessary to find relations between more classes at once. Increasing the batch size does not benefit the single label approaches.

%In our case the network is forced to learn a more general mapping between the semantic and the visual feature space since the descriptions and therefore also the embeddings change randomly during training. It has to adapt to the greater diversity of the used label semantic embeddings. This improved generalization on seen training data then helps to better understand therefore classify the unseen samples. 
uoega's avatar
uoega committed
240
%------Jonas
Tediloma's avatar
Tediloma committed
241
%To achieve this, the batch size during training is increased from 32 to 128. As the network needs to learn a mapping between a greater amount of different classes, increasing the batch size helps to find relations between more classes at once. This benefit was not observed when using one label per class. 
uoega's avatar
uoega committed
242
% 
243
\subsubsection{Automatic augmentation}
Tediloma's avatar
Tediloma committed
244

Tediloma's avatar
Tediloma committed
245
The individual labels for a class are very similar when using automatic augmentation compared to multiple manually created labels, since only a few additional words are inserted for each version. The diversity in the semantic embeddings is therefore less pronounced, which leads to a worse performance. However, compared to the single labels, where the semantic embeddings contain no diversity, the performance is significantly better. 
Tediloma's avatar
Tediloma committed
246

Tediloma's avatar
Tediloma committed
247
If diversifying the semantic embeddings is the key to improving the performance, one might expect, that generating the additional embeddings by adding random noise to a single embedding could also work. This would obviate the need for a text augmentation module. However, this method does not improve the performance compared to the single label approach when tested on our model. This shows that a specific kind of diversity is needed to get an improvement. Using word insertions clearly provides a suitable diversity, since there is an improvement despite the resulting grammatical errors described in section \ref{method}.
Tediloma's avatar
Tediloma committed
248
249

%As described in chapter \ref{method}, inserting words with automatic augmentation methods does not necessarily yield grammatically correct sentences. But as the semantic embeddings are generated using the SBERT mean-token, it introduces diversity into the different embeddings due to the averaging. Still the additional information does usually not focus on the visual description of the classes and therefore differs from the manually created multiple labels. It can rather be modeled as adding noise to the embedding vectors. But in contrast to just adding random noise, it keeps semantic information and relations between different embeddings intact. This helps the network to better generalize its mapping. Experiments using only random noise to generate diverse label embeddings lead to no performance improvements in top-1 unseen accuracy.
uoega's avatar
uoega committed
250
251
%------
%As described in chapter \ref{method}, using automatic augmentation methods introduces diversity into the different embeddings. As this does not only focus on the visual description of the classes and therefore differs from the manually created multi labels, it could be modeled as noise. But in contrast to just adding random noise to the embedding vector, it keeps semantic information and relationships intact. This helps the network to generalize its mapping. Experiments using only random noise to generate diverse label embeddings lead to no performance improvements in top-1 accuracy.
uoega's avatar
uoega committed
252

uoega's avatar
uoega committed
253
\section{Conclusion}
uoega's avatar
uoega committed
254

Tediloma's avatar
Tediloma committed
255
\noindent In this work, we demonstrate the potential of applying data augmentation to the semantic embeddings of a Zero-Shot gesture recognition model. By including more visual information in the class labels and combining multiple descriptions per class we are able to improve the performance of a model based on \cite{jasani2019skeleton} by a significant margin. The use of automatic text augmentation still leads to a sizable performance gain, while keeping the manual annotation effort low.
Tediloma's avatar
Tediloma committed
256
257

%In this work, we highlight the importance of the semantic embeddings in the context of skeleton based zero-shot gesture recognition by showing how the performance can increase based only on the augmentation of those embeddings. By including more visual information in the class labels and combining multiple descriptions per class we can improve the model based on \cite{jasani2019skeleton} by a significant margin. The use of automatic text augmentation methods \cite{ma2019nlpaug} already reduces the effort of manual annotation significantly, while maintaining most of the performance gain.
uoega's avatar
uoega committed
258

Tediloma's avatar
Tediloma committed
259
Future works might further investigate the following topics: Firstly, generating descriptive sentences from the default labels, e.g. by using methods from Natural Language Processing (NLP), would further reduce the manual annotation effort. Secondly, our methods could be tested on different Zero-Shot architectures to verify our improvements. Finally, different kinds or combinations of automatic text augmentation methods could be evaluated.
Tediloma's avatar
Tediloma committed
260

Tediloma's avatar
Tediloma committed
261
%falls der future work absatz weggelassen wird, hier einleiten mit: With further advances/research
uoega's avatar
uoega committed
262
With these advances, data augmentation of the semantic embeddings in Zero-Shot learning can prove useful in optimizing the performance of any Zero-Shot approach in the future. 
uoega's avatar
uoega committed
263
264
265
266
267





uoega's avatar
uoega committed
268
 
uoega's avatar
uoega committed
269
270
271
272
273
274
275
276


{\small
\bibliographystyle{ieee_fullname}
\bibliography{egbib}
}

\end{document}