paper_working_design.tex 21.5 KB
Newer Older
uoega's avatar
uoega committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
\documentclass[10pt,twocolumn,letterpaper]{article}

\usepackage{cvpr}
\usepackage{times}
\usepackage{epsfig}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}

% Include other packages here, before hyperref.

% If you comment hyperref and then uncomment it, you should delete
% egpaper.aux before re-running latex.  (Or just hit 'q' on the first latex
% run, let it finish, and you should be clear).
\usepackage[breaklinks=true,bookmarks=false]{hyperref}

\cvprfinalcopy % *** Uncomment this line for the final submission

\def\cvprPaperID{****} % *** Enter the CVPR Paper ID here
\def\httilde{\mbox{\tt\raisebox{-.5ex}{\symbol{126}}}}

% Pages are numbered in submission mode, and unnumbered in camera-ready
%\ifcvprfinal\pagestyle{empty}\fi
uoega's avatar
uoega committed
24
\setcounter{page}{1}
uoega's avatar
uoega committed
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
\begin{document}

%%%%%%%%% TITLE
\title{Data Augmentation of Semantic Embeddings for Skeleton based Zero-Shot Gesture Recognition}

\author{David Heiming\\
Karlsruhe Institute of Technology\\
{\tt\small uween@student.kit.edu}
% For a paper whose authors are all at the same institution,
% omit the following lines up until the closing ``}''.
% Additional authors and addresses can be added with ``\and'',
% just like the second author.
% To save space, use either the email address or home page, not both
\and
Hannes Uhl\\
Karlsruhe Institute of Technology\\
{\tt\small ujjmv@student.kit.edu}
\and
Jonas Linkerhägner\\
Karlsruhe Institute of Technology\\
{\tt\small uoega@student.kit.edu}
}

\maketitle
%\thispagestyle{empty}

%%%%%%%%% ABSTRACT
\begin{abstract}
Tediloma's avatar
Tediloma committed
53
54
55
Interaction with computer systems is one of the most important topics of the digital age. Recent advances in gesture recognition show: controlling a system only by moving parts of the own body can provide advantages over the physical interaction. To perform this task, the system needs to reliably detect the performed gestures. For current systems using deep learning methods this mean they have to be trained on all possible gesture beforehand. This is where the method of Zero-Shot learning comes in, which allows to also recognize gestures not seen during training. 
Here, one of the big challenges is to translate the semantic information given about an unseen class into an expectation of what visual features a sample of that class would have. Using typical semantic embeddings like BERT, that semantic information will be more focused on the semantic meaning of the label rather than its visual characteristics. In this work, we present different forms of data augmentation, that can be applied to the semantic embeddings of the class labels to increase their visual information content. This approach achieves a significant performance improvement for a zero-shot gesture recognition model.

uoega's avatar
uoega committed
56
57
58
59
60
\end{abstract}

%%%%%%%%% BODY TEXT
\section{Introduction}

Tediloma's avatar
Tediloma committed
61
62
63
Gesture recognition in videos is a rapidly growing field of research and could become an important component for input-device-less control of consumer products such as drones or televisions. While various past works have focused on the classification of gestures known in advance, this work deals with gesture recognition using the zero-shot learning approach. Such a task is interesting, because the zero-shot approach does not only allow the use of learned gestures for fixed commands. Additionally, it is possible to incorporate untrained gestures. The user of the product is thus offered the opportunity to expand the command set for controlling the device.\\
\indent In order to be able to classify samples of untrained (also called "unseen") classes, a network needs to have an expectation of what the gesture corresponding to that class’s label might look like. This is usually done through text embeddings \cite{estevam2020zeroshot}: Trained on unannotated text data, language embedding models extract meaning from words or sentences by converting them into a semantic embedding vector. After creating a semantic embedding for each class label, it is possible to compare the embeddings of unseen classes with those of seen classes to determine what characteristics those classes share. If those similarities are also present in the visual input samples, the network can deduce that the input sample belongs to that specific unseen class.\\
\indent It is quite common to apply data augmentation techniques such as cropping, scaling or flipping to the video input of a network in order to increase the amount of available training samples. However, in zero-shot learning there are two different, equally important kinds of training information for each class: visual and semantic. These common data augmentation strategies make it possible to multiply the amount of visual training data, but the semantic information remains minimal, usually restricted to the simple label of the class. 
uoega's avatar
uoega committed
64
65
66
We aim to provide the network more relevant semantic information about the different classes by applying several forms of data augmentation to the sematic embeddings of the class labels.


uoega's avatar
uoega committed
67
\section{Method}
Tediloma's avatar
Tediloma committed
68
\label{method}
Tediloma's avatar
Tediloma committed
69
70
71
First we need to build a network capable of zero-shot learning for gesture recognition, and assess the baseline performance. Then we apply different forms of data augmentation to the semantic embeddings of the class labels and compare the classification accuracy. 
\\
In \cite{jasani2019skeleton} a zero-shot classification network for the NTU dataset was created. Their architecture features a multilayer perceptron (MLP) to map the semantic embeddings of the class labels into the visual features space and another MLP that learns a deep similarity metric between those semantic features and the visual features of a given input sample.\\
uoega's avatar
uoega committed
72
73

\begin{figure}[t]
uoega's avatar
uoega committed
74
75
	\begin{center}
		%\fbox{\rule{0pt}{2in} \rule{0.9\linewidth}{0pt}}
Tediloma's avatar
Tediloma committed
76
		\includegraphics[width=1\linewidth]{Architektur2.png}
uoega's avatar
uoega committed
77
	\end{center}
Tediloma's avatar
Tediloma committed
78
	\caption{Overview of the network modules.}
79
	\label{architecture}
uoega's avatar
uoega committed
80
81
\end{figure}

uoega's avatar
uoega committed
82
\subsection{Architecture}
Tediloma's avatar
Tediloma committed
83
The architecture chosen for our experiments largely corresponds to the model presented in \cite{jasani2019skeleton}. We reverse engineer it from its individual modules using the information published in the paper. Certain modules are replaced or slightly modified in favor of better performance for our specific task. Here, only a brief overview of the functionality is given, explaining how the model tries to solve the zero-shot task, and which changes are made. For detailed information on the network modules, which are illustrated in figure \ref{architecture}, refer to the original papers \cite{jasani2019skeleton, yan2018spatial, reimers2019sentencebert, sung2018learning}.
uoega's avatar
uoega committed
84
85
86
The architecture consists of three parts described in the following sections.

\subsubsection{Visual path}
Tediloma's avatar
Tediloma committed
87
88
89
90

RGB videos contain a lot of information, which is not necessary to recognize the performed gesture, such as the background or a person’s clothing. To reduce the amount of unnecessary detail we use a temporal series of skeletons as input data. Each skeleton is a graph whose nodes represent the person’s joints. A full input sample consists of a series of one skeleton graph per frame. Such skeleton data can be obtained from RGB video using a framework like openpose \cite{cao2019openpose}. Since gestures are fully defined by the motion of a person’s limbs, it is possible for an appropriate network to recognize them based on an input of this form \cite{duan2021revisiting}.

The task of the visual path is the feature extraction of a video sample (in form of a temporal series of 3D-Skeletons). The Graph Convolutional Net (GCN) from \cite{yan2018spatial} is used as a feature extractor. Output: 1x256
uoega's avatar
uoega committed
91
92

\subsubsection{Semantic Path}
Tediloma's avatar
Tediloma committed
93
94
95
96


Sentence BERT (SBERT) \cite{reimers2019sentencebert} is a text embedding module that takes a sentence as input, analyzes it and gives two kinds of outputs: a cls-token vector, that is a representation of the entire sentence, and a series of embedding vectors, that each represent one word of the input sentence with its context. A mean token vector can be created out of this secondary output by applying an attention mask to the series of tokens to combine them into a single one. This way two separate semantic embeddings can be generated for each input sentence: a cls-token and a mean-token.
\\
Tediloma's avatar
Tediloma committed
97
The semantic path consists of two modules. The first is a SBERT module, that has the task of transforming the vocabulary, i.e. all possible class labels, into semantic embeddings. This is different from the original architecture in \cite{jasani2019skeleton}, where a Sent2Vec module \cite{Pagliardini_2018} is used. We use the mean-token output of the SBERT module over the cls-token as our semantic embedding because it resulted in a better performance. The attribute network (AN) then transforms the semantic embeddings into semantic features by mapping them into the visual feature space. The AN is introduced in \cite{sung2018learning} where it contributes a significant part to the solution of the ZSL task along with the Relation Net (RN), which is explained in more detail in the following section. We apply dropout to the first layer of the AN with a factor of 0.5.
uoega's avatar
uoega committed
98
99

\subsubsection{Similarity-Learning-Part}
Tediloma's avatar
Tediloma committed
100
101
102
Here we first form the relation pairs by pairwise concatenating the visual features of our sample with the semantic features of each class. These relation pairs are then fed into the relation network (RN) introduced in \cite{sung2018learning}, which is another MLP. We add an additional linear layer and apply dropout to the first and second layer with a factor of 0.5. The RN applies a similarity metric in order to assess the similarity of the semantic and visual features within each relation pair. In contrast to previous work, we do not use a fixed similarity metric. Instead, the RN learns a deep similarity metric during training, which was introduced and shown to improve performance in \cite{sung2018learning}.

 This way, it computes a similarity score for each pair, which symbolizes the input sample's similarity to each possible class. The loss is calculated by comparing the similyrity scores with a one-hot representation of the ground truth (MSE).
uoega's avatar
uoega committed
103
104
105
106


\subsection{Augmentation}

Tediloma's avatar
Tediloma committed
107
In this section, we present three approaches we use to perform data augmentation on the default class labels. The goal of these methods is to increase the visual information content in the semantic description of our gesture classes in order to improve the performance of the classification. The individual methods for data augmentation are explained in more detail using the class with the default label "Squat down" as an example.
uoega's avatar
uoega committed
108
109

\subsubsection{Descriptive labels}
Tediloma's avatar
Tediloma committed
110
In a first step, we increase the information content by replacing the class labels, which in their original form only consist of one or two words, with a complete sentence. We use sentences, that give a more precise description of the movements the person in the video makes when performing a particular gesture. In this way, the default label "Squat Down" was manually augmented to create the new label: "\textit{A human crouches down by bending her knees}". In the training process as well as testing phase, every default label is replaced by its manually written descriptive label. 
uoega's avatar
uoega committed
111
112
113

\subsubsection{Multiple labels per class}

Tediloma's avatar
Tediloma committed
114
We now increase the information content of the semanic embeddings even further by labeling each gesture with several different descriptions. Thus, we manually create two additional descriptions for each gesture that used different wording. Consequently each class now has three descriptive labels. An example label set is shown in table \ref{tab:multi_label}. The network computes a similarity score for each possible label, meaning that due to the increased vocabulary there are now 3 times as many similarity scores. In each iteration of the training process, the ground truth of a training video sample is randomly selected from one of the three possible labels. During inference, all three possibilities are considered correct if the network predicts one of them for the corresponding sample.
uoega's avatar
uoega committed
115

uoega's avatar
uoega committed
116
117
118
119
\begin{table}
	\begin{center}
		\begin{tabular}{|lc|}
			\hline	
uoega's avatar
uoega committed
120
121
122
123
124
			\textbf{1:} & A human crouches down by bending their knees. \\
			\hline
			\textbf{2:}  & A person is bending their legs to squat down. \\
			\hline
			\textbf{3:}  & Someone crouches down from a standing position.\\	
uoega's avatar
uoega committed
125
126
127
			\hline
		\end{tabular}
	\end{center}
Tediloma's avatar
Tediloma committed
128
	\caption{Three descriptive labels for the class "Squat down".}
uoega's avatar
uoega committed
129
130
	\label{tab:multi_label}
\end{table}
uoega's avatar
uoega committed
131

uoega's avatar
uoega committed
132
\subsubsection{Automatic augmentation}
Tediloma's avatar
Tediloma committed
133
134
135
To reduce the manual annotation effort, we would like to generate additional labels automatically for the multi label approach. Therefore we are using an augmenter from \verb'nlpaug' \cite{ma2019nlpaug} with RoBERTa language model \cite{liu2019roberta} to insert words into a manually created descriptive label. We use insertions rather than substitutions or deletions, since it was often impossible for the automatic text augmentation to find multiple suitable synonyms for specific words, and deleting key words would lead to a sentence that does not describe the given action appropriately.
An example label set is shown in table \ref{tab:auto_aug}. One can see, that the reduced manual anontation effort sometimes comes at the cost of generating grammatically incorrect sentences.

uoega's avatar
uoega committed
136
137

\begin{table}
uoega's avatar
uoega committed
138
	\begin{center}
uoega's avatar
uoega committed
139
140
141
142
143
144
145
146
147
148
149
150
		\begin{tabular}{|lc|}
			\hline	
			\textbf{Description}: & A human crouches down by  \\
			 & bending their knees. \\
			 \hline
			\textbf{Augmentation 1:} & A \textit{small} human crouches \textit{duck}\\
			 & down by bending their knees.\\
			 \hline
			\textbf{Augmentation 2:} & A human crouches \textit{fall} \\
			 & \textit{somewhat} by bending their knees.\\		
			\hline
		\end{tabular}
uoega's avatar
uoega committed
151
	\end{center}
uoega's avatar
uoega committed
152
153
154
155
156
	\caption{Descriptive label and two automatic augmentations for "Squat down".}
	\label{tab:auto_aug}
\end{table}  
  

uoega's avatar
uoega committed
157

uoega's avatar
uoega committed
158
\subsection{Experiments}
Tediloma's avatar
Tediloma committed
159
160
161
162



 In this work we use the NTU RGB+D 120 dataset \cite{Liu_2020}, which contains 3D skeleton data for 114,480 samples of 120 different human action classes. For our tests we use 40 gesture classes from the NTU RGB+D 120 dataset. We train the GCN exclusively on the 80 unused classes of the NTU RGB+D 120 dataset. This ensures that the unseen gestures have not already appeared at some early point in the training process before inference
Tediloma's avatar
Tediloma committed
163
164
 
In order to evaluate an augmentation method, we do training runs on eight random 35/5 (seen/unseen) splits, in such a way, that every single class is unseen in exactly one training run. During training, only the weights of the AN and RN modules are adjusted. All other modules remain unchanged after their individual training. After testing, the accuracies are averaged over the eight individual experiments. For each augmentation method we test the performance in two scenarios: In the ZSL scenario, the model only predicts on the unseen classes, while it predicts on all classes (seen and unseen) in the GZSL scenario. In the latter we measure the accuracy for seen and unseen samples, as well as the harmonic mean, following recent works \cite{jasani2019skeleton}. For default and descriptive labels, we train our Network with a batch size of 32 and without batch norm, as was done in the original paper \cite{sung2018learning} For the multi labels however, we used a batch size of 128 and batch norm at the input of the RN. This was mainly done due to performance reasons because the multi label approach with more than 3 labels did not learn anything without batch norm at all.  %batchnorm in general -> decrease in unseen
uoega's avatar
uoega committed
165

uoega's avatar
uoega committed
166
\section{Results}
uoega's avatar
uoega committed
167

uoega's avatar
uoega committed
168
169
\begin{table}
	\begin{center}
uoega's avatar
uoega committed
170
171
172
173
		\begin{tabular}{|l|c|c|c|c|}
			\hline
			Approach & ZSL & Seen & Unseen & h\\
			\hline\hline
Tediloma's avatar
Tediloma committed
174
175
176
177
			Baseline & 0.4739 & 0.8116 & 0.1067 & 0.1877\\
			Descriptive & 0.5186 & 0.8104 & 0.1503 & 0.2495\\
			Multiple & \textbf{0.6558} & 0.8283 & \textbf{0.2182} & \textbf{0.3417}\\
			Automatic & 0.5865 & \textbf{0.8290} & 0.1856 & 0.3003\\			
uoega's avatar
uoega committed
178
179
180
181
			\hline
		\end{tabular}
	\end{center}
	\caption{ZSL and GZSL results for different approaches.}
182
	\label{tab:ZSL_GZSL}
uoega's avatar
uoega committed
183
184
185
186
187
\end{table}

\begin{table}
	\begin{center}
		\begin{tabular}{|l|c|c|}
uoega's avatar
uoega committed
188
			\hline
uoega's avatar
uoega committed
189
			Approach & top-1${\pm}$ std & top-5 ${\pm}$ std \\
uoega's avatar
uoega committed
190
			\hline\hline
Tediloma's avatar
Tediloma committed
191
192
193
194
			Baseline & ${0.1067\pm 0.0246}$ & ${0.5428\pm 0.0840}$ \\
			Descriptive & ${0.1503\pm 0.0553}$ & ${0.6460\pm 0.1250}$ \\
			Multiple & ${\textbf{0.2182}\pm 0.0580}$ & ${\textbf{0.8580}\pm 0.0657}$ \\
			Automatic & ${0.1856\pm 0.0499}$ & ${0.8272\pm 0.0476}$ \\			
uoega's avatar
uoega committed
195
196
197
			\hline
		\end{tabular}
	\end{center}
uoega's avatar
uoega committed
198
	\caption{Unseen top-1 and top-5 accuracies in detail.}
199
	\label{tab:top1_top5}
uoega's avatar
uoega committed
200
\end{table}
uoega's avatar
uoega committed
201

Tediloma's avatar
Tediloma committed
202
All our results were generated following the procedure described in the Experiments section. In table \ref{tab:ZSL_GZSL}, one can see the ZSL, seen and unseen accuracies, as well as the harmonic mean. Table \ref{tab:top1_top5} shows a more detailed view of the achieved unseen accuracies. It shows the top-1 and top-5 accuracies for our approaches with their standard deviation over the 8 splits. Improvements on the ZSL accuracy, the unseen accuracy and the harmonic mean were achieved using the descriptive labels and even more so with the three descriptive labels approach. Using only one manually created descriptive label and four automatic augmentations of this description in a multi label approach performs worse compared to three descriptive labels, but still constitutes a relative 23\% increase over using only one descriptive label. The seen accuracy is quite similar for all approaches; still, it is slightly higher for the multi labels approaches, which occurred due to the use of batch norm, which always raised the seen accuracy slightly at the cost of lowering the unseen accuracy. \\
Tediloma's avatar
Tediloma committed
203
In the more detailed table one can see, that the top-5 accuracies increases similarly to their top-1 counterparts, with the exception of a less severe performance decrease when using automatic augmentations. This behavior was often observed for experiments with the multi label approach. As for the standard deviations, one can see that for the top-1 accuracies all approaches based on the descriptive labels are in the same range. For the top-5 accuracies we even get a decrease in standard deviation with higher accuracy values which indicates a higher consistency for the multi label approach.
204

uoega's avatar
uoega committed
205

uoega's avatar
uoega committed
206
\subsection{Discussion}
uoega's avatar
uoega committed
207

208
\subsubsection{From default to descriptive labels}
Tediloma's avatar
Tediloma committed
209
The improvement from the use of descriptive labels over the use of the default labels shows that incorporating more visual information into the semantic embedding by using visual descriptions as class labels helps the network to find a general relation between the semantic and the visual space learned only on the seen training data. Plainly speaking the network can find more similarities between the describing sentences compared to just one-word labels. One would expect this to already be possible on the default labels due to the use of text embeddings, but the issue there lies with the way that the embedding modules are trained. The embeddings of the class labels 'sit down' and 'drink water' might be somewhat similar, because those words appear together frequently in the large training text corpora, but visually those classes look vastly different from each other. The embeddings falsely suggest, that a similarity between the classes is there, which is less likely to happen if the embeddings are created from visual descriptions of the actions.
210
211

\subsubsection{Using multiple labels}
Tediloma's avatar
Tediloma committed
212
For the "multiple labels" approach the idea is somewhat different. The main motivation here was that using more data is generally a good idea. In our case the network is forced to learn a more general mapping between the semantic and the visual feature space since the descriptions and therefore also the embeddings change randomly during training. It has to adapt to the greater diversity of the used label semantic embeddings. This improved generalization on seen training data then helps to better understand and also classify the unseen samples. 
213
214

\subsubsection{Automatic augmentation}
Tediloma's avatar
Tediloma committed
215
As described in chapter \ref{method}, using automatic augmentation methods introduces diversity into the different embeddings. As this does not only focus on the visual description of the classes and therefore differs from the manually created multi labels, it could be modeled as noise. But in contrast to just adding random noise to the embedding vector, it keeps semantic information and relationships intact. This helps the network to generalize its mapping. Experiments using only random noise to generate diverse label embeddings lead to no performance improvements in top-1 accuracy.
uoega's avatar
uoega committed
216

uoega's avatar
uoega committed
217
\section{Conclusion}
uoega's avatar
uoega committed
218

Tediloma's avatar
Tediloma committed
219
In this work, we highlighted the importance of the semantic embeddings in the context of skeleton based zero-shot gesture recognition by showing how the performance can increase based only on the augmentation of those embeddings. By including more visual information in the class labels and combining multiple descriptions per class we could improve the model based on \cite{jasani2019skeleton} by a significant margin. The use of automatic text augmentation methods like \cite{ma2019nlpaug} already reduces the effort of manual annotation significantly, while maintaining most of the performance gain.
uoega's avatar
uoega committed
220
221


Tediloma's avatar
Tediloma committed
222
223
224
Future works could further investigate the following topics: First, generating descriptive sentences from the default labels using methods from Natural Language Processing (NLP) could be implemented to further reduce the manual annotation effort. Second, additional tests on different zero-shot architectures to verify the improvements shown in our work could be performed. Finally, different kinds or combinations of automatic text augmentation methods could be evaluated.

With these advances, data augmentation of the semantic embedding in Zero-Shot Learning could prove useful in optimizing the performance of any Zero-Shot approach in the future. 
uoega's avatar
uoega committed
225
226
227
228
229





uoega's avatar
uoega committed
230
 
uoega's avatar
uoega committed
231
232
233
234
235
236
237
238


{\small
\bibliographystyle{ieee_fullname}
\bibliography{egbib}
}

\end{document}