paper_working_design.tex 21.5 KB
 uoega committed Jul 19, 2021 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 \documentclass[10pt,twocolumn,letterpaper]{article} \usepackage{cvpr} \usepackage{times} \usepackage{epsfig} \usepackage{graphicx} \usepackage{amsmath} \usepackage{amssymb} % Include other packages here, before hyperref. % If you comment hyperref and then uncomment it, you should delete % egpaper.aux before re-running latex. (Or just hit 'q' on the first latex % run, let it finish, and you should be clear). \usepackage[breaklinks=true,bookmarks=false]{hyperref} \cvprfinalcopy % *** Uncomment this line for the final submission \def\cvprPaperID{****} % *** Enter the CVPR Paper ID here \def\httilde{\mbox{\tt\raisebox{-.5ex}{\symbol{126}}}} % Pages are numbered in submission mode, and unnumbered in camera-ready %\ifcvprfinal\pagestyle{empty}\fi  uoega committed Jul 23, 2021 24 \setcounter{page}{1}  uoega committed Jul 19, 2021 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 \begin{document} %%%%%%%%% TITLE \title{Data Augmentation of Semantic Embeddings for Skeleton based Zero-Shot Gesture Recognition} \author{David Heiming\\ Karlsruhe Institute of Technology\\ {\tt\small uween@student.kit.edu} % For a paper whose authors are all at the same institution, % omit the following lines up until the closing }''. % Additional authors and addresses can be added with \and'', % just like the second author. % To save space, use either the email address or home page, not both \and Hannes Uhl\\ Karlsruhe Institute of Technology\\ {\tt\small ujjmv@student.kit.edu} \and Jonas Linkerhägner\\ Karlsruhe Institute of Technology\\ {\tt\small uoega@student.kit.edu} } \maketitle %\thispagestyle{empty} %%%%%%%%% ABSTRACT \begin{abstract}  Tediloma committed Jul 28, 2021 53 54 55 Interaction with computer systems is one of the most important topics of the digital age. Recent advances in gesture recognition show: controlling a system only by moving parts of the own body can provide advantages over the physical interaction. To perform this task, the system needs to reliably detect the performed gestures. For current systems using deep learning methods this mean they have to be trained on all possible gesture beforehand. This is where the method of Zero-Shot learning comes in, which allows to also recognize gestures not seen during training. Here, one of the big challenges is to translate the semantic information given about an unseen class into an expectation of what visual features a sample of that class would have. Using typical semantic embeddings like BERT, that semantic information will be more focused on the semantic meaning of the label rather than its visual characteristics. In this work, we present different forms of data augmentation, that can be applied to the semantic embeddings of the class labels to increase their visual information content. This approach achieves a significant performance improvement for a zero-shot gesture recognition model.  uoega committed Jul 19, 2021 56 57 58 59 60 \end{abstract} %%%%%%%%% BODY TEXT \section{Introduction}  Tediloma committed Jul 28, 2021 61 62 63 Gesture recognition in videos is a rapidly growing field of research and could become an important component for input-device-less control of consumer products such as drones or televisions. While various past works have focused on the classification of gestures known in advance, this work deals with gesture recognition using the zero-shot learning approach. Such a task is interesting, because the zero-shot approach does not only allow the use of learned gestures for fixed commands. Additionally, it is possible to incorporate untrained gestures. The user of the product is thus offered the opportunity to expand the command set for controlling the device.\\ \indent In order to be able to classify samples of untrained (also called "unseen") classes, a network needs to have an expectation of what the gesture corresponding to that class’s label might look like. This is usually done through text embeddings \cite{estevam2020zeroshot}: Trained on unannotated text data, language embedding models extract meaning from words or sentences by converting them into a semantic embedding vector. After creating a semantic embedding for each class label, it is possible to compare the embeddings of unseen classes with those of seen classes to determine what characteristics those classes share. If those similarities are also present in the visual input samples, the network can deduce that the input sample belongs to that specific unseen class.\\ \indent It is quite common to apply data augmentation techniques such as cropping, scaling or flipping to the video input of a network in order to increase the amount of available training samples. However, in zero-shot learning there are two different, equally important kinds of training information for each class: visual and semantic. These common data augmentation strategies make it possible to multiply the amount of visual training data, but the semantic information remains minimal, usually restricted to the simple label of the class.  uoega committed Jul 23, 2021 64 65 66 We aim to provide the network more relevant semantic information about the different classes by applying several forms of data augmentation to the sematic embeddings of the class labels.  uoega committed Jul 25, 2021 67 \section{Method}  Tediloma committed Jul 26, 2021 68 \label{method}  Tediloma committed Jul 28, 2021 69 70 71 First we need to build a network capable of zero-shot learning for gesture recognition, and assess the baseline performance. Then we apply different forms of data augmentation to the semantic embeddings of the class labels and compare the classification accuracy. \\ In \cite{jasani2019skeleton} a zero-shot classification network for the NTU dataset was created. Their architecture features a multilayer perceptron (MLP) to map the semantic embeddings of the class labels into the visual features space and another MLP that learns a deep similarity metric between those semantic features and the visual features of a given input sample.\\  uoega committed Jul 19, 2021 72 73  \begin{figure}[t]  uoega committed Jul 23, 2021 74 75  \begin{center} %\fbox{\rule{0pt}{2in} \rule{0.9\linewidth}{0pt}}  Tediloma committed Jul 26, 2021 76  \includegraphics[width=1\linewidth]{Architektur2.png}  uoega committed Jul 23, 2021 77  \end{center}  Tediloma committed Jul 26, 2021 78  \caption{Overview of the network modules.}  Tediloma committed Jul 25, 2021 79  \label{architecture}  uoega committed Jul 19, 2021 80 81 \end{figure}  uoega committed Jul 25, 2021 82 \subsection{Architecture}  Tediloma committed Jul 26, 2021 83 The architecture chosen for our experiments largely corresponds to the model presented in \cite{jasani2019skeleton}. We reverse engineer it from its individual modules using the information published in the paper. Certain modules are replaced or slightly modified in favor of better performance for our specific task. Here, only a brief overview of the functionality is given, explaining how the model tries to solve the zero-shot task, and which changes are made. For detailed information on the network modules, which are illustrated in figure \ref{architecture}, refer to the original papers \cite{jasani2019skeleton, yan2018spatial, reimers2019sentencebert, sung2018learning}.  uoega committed Jul 25, 2021 84 85 86 The architecture consists of three parts described in the following sections. \subsubsection{Visual path}  Tediloma committed Jul 28, 2021 87 88 89 90  RGB videos contain a lot of information, which is not necessary to recognize the performed gesture, such as the background or a person’s clothing. To reduce the amount of unnecessary detail we use a temporal series of skeletons as input data. Each skeleton is a graph whose nodes represent the person’s joints. A full input sample consists of a series of one skeleton graph per frame. Such skeleton data can be obtained from RGB video using a framework like openpose \cite{cao2019openpose}. Since gestures are fully defined by the motion of a person’s limbs, it is possible for an appropriate network to recognize them based on an input of this form \cite{duan2021revisiting}. The task of the visual path is the feature extraction of a video sample (in form of a temporal series of 3D-Skeletons). The Graph Convolutional Net (GCN) from \cite{yan2018spatial} is used as a feature extractor. Output: 1x256  uoega committed Jul 25, 2021 91 92  \subsubsection{Semantic Path}  Tediloma committed Jul 28, 2021 93 94 95 96  Sentence BERT (SBERT) \cite{reimers2019sentencebert} is a text embedding module that takes a sentence as input, analyzes it and gives two kinds of outputs: a cls-token vector, that is a representation of the entire sentence, and a series of embedding vectors, that each represent one word of the input sentence with its context. A mean token vector can be created out of this secondary output by applying an attention mask to the series of tokens to combine them into a single one. This way two separate semantic embeddings can be generated for each input sentence: a cls-token and a mean-token. \\  Tediloma committed Jul 26, 2021 97 The semantic path consists of two modules. The first is a SBERT module, that has the task of transforming the vocabulary, i.e. all possible class labels, into semantic embeddings. This is different from the original architecture in \cite{jasani2019skeleton}, where a Sent2Vec module \cite{Pagliardini_2018} is used. We use the mean-token output of the SBERT module over the cls-token as our semantic embedding because it resulted in a better performance. The attribute network (AN) then transforms the semantic embeddings into semantic features by mapping them into the visual feature space. The AN is introduced in \cite{sung2018learning} where it contributes a significant part to the solution of the ZSL task along with the Relation Net (RN), which is explained in more detail in the following section. We apply dropout to the first layer of the AN with a factor of 0.5.  uoega committed Jul 25, 2021 98 99  \subsubsection{Similarity-Learning-Part}  Tediloma committed Jul 28, 2021 100 101 102 Here we first form the relation pairs by pairwise concatenating the visual features of our sample with the semantic features of each class. These relation pairs are then fed into the relation network (RN) introduced in \cite{sung2018learning}, which is another MLP. We add an additional linear layer and apply dropout to the first and second layer with a factor of 0.5. The RN applies a similarity metric in order to assess the similarity of the semantic and visual features within each relation pair. In contrast to previous work, we do not use a fixed similarity metric. Instead, the RN learns a deep similarity metric during training, which was introduced and shown to improve performance in \cite{sung2018learning}. This way, it computes a similarity score for each pair, which symbolizes the input sample's similarity to each possible class. The loss is calculated by comparing the similyrity scores with a one-hot representation of the ground truth (MSE).  uoega committed Jul 25, 2021 103 104 105 106  \subsection{Augmentation}  Tediloma committed Jul 26, 2021 107 In this section, we present three approaches we use to perform data augmentation on the default class labels. The goal of these methods is to increase the visual information content in the semantic description of our gesture classes in order to improve the performance of the classification. The individual methods for data augmentation are explained in more detail using the class with the default label "Squat down" as an example.  uoega committed Jul 25, 2021 108 109  \subsubsection{Descriptive labels}  Tediloma committed Jul 26, 2021 110 In a first step, we increase the information content by replacing the class labels, which in their original form only consist of one or two words, with a complete sentence. We use sentences, that give a more precise description of the movements the person in the video makes when performing a particular gesture. In this way, the default label "Squat Down" was manually augmented to create the new label: "\textit{A human crouches down by bending her knees}". In the training process as well as testing phase, every default label is replaced by its manually written descriptive label.  uoega committed Jul 25, 2021 111 112 113  \subsubsection{Multiple labels per class}  Tediloma committed Jul 28, 2021 114 We now increase the information content of the semanic embeddings even further by labeling each gesture with several different descriptions. Thus, we manually create two additional descriptions for each gesture that used different wording. Consequently each class now has three descriptive labels. An example label set is shown in table \ref{tab:multi_label}. The network computes a similarity score for each possible label, meaning that due to the increased vocabulary there are now 3 times as many similarity scores. In each iteration of the training process, the ground truth of a training video sample is randomly selected from one of the three possible labels. During inference, all three possibilities are considered correct if the network predicts one of them for the corresponding sample.  uoega committed Jul 25, 2021 115   uoega committed Jul 24, 2021 116 117 118 119 \begin{table} \begin{center} \begin{tabular}{|lc|} \hline  uoega committed Jul 25, 2021 120 121 122 123 124  \textbf{1:} & A human crouches down by bending their knees. \\ \hline \textbf{2:} & A person is bending their legs to squat down. \\ \hline \textbf{3:} & Someone crouches down from a standing position.\\  uoega committed Jul 24, 2021 125 126 127  \hline \end{tabular} \end{center}  Tediloma committed Jul 26, 2021 128  \caption{Three descriptive labels for the class "Squat down".}  uoega committed Jul 25, 2021 129 130  \label{tab:multi_label} \end{table}  uoega committed Jul 24, 2021 131   uoega committed Jul 25, 2021 132 \subsubsection{Automatic augmentation}  Tediloma committed Jul 26, 2021 133 134 135 To reduce the manual annotation effort, we would like to generate additional labels automatically for the multi label approach. Therefore we are using an augmenter from \verb'nlpaug' \cite{ma2019nlpaug} with RoBERTa language model \cite{liu2019roberta} to insert words into a manually created descriptive label. We use insertions rather than substitutions or deletions, since it was often impossible for the automatic text augmentation to find multiple suitable synonyms for specific words, and deleting key words would lead to a sentence that does not describe the given action appropriately. An example label set is shown in table \ref{tab:auto_aug}. One can see, that the reduced manual anontation effort sometimes comes at the cost of generating grammatically incorrect sentences.  uoega committed Jul 25, 2021 136 137  \begin{table}  uoega committed Jul 24, 2021 138  \begin{center}  uoega committed Jul 25, 2021 139 140 141 142 143 144 145 146 147 148 149 150  \begin{tabular}{|lc|} \hline \textbf{Description}: & A human crouches down by \\ & bending their knees. \\ \hline \textbf{Augmentation 1:} & A \textit{small} human crouches \textit{duck}\\ & down by bending their knees.\\ \hline \textbf{Augmentation 2:} & A human crouches \textit{fall} \\ & \textit{somewhat} by bending their knees.\\ \hline \end{tabular}  uoega committed Jul 24, 2021 151  \end{center}  uoega committed Jul 25, 2021 152 153 154 155 156  \caption{Descriptive label and two automatic augmentations for "Squat down".} \label{tab:auto_aug} \end{table}  uoega committed Jul 19, 2021 157   uoega committed Jul 23, 2021 158 \subsection{Experiments}  Tediloma committed Jul 28, 2021 159 160 161 162  In this work we use the NTU RGB+D 120 dataset \cite{Liu_2020}, which contains 3D skeleton data for 114,480 samples of 120 different human action classes. For our tests we use 40 gesture classes from the NTU RGB+D 120 dataset. We train the GCN exclusively on the 80 unused classes of the NTU RGB+D 120 dataset. This ensures that the unseen gestures have not already appeared at some early point in the training process before inference  Tediloma committed Jul 26, 2021 163 164  In order to evaluate an augmentation method, we do training runs on eight random 35/5 (seen/unseen) splits, in such a way, that every single class is unseen in exactly one training run. During training, only the weights of the AN and RN modules are adjusted. All other modules remain unchanged after their individual training. After testing, the accuracies are averaged over the eight individual experiments. For each augmentation method we test the performance in two scenarios: In the ZSL scenario, the model only predicts on the unseen classes, while it predicts on all classes (seen and unseen) in the GZSL scenario. In the latter we measure the accuracy for seen and unseen samples, as well as the harmonic mean, following recent works \cite{jasani2019skeleton}. For default and descriptive labels, we train our Network with a batch size of 32 and without batch norm, as was done in the original paper \cite{sung2018learning} For the multi labels however, we used a batch size of 128 and batch norm at the input of the RN. This was mainly done due to performance reasons because the multi label approach with more than 3 labels did not learn anything without batch norm at all. %batchnorm in general -> decrease in unseen  uoega committed Jul 19, 2021 165   uoega committed Jul 23, 2021 166 \section{Results}  uoega committed Jul 19, 2021 167   uoega committed Jul 23, 2021 168 169 \begin{table} \begin{center}  uoega committed Jul 23, 2021 170 171 172 173  \begin{tabular}{|l|c|c|c|c|} \hline Approach & ZSL & Seen & Unseen & h\\ \hline\hline  Tediloma committed Jul 26, 2021 174 175 176 177  Baseline & 0.4739 & 0.8116 & 0.1067 & 0.1877\\ Descriptive & 0.5186 & 0.8104 & 0.1503 & 0.2495\\ Multiple & \textbf{0.6558} & 0.8283 & \textbf{0.2182} & \textbf{0.3417}\\ Automatic & 0.5865 & \textbf{0.8290} & 0.1856 & 0.3003\\  uoega committed Jul 23, 2021 178 179 180 181  \hline \end{tabular} \end{center} \caption{ZSL and GZSL results for different approaches.}  uoega committed Jul 24, 2021 182  \label{tab:ZSL_GZSL}  uoega committed Jul 23, 2021 183 184 185 186 187 \end{table} \begin{table} \begin{center} \begin{tabular}{|l|c|c|}  uoega committed Jul 23, 2021 188  \hline  uoega committed Jul 23, 2021 189  Approach & top-1${\pm}$ std & top-5 ${\pm}$ std \\  uoega committed Jul 23, 2021 190  \hline\hline  Tediloma committed Jul 26, 2021 191 192 193 194  Baseline & ${0.1067\pm 0.0246}$ & ${0.5428\pm 0.0840}$ \\ Descriptive & ${0.1503\pm 0.0553}$ & ${0.6460\pm 0.1250}$ \\ Multiple & ${\textbf{0.2182}\pm 0.0580}$ & ${\textbf{0.8580}\pm 0.0657}$ \\ Automatic & ${0.1856\pm 0.0499}$ & ${0.8272\pm 0.0476}$ \\  uoega committed Jul 23, 2021 195 196 197  \hline \end{tabular} \end{center}  uoega committed Jul 25, 2021 198  \caption{Unseen top-1 and top-5 accuracies in detail.}  uoega committed Jul 24, 2021 199  \label{tab:top1_top5}  uoega committed Jul 23, 2021 200 \end{table}  uoega committed Jul 19, 2021 201   Tediloma committed Jul 26, 2021 202 All our results were generated following the procedure described in the Experiments section. In table \ref{tab:ZSL_GZSL}, one can see the ZSL, seen and unseen accuracies, as well as the harmonic mean. Table \ref{tab:top1_top5} shows a more detailed view of the achieved unseen accuracies. It shows the top-1 and top-5 accuracies for our approaches with their standard deviation over the 8 splits. Improvements on the ZSL accuracy, the unseen accuracy and the harmonic mean were achieved using the descriptive labels and even more so with the three descriptive labels approach. Using only one manually created descriptive label and four automatic augmentations of this description in a multi label approach performs worse compared to three descriptive labels, but still constitutes a relative 23\% increase over using only one descriptive label. The seen accuracy is quite similar for all approaches; still, it is slightly higher for the multi labels approaches, which occurred due to the use of batch norm, which always raised the seen accuracy slightly at the cost of lowering the unseen accuracy. \\  Tediloma committed Jul 26, 2021 203 In the more detailed table one can see, that the top-5 accuracies increases similarly to their top-1 counterparts, with the exception of a less severe performance decrease when using automatic augmentations. This behavior was often observed for experiments with the multi label approach. As for the standard deviations, one can see that for the top-1 accuracies all approaches based on the descriptive labels are in the same range. For the top-5 accuracies we even get a decrease in standard deviation with higher accuracy values which indicates a higher consistency for the multi label approach.  uoega committed Jul 24, 2021 204   uoega committed Jul 19, 2021 205   uoega committed Jul 23, 2021 206 \subsection{Discussion}  uoega committed Jul 19, 2021 207   uoega committed Jul 24, 2021 208 \subsubsection{From default to descriptive labels}  Tediloma committed Jul 26, 2021 209 The improvement from the use of descriptive labels over the use of the default labels shows that incorporating more visual information into the semantic embedding by using visual descriptions as class labels helps the network to find a general relation between the semantic and the visual space learned only on the seen training data. Plainly speaking the network can find more similarities between the describing sentences compared to just one-word labels. One would expect this to already be possible on the default labels due to the use of text embeddings, but the issue there lies with the way that the embedding modules are trained. The embeddings of the class labels 'sit down' and 'drink water' might be somewhat similar, because those words appear together frequently in the large training text corpora, but visually those classes look vastly different from each other. The embeddings falsely suggest, that a similarity between the classes is there, which is less likely to happen if the embeddings are created from visual descriptions of the actions.  uoega committed Jul 24, 2021 210 211  \subsubsection{Using multiple labels}  Tediloma committed Jul 26, 2021 212 For the "multiple labels" approach the idea is somewhat different. The main motivation here was that using more data is generally a good idea. In our case the network is forced to learn a more general mapping between the semantic and the visual feature space since the descriptions and therefore also the embeddings change randomly during training. It has to adapt to the greater diversity of the used label semantic embeddings. This improved generalization on seen training data then helps to better understand and also classify the unseen samples.  uoega committed Jul 24, 2021 213 214  \subsubsection{Automatic augmentation}  Tediloma committed Jul 26, 2021 215 As described in chapter \ref{method}, using automatic augmentation methods introduces diversity into the different embeddings. As this does not only focus on the visual description of the classes and therefore differs from the manually created multi labels, it could be modeled as noise. But in contrast to just adding random noise to the embedding vector, it keeps semantic information and relationships intact. This helps the network to generalize its mapping. Experiments using only random noise to generate diverse label embeddings lead to no performance improvements in top-1 accuracy.  uoega committed Jul 19, 2021 216   uoega committed Jul 23, 2021 217 \section{Conclusion}  uoega committed Jul 19, 2021 218   Tediloma committed Jul 26, 2021 219 In this work, we highlighted the importance of the semantic embeddings in the context of skeleton based zero-shot gesture recognition by showing how the performance can increase based only on the augmentation of those embeddings. By including more visual information in the class labels and combining multiple descriptions per class we could improve the model based on \cite{jasani2019skeleton} by a significant margin. The use of automatic text augmentation methods like \cite{ma2019nlpaug} already reduces the effort of manual annotation significantly, while maintaining most of the performance gain.  uoega committed Jul 19, 2021 220 221   Tediloma committed Jul 26, 2021 222 223 224 Future works could further investigate the following topics: First, generating descriptive sentences from the default labels using methods from Natural Language Processing (NLP) could be implemented to further reduce the manual annotation effort. Second, additional tests on different zero-shot architectures to verify the improvements shown in our work could be performed. Finally, different kinds or combinations of automatic text augmentation methods could be evaluated. With these advances, data augmentation of the semantic embedding in Zero-Shot Learning could prove useful in optimizing the performance of any Zero-Shot approach in the future.  uoega committed Jul 19, 2021 225 226 227 228 229   uoega committed Jul 23, 2021 230   uoega committed Jul 19, 2021 231 232 233 234 235 236 237 238  {\small \bibliographystyle{ieee_fullname} \bibliography{egbib} } \end{document}