paper_working_design.tex 30.1 KB
 uoega committed Jul 19, 2021 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 \documentclass[10pt,twocolumn,letterpaper]{article} \usepackage{cvpr} \usepackage{times} \usepackage{epsfig} \usepackage{graphicx} \usepackage{amsmath} \usepackage{amssymb} % Include other packages here, before hyperref. % If you comment hyperref and then uncomment it, you should delete % egpaper.aux before re-running latex. (Or just hit 'q' on the first latex % run, let it finish, and you should be clear). \usepackage[breaklinks=true,bookmarks=false]{hyperref} \cvprfinalcopy % *** Uncomment this line for the final submission \def\cvprPaperID{****} % *** Enter the CVPR Paper ID here \def\httilde{\mbox{\tt\raisebox{-.5ex}{\symbol{126}}}} % Pages are numbered in submission mode, and unnumbered in camera-ready %\ifcvprfinal\pagestyle{empty}\fi  uoega committed Jul 23, 2021 24 \setcounter{page}{1}  uoega committed Jul 19, 2021 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 \begin{document} %%%%%%%%% TITLE \title{Data Augmentation of Semantic Embeddings for Skeleton based Zero-Shot Gesture Recognition} \author{David Heiming\\ Karlsruhe Institute of Technology\\ {\tt\small uween@student.kit.edu} % For a paper whose authors are all at the same institution, % omit the following lines up until the closing }''. % Additional authors and addresses can be added with \and'', % just like the second author. % To save space, use either the email address or home page, not both \and Hannes Uhl\\ Karlsruhe Institute of Technology\\ {\tt\small ujjmv@student.kit.edu} \and Jonas Linkerhägner\\ Karlsruhe Institute of Technology\\ {\tt\small uoega@student.kit.edu} } \maketitle %\thispagestyle{empty} %%%%%%%%% ABSTRACT \begin{abstract}  Tediloma committed Jul 30, 2021 53 \noindent Interaction with computer systems is one of the most important topics of the digital age. Interfacing with a system through body movements rather than tactile controls can provide significant advantages. To make that possible, the system needs to reliably detect the performed gestures. Systems using conventional deep learning methods are therefore trained on all possible gestures beforehand. Zero-Shot learning models, on the other hand, aim to also recognize gestures not seen during training when given their labels. The model thus needs to extract information about an unseen gesture's visual appearance from its label. Using typical text embedding modules like BERT, that information will be focused on the semantics of the label rather than its visual characteristics. In this work, we present several forms of data augmentation that can be applied to the semantic embeddings of the class labels in order to increase their visual information content. This approach achieves a significant performance increase for a Zero-Shot gesture recognition model.  Tediloma committed Jul 30, 2021 54   Tediloma committed Jul 28, 2021 55   uoega committed Jul 19, 2021 56 57 58 59 60 \end{abstract} %%%%%%%%% BODY TEXT \section{Introduction}  Tediloma committed Jul 30, 2021 61 \noindent Gesture recognition of videos is a rapidly growing field of research and is becoming an important component of input-device-less control of consumer products such as drones or televisions. While various past works have focused on the classification of gestures known in advance \cite{marinov2021pose2drone, kopuk2019realtime}, this work deals with gesture recognition using Zero-Shot learning. This approach makes it possible to use unseen gestures (meaning gestures that the model has not seen during training). The user of the product is thus offered the opportunity to expand the command set for controlling the device.  Tediloma committed Jul 30, 2021 62   Tediloma committed Jul 30, 2021 63 In order to be able to classify samples of an unseen class, a network needs to form an expectation of what the gesture looks like based on its label. This is usually done through the use of text embeddings \cite{estevam2020zeroshot}: Trained on unannotated text data, language embedding models extract meaning from words or sentences by converting them into a semantic embedding vector. After creating a semantic embedding for each class label, it is possible to compare the embeddings of an unseen class with those of the seen classes to find similarities between them. Based on those similarities, the network can construct an expectation of what a sample of that unseen class might look like.  uoega committed Jul 29, 2021 64   Tediloma committed Jul 30, 2021 65 It is quite common to apply data augmentation techniques such as cropping, scaling or flipping to the video input of a network in order to increase the amount of available training samples \cite{perez2017effectiveness}. However, in Zero-Shot learning there are two different, equally important kinds of training information for each class: visual and semantic. The common data augmentation strategies only make it possible to multiply the amount of visual training data. But the semantic information remains minimal, usually restricted to the simple label of the class. We aim to provide the network more relevant semantic information about the different classes by applying several forms of data augmentation to the semantic embeddings of the class labels.  uoega committed Jul 23, 2021 66 67   uoega committed Jul 25, 2021 68 \section{Method}  Tediloma committed Jul 26, 2021 69 \label{method}  Tediloma committed Jul 30, 2021 70 \noindent First we need to build a network capable of Zero-Shot learning for gesture recognition. Then we define different forms of data augmentation for the semantic embeddings of the class labels and specify the experimental setting.  uoega committed Jul 29, 2021 71 %compare the classification accuracy.  Tediloma committed Jul 29, 2021 72 %We use a slightly modified version of the zero-shot classification network from Jasani et al. \cite{jasani2019skeleton}. Their architecture features a multilayer perceptron (MLP) to map the semantic embeddings of the class labels into the visual features space and another MLP that learns a deep similarity metric between those semantic features and the visual features of a given input sample.  uoega committed Jul 19, 2021 73 74  \begin{figure}[t]  uoega committed Jul 23, 2021 75 76  \begin{center} %\fbox{\rule{0pt}{2in} \rule{0.9\linewidth}{0pt}}  Tediloma committed Jul 30, 2021 77  \includegraphics[width=1\linewidth]{Architektur6.png}  uoega committed Jul 23, 2021 78  \end{center}  Tediloma committed Jul 26, 2021 79  \caption{Overview of the network modules.}  Tediloma committed Jul 25, 2021 80  \label{architecture}  uoega committed Jul 19, 2021 81 82 \end{figure}  uoega committed Jul 25, 2021 83 \subsection{Architecture}  Tediloma committed Jul 30, 2021 84 \noindent The architecture chosen for our experiments largely corresponds to the model presented in \cite{jasani2019skeleton}. We rebuild its modular architecture using the information published in the paper. Certain modules are replaced or slightly modified to fit our specific task. Here, only a brief overview of the functionality is given, explaining how the model tries to solve the Zero-Shot task, and which changes are made. For detailed information on the network modules, which are illustrated in Figure \ref{architecture}, refer to \cite{jasani2019skeleton, yan2018spatial, reimers2019sentencebert, sung2018learning}.  uoega committed Jul 25, 2021 85   Tediloma committed Jul 29, 2021 86 %The architecture consists of three parts described in the following sections.  Tediloma committed Jul 28, 2021 87   Tediloma committed Jul 29, 2021 88 %\subsubsection{Visual path}  uoega committed Jul 29, 2021 89 % add connection to paragraph above  Tediloma committed Jul 30, 2021 90 As visual input we use a temporal series of skeletons, instead of RGB videos to remove unnecessary details such as the background or a person’s clothing. Each skeleton is a graph whose nodes represent the person’s joints. A full input sample consists of a series of one skeleton graph per frame. Such skeleton data can be obtained from RGB video using a framework like \emph{Openpose} \cite{cao2019openpose}. To perform a visual feature extraction of these input samples, a Graph Convolutional Network (GCN) \cite{yan2018spatial} is used. It consists of 9 spatial temporal graph convolution layers with residual connections. The resulting ouput is a 256-dimensional vector containing the visual features.  Tediloma committed Jul 28, 2021 91   uoega committed Jul 29, 2021 92 93 %The output of the GCN utputs visual features in form of a 256-dimensional vector. %The task of the visual path is the feature extraction of a video sample (in form of a temporal series of 3D-Skeletons). The Graph Convolutional Net (GCN) from \cite{yan2018spatial} is used as a feature extractor. Output: 1x256 RGB videos contain a lot of information, which is not necessary to recognize the performed gesture, such as the background or a person’s clothing. To reduce the amount of unnecessary detail we use a temporal series of skeletons as input data. Since gestures are fully defined by the motion of a person’s limbs, it is possible for an appropriate network to recognize them based on an input of this form \cite{duan2021revisiting}.  Tediloma committed Jul 29, 2021 94 %\subsubsection{Semantic Path}  Tediloma committed Jul 30, 2021 95 96 Parallel to this visual path, a semantic feature extraction of the vocabulary, i.e. all possible class labels, is performed in two steps. First a \emph{Sentence BERT} (SBERT) module \cite{reimers2019sentencebert} transforms the class labels into semantic embeddings. This is different from the original architecture in \cite{jasani2019skeleton}, where an older \emph{Sent2Vec} module \cite{Pagliardini_2018} is used. The SBERT module takes a sentence as input, analyzes it and yields two kinds of outputs: a cls-token vector, which is a representation of the entire sentence, and a series of embedding vectors, that each represent one word of the input sentence with its context. A 768-dimensional mean token vector can be created out of this secondary output by applying an attention mask to the series of tokens to combine them into a single one. We use the mean-token output of the SBERT module over the cls-token as our semantic embedding because it resulted in a better performance. In the second step, the attribute network (AN) transforms the semantic embeddings into semantic features by mapping them into the 256-dimensional visual feature space. Compared to its original form in \cite{sung2018learning}, we apply dropout with a factor of 0.5 to the first layer of this multi layer perceptron (MLP).  Tediloma committed Jul 28, 2021 97   Tediloma committed Jul 29, 2021 98   uoega committed Jul 29, 2021 99 100 101 %The AN introduced by Sung et al. \cite{sung2018learning} is a multilayer perceptron (MLP), where we additionally apply dropout to the first layer with a factor of 0.5. %The AN is introduced in \cite{sung2018learning} where it contributes a significant part to the solution of the ZSL task along with the Relation Net (RN), which is explained in more detail in the following section. %Sentence BERT (SBERT) \cite{reimers2019sentencebert} is a text embedding module that takes a sentence as input, analyzes it and gives two kinds of outputs: a cls-token vector, that is a representation of the entire sentence, and a series of embedding vectors, that each represent one word of the input sentence with its context. A mean token vector can be created out of this secondary output by applying an attention mask to the series of tokens to combine them into a single one. This way two separate semantic embeddings can be generated for each input sentence: a cls-token and a mean-token. The semantic path, which lies (/is executed in) parallel to the visual path, consists of two modules. The first is a  Tediloma committed Jul 29, 2021 102 103 %The semantic path consists of two modules. The first is a SBERT module, that has the task of transforming the vocabulary, i.e. all possible class labels, into semantic embeddings. This is different from the original architecture in \cite{jasani2019skeleton}, where a Sent2Vec module \cite{Pagliardini_2018} is used. We use the mean-token output of the SBERT module over the cls-token as our semantic embedding because it resulted in a better performance. The attribute network (AN) then transforms the semantic embeddings into semantic features by mapping them into the visual feature space. The AN is introduced in \cite{sung2018learning} where it contributes a significant part to the solution of the ZSL task along with the Relation Net (RN), which is explained in more detail in the following section. We apply dropout to the first layer of the AN with a factor of 0.5. %\subsubsection{Similarity-Learning-Part}  Tediloma committed Jul 30, 2021 104 105  Finally, the visual and semantic feature outputs are combined by forming relation pairs. Each pair is a concatenation of the visual features of our input sample with the semantic features of one class. These relation pairs are then fed into the relation network (RN) introduced in \cite{sung2018learning}. The RN applies a similarity metric in order to assess the resemblance of the semantic and visual features within each relation pair. This way, it computes a similarity score for each pair, which symbolizes the input sample's similarity to the corresponding class. Then, the similarity scores are compared to a one-hot vector representing the ground truth class using mean squared error (MSE) loss. In contrast to previous works, this architecture does not use a fixed similarity metric. Instead, the RN is a MLP that learns a deep similarity metric during training, which was introduced and shown to improve performance in \cite{sung2018learning}. We add an additional linear layer to the RN and apply dropout to the first and second layer with a factor of 0.5.  Tediloma committed Jul 29, 2021 106   uoega committed Jul 29, 2021 107 %Here we first form the relation pairs by pairwise concatenating the visual features of our sample with the semantic features of each class. These relation pairs are then fed into the relation network (RN) introduced in \cite{sung2018learning}, which is another MLP. We add an additional linear layer and apply dropout to the first and second layer with a factor of 0.5. The RN applies a similarity metric in order to assess the similarity of the semantic and visual features within each relation pair. In contrast to previous work, we do not use a fixed similarity metric. Instead, the RN learns a deep similarity metric during training, which was introduced and shown to improve performance in \cite{sung2018learning}. It computes the deviation between the scores and a one-hot vector representing the ground truth class.  Tediloma committed Jul 29, 2021 108   Tediloma committed Jul 28, 2021 109   uoega committed Jul 25, 2021 110 111   Tediloma committed Jul 30, 2021 112 \subsection{Data augmentation}  uoega committed Jul 25, 2021 113   Tediloma committed Jul 30, 2021 114 \noindent In this section, we present three data augmentation methods for the semantic embeddings of the default class labels provided by the dataset. We apply these methods directly to the labels, because they are more tangible than the abstract embeddings. This still results in an augmentation of the semantic embeddings, since they are created from the labels. The goal of these methods is to increase the visual information content of the semantic embeddings of our gesture classes in order to improve the classification performance. To demonstrate them, we apply each augmentation to the class with the default label "squat down" as an example.  uoega committed Jul 25, 2021 115 116  \subsubsection{Descriptive labels}  Tediloma committed Jul 30, 2021 117 118  In a first step, we provide more visual information by substituting the class labels, which in their original form mostly consist of one or two words, with a complete sentence. We use sentences that give a more precise description of the movements required to perform a particular gesture. This way, the default label "squat down" is manually augmented to create the new descriptive label: "A human crouches down by bending their knees". During training and testing, every default label is replaced by its manually written descriptive counterpart.  uoega committed Jul 29, 2021 119   uoega committed Jul 25, 2021 120 121 122  \subsubsection{Multiple labels per class}  Tediloma committed Jul 30, 2021 123 We now increase the information content of the semantic embeddings even further by labeling each gesture with several different descriptions. Thus, we manually create additional descriptions that use different wording for each gesture. An example of using three descriptive labels per class is shown in Table \ref{tab:multi_label}. Since the network computes a similarity score for each possible label, there are now three times as many similarity scores due to the expanded vocabulary. In each iteration of the training process, the ground truth of a sample is randomly selected from one of the three possible labels. During inference, all three possibilities are considered correct if the network predicts one of them for the corresponding sample.  uoega committed Jul 25, 2021 124   uoega committed Jul 24, 2021 125 126 127 128 \begin{table} \begin{center} \begin{tabular}{|lc|} \hline  uoega committed Jul 25, 2021 129 130 131 132 133  \textbf{1:} & A human crouches down by bending their knees. \\ \hline \textbf{2:} & A person is bending their legs to squat down. \\ \hline \textbf{3:} & Someone crouches down from a standing position.\\  uoega committed Jul 24, 2021 134 135 136  \hline \end{tabular} \end{center}  Tediloma committed Jul 30, 2021 137  \caption{Three descriptive labels for the class "squat down".}  uoega committed Jul 25, 2021 138 139  \label{tab:multi_label} \end{table}  uoega committed Jul 24, 2021 140   uoega committed Jul 25, 2021 141 \subsubsection{Automatic augmentation}  Tediloma committed Jul 30, 2021 142 143 \label{autoaug} To reduce the manual annotation effort, we now generate additional labels automatically for the multiple labels approach. For this purpose, we use an augmenter from \emph{nlpaug} \cite{ma2019nlpaug} with the \emph{RoBERTa} language model \cite{liu2019roberta} to insert words into a manually created descriptive label. We do not use word substitutions, since it is often impossible for the automatic text augmentation to find multiple suitable synonyms for specific words. Word deletions are also suboptimal, because removing key words leads to a sentence that does not describe the given action appropriately. An example label set is shown in table \ref{tab:auto_aug}. One can see, that the reduced manual annotation effort sometimes comes at the cost of generating grammatically incorrect sentences.  Tediloma committed Jul 26, 2021 144   uoega committed Jul 25, 2021 145 146  \begin{table}  uoega committed Jul 24, 2021 147  \begin{center}  uoega committed Jul 25, 2021 148 149 150 151 152 153 154 155  \begin{tabular}{|lc|} \hline \textbf{Description}: & A human crouches down by \\ & bending their knees. \\ \hline \textbf{Augmentation 1:} & A \textit{small} human crouches \textit{duck}\\ & down by bending their knees.\\ \hline  Tediloma committed Jul 30, 2021 156  \textbf{Augmentation 2:} & A human crouches \textit{fall} down \\  uoega committed Jul 25, 2021 157 158 159  & \textit{somewhat} by bending their knees.\\ \hline \end{tabular}  uoega committed Jul 24, 2021 160  \end{center}  Tediloma committed Jul 30, 2021 161  \caption{Descriptive label and two automatic augmentations for "squat down".}  uoega committed Jul 25, 2021 162 163 164 165  \label{tab:auto_aug} \end{table}  uoega committed Jul 19, 2021 166   uoega committed Jul 23, 2021 167 \subsection{Experiments}  Tediloma committed Jul 28, 2021 168 169   Tediloma committed Jul 30, 2021 170 \noindent In this work, we use the \emph{NTU RGB+D 120} dataset \cite{Liu_2020}, which contains 3D skeleton data for 114,480 samples of 120 different human action classes. To evaluate our model we pick a subset of 40 gestures classes to execute four performance tests: one with the default labels as a baseline, and one per augmentation method. A performance test consists of eight training runs on 35/5 (seen/unseen) splits, which are randomized in such a way that every single class is unseen in exactly one training run.  uoega committed Jul 29, 2021 171   Tediloma committed Jul 30, 2021 172 During a training run, only the weights of the AN and RN modules are adjusted. The visual feature extractor is trained beforehand on the 80 unused classes of the \emph{NTU} dataset to ensure that the unseen gestures have not appeared in the training process at some early stage. The SBERT module has already been trained on large text corpora by \emph{Sentence-Transformers} \cite{reimers2019sentencebert}. %is used as provided by \cite  uoega committed Jul 29, 2021 173   Tediloma committed Jul 30, 2021 174 We test the performance in two scenarios for each augmentation method: In the ZSL scenario, the model only predicts on the unseen classes, while it predicts on all classes (seen and unseen) in the GZSL scenario. In the latter we measure the accuracy for seen and unseen samples, as well as the harmonic mean, following recent works \cite{gupta2021syntactically}. In each scenario the results are averaged over the eight individual training runs of a performance test. For default and descriptive labels, we train our network with a batch size of 32, as it was done in the original paper \cite{sung2018learning}. When using multiple labels, we increase the batch size to 128 and add batch normalization at the input of the RN.  uoega committed Jul 29, 2021 175   uoega committed Jul 29, 2021 176   uoega committed Jul 19, 2021 177   uoega committed Jul 23, 2021 178 \section{Results}  uoega committed Jul 19, 2021 179   uoega committed Jul 30, 2021 180   Tediloma committed Jul 30, 2021 181 \noindent All our results are generated following the procedure described in the experiments section. For the multiple labels approach three manually created labels per class are used. The automatic augmentation approach utilizes five labels: one manually created label and four augmented versions. In table \ref{tab:ZSL_GZSL} one can see the ZSL, seen and unseen accuracies, as well as the harmonic mean. Table \ref{tab:top1_top5} displays a more detailed view of the achieved unseen accuracies. It shows the top-1 and top-5 accuracies for our approaches with their standard deviations (std) over the eight splits.  uoega committed Jul 30, 2021 182   Tediloma committed Jul 30, 2021 183 Improvements on the ZSL accuracy, the unseen accuracy and the harmonic mean are achieved using the descriptive labels. The accuracies increase even further with the multiple labels approach. Using automatic augmentation performs worse compared to multiple manually created labels, but it still constitutes a relative 23\% increase over using only one descriptive label.  uoega committed Jul 30, 2021 184   Tediloma committed Jul 30, 2021 185 The seen accuracy stays within the same range, only experiencing a marginal increase for the two cases that use multiple labels. This behaviour along with a decrease in unseen accuracy is observed whenever batch normalization is applied to any of our approaches. Therefore it is only applied in the cases where multiple labels are used because they require batch normalization in order for the training to converge.  uoega committed Jul 30, 2021 186   Tediloma committed Jul 30, 2021 187 Table \ref{tab:top1_top5} shows that the top-5 accuracies behave similarly to their top-1 counterparts, with the exception of a less severe performance decrease when using automatic augmentations. The standard deviations of the top-1 accuracies are in the same range for all approaches based on the descriptive labels. The standard deviation belonging to the top-5 accuracies decreases for the multiple label aproaches, which indicates a higher prediction consistency.  uoega committed Jul 30, 2021 188 189  \begin{table}[t]  uoega committed Jul 23, 2021 190  \begin{center}  uoega committed Jul 23, 2021 191 192  \begin{tabular}{|l|c|c|c|c|} \hline  uoega committed Jul 29, 2021 193  Augmentation & ZSL & Seen & Unseen & h\\  uoega committed Jul 23, 2021 194  \hline\hline  Tediloma committed Jul 26, 2021 195 196 197 198  Baseline & 0.4739 & 0.8116 & 0.1067 & 0.1877\\ Descriptive & 0.5186 & 0.8104 & 0.1503 & 0.2495\\ Multiple & \textbf{0.6558} & 0.8283 & \textbf{0.2182} & \textbf{0.3417}\\ Automatic & 0.5865 & \textbf{0.8290} & 0.1856 & 0.3003\\  uoega committed Jul 23, 2021 199 200 201 202  \hline \end{tabular} \end{center} \caption{ZSL and GZSL results for different approaches.}  uoega committed Jul 24, 2021 203  \label{tab:ZSL_GZSL}  uoega committed Jul 23, 2021 204 205 \end{table}  uoega committed Jul 30, 2021 206 \begin{table}[t]  uoega committed Jul 23, 2021 207 208  \begin{center} \begin{tabular}{|l|c|c|}  uoega committed Jul 23, 2021 209  \hline  uoega committed Jul 29, 2021 210  Augmentation & top-1${\pm}$ std & top-5 ${\pm}$ std \\  uoega committed Jul 23, 2021 211  \hline\hline  Tediloma committed Jul 26, 2021 212 213 214 215  Baseline & ${0.1067\pm 0.0246}$ & ${0.5428\pm 0.0840}$ \\ Descriptive & ${0.1503\pm 0.0553}$ & ${0.6460\pm 0.1250}$ \\ Multiple & ${\textbf{0.2182}\pm 0.0580}$ & ${\textbf{0.8580}\pm 0.0657}$ \\ Automatic & ${0.1856\pm 0.0499}$ & ${0.8272\pm 0.0476}$ \\  uoega committed Jul 23, 2021 216 217 218  \hline \end{tabular} \end{center}  uoega committed Jul 29, 2021 219  \caption{Unseen top-1 and top-5 accuracies (GZSL).}  uoega committed Jul 24, 2021 220  \label{tab:top1_top5}  uoega committed Jul 23, 2021 221 \end{table}  uoega committed Jul 19, 2021 222   uoega committed Jul 23, 2021 223 \subsection{Discussion}  uoega committed Jul 19, 2021 224   uoega committed Jul 24, 2021 225 \subsubsection{From default to descriptive labels}  Tediloma committed Jul 30, 2021 226 227  The improvement from the use of descriptive labels shows that incorporating more visual information into the semantic embeddings helps the network to find a general relation between the semantic and the visual space. Plainly speaking the network can find more similarities between the class labels. This is important since the expected visual features of an unseen class are determined based on the similarities between its label and the seen labels. One might expect these similarities to also be present in the embeddings of the default labels because SBERT should be able to generate representative embeddings that share characteristics with similar classes. While such similarities are present in the SBERT embeddings, they are not focused on the visual appearance of the gestures. For example, the embeddings of the class labels "sit down" and "drink water" might be somewhat similar, because those words appear together frequently in the large text corpora that SBERT was trained on. Visually however, those classes look vastly different from each other. The embeddings falsely suggest, that a similarity between the classes is there, which is less likely to happen if the embeddings are created from visual descriptions of the actions.  uoega committed Jul 29, 2021 228 229  %this to already be possible on the default labels due to the use of text embeddings, but the issue there lies with the way that the embedding modules are trained. The embeddings of the class labels 'sit down' and 'drink water' might be somewhat similar, because those words appear together frequently in the large training text corpora, but visually those classes look vastly different from each other. The embeddings falsely suggest, that a similarity between the classes is there, which is less likely to happen if the embeddings are created from visual descriptions of the actions. Usually this should already be enabled by using text embedding techniques that were trained on large text corpora to find semantic relationships. But the problem with that is that the texts it was trained on contains the words used to describe motions in many different contexts and usually not visually describing it. The main reason for this is, that most humans do not need e.g. an explanation on what “stand up” looks like. But for our task the visual relationships are needed which could explain why using descriptive labels leads to improvements.  uoega committed Jul 24, 2021 230 231  \subsubsection{Using multiple labels}  uoega committed Jul 29, 2021 232 233 234 %vielleicht kurz auf die Bedeutung der Batchsize eingehen 32->128 weil mehr Klassen since the unseen labels can only be categorized based on the relations between and the seen labels. to also be present in the default label by using visual descriptions as class labels  Tediloma committed Jul 30, 2021 235 When using multiple labels, the idea is somewhat different. The main motivation is that using larger amounts of data is generally a good idea. Here, the descriptions and therefore the embeddings of each sample are chosen randomly among the three possibilities during training. This forces the network to assign a high similarity to all three labels corresponding to a sample, which leads to a more general mapping between the semantic and the visual feature space. The model has to adapt to the greater diversity of the used semantic embeddings. This improved generalization on seen training data then helps the network understand and therefore classify the unseen samples better.  Tediloma committed Jul 30, 2021 236 237 238 239  For the methods using multiple labels per class, the batch size during training is increased from 32 to 128. Since the network needs to learn a mapping for a greater amount of classes, increasing the batch size is necessary to find relations between more classes at once. Increasing the batch size does not benefit the single label approaches. %In our case the network is forced to learn a more general mapping between the semantic and the visual feature space since the descriptions and therefore also the embeddings change randomly during training. It has to adapt to the greater diversity of the used label semantic embeddings. This improved generalization on seen training data then helps to better understand therefore classify the unseen samples.  uoega committed Jul 30, 2021 240 %------Jonas  Tediloma committed Jul 30, 2021 241 %To achieve this, the batch size during training is increased from 32 to 128. As the network needs to learn a mapping between a greater amount of different classes, increasing the batch size helps to find relations between more classes at once. This benefit was not observed when using one label per class.  uoega committed Jul 30, 2021 242 %  uoega committed Jul 24, 2021 243 \subsubsection{Automatic augmentation}  Tediloma committed Jul 30, 2021 244   Tediloma committed Jul 30, 2021 245 The individual labels for a class are very similar when using automatic augmentation compared to multiple manually created labels, since only a few additional words are inserted for each version. The diversity in the semantic embeddings is therefore less pronounced, which leads to a worse performance. However, compared to the single labels, where the semantic embeddings contain no diversity, the performance is significantly better.  Tediloma committed Jul 30, 2021 246   Tediloma committed Jul 30, 2021 247 If diversifying the semantic embeddings is the key to improving the performance, one might expect, that generating the additional embeddings by adding random noise to a single embedding could also work. This would obviate the need for a text augmentation module. However, this method does not improve the performance compared to the single label approach when tested on our model. This shows that a specific kind of diversity is needed to get an improvement. Using word insertions clearly provides a suitable diversity, since there is an improvement despite the resulting grammatical errors described in section \ref{method}.  Tediloma committed Jul 30, 2021 248 249  %As described in chapter \ref{method}, inserting words with automatic augmentation methods does not necessarily yield grammatically correct sentences. But as the semantic embeddings are generated using the SBERT mean-token, it introduces diversity into the different embeddings due to the averaging. Still the additional information does usually not focus on the visual description of the classes and therefore differs from the manually created multiple labels. It can rather be modeled as adding noise to the embedding vectors. But in contrast to just adding random noise, it keeps semantic information and relations between different embeddings intact. This helps the network to better generalize its mapping. Experiments using only random noise to generate diverse label embeddings lead to no performance improvements in top-1 unseen accuracy.  uoega committed Jul 30, 2021 250 251 %------ %As described in chapter \ref{method}, using automatic augmentation methods introduces diversity into the different embeddings. As this does not only focus on the visual description of the classes and therefore differs from the manually created multi labels, it could be modeled as noise. But in contrast to just adding random noise to the embedding vector, it keeps semantic information and relationships intact. This helps the network to generalize its mapping. Experiments using only random noise to generate diverse label embeddings lead to no performance improvements in top-1 accuracy.  uoega committed Jul 19, 2021 252   uoega committed Jul 23, 2021 253 \section{Conclusion}  uoega committed Jul 19, 2021 254   Tediloma committed Jul 30, 2021 255 \noindent In this work, we demonstrate the potential of applying data augmentation to the semantic embeddings of a Zero-Shot gesture recognition model. By including more visual information in the class labels and combining multiple descriptions per class we are able to improve the performance of a model based on \cite{jasani2019skeleton} by a significant margin. The use of automatic text augmentation still leads to a sizable performance gain, while keeping the manual annotation effort low.  Tediloma committed Jul 30, 2021 256 257  %In this work, we highlight the importance of the semantic embeddings in the context of skeleton based zero-shot gesture recognition by showing how the performance can increase based only on the augmentation of those embeddings. By including more visual information in the class labels and combining multiple descriptions per class we can improve the model based on \cite{jasani2019skeleton} by a significant margin. The use of automatic text augmentation methods \cite{ma2019nlpaug} already reduces the effort of manual annotation significantly, while maintaining most of the performance gain.  uoega committed Jul 19, 2021 258   Tediloma committed Jul 30, 2021 259 Future works might further investigate the following topics: Firstly, generating descriptive sentences from the default labels, e.g. by using methods from Natural Language Processing (NLP), would further reduce the manual annotation effort. Secondly, our methods could be tested on different Zero-Shot architectures to verify our improvements. Finally, different kinds or combinations of automatic text augmentation methods could be evaluated.  Tediloma committed Jul 26, 2021 260   Tediloma committed Jul 30, 2021 261 %falls der future work absatz weggelassen wird, hier einleiten mit: With further advances/research  uoega committed Jul 29, 2021 262 With these advances, data augmentation of the semantic embeddings in Zero-Shot learning can prove useful in optimizing the performance of any Zero-Shot approach in the future.  uoega committed Jul 19, 2021 263 264 265 266 267   uoega committed Jul 23, 2021 268   uoega committed Jul 19, 2021 269 270 271 272 273 274 275 276  {\small \bibliographystyle{ieee_fullname} \bibliography{egbib} } \end{document}