Commit 62280d54 authored by Tediloma's avatar Tediloma
Browse files

paper überarbeitet teil 2

parent cf852020
......@@ -33,10 +33,12 @@
\citation{reimers2019sentencebert}
\citation{sung2018learning}
\citation{sung2018learning}
\citation{sung2018learning}
\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces Architecture of the network.}}{2}{figure.1}\protected@file@percent }
\newlabel{architecture}{{1}{2}{Architecture of the network}{figure.1}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {1.4}\hskip -1em.\nobreakspace {}Data augmentation}{2}{subsection.1.4}\protected@file@percent }
\@writefile{toc}{\contentsline {section}{\numberline {2}\hskip -1em.\nobreakspace {}Method}{2}{section.2}\protected@file@percent }
\newlabel{method}{{2}{2}{\hskip -1em.~Method}{section.2}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {2.1}\hskip -1em.\nobreakspace {}Architecture}{2}{subsection.2.1}\protected@file@percent }
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.1.1}Visual path}{2}{subsubsection.2.1.1}\protected@file@percent }
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.1.2}Semantic Path}{2}{subsubsection.2.1.2}\protected@file@percent }
......@@ -56,11 +58,11 @@
\@writefile{lot}{\contentsline {table}{\numberline {3}{\ignorespaces ZSL and GZSL results for different approaches.}}{3}{table.3}\protected@file@percent }
\newlabel{tab:ZSL_GZSL}{{3}{3}{ZSL and GZSL results for different approaches}{table.3}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {2.3}\hskip -1em.\nobreakspace {}Experiments}{3}{subsection.2.3}\protected@file@percent }
\@writefile{toc}{\contentsline {section}{\numberline {3}\hskip -1em.\nobreakspace {}Results}{3}{section.3}\protected@file@percent }
\citation{jasani2019skeleton}
\citation{ma2019nlpaug}
\@writefile{lot}{\contentsline {table}{\numberline {4}{\ignorespaces Unseen top-1 and top-5 accuracies in detail.}}{4}{table.4}\protected@file@percent }
\newlabel{tab:top1_top5}{{4}{4}{Unseen top-1 and top-5 accuracies in detail}{table.4}{}}
\@writefile{toc}{\contentsline {section}{\numberline {3}\hskip -1em.\nobreakspace {}Results}{4}{section.3}\protected@file@percent }
\@writefile{toc}{\contentsline {subsection}{\numberline {3.1}\hskip -1em.\nobreakspace {}Discussion}{4}{subsection.3.1}\protected@file@percent }
\@writefile{toc}{\contentsline {subsubsection}{\numberline {3.1.1}From default to descriptive labels}{4}{subsubsection.3.1.1}\protected@file@percent }
\@writefile{toc}{\contentsline {subsubsection}{\numberline {3.1.2}Using multiple labels}{4}{subsubsection.3.1.2}\protected@file@percent }
......
This is pdfTeX, Version 3.141592653-2.6-1.40.22 (MiKTeX 21.6) (preloaded format=pdflatex 2021.7.25) 25 JUL 2021 20:33
This is pdfTeX, Version 3.141592653-2.6-1.40.22 (MiKTeX 21.6) (preloaded format=pdflatex 2021.7.25) 26 JUL 2021 02:08
entering extended mode
**./paper_working_design.tex
(paper_working_design.tex
......@@ -391,28 +391,19 @@ Underfull \hbox (badness 10000) in paragraph at lines 72--75
<Architektur2.png, id=24, 885.6839pt x 440.77171pt>
File: Architektur2.png Graphic file (type png)
<use Architektur2.png>
Package pdftex.def Info: Architektur2.png used on input line 87.
Package pdftex.def Info: Architektur2.png used on input line 88.
(pdftex.def) Requested size: 213.4209pt x 106.21107pt.
[2 <./Architektur2.png>]
Underfull \hbox (badness 4181) in paragraph at lines 136--138
Underfull \hbox (badness 4181) in paragraph at lines 137--139
\OT1/ptm/m/n/10 To re-duce the man-ual an-no-ta-tion ef-fort, we would
[]
Underfull \hbox (badness 6477) in paragraph at lines 136--138
\OT1/ptm/m/n/10 like to gen-er-ate ad-di-tional la-bels au-to-mat-i-cally for
Underfull \hbox (badness 3039) in paragraph at lines 137--139
\OT1/ptm/m/n/10 multi la-bel ap-proach. There-fore we are us-ing the
[]
Underfull \hbox (badness 1888) in paragraph at lines 136--138
\OT1/ptm/m/n/10 the multi la-bel ap-proach. There-for we're us-ing the
[]
Underfull \vbox (badness 10000) has occurred while \output is active []
[3]
[4] (paper_working_design.bbl
[3] [4] (paper_working_design.bbl
Underfull \hbox (badness 10000) in paragraph at lines 29--32
[]\OT1/ptm/m/n/9 Edward Ma. Nlp aug-men-ta-tion.
[]
......@@ -421,13 +412,13 @@ Underfull \hbox (badness 10000) in paragraph at lines 29--32
] (paper_working_design.aux) )
Here is how much of TeX's memory you used:
9385 strings out of 478864
135257 string characters out of 2860441
459257 words of memory out of 3000000
27116 multiletter control sequences out of 15000+600000
9386 strings out of 478864
135265 string characters out of 2860441
459293 words of memory out of 3000000
27117 multiletter control sequences out of 15000+600000
428885 words of font info for 82 fonts, out of 8000000 for 9000
1141 hyphenation exceptions out of 8191
72i,13n,80p,1149b,362s stack positions out of 5000i,500n,10000p,200000b,80000s
72i,13n,80p,1230b,362s stack positions out of 5000i,500n,10000p,200000b,80000s
{K:/Programme/MiKTeX/fonts/enc/dvips/base/8r.enc}
<K:/Programme/MiKTeX/fonts/type1/public/amsfonts/cm/cmmi10.pfb><K:/Programme/Mi
KTeX/fonts/type1/public/amsfonts/cm/cmr10.pfb><K:/Programme/MiKTeX/fonts/type1/
......@@ -435,9 +426,9 @@ public/amsfonts/cm/cmsy10.pfb><K:/Programme/MiKTeX/fonts/type1/urw/courier/ucrr
8a.pfb><K:/Programme/MiKTeX/fonts/type1/urw/times/utmb8a.pfb><K:/Programme/MiKT
eX/fonts/type1/urw/times/utmr8a.pfb><K:/Programme/MiKTeX/fonts/type1/urw/times/
utmri8a.pfb>
Output written on paper_working_design.pdf (5 pages, 251827 bytes).
Output written on paper_working_design.pdf (5 pages, 252280 bytes).
PDF statistics:
134 PDF objects out of 1000 (max. 8388607)
136 PDF objects out of 1000 (max. 8388607)
41 named destinations out of 1000 (max. 500000)
6 words of extra memory for PDF output out of 10000 (max. 10000000)
......@@ -79,6 +79,7 @@ We aim to provide the network more relevant semantic information about the diffe
\section{Method}
\label{method}
First we need to build a network capable of zero-shot learning for gesture recognition, and assess the baseline performance. Then we apply different forms of data augmentation to the semantic embeddings of the class labels and compare the classification accuracy. For our tests we use 40 gesture classes from the NTU RGB+D 120 dataset.
\begin{figure}[t]
......@@ -98,23 +99,23 @@ The architecture consists of three parts described in the following sections.
The task of the visual path is the feature extraction of a video sample (in form of a temporal series of 3D-Skeletons). The Graph Convolutional Net (GCN) from \cite{yan2018spatial} is used as the feature extractor, which in our case has been trained exclusively with the 80 unused classes of the NTU-RGB+D 120 dataset. This ensures that the unseen gestures have not already appeared at some early point in the training process before inference. Further details on feature extraction can be found in the referenced paper \cite{yan2018spatial}. No significant changes were made to this part of the network.
\subsubsection{Semantic Path}
The semantic path consists of two modules. First is an SBERT module, that has the task of transforming the vocabulary, i.e. all possible class labels, into semantic embeddings. This is different from the original architecture in \cite{jasani2019skeleton}, where a Sent2Vec module is used. We chose to use the mean-token output of the SBERT module over the cls-token as our semantic embedding because it resulted in a better performance. The details of this model, which translates the class labels into representative 768-dimensional vectors, can be found in \cite{reimers2019sentencebert}. The attribute network (AN) then transforms the semantic embeddings into semantic features by mapping them into the visual feature space. The AN is introduced in \cite{sung2018learning} where it contributes a significant part to the solution of the ZSL task together along with the Relation Net (RN), which is explained in more detail in the following section. Minor changes have also been made to the AN, namely the output size of the first layer has been changed to 1200 and it's dropout factor to 0.5.
The semantic path consists of two modules. First is an SBERT module, that has the task of transforming the vocabulary, i.e. all possible class labels, into semantic embeddings. This is different from the original architecture in \cite{jasani2019skeleton}, where a Sent2Vec module is used. We chose to use the mean-token output of the SBERT module over the cls-token as our semantic embedding because it resulted in a better performance. The details of this model, which translates the class labels into representative 768-dimensional vectors, can be found in \cite{reimers2019sentencebert}. The attribute network (AN) then transforms the semantic embeddings into semantic features by mapping them into the visual feature space. The AN is introduced in \cite{sung2018learning} where it contributes a significant part to the solution of the ZSL task along with the Relation Net (RN), which is explained in more detail in the following section. We apply dropout to the first layer of the AN with a factor of 0,5.
\subsubsection{Similarity-Learning-Part}
Here we first form the relation pairs by pairwise concatenating the visual features of our sample with the semantic features of each class. These relation pairs are are then fed into the relation netwoek (RN) introduced in \cite{sung2018learning}. The RN learns a deep similarity metric in order to assess the similarity of the semantic and visual features within each relation pair. This way, it computes a similarity score for each pair, which symbolizes the input sample's similarity to each possible class. Again, we refer to the originators of the architecture for more details. Following changes have been made to the RN: we added another linear layer and a drop-out layer in the MLP. Furthermore, a batch norm layer has also been introduced for some of our experiments. The latter one is used to speed up the training process.
The AN module as well as the RN module from [Sung2018L2C] are really trained during training process. All other modules mentioned remain unchanged after their pretraining.
Here we first form the relation pairs by pairwise concatenating the visual features of our sample with the semantic features of each class. These relation pairs are are then fed into the relation netwoek (RN) introduced in \cite{sung2018learning}, which is simply another MLP. The RN learns a deep similarity metric in order to assess the similarity of the semantic and visual features within each relation pair. This way, it computes a similarity score for each pair, which symbolizes the input sample's similarity to each possible class. Again, we refer to the originators of the architecture for more details. We add an additional linear layer and apply dropout to the first and second layor with a factor of 0,5.
The AN module as well as the RN module from \cite{sung2018learning} are really trained during training process. All other modules mentioned remain unchanged after their individual training.
\subsection{Augmentation}
In the following, we present three approaches we used to perform data augmentation on the default class labels. The goal of these methods is to increase the information content in the semantic description of our gesture classes in order to improve the performance of the classification. The individual methods for data augmentation are explained in more detail using the example label "Squat Down".
In this section, we present three approaches we used to perform data augmentation on the default class labels. The goal of these methods is to increase the visual information content in the semantic description of our gesture classes in order to improve the performance of the classification. The individual methods for data augmentation are explained in more detail using the example default label "Squat down".
\subsubsection{Descriptive labels}
In a first step, we tried to increase the information content by replacing the class labels, which in their original form only consist of one or two words, with a complete sentence. The short sentence should give a more precise description of the movements the person in the video makes when performing a particular gesture. In this way, the default label "Squat Down" was manually annotated to create the new label: "\textit{A human crouches down by bending her knees}". In the training process as well as in the inference phase, every default label was then replaced by its manually written counterpart label.
In a first step, we increase the information content by replacing the class labels, which in their original form only consist of one or two words, with a complete sentence. We use sentences, that give a more precise description of the movements the person in the video makes when performing a particular gesture. In this way, the default label "Squat Down" was manually annotated to create the new label: "\textit{A human crouches down by bending her knees}". In the training process as well as testing phase, every default label is replaced by its manually written descriptive label.
\subsubsection{Multiple labels per class}
Based on the promising results of the descriptive labels approach, we wanted to increase the information content in the semantic space even further. In another experiment, we therefore wanted to label each action with several different descriptions. To accomplish this, we manually created two additional descriptions for each gesture that were slightly different from each other. Consequently each class now has three descriptive labels. An example for the class "Squat down" can be seen in \ref{tab:multi_label}. In each iteration of the training process, the corresponding class label to a training video sample was randomly selected from one of the three possibilities. Later, during inference, all three possibilities were considered correct if the network predicted one of them for the corresponding video.
We now increase the information content of the semanic embeddings even further by labeling each gesture with several different descriptions.Thus, we manually created two additional descriptions for each gesture that used different wording. Consequently each class now has three descriptive labels. An example for the class "Squat down" can be seen in \ref{tab:multi_label}. In each iteration of the training process, the ground truth of a training video sample is randomly selected from one of the three possible labels. Later, during inference, all three possibilities were considered correct if the network predicted one of them for the corresponding sample.
\begin{table}
\begin{center}
......@@ -133,8 +134,8 @@ Based on the promising results of the descriptive labels approach, we wanted to
\end{table}
\subsubsection{Automatic augmentation}
To reduce the manual annotation effort, we would like to generate additional labels automatically for the multi label approach. Therefor were using the \verb'ContextualWordEmbsAug' Augmenter with RoBERTa \cite{liu2019roberta} language model from \verb'nlpaug' \cite{ma2019nlpaug} to insert words into a descriptive label. We decided on insertions and not substitutions or deletions, since these did not perform well in our tests. % sentence for synonyms, only one can be found
For the class squat down an example can be seen in \ref{tab:auto_aug}. One can see, that the augmented sentences are not necessarily grammatically correct and less human readable. But as our semantic embedding is generated using a weighted average of the tokens of every word from SBERT with an attention mask, it introduces some kind of variance/diversity into the different embeddings of the descriptive labels. We expect this to perform worse compared to the three manually created descriptive label approach but still leading to some improvements compared to just using one descriptive label.
To reduce the manual annotation effort, we would like to generate additional labels automatically for the multi label approach. Therefore we are using the \verb'ContextualWordEmbsAug' Augmenter with RoBERTa \cite{liu2019roberta} language model from \verb'nlpaug' \cite{ma2019nlpaug} to insert words into a manually created descriptive label. We decided on insertions and not substitutions or deletions, since it was often impossible for the automatic text augmentation to find multiple suitable synonyms for specific words, and deleting key words would lead to a sentence that does not describe the given action.
For the class squat down an example can be seen in \ref{tab:auto_aug}. One can see, that the augmented sentences are not necessarily grammatically correct and less human readable. But as our semantic embedding is generated using a weighted average of the tokens of every word from SBERT with an attention mask, it introduces a certain diversity into the different embeddings of the descriptive labels. We expect this to perform worse than the previous approach, but better than the single descriptive label method, while requiring the same manual annotation effort.
\begin{table}
\begin{center}
......@@ -159,7 +160,7 @@ For the class squat down an example can be seen in \ref{tab:auto_aug}. One can s
\subsection{Experiments}
For evaluating our model, we do training runs on eight random 35/5 splits, which include every class once, such that every class is used as an unseen class once. The accuracies however are averaged of the eight individual experiments. For each approach we’re calculating the top-1 accuracy over only the 5 unseen classes (ZSL) and on seen and unseen test data and the harmonic mean, following recent works \cite{jasani2019skeleton} (GZSL). For default and descriptive labels, we train our Network with a batch size of 32 and without batch norm like in the original paper \cite{sung2018learning} For the multi labels however, we used a batch size of 128 and batch norm. This was mainly done due to performance reasons because the multi label approach with more than 3 labels did not learn anything without batch norm at all. %batchnorm in general -> decrease in unseen
In order to evaluate an aumentation method, we do training runs on eight random 35/5 (seen/unseen) splits, in such a way, that every single class is unseen in exactly one training run. The accuracies are then averaged over the eight individual experiments. For each augmentation method we test the performance in two scenarios: In the ZSL scenario, the model only predicts on the unseen classes, while it predicts on all classes (seen and unseen) in the GZSL scenario. In the latter we measure the accuracy for seen and unseen samples, as well as the harmonic mean, following recent works \cite{jasani2019skeleton}. For default and descriptive labels, we train our Network with a batch size of 32 and without batch norm, as was done in the original paper \cite{sung2018learning} For the multi labels however, we used a batch size of 128 and batch norm. This was mainly done due to performance reasons because the multi label approach with more than 3 labels did not learn anything without batch norm at all. %batchnorm in general -> decrease in unseen
\section{Results}
......@@ -197,27 +198,29 @@ For evaluating our model, we do training runs on eight random 35/5 splits, which
\label{tab:top1_top5}
\end{table}
All our results were generated following the procedure described in the Experiments section. In table \ref{tab:ZSL_GZSL}, one can see the ZSL, seen and unseen accuracy and the harmonic mean. Table \ref{tab:top1_top5} shows a more detailed view on the unseen accuracies achieved. It shows the top-1 and top-5 accuracies for our approaches with standard deviation. One can see the baseline results in the line indicated by “Default Labels”. Improvements on the zsl accuracy, the unseen accuracy and the harmonic mean were achieved using the descriptive labels and the three descriptive labels approach. Using only one manually created descriptive label and four automatic augmentations of this description in a multi label approach achieves lower values compared to three descriptive labels, but still improves the unseen performance of using only one descriptive label by 23\%. The seen accuracy is quite similar for all approaches; still, it is slightly higher for the multi labels approaches, which occurred due to the use of batch norm. \\
In the more detailed table one can see, that the top-5 accuracy increases similar to the top-1 accuracies. The decrease however is much smaller when using the automatic augmentation. This behavior was often observed for experiments with the multi label approach. As for the standard deviations, one can see that for the top-1 all based on the descriptive labels are in the same range. For the top-5 accuracies we even get a decrease in standard deviation with higher accuracy values which shows the advantages of the multi label approach.
All our results were generated following the procedure described in the Experiments section. In table \ref{tab:ZSL_GZSL}, one can see the ZSL, seen and unseen accuracies, as well as the harmonic mean. Table \ref{tab:top1_top5} shows a more detailed view of the achieved unseen accuracies. It shows the top-1 and top-5 accuracies for our approaches with their standard deviation over the 8 splits. One can see the baseline results in the line indicated by “Default Labels”. Improvements on the zsl accuracy, the unseen accuracy and the harmonic mean were achieved using the descriptive labels and even more so with the three descriptive labels approach. Using only one manually created descriptive label and four automatic augmentations of this description in a multi label approach achieves lower values compared to three descriptive labels, but still constitutes a relative 23\% increase over using only one descriptive label. The seen accuracy is quite similar for all approaches; still, it is slightly higher for the multi labels approaches, which occurred due to the use of batch norm, which always raised the seen accuracy slightly at the cost of the unseen accuracy. \\
In the more detailed table one can see, that the top-5 accuracies increases similarly to their top-1 counterparts, with the exception of a less severe performance decrease when using automatic augmentations. This behavior was often observed for experiments with the multi label approach. As for the standard deviations, one can see that for the top-1 accuracies all approaches based on the descriptive labels are in the same range. For the top-5 accuracies we even get a decrease in standard deviation with higher accuracy values which indicates a higher consistency for the multi label approach.
\subsection{Discussion}
\subsubsection{From default to descriptive labels}
The improvement from the use of descriptive labels over the use of the default labels shows that incorporating more visual information into the semantic embedding by using visual descriptions as class labels helps the network to find a general relation between the semantic and the visual space learned only on the seen training data. Plainly speaking the network finds more similarities between the describing sentences compared to just one-word labels. Usually this should already be enabled by using text embedding techniques that were trained on large text corpora to find semantic relationships. But the problem with that is that the texts it was trained on contains the words used to describe motions in many different contexts and usually not visually describing it. The main reason for this is, that most humans do not need e.g. an explanation on what “stand up” looks like. But for our task the visual relationships are needed which could explain why using descriptive labels leads to improvements.
The improvement from the use of descriptive labels over the use of the default labels shows that incorporating more visual information into the semantic embedding by using visual descriptions as class labels helps the network to find a general relation between the semantic and the visual space learned only on the seen training data. Plainly speaking the network can find more similarities between the describing sentences compared to just one-word labels. One would expect this to already be possible on the default labels due to the use of text embeddings, but the issue there lies with the way that the embedding modules are trained. The embeddings of the class labels 'sit down' and 'drink water' might be somewhat similar, because those words appear together frequently in the large training text corpora, but visually those classes look vastly different from each other. The embeddings falsely suggest, that a similyrity between the classes is there, which is less likely to happen if the embeddings are created from visual descriptions of the actions.
\subsubsection{Using multiple labels}
For the multi label approach the idea is little bit different. The main motivation here was that using more data is generally a good idea. In our case the network is forced to learn a more general mapping between the semantic and the visual feature space since the descriptions and therefor also the embeddings change during training randomly. It has to adapt to the greater variance of the used label semantic embeddings. This better generalization on seen training data then helps to better understand and also classify the unseen samples.
For the multi label approach the idea is little bit different. The main motivation here was that using more data is generally a good idea. In our case the network is forced to learn a more general mapping between the semantic and the visual feature space since the descriptions and therefore also the embeddings change during training randomly. It has to adapt to the greater diversity of the used label semantic embeddings. This improved generalization on seen training data then helps to better understand and also classify the unseen samples.
\subsubsection{Automatic augmentation}
As described in Method, using automatic augmentation methods introduces diversity into the different embeddings. As this does not only focus on the visual description of the classes and therefore differs from the manually created multi labels, it could be modeled as noise. But in contrast to just adding random noise to the embedding vector, it keeps semantic information and relationships. This helps the network to generalize its mapping. Experiments using only random noise to generate diverse label embeddings lead to no performance improvements in top-1 accuracy.
As described in \ref{method}, using automatic augmentation methods introduces diversity into the different embeddings. As this does not only focus on the visual description of the classes and therefore differs from the manually created multi labels, it could be modeled as noise. But in contrast to just adding random noise to the embedding vector, it keeps semantic information and relationships intact. This helps the network to generalize its mapping. Experiments using only random noise to generate diverse label embeddings lead to no performance improvements in top-1 accuracy.
\section{Conclusion}
In this work, we highlighted the importance of the semantic embeddings in the context of skeleton based Zero-Shot Gesture Recognition by using data augmentation of semantic embeddings. By including more visual information in the sentence labels that describe the classes and combining multiple descriptions per class we could improve the model based on \cite{jasani2019skeleton} by a significant margin. The use of automatic text augmentation methods like \cite{ma2019nlpaug} already reduces the effort of manual annotation significantly, while maintaining most of the performance. Together with a further reduction of the manual annotation effort in the future, data augmentation of the semantic embedding in Zero-Shot Learning could prove useful in optimizing the performance of any Zero-Shot approach.
In this work, we highlighted the importance of the semantic embeddings in the context of skeleton based zero-shot gesture recognition by showing how the performance can increase based only on the augmentation of those embeddings. By including more visual information in the class labels and combining multiple descriptions per class we could improve the model based on \cite{jasani2019skeleton} by a significant margin. The use of automatic text augmentation methods like \cite{ma2019nlpaug} already reduces the effort of manual annotation significantly, while maintaining most of the performance gain.
To achieve this, future works could further investigate the following topics: First, generating sentences from the default labels using methods from Natural Language Processing (NLP) could be implemented to further reduce the manual annotation effort. Second, additional tests on different zero-shot architectures to verify the improvements shown in our work, could be performed. Finally different kinds or combinations of automatic text augmentation methods could be evaluated.
Future works could further investigate the following topics: First, generating descriptive sentences from the default labels using methods from Natural Language Processing (NLP) could be implemented to further reduce the manual annotation effort. Second, additional tests on different zero-shot architectures to verify the improvements shown in our work could be performed. Finally, different kinds or combinations of automatic text augmentation methods could be evaluated.
With these advances, data augmentation of the semantic embedding in Zero-Shot Learning could prove useful in optimizing the performance of any Zero-Shot approach in the future.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment