title={RoBERTa: A Robustly Optimized BERT Pretraining Approach},
author={Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov},
year={2019},
journal={arXiv:1907.11692},
archivePrefix={arXiv}
}
@article{reimers2019sentencebert,
title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
@@ -110,7 +110,7 @@ Die zwei letztgenannten Module AN und RN aus [Learning2Compare] sind es auch, di
\subsection{Augmentations}
\subsubsection{Automatic Augmentation}
To reduce the manual annotation effort, we would like to generate additional labels automatically for the multi label approach. Therefor we’re using the ContextualWordEmbsAug Augmenter with RoBERTa [liu2019roberta] language model from nlpaug [CITATION] to insert words into a descriptive embedding. We decided on insertions and not substitutions or deletions, since these did not perform well in our tests. (For substitutions with synonyms, we would have expected a better performance, but it turned out that there weren’t enough synonyms for the key words in our sentences.) For the class squat down an example for the used word insertions would be:\\
To reduce the manual annotation effort, we would like to generate additional labels automatically for the multi label approach. Therefor we’re using the \verb'ContextualWordEmbsAug' Augmenter with RoBERTa \cite{liu2019roberta} language model from \verb'nlpaug'\cite{ma2019nlpaug} to insert words into a descriptive label. We decided on insertions and not substitutions or deletions, since these did not perform well in our tests. (For substitutions with synonyms, we would have expected a better performance, but it turned out that there weren’t enough synonyms for the key words in our sentences.) For the class squat down an example for the used word insertions would be:\\
\noindent
{\bf Description:} A human crouches down by bending their knees.\\
...
...
@@ -164,7 +164,7 @@ One can see, that the augmented sentences are not necessarily grammatically corr
\subsection{Experiments}
For evaluating our model, we do training runs on 8 random 35/5 splits, which include every class once, such that every class is used as an unseen class once [TABLE]. The accuracies however are averaged of the eight individual experiments. For each approach we’re calculating the top-1 accuracy over only the 5 unseen classes (ZSL) and on seen and unseen test data and the harmonic mean, following recent works [CITATION synse or main paper] (GZSL). For default and descriptive labels, we train our Network with a batch size of 32 and without batch norm like in the original paper [CITATION relation net]. For the multi labels however, we used a batch size of 128 and batch norm. This was mainly done due to performance reasons because the multi label approach with more than 3 labels did not learn anything without batch norm at all.
For evaluating our model, we do training runs on eight random 35/5 splits, which include every class once, such that every class is used as an unseen class once. The accuracies however are averaged of the eight individual experiments. For each approach we’re calculating the top-1 accuracy over only the 5 unseen classes (ZSL) and on seen and unseen test data and the harmonic mean, following recent works \cite{jasani2019skeleton} (GZSL). For default and descriptive labels, we train our Network with a batch size of 32 and without batch norm like in the original paper \cite{sung2018learning} For the multi labels however, we used a batch size of 128 and batch norm. This was mainly done due to performance reasons because the multi label approach with more than 3 labels did not learn anything without batch norm at all.
\section{Results}
...
...
@@ -182,6 +182,7 @@ For evaluating our model, we do training runs on 8 random 35/5 splits, which inc
\end{tabular}
\end{center}
\caption{ZSL and GZSL results for different approaches.}
\label{tab:ZSL_GZSL}
\end{table}
\begin{table}
...
...
@@ -198,25 +199,27 @@ For evaluating our model, we do training runs on 8 random 35/5 splits, which inc
\end{tabular}
\end{center}
\caption{Unseen top-1 and top-5 accuracies results in detail.}
\label{tab:top1_top5}
\end{table}
All our results were generated following the procedure described in the Experiments section. In [TABLE] one can see the ZSL accuracies of our approach with standard deviation/min-max. [TABLE] shows the seen accuracy, unseen accuracy and the harmonic mean.
All our results were generated following the procedure described in the Experiments section. In table \ref{tab:ZSL_GZSL}, one can see the ZSL, seen and unseen accuracy and the harmonic mean. Table \ref{tab:top1_top5} shows a more detailed view on the unseen accuracies achieved. It shows the top-1 and top-5 accuracies for our approaches with standard deviation. One can see the baseline results in the line indicated by “Default Labels”. Improvements on the zsl accuracy, the unseen accuracy and the harmonic mean were achieved using the descriptive labels and the three descriptive labels approach. Using only one manually created descriptive label and four automatic augmentations of this description in a multi label approach achieves lower values compared to three descriptive labels, but still improves the unseen performance of using only one descriptive label by 23\%. The seen accuracy is quite similar for all approaches; still, it is slightly higher for the multi labels approaches, which occurred due to the use of batch norm (/always when using batch norm). \\
In the more detailed table one can see, that the top-5 accuracy increases similar to the top-1 accuracies. The decrease however is much smaller when using the automatic augmentation. This behavior was often observed for experiments with the multi label approach. As for the standard deviations, one can see that for the top-1 all based on the descriptive labels are in the same range. For the top-5 accuracies we even get a decrease in standard deviation with higher accuracy values which shows the advantages of the multi label approach.
\subsection{Discussion}
Discussion
One can see the baseline results in the line “Default Labels”. Improvements on the unseen accuracy and harmonic mean were achieved using the descriptive labels and the three descriptive labels approach. Using only one manually created descriptive label and four automatic augmentations of this description in a multi label approach achieves lower unseen accuracies compared to three descriptive labels, but still improves the unseen performance of using only one descriptive label by 0.035. The seen accuracy is quite similar for all approaches. Still, it is slightly higher for the multi labels approaches, which occurred due to the use of batch norm (/always when using batch norm). For the ZSL accuracy in [TABLE] we can observe the same behavior.
\newline
“Why does this work?”
The improvement from the use of descriptive labels over the use of the default labels shows that incorporating more visual information into the semantic embedding by using visual descriptions as class labels helps the network to find a general relation between the semantic and the visual space learned only on the seen training data. Plainly speaking the network finds more similarities between the describing sentences compared to just one-word labels. Usually this should already be solved by using text embedding techniques that were trained on large text corpora to find semantic relationships. But the problem with that is that the texts it was trained on contains the words used to describe motions in many different contexts and usually not visually describing it. The main reason for this is, that most humans don not need e.g. an explanation on what “stand up” is. But for our task the visual relationships are needed which could explain why using descriptive labels leads to improvements.
\newline
For the multi label approach the idea is little bit different. The main motivation here was that using more data is generally a good idea. In Our case the network is forced to learn a more general mapping between the semantic and the visual feature space since the descriptions and therefor also the embeddings change during training randomly. This better generalization on seen training data than helps to better understand and with that classify the unseen samples.
\newline
As described in [Methods] using automatic augmentation methods introduces some kind of variance/diversity into the different embeddings not only focusing on the visual description of the classes and therefore different to the manually created multi labels. This introduced variance/diversity could be modeled as noise. But in contrast to just adding random noise to the embedding vector keep semantic information and relationships. This still helps the network to generalize its mapping. Experiments using only random noise to generate diverse label embeddings lead to no performance improvements in top-1 accuracy.
\subsubsection{From default to descriptive labels}
The improvement from the use of descriptive labels over the use of the default labels shows that incorporating more visual information into the semantic embedding by using visual descriptions as class labels helps the network to find a general relation between the semantic and the visual space learned only on the seen training data. Plainly speaking the network finds more similarities between the describing sentences compared to just one-word labels. Usually this should already be enabled by using text embedding techniques that were trained on large text corpora to find semantic relationships. But the problem with that is that the texts it was trained on contains the words used to describe motions in many different contexts and usually not visually describing it. The main reason for this is, that most humans do not need e.g. an explanation on what “stand up” looks like. But for our task the visual relationships are needed which could explain why using descriptive labels leads to improvements.
\subsubsection{Using multiple labels}
For the multi label approach the idea is little bit different. The main motivation here was that using more data is generally a good idea. In our case the network is forced to learn a more general mapping between the semantic and the visual feature space since the descriptions and therefor also the embeddings change during training randomly. It has to adapt to the greater variance of the used label semantic embeddings. (Here we could insert a TSNE of output features of AN, to show) This better generalization on seen training data then helps to better understand and also classify the unseen samples.
\subsubsection{Automatic augmentation}
As described in Method, using automatic augmentation methods introduces some kind of variance(/diversity) into the different embeddings. As this does not only focus on the visual description of the classes and therefore differs from the manually created multi labels, it could be modeled as noise. But in contrast to just adding random noise to the embedding vector, it keeps semantic information and relationships. This helps the network to generalize its mapping. Experiments using only random noise to generate diverse label embeddings lead to no performance improvements in top-1 accuracy.
\section{Conclusion}
In this work, we showed the (/present a proof of concept of the) importance of the semantic embeddings in the context of skeleton based Zero-Shot Gesture Recognition by using data augmentation of semantic embeddings. By including more visual information in the sentence labels that describe the classes and combining multiple descriptions per class we could improve the model based on [CITATION] by a significant margin. The use of automatic text augmentation methods like [ma2019nlpaug] already reduces the effort of manual annotation significantly, while maintaining most of the performance. Together with a further reduction of the manual annotation effort in the future, data augmentation of the semantic embedding in Zero-Shot Learning could prove useful in optimizing the performance of any Zero-Shot approach.
In this work, we showed the (/present a proof of concept of the) importance of the semantic embeddings in the context of skeleton based Zero-Shot Gesture Recognition by using data augmentation of semantic embeddings. By including more visual information in the sentence labels that describe the classes and combining multiple descriptions per class we could improve the model based on \cite{jasani2019skeleton} by a significant margin. The use of automatic text augmentation methods like \cite{ma2019nlpaug} already reduces the effort of manual annotation significantly, while maintaining most of the performance. Together with a further reduction of the manual annotation effort in the future, data augmentation of the semantic embedding in Zero-Shot Learning could prove useful in optimizing the performance of any Zero-Shot approach.
To achieve this, future works could further investigate the following topics: First, generating sentences from the default labels using methods from Natural Language Processing (NLP) could be implemented to further reduce the manual annotation effort. Second, additional tests on different zero-shot architectures to verify the improvements shown in our work, could be performed. Finally different kinds or combinations of automatic text augmentation methods could be evaluated.