Commit 6e9f42a3 authored by uoega's avatar uoega
Browse files

paper ver 2.5

parent d14ece55
......@@ -54,23 +54,23 @@
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.2.2}Multiple labels per class}{3}{subsubsection.2.2.2}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.2.3}Automatic augmentation}{3}{subsubsection.2.2.3}}
\@writefile{toc}{\contentsline {subsection}{\numberline {2.3}\hskip -1em.\nobreakspace {}Experiments}{3}{subsection.2.3}}
\@writefile{lot}{\contentsline {table}{\numberline {3}{\ignorespaces ZSL and GZSL results for different approaches.}}{3}{table.3}}
\newlabel{tab:ZSL_GZSL}{{3}{3}{ZSL and GZSL results for different approaches}{table.3}{}}
\@writefile{lot}{\contentsline {table}{\numberline {4}{\ignorespaces Unseen top-1 and top-5 accuracies (GZSL).}}{3}{table.4}}
\newlabel{tab:top1_top5}{{4}{3}{Unseen top-1 and top-5 accuracies (GZSL)}{table.4}{}}
\@writefile{toc}{\contentsline {section}{\numberline {3}\hskip -1em.\nobreakspace {}Results}{3}{section.3}}
\citation{jasani2019skeleton}
\citation{ma2019nlpaug}
\bibstyle{ieee_fullname}
\bibdata{egbib}
\bibcite{cao2019openpose}{1}
\bibcite{estevam2020zeroshot}{2}
\bibcite{gupta2021syntactically}{3}
\@writefile{lot}{\contentsline {table}{\numberline {3}{\ignorespaces ZSL and GZSL results for different approaches.}}{4}{table.3}}
\newlabel{tab:ZSL_GZSL}{{3}{4}{ZSL and GZSL results for different approaches}{table.3}{}}
\@writefile{lot}{\contentsline {table}{\numberline {4}{\ignorespaces Unseen top-1 and top-5 accuracies (GZSL).}}{4}{table.4}}
\newlabel{tab:top1_top5}{{4}{4}{Unseen top-1 and top-5 accuracies (GZSL)}{table.4}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {3.1}\hskip -1em.\nobreakspace {}Discussion}{4}{subsection.3.1}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {3.1.1}From default to descriptive labels}{4}{subsubsection.3.1.1}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {3.1.2}Using multiple labels}{4}{subsubsection.3.1.2}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {3.1.3}Automatic augmentation}{4}{subsubsection.3.1.3}}
\@writefile{toc}{\contentsline {section}{\numberline {4}\hskip -1em.\nobreakspace {}Conclusion}{4}{section.4}}
\bibcite{cao2019openpose}{1}
\bibcite{estevam2020zeroshot}{2}
\bibcite{gupta2021syntactically}{3}
\bibcite{jasani2019skeleton}{4}
\bibcite{kopuk2019realtime}{5}
\bibcite{Liu_2020}{6}
......
This is pdfTeX, Version 3.14159265-2.6-1.40.19 (MiKTeX 2.9.6840 64-bit) (preloaded format=pdflatex 2018.10.16) 29 JUL 2021 22:39
This is pdfTeX, Version 3.14159265-2.6-1.40.19 (MiKTeX 2.9.6840 64-bit) (preloaded format=pdflatex 2018.10.16) 30 JUL 2021 09:03
entering extended mode
**./paper_working_design.tex
(paper_working_design.tex
......@@ -400,17 +400,7 @@ LaTeX Font Info: Font shape `OT1/ptm/bx/n' in size <10> not available
Underfull \vbox (badness 4467) has occurred while \output is active []
[2 <./Architektur4.png>]
Underfull \hbox (badness 10000) in paragraph at lines 213--214
[]
Underfull \hbox (badness 10000) in paragraph at lines 215--216
[]
[3] (paper_working_design.bbl [4]
[2 <./Architektur4.png>] [3] (paper_working_design.bbl [4]
Underfull \hbox (badness 2080) in paragraph at lines 52--57
[]\OT1/ptm/m/n/9 Zdravko Mari-nov, Stanka Vasileva, Qing Wang, Con-
[]
......@@ -436,24 +426,24 @@ Underfull \hbox (badness 2941) in paragraph at lines 78--82
[]
)
Package atveryend Info: Empty hook `BeforeClearDocument' on input line 260.
Package atveryend Info: Empty hook `BeforeClearDocument' on input line 266.
[5
]
Package atveryend Info: Empty hook `AfterLastShipout' on input line 260.
Package atveryend Info: Empty hook `AfterLastShipout' on input line 266.
(paper_working_design.aux)
Package atveryend Info: Executing hook `AtVeryEndDocument' on input line 260.
Package atveryend Info: Empty hook `AtEndAfterFileList' on input line 260.
Package atveryend Info: Empty hook `AtVeryVeryEnd' on input line 260.
Package atveryend Info: Executing hook `AtVeryEndDocument' on input line 266.
Package atveryend Info: Empty hook `AtEndAfterFileList' on input line 266.
Package atveryend Info: Empty hook `AtVeryVeryEnd' on input line 266.
)
Here is how much of TeX's memory you used:
6269 strings out of 492970
92305 string characters out of 3126593
191406 words of memory out of 3000000
189406 words of memory out of 3000000
10015 multiletter control sequences out of 15000+200000
29095 words of font info for 69 fonts, out of 3000000 for 9000
1141 hyphenation exceptions out of 8191
32i,13n,27p,1173b,466s stack positions out of 5000i,500n,10000p,200000b,50000s
32i,13n,27p,1173b,374s stack positions out of 5000i,500n,10000p,200000b,50000s
{C:/Users/XPS15/AppData/Local/Programs/MiKTeX 2.9/fonts/enc/dvips/base/8r.enc
}<C:/Users/XPS15/AppData/Local/Programs/MiKTeX 2.9/fonts/type1/public/amsfonts/
cm/cmmi10.pfb><C:/Users/XPS15/AppData/Local/Programs/MiKTeX 2.9/fonts/type1/pub
......@@ -463,7 +453,7 @@ iKTeX 2.9/fonts/type1/urw/courier/ucrr8a.pfb><C:/Users/XPS15/AppData/Local/Prog
rams/MiKTeX 2.9/fonts/type1/urw/times/utmb8a.pfb><C:/Users/XPS15/AppData/Local/
Programs/MiKTeX 2.9/fonts/type1/urw/times/utmr8a.pfb><C:/Users/XPS15/AppData/Lo
cal/Programs/MiKTeX 2.9/fonts/type1/urw/times/utmri8a.pfb>
Output written on paper_working_design.pdf (5 pages, 340586 bytes).
Output written on paper_working_design.pdf (5 pages, 340968 bytes).
PDF statistics:
134 PDF objects out of 1000 (max. 8388607)
40 named destinations out of 1000 (max. 500000)
......
......@@ -168,15 +168,23 @@ To reduce the manual annotation effort, we would like to generate additional lab
In this work we use the NTU RGB+D 120 dataset \cite{Liu_2020}, which contains 3D skeleton data for 114,480 samples of 120 different human action classes. To evaluate our model we pick a subset of 40 gestures classes to execute four performance tests: one with our default labels as a baseline, and one per augmentation method. A performance test consists of eight training runs on 35/5 (seen/unseen) splits, which are randomized in such a way that every single class is unseen in exactly one training run.
During a training run, only the weights of the AN and RN modules are adjusted. The GCN is trained beforehand on the 80 unused classes of the NTU dataset to ensure that the unseen gestures have not appeared in the training process at some early stage. The SBERT module has already been trained on large text corpora by \cite{reimers2019sentencebert}.
During a training run, only the weights of the AN and RN modules are adjusted. The GCN is trained beforehand on the 80 unused classes of the NTU dataset to ensure that the unseen gestures have not appeared in the training process at some early stage. The SBERT module has already been trained on large text corpora by \cite{reimers2019sentencebert}. %is used as provided by \cite
For testing, the accuracies are averaged over the eight individual experiments of a performance test. We test the performance in two scenarios for each augmentation method: In the ZSL scenario, the model only predicts on the unseen classes, while it predicts on all classes (seen and unseen) in the GZSL scenario. In the latter we measure the accuracy for seen and unseen samples, as well as the harmonic mean, following recent works \cite{gupta2021syntactically}. For default and descriptive labels, we train our Network with a batch size of 32 and without batch norm, as was done in the original paper \cite{sung2018learning}. When using multiple labels, we instead use a batch size of 128 and batch norm at the input of the RN.
For testing, the accuracies are averaged over the eight individual experiments of a performance test. We test the performance in two scenarios for each augmentation method: In the ZSL scenario, the model only predicts on the unseen classes, while it predicts on all classes (seen and unseen) in the GZSL scenario. In the latter we measure the accuracy for seen and unseen samples, as well as the harmonic mean, following recent works \cite{gupta2021syntactically}. For default and descriptive labels, we train our network with a batch size of 32 and without batch normalization, as it was done in the original paper \cite{sung2018learning}. When using multiple labels, we instead use a batch size of 128 and batch normalization at the input of the RN.
\section{Results}
\begin{table}
All our results are generated following the procedure described in the experiments section. In table \ref{tab:ZSL_GZSL} one can see the ZSL, seen and unseen accuracies, as well as the harmonic mean. Table \ref{tab:top1_top5} displays a more detailed view of the achieved unseen accuracies. It shows the top-1 and top-5 accuracies for our approaches with their standard deviations (std) over the eight splits.
Improvements on the ZSL accuracy, the unseen accuracy and the harmonic mean were achieved using the descriptive labels and even more so with the three descriptive labels approach. Using only one manually created descriptive label and four automatic augmentations of this description in a multiple label approach, performs worse compared to three descriptive labels. But it still constitutes a relative 23\% increase over using only one descriptive label. The seen accuracy only experiences a marginal increase for the two cases that use multiple labels. This behaviour is observed whenever batch normalization is applied to any of our approaches along with a decrease in unseen accuracy. Therefore it is only applied in the cases were multiple labels are used because they require batch normalization in order for the training to converge.
Table \ref{tab:top1_top5} shows that the top-5 accuracies behave similarly to their top-1 counterparts, with the exception of a less severe decrease when using automatic augmentations. The standard deviations of the top-1 accuracies are in the same range for all approaches based on the descriptive labels. The standard deviation belonging to the top-5 accuracies decreases for the multiple label aproaches, which indicates a higher prediction consistency.
\begin{table}[t]
\begin{center}
\begin{tabular}{|l|c|c|c|c|}
\hline
......@@ -193,7 +201,7 @@ For testing, the accuracies are averaged over the eight individual experiments o
\label{tab:ZSL_GZSL}
\end{table}
\begin{table}
\begin{table}[t]
\begin{center}
\begin{tabular}{|l|c|c|}
\hline
......@@ -210,20 +218,13 @@ For testing, the accuracies are averaged over the eight individual experiments o
\label{tab:top1_top5}
\end{table}
All our results are generated following the procedure described in the Experiments section. In table \ref{tab:ZSL_GZSL} one can see the ZSL, seen and unseen accuracies, as well as the harmonic mean. Table \ref{tab:top1_top5} displays a more detailed view of the achieved unseen accuracies. It shows the top-1 and top-5 accuracies for our approaches with their standard deviations (std) over the eight splits.\\
Improvements on the ZSL accuracy, the unseen accuracy and the harmonic mean were achieved using the descriptive labels and even more so with the three descriptive labels approach. Using only one manually created descriptive label and four automatic augmentations of this description in a multiple label approach, performs worse compared to three descriptive labels. But it still constitutes a relative 23\% increase over using only one descriptive label. The seen accuracy only experiences a marginal increase for the two cases that use multiple labels. This behaviour is observed whenever batch normalization is applied to any of our approaches along with a decrease in unseen accuracy. Therefore it is only applied in the cases were multiple labels are used because they require batch normalization in order for the training to converge. \\
Table \ref{tab:top1_top5} shows that the top-5 accuracies behave similarly to their top-1 counterparts, with the exception of a less severe decrease when using automatic augmentations. The standard deviations of the top-1 accuracies are in the same range for all approaches based on the descriptive labels. The standard deviation belonging to the top-5 accuracies decreases for the multiple label aproaches, which indicates a higher prediction consistency.
\subsection{Discussion}
\subsubsection{From default to descriptive labels}
The improvement from the use of descriptive labels shows that incorporating more visual information into the semantic embeddings helps the network to find a general relation between the semantic and the visual space based only on the seen training data. Plainly speaking the network can find more similarities between the class labels. This is important since the assumed visual features of an unseen class are determined based on the similarities between its label and the seen labels.
One might expect these similarities to also be present in the embeddings of the default labels because SBERT should be able to generate a representative embeddings that share characteristics with similar classes.
While such similarities are present in the SBERT embeddings, they are not focused on the visual appearance of the gestures.
For example, the embeddings of the class labels 'sit down' and 'drink water' might be somewhat similar, because those words appear together frequently in the large training text corpora (which SBERT was trained on), but visually those classes look vastly different from each other. The embeddings falsely suggest, that a similarity between the classes is there, which is less likely to happen if the embeddings are created from visual descriptions of the actions.
For example, the embeddings of the class labels "sit down" and "drink water" might be somewhat similar, because those words appear together frequently in the large training text corpora (which SBERT was trained on), but visually those classes look vastly different from each other. The embeddings falsely suggest, that a similarity between the classes is there, which is less likely to happen if the embeddings are created from visual descriptions of the actions.
%this to already be possible on the default labels due to the use of text embeddings, but the issue there lies with the way that the embedding modules are trained. The embeddings of the class labels 'sit down' and 'drink water' might be somewhat similar, because those words appear together frequently in the large training text corpora, but visually those classes look vastly different from each other. The embeddings falsely suggest, that a similarity between the classes is there, which is less likely to happen if the embeddings are created from visual descriptions of the actions. Usually this should already be enabled by using text embedding techniques that were trained on large text corpora to find semantic relationships. But the problem with that is that the texts it was trained on contains the words used to describe motions in many different contexts and usually not visually describing it. The main reason for this is, that most humans do not need e.g. an explanation on what “stand up” looks like. But for our task the visual relationships are needed which could explain why using descriptive labels leads to improvements.
......@@ -231,17 +232,22 @@ For example, the embeddings of the class labels 'sit down' and 'drink water' mig
%vielleicht kurz auf die Bedeutung der Batchsize eingehen 32->128 weil mehr Klassen since the unseen labels can only be categorized based on the relations between and the seen labels. to also be present in the default label by using visual descriptions as class labels
For the "multiple labels" approach the idea is somewhat different. The main motivation here was that using more data is generally a good idea. In our case the network is forced to learn a more general mapping between the semantic and the visual feature space since the descriptions and therefore also the embeddings change randomly during training. It has to adapt to the greater diversity of the used label semantic embeddings. This improved generalization on seen training data then helps to better understand and also classify the unseen samples.
When using multiple labels the idea is somewhat different. The main motivation here is that using larger amounts of data is generally a good idea. In our case the network is forced to learn a more general mapping between the semantic and the visual feature space since the descriptions and therefore also the embeddings change randomly during training. It has to adapt to the greater diversity of the used label semantic embeddings. This improved generalization on seen training data then helps to better understand therefore classify the unseen samples.
%------Jonas
To achieve this, the batch size during training is increased from 32 to 128. As the network needs to learn a mapping between a greater amount of different classes, increasing the batch size helps to find relations between more classes at once. This benefit was not observed when using one label per class.
%
\subsubsection{Automatic augmentation}
As described in chapter \ref{method}, using automatic augmentation methods introduces diversity into the different embeddings. As this does not only focus on the visual description of the classes and therefore differs from the manually created multi labels, it could be modeled as noise. But in contrast to just adding random noise to the embedding vector, it keeps semantic information and relationships intact. This helps the network to generalize its mapping. Experiments using only random noise to generate diverse label embeddings lead to no performance improvements in top-1 accuracy.
%------Jonas
As described in chapter \ref{method}, inserting words with automatic augmentation methods not necessarily yield grammatically correct sentences. But as the semantic embeddings are generated using the SBERT mean-token, it introduces diversity into the different embeddings due to the averaging. Still the additional information does usually not focus on the visual description of the classes and therefore differs from the manually created multiple labels. It can rather be modeled as adding noise to the embedding vectors. But in contrast to just adding random noise, it keeps semantic information and relations between different embeddings intact. This helps the network to better generalize its mapping. Experiments using only random noise to generate diverse label embeddings lead to no performance improvements in top-1 unseen accuracy.
%------
%As described in chapter \ref{method}, using automatic augmentation methods introduces diversity into the different embeddings. As this does not only focus on the visual description of the classes and therefore differs from the manually created multi labels, it could be modeled as noise. But in contrast to just adding random noise to the embedding vector, it keeps semantic information and relationships intact. This helps the network to generalize its mapping. Experiments using only random noise to generate diverse label embeddings lead to no performance improvements in top-1 accuracy.
\section{Conclusion}
In this work, we highlight the importance of the semantic embeddings in the context of skeleton based zero-shot gesture recognition by showing how the performance can increase based only on the augmentation of those embeddings. By including more visual information in the class labels and combining multiple descriptions per class we can improve the model based on \cite{jasani2019skeleton} by a significant margin. The use of automatic text augmentation methods like \cite{ma2019nlpaug} already reduces the effort of manual annotation significantly, while maintaining most of the performance gain.
In this work, we highlight the importance of the semantic embeddings in the context of skeleton based zero-shot gesture recognition. by showing how the performance can increase based only on the augmentation of those embeddings. By including more visual information in the class labels and combining multiple descriptions per class we can improve the model based on \cite{jasani2019skeleton} by a significant margin. The use of automatic text augmentation methods \cite{ma2019nlpaug} already reduces the effort of manual annotation significantly, while maintaining most of the performance gain.
Future works might further investigate the following topics: First, generating descriptive sentences from the default labels using methods from Natural Language Processing (NLP) could be implemented to further reduce the manual annotation effort. Second, additional tests on different zero-shot architectures to verify the improvements shown in our work can be performed. Finally, different kinds or combinations of automatic text augmentation methods can be evaluated.
Future works might further investigate the following topics: First, generating descriptive sentences from the default labels, e.g. by using methods from Natural Language Processing (NLP), can be implemented to further reduce the manual annotation effort. Second, additional tests on different zero-shot architectures to verify the improvements shown in our work can be performed. Finally, different kinds or combinations of automatic text augmentation methods can be evaluated.
With these advances, data augmentation of the semantic embeddings in Zero-Shot learning can prove useful in optimizing the performance of any Zero-Shot approach in the future.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment