Commit cf852020 authored by Tediloma's avatar Tediloma
Browse files

paper überarbeitet (bis similarity learning)

parent 8ed37647
......@@ -20,44 +20,52 @@
\citation{Liu_2020}
\citation{jasani2019skeleton}
\citation{reimers2019sentencebert}
\@writefile{toc}{\contentsline {section}{\numberline {1}\hskip -1em.\nobreakspace {}Introduction}{1}{section.1}}
\@writefile{toc}{\contentsline {subsection}{\numberline {1.1}\hskip -1em.\nobreakspace {}Zero-shot learning}{1}{subsection.1.1}}
\@writefile{toc}{\contentsline {subsection}{\numberline {1.2}\hskip -1em.\nobreakspace {}Skeleton-based visual recognition}{1}{subsection.1.2}}
\@writefile{toc}{\contentsline {subsection}{\numberline {1.3}\hskip -1em.\nobreakspace {}Related work}{1}{subsection.1.3}}
\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces Architecture of the network.}}{2}{figure.1}}
\newlabel{fig:long}{{1}{2}{Architecture of the network}{figure.1}{}}
\newlabel{fig:onecol}{{1}{2}{Architecture of the network}{figure.1}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {1.4}\hskip -1em.\nobreakspace {}Data augmentation}{2}{subsection.1.4}}
\@writefile{toc}{\contentsline {section}{\numberline {2}\hskip -1em.\nobreakspace {}Method}{2}{section.2}}
\@writefile{toc}{\contentsline {subsection}{\numberline {2.1}\hskip -1em.\nobreakspace {}Architecture}{2}{subsection.2.1}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.1.1}Visual path}{2}{subsubsection.2.1.1}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.1.2}Semantic Path}{2}{subsubsection.2.1.2}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.1.3}Similarity-Learning-Part}{2}{subsubsection.2.1.3}}
\@writefile{toc}{\contentsline {section}{\numberline {1}\hskip -1em.\nobreakspace {}Introduction}{1}{section.1}\protected@file@percent }
\@writefile{toc}{\contentsline {subsection}{\numberline {1.1}\hskip -1em.\nobreakspace {}Zero-shot learning}{1}{subsection.1.1}\protected@file@percent }
\@writefile{toc}{\contentsline {subsection}{\numberline {1.2}\hskip -1em.\nobreakspace {}Skeleton-based visual recognition}{1}{subsection.1.2}\protected@file@percent }
\@writefile{toc}{\contentsline {subsection}{\numberline {1.3}\hskip -1em.\nobreakspace {}Related work}{1}{subsection.1.3}\protected@file@percent }
\citation{jasani2019skeleton}
\citation{jasani2019skeleton}
\citation{jasani2019skeleton}
\citation{yan2018spatial}
\citation{yan2018spatial}
\citation{jasani2019skeleton}
\citation{reimers2019sentencebert}
\citation{sung2018learning}
\citation{sung2018learning}
\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces Architecture of the network.}}{2}{figure.1}\protected@file@percent }
\newlabel{architecture}{{1}{2}{Architecture of the network}{figure.1}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {1.4}\hskip -1em.\nobreakspace {}Data augmentation}{2}{subsection.1.4}\protected@file@percent }
\@writefile{toc}{\contentsline {section}{\numberline {2}\hskip -1em.\nobreakspace {}Method}{2}{section.2}\protected@file@percent }
\@writefile{toc}{\contentsline {subsection}{\numberline {2.1}\hskip -1em.\nobreakspace {}Architecture}{2}{subsection.2.1}\protected@file@percent }
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.1.1}Visual path}{2}{subsubsection.2.1.1}\protected@file@percent }
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.1.2}Semantic Path}{2}{subsubsection.2.1.2}\protected@file@percent }
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.1.3}Similarity-Learning-Part}{2}{subsubsection.2.1.3}\protected@file@percent }
\citation{liu2019roberta}
\citation{ma2019nlpaug}
\citation{jasani2019skeleton}
\citation{sung2018learning}
\@writefile{lot}{\contentsline {table}{\numberline {1}{\ignorespaces Three descriptive labels for class "Squat down".}}{3}{table.1}}
\@writefile{lot}{\contentsline {table}{\numberline {1}{\ignorespaces Three descriptive labels for class "Squat down".}}{3}{table.1}\protected@file@percent }
\newlabel{tab:multi_label}{{1}{3}{Three descriptive labels for class "Squat down"}{table.1}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {2.2}\hskip -1em.\nobreakspace {}Augmentation}{3}{subsection.2.2}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.2.1}Descriptive labels}{3}{subsubsection.2.2.1}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.2.2}Multiple labels per class}{3}{subsubsection.2.2.2}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.2.3}Automatic augmentation}{3}{subsubsection.2.2.3}}
\@writefile{lot}{\contentsline {table}{\numberline {2}{\ignorespaces Descriptive label and two automatic augmentations for "Squat down".}}{3}{table.2}}
\@writefile{toc}{\contentsline {subsection}{\numberline {2.2}\hskip -1em.\nobreakspace {}Augmentation}{3}{subsection.2.2}\protected@file@percent }
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.2.1}Descriptive labels}{3}{subsubsection.2.2.1}\protected@file@percent }
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.2.2}Multiple labels per class}{3}{subsubsection.2.2.2}\protected@file@percent }
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.2.3}Automatic augmentation}{3}{subsubsection.2.2.3}\protected@file@percent }
\@writefile{lot}{\contentsline {table}{\numberline {2}{\ignorespaces Descriptive label and two automatic augmentations for "Squat down".}}{3}{table.2}\protected@file@percent }
\newlabel{tab:auto_aug}{{2}{3}{Descriptive label and two automatic augmentations for "Squat down"}{table.2}{}}
\@writefile{lot}{\contentsline {table}{\numberline {3}{\ignorespaces ZSL and GZSL results for different approaches.}}{3}{table.3}}
\@writefile{lot}{\contentsline {table}{\numberline {3}{\ignorespaces ZSL and GZSL results for different approaches.}}{3}{table.3}\protected@file@percent }
\newlabel{tab:ZSL_GZSL}{{3}{3}{ZSL and GZSL results for different approaches}{table.3}{}}
\@writefile{toc}{\contentsline {subsection}{\numberline {2.3}\hskip -1em.\nobreakspace {}Experiments}{3}{subsection.2.3}}
\@writefile{toc}{\contentsline {subsection}{\numberline {2.3}\hskip -1em.\nobreakspace {}Experiments}{3}{subsection.2.3}\protected@file@percent }
\citation{jasani2019skeleton}
\citation{ma2019nlpaug}
\@writefile{lot}{\contentsline {table}{\numberline {4}{\ignorespaces Unseen top-1 and top-5 accuracies in detail.}}{4}{table.4}}
\@writefile{lot}{\contentsline {table}{\numberline {4}{\ignorespaces Unseen top-1 and top-5 accuracies in detail.}}{4}{table.4}\protected@file@percent }
\newlabel{tab:top1_top5}{{4}{4}{Unseen top-1 and top-5 accuracies in detail}{table.4}{}}
\@writefile{toc}{\contentsline {section}{\numberline {3}\hskip -1em.\nobreakspace {}Results}{4}{section.3}}
\@writefile{toc}{\contentsline {subsection}{\numberline {3.1}\hskip -1em.\nobreakspace {}Discussion}{4}{subsection.3.1}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {3.1.1}From default to descriptive labels}{4}{subsubsection.3.1.1}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {3.1.2}Using multiple labels}{4}{subsubsection.3.1.2}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {3.1.3}Automatic augmentation}{4}{subsubsection.3.1.3}}
\@writefile{toc}{\contentsline {section}{\numberline {4}\hskip -1em.\nobreakspace {}Conclusion}{4}{section.4}}
\@writefile{toc}{\contentsline {section}{\numberline {3}\hskip -1em.\nobreakspace {}Results}{4}{section.3}\protected@file@percent }
\@writefile{toc}{\contentsline {subsection}{\numberline {3.1}\hskip -1em.\nobreakspace {}Discussion}{4}{subsection.3.1}\protected@file@percent }
\@writefile{toc}{\contentsline {subsubsection}{\numberline {3.1.1}From default to descriptive labels}{4}{subsubsection.3.1.1}\protected@file@percent }
\@writefile{toc}{\contentsline {subsubsection}{\numberline {3.1.2}Using multiple labels}{4}{subsubsection.3.1.2}\protected@file@percent }
\@writefile{toc}{\contentsline {subsubsection}{\numberline {3.1.3}Automatic augmentation}{4}{subsubsection.3.1.3}\protected@file@percent }
\@writefile{toc}{\contentsline {section}{\numberline {4}\hskip -1em.\nobreakspace {}Conclusion}{4}{section.4}\protected@file@percent }
\bibstyle{ieee_fullname}
\bibdata{egbib}
\bibcite{duan2021revisiting}{1}
......@@ -67,3 +75,5 @@
\bibcite{ma2019nlpaug}{5}
\bibcite{reimers2019sentencebert}{6}
\bibcite{sung2018learning}{7}
\bibcite{yan2018spatial}{8}
\gdef \@abspage@last{5}
......@@ -41,4 +41,10 @@ Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H.~S. Torr, and
\newblock Learning to compare: Relation network for few-shot learning.
\newblock {\em arXiv:1711.06025}, 2018.
\bibitem{yan2018spatial}
Sijie Yan, Yuanjun Xiong, and Dahua Lin.
\newblock Spatial temporal graph convolutional networks for skeleton-based
action recognition.
\newblock {\em arXiv:1801.07455}, 2018.
\end{thebibliography}
This is BibTeX, Version 0.99dThe top-level auxiliary file: paper_working_design.aux
This is BibTeX, Version 0.99d
Capacity: max_strings=200000, hash_size=200000, hash_prime=170003
The top-level auxiliary file: paper_working_design.aux
Reallocating 'name_of_file' (item size: 1) to 14 items.
The style file: ieee_fullname.bst
Reallocating 'name_of_file' (item size: 1) to 6 items.
Database file #1: egbib.bib
You've used 8 entries,
2120 wiz_defined-function locations,
540 strings with 5197 characters,
and the built_in function-call counts, 2678 in all, are:
= -- 235
> -- 203
< -- 0
+ -- 80
- -- 72
* -- 201
:= -- 498
add.period$ -- 24
call.type$ -- 8
change.case$ -- 60
chr.to.int$ -- 0
cite$ -- 8
duplicate$ -- 78
empty$ -- 156
format.name$ -- 72
if$ -- 514
int.to.chr$ -- 0
int.to.str$ -- 8
missing$ -- 7
newline$ -- 43
num.names$ -- 16
pop$ -- 62
preamble$ -- 1
purify$ -- 52
quote$ -- 0
skip$ -- 58
stack$ -- 0
substring$ -- 73
swap$ -- 8
text.length$ -- 0
text.prefix$ -- 0
top$ -- 0
type$ -- 32
warning$ -- 0
while$ -- 17
width$ -- 9
write$ -- 83
This diff is collapsed.
......@@ -61,7 +61,7 @@ Gesture recognition in videos is a rapidly growing field of research and could b
%-------------------------------------------------------------------------
\subsection{Zero-shot learning}
In order to be able to classify samples of untrained classes a network needs to have an expectation of what the gesture corresponding to that class’s label might look like. This is usually done through text embeddings: Trained on unannotated text data, language embedding models extract meaning from words or sentences by converting them into a semantic embedding vector. After creating a semantic embedding for each class label, it is possible to compare the embeddings of unseen classes with those of seen classes to determine what characteristics those classes share. If those similarities are also present in the visual input samples, the network can deduce that the input sample belongs to that specific unseen class.
In order to be able to classify samples of untrained (also calles 'unseen') classes, a network needs to have an expectation of what the gesture corresponding to that class’s label might look like. This is usually done through text embeddings: Trained on unannotated text data, language embedding models extract meaning from words or sentences by converting them into a semantic embedding vector. After creating a semantic embedding for each class label, it is possible to compare the embeddings of unseen classes with those of seen classes to determine what characteristics those classes share. If those similarities are also present in the visual input samples, the network can deduce that the input sample belongs to that specific unseen class.
\subsection{Skeleton-based visual recognition}
......@@ -69,7 +69,7 @@ RGB videos contain a lot of information that is not necessary to recognize the p
%-------------------------------------------------------------------------
\subsection{Related work}
In \cite{jasani2019skeleton} a zero-shot classification network for the NTU RGB + D dataset was created. Their architecture features an MLP to map the semantic embeddings of the class labels into the visual features space and another MLP that learns a deep similarity metric between those semantic features and the visual features of a given input sample.\\
In \cite{jasani2019skeleton} a zero-shot classification network for the NTU RGB+D dataset was created. Their architecture features an MLP to map the semantic embeddings of the class labels into the visual features space and another MLP that learns a deep similarity metric between those semantic features and the visual features of a given input sample.\\
\\
Sentence BERT (SBERT) \cite{reimers2019sentencebert} is a text embedding module that takes a sentence as input, analyzes it and gives two kinds of outputs: a cls-token vector, that is a representation of the entire sentence, and a series of embedding vectors, that each represent one word of the input sentence with its context. A mean token vector can be created out of this secondary output by applying an attention mask to the series of tokens to combine them into a single one. This way two separate semantic embeddings can be generated for each input sentence: a cls-token and a mean-token.
......@@ -79,7 +79,7 @@ We aim to provide the network more relevant semantic information about the diffe
\section{Method}
Überleitung+gesamtes Vorgehen
First we need to build a network capable of zero-shot learning for gesture recognition, and assess the baseline performance. Then we apply different forms of data augmentation to the semantic embeddings of the class labels and compare the classification accuracy. For our tests we use 40 gesture classes from the NTU RGB+D 120 dataset.
\begin{figure}[t]
\begin{center}
......@@ -87,22 +87,21 @@ We aim to provide the network more relevant semantic information about the diffe
\includegraphics[width=0.9\linewidth]{Architektur2.png}
\end{center}
\caption{Architecture of the network.}
\label{fig:long}
\label{fig:onecol}
\label{architecture}
\end{figure}
\subsection{Architecture}
The architecture chosen for our experiments corresponds in large parts to the model presented in [jas2019Skelet]. We tried to reengineer it from its individual modules using the information published in the paper. Certain modules were replaced or slightly modified in favor of better performance. Detailed information on the model used (which is illustrated in [Figure...]) should be taken from the publication of Jas. et. al [jas2019Skelet]. Here, however, only a brief overview of the functionality is given, according to how the model tries to solve the zero-shot task, and which changes have been made compared to [jas2019Skelet].
The architecture chosen for our experiments largely corresponds to the model presented in \cite{jasani2019skeleton}. We tried to reengineer it from its individual modules using the information published in the paper. Certain modules were replaced or slightly modified in favor of better performance for our specific task. Detailed information on the model used (which is illustrated in \ref{architecture}) can be taken from the publication of Jas. et. al \cite{jasani2019skeleton}. Here, only a brief overview of the functionality is given, explaining how the model tries to solve the zero-shot task, and which changes have been made compared to \cite{jasani2019skeleton}.
The architecture consists of three parts described in the following sections.
\subsubsection{Visual path}
The task of the visual path is the feature extraction of the video sample to be classified. The Graph Convolutional Net (GCN) from [Yan2018GCN] is used as the feature extractor, which in our case has been trained exclusively with the 80 unused classes of the NTU-RGB+D 120 dataset in order not to violate the zero-shot approach. This ensures that the unseen gestures have not already appeared at some early point in the training process before inference. Further details on feature extraction can be found in the referenced paper [Yan2018GCN]. No significant changes were made to this part of the network.
The task of the visual path is the feature extraction of a video sample (in form of a temporal series of 3D-Skeletons). The Graph Convolutional Net (GCN) from \cite{yan2018spatial} is used as the feature extractor, which in our case has been trained exclusively with the 80 unused classes of the NTU-RGB+D 120 dataset. This ensures that the unseen gestures have not already appeared at some early point in the training process before inference. Further details on feature extraction can be found in the referenced paper \cite{yan2018spatial}. No significant changes were made to this part of the network.
\subsubsection{Semantic Path}
The semantic path consists of two modules. The first one has the task of transforming the vocabulary, i.e. all possible class labels, into a semantic embedding. For this, in contrast to the original architecture, no Sent2Vec module is used, but an sBert module. For our experiments we used the weighted average of the word-tokens and not as one would maybe expect the class tokens. We did this due to better performance. The details of this model, which translates the class labels into representative 768-dimensional vectors, can be found in [citation Bert paper]. The semantic embedding is followed by the mapping of the semantic features into the visual context. This task is performed by a multi-layer perceptron (MLP), which will be referred to as an attribute network (AN) in the following. The AN is located at the boundary between the semantic path and the similarity learning part. It is introduced in [Sung2018L2C] where it contributes a significant part to the solution of the ZSL task together with the Relation Net (RN), which is explained in more detail in the following section. Minor changes have also been made to the AN. These can be seen in the dimensionality of the individual layers and the added drop-out layer, with a drop-out factor of 0.5.
The semantic path consists of two modules. First is an SBERT module, that has the task of transforming the vocabulary, i.e. all possible class labels, into semantic embeddings. This is different from the original architecture in \cite{jasani2019skeleton}, where a Sent2Vec module is used. We chose to use the mean-token output of the SBERT module over the cls-token as our semantic embedding because it resulted in a better performance. The details of this model, which translates the class labels into representative 768-dimensional vectors, can be found in \cite{reimers2019sentencebert}. The attribute network (AN) then transforms the semantic embeddings into semantic features by mapping them into the visual feature space. The AN is introduced in \cite{sung2018learning} where it contributes a significant part to the solution of the ZSL task together along with the Relation Net (RN), which is explained in more detail in the following section. Minor changes have also been made to the AN, namely the output size of the first layer has been changed to 1200 and it's dropout factor to 0.5.
\subsubsection{Similarity-Learning-Part}
The Similariy-Learning part consists only of another MLP, which is called Relation Net in [Sung2018L2C]. According to their idea, the RN should learn in the training process to compare the semantic features of the individual labels in the vocabulary with the visual features of the video sample to be classified, in order to be able to make a correct assignment (even for unseen classes). Again, we refer to the originators of the architecture for more details. Following changes have been made to the RN: we added another linear layer and a drop-out layer in the MLP. Furthermore, a batch norm layer has also been introduced for some of our experiments. The latter one is used to speed up the training process.
Here we first form the relation pairs by pairwise concatenating the visual features of our sample with the semantic features of each class. These relation pairs are are then fed into the relation netwoek (RN) introduced in \cite{sung2018learning}. The RN learns a deep similarity metric in order to assess the similarity of the semantic and visual features within each relation pair. This way, it computes a similarity score for each pair, which symbolizes the input sample's similarity to each possible class. Again, we refer to the originators of the architecture for more details. Following changes have been made to the RN: we added another linear layer and a drop-out layer in the MLP. Furthermore, a batch norm layer has also been introduced for some of our experiments. The latter one is used to speed up the training process.
The AN module as well as the RN module from [Sung2018L2C] are really trained during training process. All other modules mentioned remain unchanged after their pretraining.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment