Commit 05d655ba authored by uoega's avatar uoega
Browse files

update readme, first deletions

parent ed9b0661
......@@ -35,7 +35,7 @@ The files for the automatic augmentation can be found in the subfolder **/augmen
## Learning to Compare ZSL
This module is located in the folder **LearningToCompare_ZSL**. Our Version is based on the implementation of the [original paper](https://arxiv.org/abs/1711.06025) from [LearningToCompare_ZSL](https://github.com/lzrobots/LearningToCompare_ZSL).
This module is located in the folder **LearningToCompare_ZSL**. Our Version is based on the implementation of the [original paper](https://arxiv.org/abs/1711.06025) from [LearningToCompare_ZSL](https://github.com/lzrobots/LearningToCompare_ZSL). Be careful with CUDA 11, we expericend errors there. Use CUDA 10.2 instead.
The main file to train the Zero-Shot part of the architecture is `NTU_RelationNet_copy.py`. There are different input arguments most importantly the unseen classes (from 1 to 40) and the used label embedding. For example:
......@@ -47,32 +47,50 @@ This uses the classes 2,9,11,18 and 38 as unseen classes and the ver1 descriptiv
### Multiple Labels
Here the file `NTU_RelationNet_random_multi_label.py` is used for training. The input arguements are the same like for the single label approach but now more than one label embedding can be used, for example:
Here the file `NTU_RelationNet_random_multi_label.py` is used for training. The input arguements are the same like for the single label approach but now more than one label embedding can be used. For example:
```python:
python NTU_RelationNet_random_multi_label.py -u 2 9 11 18 38 -s sentence_40_mean_ver1_norm sentence_40_mean_ver2_norm sentence_40_mean_ver5_norm
```
This uses the SBERT mean label embedding Version 1,2 and 5.
### Inference
To test the performance of the trained model there is the `NTU_RelationNet_test.py` script which expects the folder with trained network in it as input. Keep in mind, that the network parameters are not initialized by the imported data, but by the other input arguments. For the inference of a single sample please refer to `NTU_RelationNet_test_single_sample.py`. For top5 accuracies use `NTU_RelationNet_test_top5.py`. When using multiple labels refer to the files `...test_multi_label...`
### Experimental Setups
Different achitectures compared to the one proposed in the paper were tested seperately.
- Skip Connections: `NTU_RelationNet_skip_con.py`
- Two Skip Connections: `NTU_RelationNet_skip_con2.py`
- Self Attention layer after concatenation: `NTU_RelationNet_att.py`
- Self Attention as additional layer: `NTU_RelationNet_att2.py`
- Double Mapping, where both visual and sementic features were mapped into a joint space: `NTU_RelationNet_copy_double_mapping.py`
## Additional files
As usual more experiments than the ones presented in the paper were performed. The corresponding files are put together here.
### Simase Networks
Located in the folder **siamese-triplet** are all files used for experiments using a siamese net to cluster the visual features from the ST-GCN.
Located in the folder **siamese-triplet** are all files used for experiments using a siamese net to cluster the visual features from the ST-GCN.
## Results
Averaged accuracies over 10 different seen/unseen 5/35 splits each (per line).
|Augmentation | ZSL | Seen | Unseen | h|
|---|---|---|---|---|
|Baseline | $0.4739$ | $0.8116$ | $0.1067$ | $0.1877$|
|Descriptive Labels | $0.5186$ | $0.8104$ | $0.1503$ | $0.2495$
|Multiple Descriptive Labels | $\textbf{0.6558}$ | $0.8283$ | $\textbf{0.2182}$ |$\textbf{0.3417}$|
|Automatic Augmentation | $0.5865$ | $\textbf{0.8290}$ | $0.1856$ | $0.3003$|
|Baseline | 0.4739 | 0.8116 | 0.1067 | 0.1877|
|Descriptive Labels | 0.5186 | 0.8104 | 0.1503 | 0.2495
|Multiple Descriptive Labels | **0.6558** | 0.8283 | **0.2182** |**0.3417**|
|Automatic Augmentation | 0.5865 | **0.8290** | 0.1856 | 0.3003|
Augmentation | top-1${\pm}$ std | top-5 ${\pm}$ std
Augmentation | top-1 ± std | top-5 ± std
|---|---|---|
Baseline | ${0.1067\pm 0.0246}$ | ${0.5428\pm 0.0840}$
Descriptive Labels | ${0.1503\pm 0.0553}$ | ${0.6460\pm 0.1250}$
Multiple Descriptive Labels | ${\textbf{0.2182}\pm 0.0580}$ | ${\textbf{0.8580}\pm 0.0657}$
Automatic Augmentation | ${0.1856\pm 0.0499}$ | ${0.8272\pm 0.0476}$
Baseline | 0.1067 ± 0.0246 | 0.5428 ± 0.0840
Descriptive Labels | 0.1503 ± 0.0553 | 0.6460 ± 0.1250
Multiple Descriptive Labels | **0.2182 ± 0.0580** | **0.8580 ± 0.0657**
Automatic Augmentation | 0.1856 ± 0.0499 | 0.8272 ± 0.0476
%% Cell type:code id: tags:
``` python
import os
import time
import sys
import re
from subprocess import call
import numpy as np
from nltk import TweetTokenizer
from nltk.tokenize import StanfordTokenizer
```
%% Cell type:markdown id: tags:
# Downloading the models
As mentioned in the readme, here are the pretrained models you can download:
- [sent2vec_wiki_unigrams](https://drive.google.com/open?id=0B6VhzidiLvjSa19uYWlLUEkzX3c) 5GB (600dim, trained on english wikipedia)
- [sent2vec_wiki_bigrams](https://drive.google.com/open?id=0B6VhzidiLvjSaER5YkJUdWdPWU0) 16GB (700dim, trained on english wikipedia)
- [sent2vec_twitter_unigrams](https://drive.google.com/open?id=0B6VhzidiLvjSaVFLM0xJNk9DTzg) 13GB (700dim, trained on english tweets)
- [sent2vec_twitter_bigrams](https://drive.google.com/open?id=0B6VhzidiLvjSeHI4cmdQdXpTRHc) 23GB (700dim, trained on english tweets)
- [sent2vec_toronto books_unigrams](https://drive.google.com/open?id=0B6VhzidiLvjSOWdGM0tOX1lUNEk) 2GB (700dim, trained on the [BookCorpus dataset](http://yknzhu.wixsite.com/mbweb))
- [sent2vec_toronto books_bigrams](https://drive.google.com/open?id=0B6VhzidiLvjSdENLSEhrdWprQ0k) 7GB (700dim, trained on the [BookCorpus dataset](http://yknzhu.wixsite.com/mbweb))
%% Cell type:markdown id: tags:
---
From here, one simple way to get sentence embeddings is to use the `print-sentence-vectors` command as shown in the README. To properly use our models you ideally need to use the same preprocessing used during training. We provide here some simple code wrapping around the `print-sentence-vectors` command and handling the tokenization to match our models properly.
%% Cell type:markdown id: tags:
# Linking things together
In order to use the Stanford NLP tokenizer with NLTK, you need to get the `stanford-postagger.jar` available in the [CoreNLP library package](http://stanfordnlp.github.io/CoreNLP/).
You can then proceed to link things by modifying the paths in the following cell:
%% Cell type:code id: tags:
``` python
FASTTEXT_EXEC_PATH = os.path.abspath("./fasttext")
BASE_SNLP_PATH = "/home/path/to/stanford_NLP/stanford-postagger-2016-10-31/"
SNLP_TAGGER_JAR = os.path.join(BASE_SNLP_PATH, "stanford-postagger.jar")
MODEL_WIKI_UNIGRAMS = os.path.abspath("./wiki_unigrams.bin")
MODEL_WIKI_BIGRAMS = os.path.abspath("./wiki_bigrams.bin")
MODEL_TWITTER_UNIGRAMS = os.path.abspath('./twitter_unigrams.bin')
MODEL_TWITTER_BIGRAMS = os.path.abspath('./twitter_bigrams.bin')
```
%% Cell type:markdown id: tags:
# Generating sentence embeddings
Now you can just run the following cells:
## Utils for tokenization
%% Cell type:code id: tags:
``` python
def tokenize(tknzr, sentence, to_lower=True):
"""Arguments:
- tknzr: a tokenizer implementing the NLTK tokenizer interface
- sentence: a string to be tokenized
- to_lower: lowercasing or not
"""
sentence = sentence.strip()
sentence = ' '.join([format_token(x) for x in tknzr.tokenize(sentence)])
if to_lower:
sentence = sentence.lower()
sentence = re.sub('((www\.[^\s]+)|(https?://[^\s]+)|(http?://[^\s]+))','<url>',sentence) #replace urls by <url>
sentence = re.sub('(\@[^\s]+)','<user>',sentence) #replace @user268 by <user>
filter(lambda word: ' ' not in word, sentence)
return sentence
def format_token(token):
""""""
if token == '-LRB-':
token = '('
elif token == '-RRB-':
token = ')'
elif token == '-RSB-':
token = ']'
elif token == '-LSB-':
token = '['
elif token == '-LCB-':
token = '{'
elif token == '-RCB-':
token = '}'
return token
def tokenize_sentences(tknzr, sentences, to_lower=True):
"""Arguments:
- tknzr: a tokenizer implementing the NLTK tokenizer interface
- sentences: a list of sentences
- to_lower: lowercasing or not
"""
return [tokenize(tknzr, s, to_lower) for s in sentences]
```
%% Cell type:markdown id: tags:
## Utils for inferring embeddings
%% Cell type:code id: tags:
``` python
def get_embeddings_for_preprocessed_sentences(sentences, model_path, fasttext_exec_path):
"""Arguments:
- sentences: a list of preprocessed sentences
- model_path: a path to the sent2vec .bin model
- fasttext_exec_path: a path to the fasttext executable
"""
timestamp = str(time.time())
test_path = os.path.abspath('./'+timestamp+'_fasttext.test.txt')
embeddings_path = os.path.abspath('./'+timestamp+'_fasttext.embeddings.txt')
dump_text_to_disk(test_path, sentences)
call(fasttext_exec_path+
' print-sentence-vectors '+
model_path + ' < '+
test_path + ' > ' +
embeddings_path, shell=True)
embeddings = read_embeddings(embeddings_path)
os.remove(test_path)
os.remove(embeddings_path)
assert(len(sentences) == len(embeddings))
return np.array(embeddings)
def read_embeddings(embeddings_path):
"""Arguments:
- embeddings_path: path to the embeddings
"""
with open(embeddings_path, 'r') as in_stream:
embeddings = []
for line in in_stream:
line = '['+line.replace(' ',',')+']'
embeddings.append(eval(line))
return embeddings
return []
def dump_text_to_disk(file_path, X, Y=None):
"""Arguments:
- file_path: where to dump the data
- X: list of sentences to dump
- Y: labels, if any
"""
with open(file_path, 'w') as out_stream:
if Y is not None:
for x, y in zip(X, Y):
out_stream.write('__label__'+str(y)+' '+x+' \n')
else:
for x in X:
out_stream.write(x+' \n')
def get_sentence_embeddings(sentences, ngram='bigrams', model='concat_wiki_twitter'):
""" Returns a numpy matrix of embeddings for one of the published models. It
handles tokenization and can be given raw sentences.
Arguments:
- ngram: 'unigrams' or 'bigrams'
- model: 'wiki', 'twitter', or 'concat_wiki_twitter'
- sentences: a list of raw sentences ['Once upon a time', 'This is another sentence.', ...]
"""
wiki_embeddings = None
twitter_embbedings = None
tokenized_sentences_NLTK_tweets = None
tokenized_sentences_SNLP = None
if model == "wiki" or model == 'concat_wiki_twitter':
tknzr = StanfordTokenizer(SNLP_TAGGER_JAR, encoding='utf-8')
s = ' <delimiter> '.join(sentences) #just a trick to make things faster
tokenized_sentences_SNLP = tokenize_sentences(tknzr, [s])
tokenized_sentences_SNLP = tokenized_sentences_SNLP[0].split(' <delimiter> ')
assert(len(tokenized_sentences_SNLP) == len(sentences))
if ngram == 'unigrams':
wiki_embeddings = get_embeddings_for_preprocessed_sentences(tokenized_sentences_SNLP, \
MODEL_WIKI_UNIGRAMS, FASTTEXT_EXEC_PATH)
else:
wiki_embeddings = get_embeddings_for_preprocessed_sentences(tokenized_sentences_SNLP, \
MODEL_WIKI_BIGRAMS, FASTTEXT_EXEC_PATH)
if model == "twitter" or model == 'concat_wiki_twitter':
tknzr = TweetTokenizer()
tokenized_sentences_NLTK_tweets = tokenize_sentences(tknzr, sentences)
if ngram == 'unigrams':
twitter_embbedings = get_embeddings_for_preprocessed_sentences(tokenized_sentences_NLTK_tweets, \
MODEL_TWITTER_UNIGRAMS, FASTTEXT_EXEC_PATH)
else:
twitter_embbedings = get_embeddings_for_preprocessed_sentences(tokenized_sentences_NLTK_tweets, \
MODEL_TWITTER_BIGRAMS, FASTTEXT_EXEC_PATH)
if model == "twitter":
return twitter_embbedings
elif model == "wiki":
return wiki_embeddings
elif model == "concat_wiki_twitter":
return np.concatenate((wiki_embeddings, twitter_embbedings), axis=1)
sys.exit(-1)
```
%% Cell type:markdown id: tags:
## Usecase
To get embeddings you can now use the `get_sentence_embeddings` function, the paremeters are:
- sentences: a list of unprocessed sentences
- ngram: either `bigrams` or `unigrams`
- model: `wiki`, `twitter` or `concat_wiki_twitter`
Loading the models can take some time, but once loaded the inferrence is fast.
%% Cell type:code id: tags:
``` python
sentences = ['Once upon a time.', 'And now for something completely different.']
my_embeddings = get_sentence_embeddings(sentences, ngram='unigrams', model='twitter')
print(my_embeddings.shape)
```
%% Cell type:markdown id: tags:
et voila :)
%% Cell type:code id: tags:
``` python
```
FROM ubuntu
RUN mkdir -p /opt/sent2vec/src
ADD setup.py /opt/sent2vec/
ADD src /opt/sent2vec/src/
ADD Makefile /opt/sent2vec/
ADD requirements.txt /opt/sent2vec/
RUN apt-get update
RUN apt-get install -y python3-pip python3-dev build-essential libevent-pthreads-2.1-6
WORKDIR /opt/sent2vec
RUN pip3 install --upgrade pip
RUN pip3 install -r requirements.txt
RUN pip3 install .
RUN make
BSD License
For fastText software
Copyright (c) 2016-present, Facebook, Inc. All rights reserved.
Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name Facebook nor the names of its contributors may be used to
endorse or promote products derived from this software without specific
prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# Copyright (c) 2016-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree. An additional grant
# of patent rights can be found in the PATENTS file in the same directory.
#
CXX = c++
CXXFLAGS = -pthread -std=c++0x
OBJS = args.o dictionary.o productquantizer.o matrix.o shmem_matrix.o qmatrix.o vector.o model.o utils.o fasttext.o
INCLUDES = -I.
ifneq ($(shell uname),Darwin)
LINK_RT := -lrt
endif
opt: CXXFLAGS += -O3 -funroll-loops
opt: fasttext
debug: CXXFLAGS += -g -O0 -fno-inline
debug: fasttext
args.o: src/args.cc src/args.h
$(CXX) $(CXXFLAGS) -c src/args.cc
dictionary.o: src/dictionary.cc src/dictionary.h src/args.h
$(CXX) $(CXXFLAGS) -c src/dictionary.cc
productquantizer.o: src/productquantizer.cc src/productquantizer.h src/utils.h
$(CXX) $(CXXFLAGS) -c src/productquantizer.cc
matrix.o: src/matrix.cc src/matrix.h src/utils.h
$(CXX) $(CXXFLAGS) -c src/matrix.cc
shmem_matrix.o: src/shmem_matrix.cc src/shmem_matrix.h
$(CXX) $(CXXFLAGS) -c src/shmem_matrix.cc
qmatrix.o: src/qmatrix.cc src/qmatrix.h src/utils.h
$(CXX) $(CXXFLAGS) -c src/qmatrix.cc
vector.o: src/vector.cc src/vector.h src/utils.h
$(CXX) $(CXXFLAGS) -c src/vector.cc
model.o: src/model.cc src/model.h src/args.h
$(CXX) $(CXXFLAGS) -c src/model.cc
utils.o: src/utils.cc src/utils.h
$(CXX) $(CXXFLAGS) -c src/utils.cc
fasttext.o: src/fasttext.cc src/*.h
$(CXX) $(CXXFLAGS) -c src/fasttext.cc
fasttext: $(OBJS) src/fasttext.cc
$(CXX) $(CXXFLAGS) $(OBJS) src/main.cc -o fasttext $(LINK_RT)
clean:
rm -rf *.o fasttext
## Updates
Code and pre-trained models related to the [Bi-Sent2vec](https://arxiv.org/abs/1912.12481), cross-lingual extension of Sent2Vec can be found [here](https://github.com/epfml/Bi-sent2vec).
# Sent2vec
TLDR: This library provides numerical representations (features) for words, short texts, or sentences, which can be used as input to any machine learning task.
### Table of Contents
* [Setup and Requirements](#setup-and-requirements)
* [Sentence Embeddings](#sentence-embeddings)
- [Generating Features from Pre-Trained Models](#generating-features-from-pre-trained-models)
- [Downloading Sent2vec Pre-Trained Models](#downloading-sent2vec-pre-trained-models)
- [Train a New Sent2vec Model](#train-a-new-sent2vec-model)
- [Nearest Neighbour Search and Analogies](#nearest-neighbour-search-and-analogies)
* [Word (Unigram) Embeddings](#unigram-embeddings)
- [Extracting Word Embeddings from Pre-Trained Models](#extracting-word-embeddings-from-pre-trained-models)
- [Downloading Pre-Trained Models](#downloading-pre-trained-models)
- [Train a CBOW Character and Word Ngrams Model](#train-a-cbow-character-and-word-ngrams-model)
* [References](#references)
# Setup and Requirements
Our code builds upon [Facebook's FastText library](https://github.com/facebookresearch/fastText), see also their nice documentation and python interfaces.
To compile the library, simply run a `make` command.
A Cython module allows you to keep the model in memory while inferring sentence embeddings. In order to compile and install the module, run the following from the project root folder:
```
pip install .
```
## Note -
if you install sent2vec using
```
$ pip install sent2vec
```
then you'll get the wrong package. Please follow the instructions in the README.md to install it correctly.
# Sentence Embeddings
For the purpose of generating sentence representations, we introduce our sent2vec method and provide code and models. Think of it as an unsupervised version of [FastText](https://github.com/facebookresearch/fastText), and an extension of word2vec (CBOW) to sentences.
The method uses a simple but efficient unsupervised objective to train distributed representations of sentences. The algorithm outperforms the state-of-the-art unsupervised models on most benchmark tasks, and on many tasks even beats supervised models, highlighting the robustness of the produced sentence embeddings, see [*the paper*](https://aclweb.org/anthology/N18-1049) for more details.
## Generating Features from Pre-Trained Models
### Directly from Python
If you've installed the Cython module, you can infer sentence embeddings while keeping the model in memory:
```python
import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('model.bin')
emb = model.embed_sentence("once upon a time .")
embs = model.embed_sentences(["first sentence .", "another sentence"])
```
Text preprocessing (tokenization and lowercasing) is not handled by the module, check `wikiTokenize.py` for tokenization using NLTK and Stanford NLP.
An alternative to the Cython module is using the python code provided in the `get_sentence_embeddings_from_pre-trained_models` notebook. It handles tokenization and can be given raw sentences, but does not keep the model in memory.
#### Running Inference with Multiple Processes
There is an 'inference' mode for loading the model in the Cython module, which loads the model's input matrix into a shared memory segment and doesn't load the output matrix, which is not needed for inference. This is an optimization for the usecase of running inference with multiple independent processes, which would otherwise each need to load a copy of the model into their address space. To use it:
```python
model.load_model('model.bin', inference_mode=True)
```
The model is loaded into a shared memory segment named after the model name. The model will stay in memory until you explicitely remove the shared memory segment. To do it from Python:
```python
model.release_shared_mem('model.bin')
```
### Using the Command-line Interface
Given a pre-trained model `model.bin` (download links see below), here is how to generate the sentence features for an input text. To generate the features, use the `print-sentence-vectors` command and the input text file needs to be provided as one sentence per line:
```
./fasttext print-sentence-vectors model.bin < text.txt
```
This will output sentence vectors (the features for each input sentence) to the standard output, one vector per line.
This can also be used with pipes:
```
cat text.txt | ./fasttext print-sentence-vectors model.bin
```
## Downloading Sent2vec Pre-Trained Models
- [sent2vec_wiki_unigrams](https://drive.google.com/open?id=0B6VhzidiLvjSa19uYWlLUEkzX3c) 5GB (600dim, trained on english wikipedia)
- [sent2vec_wiki_bigrams](https://drive.google.com/open?id=0B6VhzidiLvjSaER5YkJUdWdPWU0) 16GB (700dim, trained on english wikipedia)
- [sent2vec_twitter_unigrams](https://drive.google.com/open?id=0B6VhzidiLvjSaVFLM0xJNk9DTzg) 13GB (700dim, trained on english tweets)
- [sent2vec_twitter_bigrams](https://drive.google.com/open?id=0B6VhzidiLvjSeHI4cmdQdXpTRHc) 23GB (700dim, trained on english tweets)
- [sent2vec_toronto books_unigrams](https://drive.google.com/open?id=0B6VhzidiLvjSOWdGM0tOX1lUNEk) 2GB (700dim, trained on the [BookCorpus dataset](http://yknzhu.wixsite.com/mbweb))
- [sent2vec_toronto books_bigrams](https://drive.google.com/open?id=0B6VhzidiLvjSdENLSEhrdWprQ0k) 7GB (700dim, trained on the [BookCorpus dataset](http://yknzhu.wixsite.com/mbweb))
(as used in the NAACL2018 paper)
Note: users who downloaded models prior to [this release](https://github.com/epfml/sent2vec/releases/tag/v1) will encounter compatibility issues when trying to use the old models with the latest commit. Those users can still use the code in the release to keep using old models.
### Tokenizing
Both feature generation as above and also training as below do require that the input texts (sentences) are already tokenized. To tokenize and preprocess text for the above models, you can use
```
python3 tweetTokenize.py <tweets_folder> <dest_folder> <num_process>
```
for tweets, or then the following for wikipedia:
```
python3 wikiTokenize.py corpora > destinationFile
```
Note: For `wikiTokenize.py`, set the `SNLP_TAGGER_JAR` parameter to be the path of `stanford-postagger.jar` which you can download [here](http://www.java2s.com/Code/Jar/s/Downloadstanfordpostaggerjar.htm)
## Train a New Sent2vec Model
To train a new sent2vec model, you first need some large training text file. This file should contain one sentence per line. The provided code does not perform tokenization and lowercasing, you have to preprocess your input data yourself, see above.
You can then train a new model. Here is one example of command:
./fasttext sent2vec -input wiki_sentences.txt -output my_model -minCount 8 -dim 700 -epoch 9 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 20 -t 0.000005 -dropoutK 4 -minCountLabel 20 -bucket 4000000 -maxVocabSize 750000 -numCheckPoints 10
Here is a description of all available arguments:
```
sent2vec -input train.txt -output model
The following arguments are mandatory:
-input training file path
-output output file path
The following arguments are optional:
-lr learning rate [0.2]
-lrUpdateRate change the rate of updates for the learning rate [100]
-dim dimension of word and sentence vectors [100]
-epoch number of epochs [5]
-minCount minimal number of word occurences [5]
-minCountLabel minimal number of label occurences [0]
-neg number of negatives sampled [10]
-wordNgrams max length of word ngram [2]
-loss loss function {ns, hs, softmax} [ns]
-bucket number of hash buckets for vocabulary [2000000]
-thread number of threads [2]
-t sampling threshold [0.0001]
-dropoutK number of ngrams dropped when training a sent2vec model [2]
-verbose verbosity level [2]
-maxVocabSize vocabulary exceeding this size will be truncated [None]
-numCheckPoints number of intermediary checkpoints to save when training [1]
```
## Nearest Neighbour Search and Analogies
Given a pre-trained model `model.bin` , here is how to use these features. For the nearest neighbouring sentence feature, you need the model as well as a corpora in which you can search for the nearest neighbouring sentence to your input sentence. We use cosine distance as our distance metric. To do so, we use the command `nnSent` and the input should be 1 sentence per line:
```
./fasttext nnSent model.bin corpora [k]
```
k is optional and is the number of nearest sentences that you want to output.
For the analogiesSent, the user inputs 3 sentences A,B and C and finds a sentence from the corpora which is the closest to D in the A:B::C:D analogy pattern.
```
./fasttext analogiesSent model.bin corpora [k]
```
k is optional and is the number of nearest sentences that you want to output.
# Unigram Embeddings
For the purpose of generating word representations, we compared word embeddings obtained training sent2vec models with other word embedding models, including a novel method we refer to as CBOW char + word ngrams (`cbow-c+w-ngrams`). This method augments fasttext char augmented CBOW with word n-grams. You can see the full comparison of results in [*this paper*](https://www.aclweb.org/anthology/N19-1098).
## Extracting Word Embeddings from Pre-Trained Models
If you have the Cython wrapper installed, some functionalities allow you to play with word embeddings obtained from `sent2vec` or `cbow-c+w-ngrams`:
```python
import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('model.bin') # The model can be sent2vec or cbow-c+w-ngrams
vocab = model.get_vocabulary() # Return a dictionary with words and their frequency in the corpus
uni_embs, vocab = model.get_unigram_embeddings() # Return the full unigram embedding matrix
uni_embs = model.embed_unigrams(['dog', 'cat']) # Return unigram embeddings given a list of unigrams
```
Asking for a unigram embedding not present in the vocabulary will return a zero vector in case of sent2vec. The `cbow-c+w-ngrams` method will be able to use the sub-character ngrams to infer some representation.