Monolingual and Multilingual Image Captioning
This is the source code that accompanies Multilingual Image Description with Neural Sequence Models . You can use it to train multilingual multimodal language models for image description.
CUDA 6.5, 7.0, or 7.5
python 2.7
numpy 1.91
scipy 0.15
h5py 2.5.0
dominate to visualise the generated descriptions
Download a pre-processed version of the IAPRTC-12 dataset for English and German from Dropbox. Unzip into iaprtc12_eng
and iaprtc12_ger
, respectively.
Run python util/makejson.py --path iaprtc12_eng
followed by python util/jsonmat2h5.py --path iaprtc12_eng
to create the dataset.h5 file expected by GroundedTranslation. Repeat this process, replacing eng
for ger
to create the German dataset.h5 file.
Run THEANO_FLAGS=floatX=float32,device=gpu0 python train.py --dataset iaprtc12_eng --hidden_size=256 --fixed_seed --run_string=fixed_seed-eng256mlm
to train an English Vision-to-Language one-layer LSTM. Training takes 500s/epoch on a Tesla K20X.
By default, this uses --optimiser=adam
, --batch_size=100
instances, --big_batch=10000
and --l2reg=1e-8
weight regularisation. The hidden units have --hidden_size=256
dimensions, with dropout parameters of --dropin=0.5
, and an --unk=3
threshold for pruning the word vocabulary.
This model should report a maximum BLEU4 of 15.21 (PPLX 6.898) on the val split, using a fixed seed of 1234.
Run THEANO_FLAGS=floatX=float32,device=gpu0 python train.py --dataset iaprtc12_ger --hidden_size=256 --fixed_seed --run_string=fixed_seed-ger256mlm
to train a German Vision-to-Language one-layer LSTM. Training takes 500s/epoch on a Tesla K20X.
By default, this uses --optimiser=adam
, --batch_size=100
instances, --big_batch=10000
and --l2reg=1e-8
weight regularisation. The hidden units have --hidden_size=256
dimensions, with dropout parameters of --dropin=0.5
, and an --unk=3
threshold for pruning the word vocabulary.
This model should report a maximum BLEU4 of 11.91 (PPLX 9.347) on the val split, using a fixed seed of 1234.
Run THEANO_FLAGS=floatX=float32,device=gpu0 python extract_hidden_features.py --dataset=iaprtc12_eng --model_checkpoints=PATH_TO_MODEL_CHECKPOINTS --hidden_size=256 --h5_writeable
to extract the final hidden state representations from a saved model state. The representations will be stored in dataset/dataset.h5
in the gold-hidden_feats-vis_enc-256
field.
You can add --use_predicted_tokens
, --hidden_size
, and --no_image
to affect the label of the storage field. Specifically, --hidden_size
can only be varied with an appropriately trained model. --no_image
can only be varied with a model trained over only word inputs. --use_predicted_tokens
only makes sense with an MLM.
--hidden_size=512
-> gold-hidden_feats-vis_enc-512
(multimodal hidden features with 512 dims)
--use_predicted_tokens
-> predicted-hidden_feats-vis_enc-256
(hidden features from predicted descriptions)
--no_image
-> gold-hidden_feats-mt_enc-256
(LM-only hidden features)
If you want to train a German model with transferred features from English, run THEANO_FLAGS=floatX=float32,device=gpu0 python train.py --dataset iaprtc12_ger --hidden_size=256 --fixed_seed --source_vectors=iaprtc12_eng --source_type=gold --source_enc=vis_enc --run_string=fixed_seed-eng256mlm-ger256mlm
to train a German-to-English one-layer LSTM.
By default, this uses --optimiser=adam
, --batch_size=100
instances, --big_batch=10000
and --l2reg=1e-8
weight regularisation. The hidden units have --hidden_size=256
dimensions, with dropout parameters of --dropin=0.5
, and an --unk=3
threshold for pruning the word vocabulary.
This model should report a maximum BLEU4 of 14.79 (PPLX 9.525) on the val split, using a fixed seed of 1234. This represents a 2.88 BLEU point improvement over the German monolingual baseline.
In the other direction, let's train an English model with transferred German features: THEANO_FLAGS=floatX=float32,device=gpu0 python train.py --dataset iaprtc12_eng --hidden_size=256 --fixed_seed --source_vectors=iaprtc12_ger --source_type=gold --source_enc=vis_enc --run_string=fixed_seed-ger256mlm-eng256mlm
. This model should report a maximum BLEU4 of 19.78 (PPLX 6.148) on the val split, using a fixed seed of 1234. This represents a 4.57 BLEU point improvement over the monolingual baseline.
还没有评论,说两句吧!
热门资源
Keras-ResNeXt
Keras ResNeXt Implementation of ResNeXt models...
seetafaceJNI
项目介绍 基于中科院seetaface2进行封装的JAVA...
spark-corenlp
This package wraps Stanford CoreNLP annotators ...
capsnet-with-caps...
CapsNet with capsule-wise convolution Project ...
inferno-boilerplate
This is a very basic boilerplate example for pe...
智能在线
400-630-6780
聆听.建议反馈
E-mail: support@tusaishared.com