@inproceedings{cho-lebanoff-foroosh-liu:2019,
Author = {Sangwoo Cho and Logan Lebanoff and Hassan Foroosh and Fei Liu},
Title = {Improving the Similarity Measure of Determinantal Point Processes for Extractive Multi-Document Summarization},
Booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)},
Year = {2019}}
This repository contains the code for a similarity measure network using Capsule network.
Dependencies
This code is developed with the following environment:
Train and evaluate on the CNN/DM summary pair dataset
Set up directory for training/testing data
$ git clone https://github.com/sangwoo3/summarization-dpp-capsnet.git & cd summarization-dpp-capsnet
$ mkdir data & cd data
Download the data
Download CNN/DM summary pair dataset from HERE and extract it under /data directory
This summary dataset is pre-processed with 50k prevailing vocabularies in CNN/DM summary pair dataset. The label is 1 for a positive pair sentence, and 0 for a negative pair. The positive pair is a pair of a summary sentence and its most similar sentence in the source document that leads to the largest Rouge scores. The negative pair is a pair of the same summary sentence and a random sentence in the same document.
Download Glove word vectors of 50k vocabulary from HERE and place it under /data directory
If you want raw CNN/DM summary dataset, download from HERE.
This data contains candiate summary sentences for each document. The data is pre-processed with the preprocess.py file to generate the above CNN/DM summary pair dataset.)
Download the pre-trained model from HERE and place it under /result/capnet_sim directory
/result/capnet_sim is a default directory for training results
Download the model fine-tuned on STS dataset from HERE
This model is trained on CNN/DM summary pair dataset and then fine-tuned on STS.
It can be used to evaluate STS prediction accuracy.
System summary
We provide our best system summaries of DUC04 and TAC11. They are generated with DPP and in the system_summary directory. For DPP and multi-document dataset, we do not provide the code and dataset due to license. Please refer to DPP code and download DUC 03/04 and TAC 08/09/10/11 dataset with your request and approval.
License
This project is licensed under the BSD License - see the LICENSE.md file for details.