Abstract
Large transformer-based generative models trained on huge corpora have shown unparalleled language generation ability. While these models are powerful, finegrained control of attributes of the generated language (e.g. gradually switching topic or sentiment) is difficult without modifying the model architecture to allow extra attribute inputs or fine-tuning with attribute-specific data. Both entirely change the original generative function — which, if done poorly, cannot be undone — and also entail the significant cost of retraining. We instead propose the simple Plug and Play Language Model (PPLM) approach for controlled language generation. PPLM consists of plugging in simple attribute classifiers (which may be single layer models or even a bag-of-words), and making updates in the activation space, without changing any model parameters. Such a control scheme provides vast flexibility and allows full recovery of the original generative function. The results demonstrate fine-grained control over a range of topics and sentiment styles, as well as the ability to detoxify generated texts. Our experiments, including human evaluation studies, show that text generated via this control scheme is aligned with desired attributes, while retaining fluency.