series of scripts to fine-tune GPT-2 and BERT models using reddit data for generating realistic replies.
jupyter notebooks also available on Google Colab here
see my blog post for a walkthrough on running the scripts
processing training data
I use pandas read_gbq to read from google bigquery. get_reddit_from_gbq.py automates the download. prep_data.py cleans and transforms the data into a format that is usable by the GPT2 and BERT fine-tuning scripts. I manually upload the results from prep_data.py into Google Drive to be used by the Google Colab notebooks.
reddit = praw.Reddit(client_id='client_id',
client_secret='client_secret',
password='reddit_password',
username='reddit_username',
user_agent='reddit user agent name')
...
subreddit = reddit.subreddit(subreddit_name)
for h in subreddit.rising(limit=5):
for c in h.comments:
{do stuff}