资源算法 bert-token-embeddings

bert-token-embeddings

2020-03-10 | |  39 |   0 |   0

Bert Pretrained Token Embeddings

BERT(BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding) yields pretrained token (=subword) embeddings. Let's extract and save them in the word2vec format so that they can be used for downstream tasks.

Requirements

  • pytorch_pretrained_bert

  • NumPy

  • tqdm

Extraction

  • Check extract.py.

Bert (Pretrained) Token Embeddings in word2vec format

Models# Vocab# DimNotes
bert-base-uncased30,522768
bert-large-uncased30,5221024
bert-base-cased28,996768
bert-large-cased28,9961024
bert-base-multilingual-cased119,547768Recommended
bert-base-multilingual-uncased30,522768Not recommended
bert-base-chinese21,128768

Example

  • Check example.ipynb to see how to load (sub-)word vectors with gensim and plot them in 2d space using tSNE.

  • Related tokens to look

图片.png

* Related tokens to ##go

上一篇:distill-bert

下一篇:BERT-chinese

用户评价
全部评价

热门资源

  • seetafaceJNI

    项目介绍 基于中科院seetaface2进行封装的JAVA...

  • spark-corenlp

    This package wraps Stanford CoreNLP annotators ...

  • Keras-ResNeXt

    Keras ResNeXt Implementation of ResNeXt models...

  • capsnet-with-caps...

    CapsNet with capsule-wise convolution Project ...

  • inferno-boilerplate

    This is a very basic boilerplate example for pe...