pytextrank
PyTextRank is a Python implementation of TextRank as aspaCy extension, used to:
extract the top-ranked phrases from text documents
infer links from unstructured text into structured data
run extractive summarization of text documents
Note that PyTextRank is intended to provide support forentity linking, in contrast to the more commonplace usage ofnamed entity recognition. These approaches can be used together in complementary ways to improve the results overall. The introduction of graph algorithms -- notably,eigenvalue centrality-- provides a more flexible and robust basis for integrating additional techniques that enhance the natural language work being performed.
Internally PyTextRank constructs a lemma graph to represent links among the candidate phrases (e.g., unrecognized entities) and their supporting language. Generally speaking, any means of enriching that graph prior to phrase ranking will tend to improve results. Possible ways to enrich the lemma graph includecoreference resolutionandsemantic relations, as well as leveraging knowledge graphs in the general case.
For example,WordNetandDBpediaboth provide means for inferring links among entities, and purpose-built knowledge
graphs can be applied for specific use cases.
These can help enrich a lemma graph even in cases where links are not explicit
within the text.
Consider a paragraph that mentions cats
and kittens
in different sentences:
an implied semantic relation exists between the two nouns since the lemma kitten
is a hyponym of the lemma cat
-- such that an inferred link can be added
between them.
This has an additional benefit of linking parsed and annotated documents into more structured data, and can also be used to supportknowledge graph construction.
The TextRank algorithm used here is based on research published in:
"TextRank: Bringing Order into Text"
Rada Mihalcea,Paul Tarau
Empirical Methods in Natural Language Processing (2004)
Several modifications in PyTextRank improve on the algorithm originally described in the paper:
fixed a bug: see Java impl, 2008
use lemmatization in place of stemming
include verbs in the graph (but not in the resulting phrases)
leverage preprocessing via noun chunking and named entity recognition
provide extractive summarization based on ranked phrases
This implementation was inspired by theWilliams 2016talk on text summarization. Note that while there are better approaches forsummarizing text, questions linger about some of the top contenders -- see:1,2. Arguably, having alternatives such as this allow for cost trade-offs.
Prerequisites:
To install from PyPi:
pip install pytextrank
If you install directly from this Git repo, be sure to install the dependencies as well:
pip install -r requirements.txt
import spacy import pytextrank # example text text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types." # load a spaCy model, depending on language, scale, etc. nlp = spacy.load("en_core_web_sm") # add PyTextRank to the spaCy pipeline tr = pytextrank.TextRank() nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True) doc = nlp(text) # examine the top-ranked phrases in the document for p in doc._.phrases: print("{:.4f} {:5d} {}".format(p.rank, p.count, p.text)) print(p.chunks)
For other example usage, see thePyTextRank wiki. If you need to troubleshoot any problems:
use GitHub issues (most recommended)
tweet to #textrank
on Twitter (cc @pacoid
)
For related course materials and training, please check for calendar updates in the article"Natural Language Processing in Python".
Let us know if you find this package useful, tell us about use cases, describe what else you would like to see integrated, etc. For inquiries about consulting work in machine learning, natural language, knowledge graph, and other AI applications, contactDerwen, Inc.
PyTextRank has an MIT license, which is succinct and simplifies use in commercial applications.
Please use the following BibTeX entry for citing PyTextRank in publications:
@Misc{PyTextRank, author = {Nathan, Paco}, title = {PyTextRank, a Python implementation of TextRank for phrase extraction and summarization of text documents}, howpublished = {url{https://github.com/DerwenAI/pytextrank/}}, year = {2016} }
PR to propose adding PyTR to the spaCy Universe
update the wiki for version 2.x
include the unit tests
fix Sphinx errors, generate docs
build a conda package
show examples of spacy-wordnet
to enrich the lemma graph
leverage neuralcoref
to enrich the lemma graph
generate a phrase graph, with entity linking to DBpedia, etc.
Many thanks to contributors:@htmartin,@williamsmj,@mattkohl,@vanita5,@HarshGrandeur,@mnowotka,@kjam,@dvsrepo,@SaiThejeshwar,@laxatives,@dimmu, plus the support from Derwen, Inc.
上一篇:MathJax
下一篇:pytextmining
还没有评论,说两句吧!
热门资源
Keras-ResNeXt
Keras ResNeXt Implementation of ResNeXt models...
seetafaceJNI
项目介绍 基于中科院seetaface2进行封装的JAVA...
spark-corenlp
This package wraps Stanford CoreNLP annotators ...
capsnet-with-caps...
CapsNet with capsule-wise convolution Project ...
inferno-boilerplate
This is a very basic boilerplate example for pe...
智能在线
400-630-6780
聆听.建议反馈
E-mail: support@tusaishared.com