stanford_corenlp_pywrapper
UPDATE MARCH 2018: this code is obsolete, beyond the issue of version compatibility issues, because CoreNLP has an easy-to-use and well-documented server mode now (I think for a while?). Example of using it: https://gist.github.com/brendano/29d9dc619bd7e087b459e6027a52af89
The only possible advantage of this wrapper code is that it does process management for you under the python process, which might be slightly convenient since you don't have to run a separate server. But this architecture is much worse with regards to parallelization (the external server can load resources only once and use threads to parallelize for multiple clients) and certain types of development convenience (with an external server, you don't have to re-load the models during development). I guess this code could be useful if you have to use an older CoreNLP version (for example, if you want to replicate older research results that depend on older formats of things).
=========================================================
This is a Python wrapper for the Stanford CoreNLP library for Unix (Mac, Linux), allowing sentence splitting, POS/NER, temporal expression, constituent and dependency parsing, and coreference annotations. It runs the Java software as a subprocess and communicates with named pipes or sockets. It can also be used to get a JSON-formatted version of the NLP annotations.
Alternatives you may want to consider:
Many others listed here
Obsolete notice?: This was written around 2015 or so. But at some point (later?) CoreNLP added its own server mode, which is better to use than the server included inside of this package -- it stays up to date with their system, presumably -- and also it now has native JSON output support. This package should probably be replaced with a python client for that server and possibly the process management support.
You need to have CoreNLP already downloaded. If you want to, you can install this software with something like:
git clone https://github.com/brendano/stanford_corenlp_pywrapper cd stanford_corenlp_pywrapper pip install .
Or you can just put the stanford_corenlp_pywrapper
subdirectory into your
project (or use virtualenv, etc.). For example:
git clone https://github.com/brendano/stanford_corenlp_pywrapper scp_repo ln -s scp_repo/stanford_corenlp_pywrapper .
Java needs to be a version that CoreNLP is happy with; perhaps version 8.
See proc_text_files.py
for an example of processing text files.
Note that you'll have to edit it to specify the jar paths as described below.
The basic arguments to open a server are (1) the pipeline mode (or alternatively, the annotator pipeline), and (2) the path to the CoreNLP jar files (passed on to the Java classpath)
The pipeline modes are just quick shortcuts for some pipeline configurations we
commonly use. They are defined near the top ofstanford_corenlp_pywrapper/sockwrap.py
and include
ssplit
: tokenization and sentence splitting (included in all subsequent ones)
pos
: POS (and lemmas)
ner
: POS and NER (and lemmas)
parse
: fairly basic parsing with POS, lemmas, trees, dependencies
nerparse
: parsing with NER, POS, lemmas, depenencies.
coref
: Coreference, including constituent parsing.
Here we assume the program has been installed using pip install
. You will
have to change corenlp_jars
to where you have them on your system.
Here's how to initialize the pipeline with the pos
mode:
>>> from stanford_corenlp_pywrapper import CoreNLP >>> proc = CoreNLP("pos", corenlp_jars=["/home/sw/corenlp/stanford-corenlp-full-2015-04-20/*"])
If things are working there will be lots of messages looking something like:
INFO:CoreNLP_PyWrapper:mode given as 'pos' so setting annotators: tokenize, ssplit, pos, lemma INFO:CoreNLP_PyWrapper:Starting java subprocess, and waiting for signal it's ready, with command: exec java -Xmx4g -XX:ParallelGCThreads=1 -cp '/Users/brendano/sw/nlp/stanford_corenlp_pywrapper/stanford_corenlp_pywrapper/lib/*:/home/sw/corenlp/stanford-corenlp-full-2015-04-20/*:/home/sw/stanford-srparser-2014-10-23-models.jar' corenlp.SocketServer --outpipe /tmp/corenlp_pywrap_pipe_pypid=140_time=1435943221.14 --configdict '{"annotators":"tokenize, ssplit, pos, lemma"}' Adding annotator tokenize TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer. Adding annotator ssplit Adding annotator pos Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.7 sec]. Adding annotator lemma INFO:CoreNLP_JavaServer: CoreNLP pipeline initialized. INFO:CoreNLP_JavaServer: Waiting for commands on stdin INFO:CoreNLP_PyWrapper:Successful ping. The server has started. INFO:CoreNLP_PyWrapper:Subprocess is ready.
Now it's ready to parse documents. You give it a string and it returns JSON-safe data structures:
>>> proc.parse_doc("hello world. how are you?") {u'sentences': [ {u'tokens': [u'hello', u'world', u'.'], u'lemmas': [u'hello', u'world', u'.'], u'pos': [u'UH', u'NN', u'.'], u'char_offsets': [[0, 5], [6, 11], [11, 12]] }, {u'tokens': [u'how', u'are', u'you', u'?'], u'lemmas': [u'how', u'be', u'you', u'?'], u'pos': [u'WRB', u'VBP', u'PRP', u'.'], u'char_offsets': [[13, 16], [17, 20], [21, 24], [24, 25]] } ] }
You can also specify the annotators directly. For example,
say we want to parse but don't want lemmas. This can be done
with the configdict
option:
>>> p = CoreNLP(configdict={'annotators':'tokenize, ssplit, pos, parse'}, output_types=['pos','parse'])
Or use an external configuration file (of the same sort the original CoreNLP commandline uses):
>>> p = CoreNLP(configfile='sample.ini') >>> p.parse_doc("hello world. how are you?") ...
The annotators
configuration option is explained more on the CoreNLP webpage.
Another example: using the shift-reduce constituent parser. The jar paths will have to be changed for your system.
>>> p = CoreNLP(configdict={ 'annotators': "tokenize,ssplit,pos,lemma,parse", 'parse.model': 'edu/stanford/nlp/models/srparser/englishSR.ser.gz'}, corenlp_jars=["/path/to/stanford-corenlp-full-2015-04-20/*", "/path/to/stanford-srparser-2014-10-23-models.jar"])
Another example: coreference. This tool does not annotatecoreference in
the same way that it annotates other linguistic features. Where the other kinds
of annotation (for instance, part of speech tagging) are collected in the
top-level sentences
attribute of the json output, coreference annotations get
collected in the attribute, entities
.
In the example below, there are 3 entities in the two sentences. The first is "Fred", the second is "her", and the third is the telescope. The telescope is mentioned twice, ("telescope" and "It"), so there are two mention objects in the json. "It" and "telescope" are said to co-refer.
>>> proc = CoreNLP("coref") >>> proc.parse_doc("Fred saw her through a telescope. It was broken.")['entities'] [{u'entityid': 1, u'mentions': [{u'animacy': u'ANIMATE', u'gender': u'MALE', u'head': 0, u'mentionid': 1, u'mentiontype': u'PROPER', u'number': u'SINGULAR', u'representative': True, u'sentence': 0, u'tokspan_in_sentence': [0, 1]}]}, {u'entityid': 2, u'mentions': [{u'animacy': u'ANIMATE', u'gender': u'FEMALE', u'head': 2, u'mentionid': 2, u'mentiontype': u'PRONOMINAL', u'number': u'SINGULAR', u'representative': True, u'sentence': 0, u'tokspan_in_sentence': [2, 3]}]}, {u'entityid': 3, u'mentions': [{u'animacy': u'INANIMATE', u'gender': u'NEUTRAL', u'head': 5, u'mentionid': 3, u'mentiontype': u'NOMINAL', u'number': u'SINGULAR', u'representative': True, u'sentence': 0, u'tokspan_in_sentence': [4, 6]}, {u'animacy': u'INANIMATE', u'gender': u'NEUTRAL', u'head': 0, u'mentionid': 4, u'mentiontype': u'PRONOMINAL', u'number': u'SINGULAR', u'sentence': 1, u'tokspan_in_sentence': [0, 1]}]}]
We always use 0-indexed numbering conventions for token, sentence, and character indexes. Spans are always inclusive-exclusive pairs, just like Python slicing.
You can get the raw unserialized JSON with the option raw=True
: e.g.,parse_doc("Hello world.", raw=True)
. The python<->java communication is
based on JSON and this just hands it back without deserializing it. In
fact you can run the Java code as a standalone commandline program to just
produce the JSON format. This can be helpful for storing parses from large
corpora. (Even though this format is pretty repetitive, it is much more
compact than CoreNLP's XML format. Though of course protobuf or something
should be better.)
To use a different CoreNLP version, just update corenlp_jars
to what you want. If a future CoreNLP breaks binary (Java API)
compatibility, you'll have to edit the Java server code and re-compile with./build.sh
.
To change the Java settings, see the java_command
and java_options
arguments.
Output messages (on standard error) that start with INFO:CoreNLP_PyWrapper
,INFO:CoreNLP_JavaServer
, or INFO:CoreNLP_RWrapper
are from our code.
Other output is probably from CoreNLP.
Only works on Unix (Linux and Mac). Does not currently work on Windows.
If you want to know the latest version of CoreNLP this has been tested with, look at the paths in the default options in the Python source code.
SOCKET
mode: By default, the inter-process communication is
through named pipes, established with Unix calls. As an alternative, there is
also a socket server mode (comm_mode='SOCKET'
) which is sometimes more robust,
but requires using a port number, which you have to ensure does not
conflict with any other processes running at the same time. (It's not much
of a server since the python code assumes it's the only process
communicating with it.) One advantage of `SOCKET' mode is that it has
a timeout, in case CoreNLP is taking a very long time to return an answer.
Question: do JPype orPy4J work well? They seemed complex which is why we wrote our own IPC mechanism. But if there's a better alternative, no need.
There's a tiny amount of pytest-style tests.
py.test -v sockwrap.py
Major changes include
2015-07-03: add pipe mode and make it default (the namedpipe branch), plus an R wrapper.
2015-05-15: no longer need to specify output_types
(the outputs to include are inferred from the annotators
setting).
For details see the commit log.
Copyright Brendan O'Connor (http://brenocon.com).
License GPL version 2 or later.
上一篇:spark-corenlp
还没有评论,说两句吧!
热门资源
seetafaceJNI
项目介绍 基于中科院seetaface2进行封装的JAVA...
spark-corenlp
This package wraps Stanford CoreNLP annotators ...
Keras-ResNeXt
Keras ResNeXt Implementation of ResNeXt models...
capsnet-with-caps...
CapsNet with capsule-wise convolution Project ...
shih-styletransfer
shih-styletransfer Code from Style Transfer ...
智能在线
400-630-6780
聆听.建议反馈
E-mail: support@tusaishared.com