Abstract
Representing words as low dimensional vectors is
very useful in many natural language processing
tasks. This idea has been extended to medical domain where medical codes listed in medical claims
are represented as vectors to facilitate exploratory
analysis and predictive modeling. However, depending on a type of a medical provider, medical
claims can use medical codes from different ontologies or from a combination of ontologies, which
complicates learning of the representations. To be
able to properly utilize such multi-source medical
claim data, we propose an approach that represents
medical codes from different ontologies in the same
vector space. We first modify the Pointwise Mutual
Information (PMI) measure of similarity between
the codes. We then develop a new negative sampling method for word2vec model that implicitly
factorizes the modified PMI matrix. The new approach was evaluated on the code cross-reference
problem, which aims at identifying similar codes
across different ontologies. In our experiments,
we evaluated cross-referencing between ICD-9 and
CPT medical code ontologies. Our results indicate that vector representations of codes learned
by the proposed approach provide superior crossreferencing when compared to several existing approaches