Abstract Large-scale customer service call records include lots of valuable information for business intelligence. However, the analysis of those records has not utilized in the big data era before. There are two fundamental problems before mining and analyses: 1) The telephone conversation is mixed with words of agents and users which have to be recognized before analysis; 2) The speakers in conversation are not in a pre-defifined set. These problems are new challenges which have not been well studied in the previous work. In this paper, we propose a four-phase framework for role labeling in real customer service telephone conversation, with the benefifit of integrating multi-modality features, i.e., both low-level acoustic features and semantic-level textual features. Firstly, we conduct !Bayesian Information Criterion (!BIC) based speaker diarization to get two segments clusters from an audio stream. Secondly, the segments are transferred into text in an Automatic Speech Recognition (ASR) phase with a deep learning model DNN-HMM. Thirdly, by integrating acoustic and textual features, dialog level role labeling is proposed to map the two clusters into the agent and the user. Finally, sentence level role correction is designed in order to label results correctly in a fifine-grained notion, which reduces the errors made in previous phases. The proposed framework is tested on two real datasets: mobile and bank customer service calls datasets. The precision of dialog level labeling is over 99.0%. On the sentence level, the accuracy of labeling reaches 90.4%, greatly outperforming traditional acoustic features based method which achieves only 78.5% in accuracy