Authors
Chaofan Dai, Qideng Tang, Wubin Ma, Yahui Wu, Haohao Zhou and Huahua Ding, National University of Defense Technology, China
Abstract
Entity resolution (ER), which aims to identify whether data records from various sources refer to the same real-world entity, is a crucial part of data integration systems. Traditional ER solutions assumes that data records are stored in relational tables with an aligned schema. However, in practical applications, it is common that data records to be matched may have different formats (e.g., relational, semi-structured, or textual types). In order to support ER for data records with varying formats, Generalized Entity Resolution has been proposed and has recently gained much attention. In this paper, we propose PromptER, a model based on pre-trained language models that offers an efficient and effective approach to accomplish Generalized Entity Resolution tasks. PromptER starts with a supervised contrastive learning process to train a Transformer encoder, which is afterward used for blocking and fine-tuned for matching. Specially, in the record embedding process, PromptER uses the proposed prompt embedding technique to better utilized the pre-trained language model layers and avoid embedding bias. Morever, we design a novel data augmentation method and an evaluation method to enhance the performance of the proposed model. We conduct experiments on the Generalized Entity Resolution dataset Machamp and the results show that PromptER significantly outperforms other state-of-art methods in the blocking and matching tasks.
Keywords
Entity resolution, data integration, deep learning, contrastive learning, prompt learning