How to Repair Data? The HoloClean Framework (unpublished paper)

Report about the HoloClean Framework (Github-repository) written for the Seminar Horrible Data at HPI. The paper, on which this work is based on ([1]), can be found online.

The PDF version of this paper is included in this repository. You can download it from holoclean.pdf.

Abstract

In a world, where big data and machine learning approaches are used in a growing number of enterprise applications, maintaining data quality has become more important than ever. Data repairing is one way to deal with that. Most repairing tools limit themselves to only one input signal, such as quantitative statistics, Integrity Constraints, or external data, to compute repairs. Rekatsinas et al. therefore propose a new holistic framework, called HoloClean [1], that observes multiple input signals and generates a probabilistic model out of unclean datasets and those signals. It uses DeepDive to perform statistical learning and probabilistic inference to compute the marginal probabilities of repairs, while scaling to dataset sizes with millions of tuples. Rekatsinas et al. show that HoloClean achieves an average accuracy improvement of 2x against single-input state-of-the-art approaches across four different datasets.

Sources

[1]: Theodoros Rekatsinas, Xu Chu, Ihab F Ilyas, and Christopher Ré. 2017. Holoclean: Holistic data repairs with probabilistic inference. In Proceedings of the VLDB Endowment, Vol. 10. VLDB Endowment, 1190–1201.

上一篇：HoloClean-Legacy-deprecated

下一篇：yellowbrick-docs-zh

用户评价

全部评价

还没有评论，说两句吧！

热门资源

TensorFlow-Course

This repository aims to provide simple and read...
seetafaceJNI

项目介绍基于中科院seetaface2进行封装的JAVA...
mxnet_VanillaCNN

This is a mxnet implementation of the Vanilla C...
vsepp_tensorflow

Improving Visual-Semantic Embeddings with Hard ...
DuReader_QANet_BiDAF

Machine Reading Comprehension on DuReader Usin...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com