资源算法holoclean-summary-paper

holoclean-summary-paper

2020-01-02 | |  31 |   0 |   0

How to Repair Data? The HoloClean Framework (unpublished paper)

Report about the HoloClean Framework (Github-repository) written for the Seminar Horrible Data at HPI. The paper, on which this work is based on ([1]), can be found online.

The PDF version of this paper is included in this repository. You can download it from holoclean.pdf.

Abstract

In a world, where big data and machine learning approaches are used in a growing number of enterprise applications, maintaining data quality has become more important than ever. Data repairing is one way to deal with that. Most repairing tools limit themselves to only one input signal, such as quantitative statistics, Integrity Constraints, or external data, to compute repairs. Rekatsinas et al. therefore propose a new holistic framework, called HoloClean [1], that observes multiple input signals and generates a probabilistic model out of unclean datasets and those signals. It uses DeepDive to perform statistical learning and probabilistic inference to compute the marginal probabilities of repairs, while scaling to dataset sizes with millions of tuples. Rekatsinas et al. show that HoloClean achieves an average accuracy improvement of 2x against single-input state-of-the-art approaches across four different datasets.

Sources

[1]: Theodoros Rekatsinas, Xu Chu, Ihab F Ilyas, and Christopher Ré. 2017. Holoclean: Holistic data repairs with probabilistic inference. In Proceedings of the VLDB Endowment, Vol. 10. VLDB Endowment, 1190–1201.


上一篇:HoloClean-Legacy-deprecated

下一篇:yellowbrick-docs-zh

用户评价
全部评价

热门资源

  • seetafaceJNI

    项目介绍 基于中科院seetaface2进行封装的JAVA...

  • spark-corenlp

    This package wraps Stanford CoreNLP annotators ...

  • Keras-ResNeXt

    Keras ResNeXt Implementation of ResNeXt models...

  • capsnet-with-caps...

    CapsNet with capsule-wise convolution Project ...

  • shih-styletransfer

    shih-styletransfer Code from Style Transfer ...