SPADE: A Semi-supervised Probabilistic Approach for Detecting Errors in Tables

SPADE: A Semi-supervised Probabilistic Approach for Detecting Errors in Tables

Minh Pham, Craig A. Knoblock, Muhao Chen, Binh Vu, Jay Pujara

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence
Main Track. Pages 3543-3551. https://doi.org/10.24963/ijcai.2021/488

Error detection is one of the most important steps in data cleaning and usually requires extensive human interaction to ensure quality. Existing supervised methods in error detection require a significant amount of training data while unsupervised methods rely on fixed inductive biases, which are usually hard to generalize, to solve the problem. In this paper, we present SPADE, a novel semi-supervised probabilistic approach for error detection. SPADE introduces a novel probabilistic active learning model, where the system suggests examples to be labeled based on the agreements between user labels and indicative signals, which are designed to capture potential errors. SPADE uses a two-phase data augmentation process to enrich a dataset before training a deep learning classifier to detect unlabeled errors. In our evaluation, SPADE achieves an average F1-score of 0.91 over five datasets and yields a 10% improvement compared with the state-of-the-art systems.
Keywords:
Machine Learning Applications: Applications of Supervised Learning
Data Mining: Anomaly/Outlier Detection