Abstract
A Distributed and Scalable Machine Learning Approach for Big Data / 1512
Hongliang Guo, Jie Zhang
With the rapid development of data sensing and collection technologies, we can easily obtain large volumes of data (big data). However, big data poses huge challenges to many popular machine learning techniques which take all the data at the same time for processing. To address the big data related challenges, we first partition the data along its feature space, and apply the parallel block coordinate descent algorithm for distributed computation; then, we continue to partition the data along the sample space, and propose a novel matrix decomposition and combination approach for distributed processing. The final results from all the entities are guaranteed to be the same as the centralized solution. Extensive experiments performed on Hadoop confirm that our proposed approach is superior in terms of both testing errors and convergence rate (computation time) over the canonical distributed machine learning techniques that deal with big data.