Streaming Algorithms and Lower Bounds for Estimating Correlation Clustering Cost
Authors:
Sepehr Assadi, Vihan Shah, Chen Wang.
Abstract:
Correlation clustering is a fundamental optimization problem at the intersection of machine learning and theoretical computer science.
Motivated by applications to big data processing, recent years have witnessed a flurry of results on this problem in the streaming model.
In this model, the algorithm needs to process the input n-vertex graph by making one or few passes over the stream of its edges and using a limited
memory, much smaller than the input size.
All previous work on streaming correlation clustering have focused on semi-streaming algorithms with Ω(n) memory, whereas in this work, we study streaming algorithms with much smaller memory requirement of only polylog(n) bits. This stringent memory requirement is in the same spirit of classical streaming algorithms that instead of recovering a full solution to the problem---which can be prohibitively large with such small memory as is the case in our problem---, aimed to learn certain statistical properties of their inputs. In our case, this translates to determining the ``(correlation) clusterability'' of input graphs, or more precisely, estimating the cost of the optimal correlation clustering solution.
As our main result, we present two novel algorithms that in only polylog(n) space are able to estimate the optimal correlation clustering cost up to some constant multiplicative factor plus some extra additive error. One of the algorithms outputs a 3-multiplicative approximation plus o(n^2) additive approximation, and the other one improves the additive error further down at the cost of increasing the multiplicative factor to some large constant. We then present new lower bounds that justify these mix of both multiplicative and additive error approximation in our algorithms.
All previous work on streaming correlation clustering have focused on semi-streaming algorithms with Ω(n) memory, whereas in this work, we study streaming algorithms with much smaller memory requirement of only polylog(n) bits. This stringent memory requirement is in the same spirit of classical streaming algorithms that instead of recovering a full solution to the problem---which can be prohibitively large with such small memory as is the case in our problem---, aimed to learn certain statistical properties of their inputs. In our case, this translates to determining the ``(correlation) clusterability'' of input graphs, or more precisely, estimating the cost of the optimal correlation clustering solution.
As our main result, we present two novel algorithms that in only polylog(n) space are able to estimate the optimal correlation clustering cost up to some constant multiplicative factor plus some extra additive error. One of the algorithms outputs a 3-multiplicative approximation plus o(n^2) additive approximation, and the other one improves the additive error further down at the cost of increasing the multiplicative factor to some large constant. We then present new lower bounds that justify these mix of both multiplicative and additive error approximation in our algorithms.
Conference version:
[Website] (This also includes a short talk and its slides)