Abstract
By employing time-varying proximal functions, adaptive subgradient methods (A DAG RAD) have improved the regret bound and been widely used in online learning and optimization. However, A DA G RAD with full matrix proximal functions (A DA FULL ) cannot deal with large-scale problems due to the impractical time and space complexities, though it has better performance when gradients are correlated. In this paper, we propose A DA FD, an efficient variant of A DA FULL based on a deterministic matrix sketching technique called frequent directions. Following A DA FULL, we incorporate our A DA FD into both primal-dual subgradient method and composite mirror descent method to develop two efficient methods. By maintaining and manipulating low-rank matrices, at each iteration, the space complexity is reduced from O(d2 ) to O(? d) and the time complexity is reduced from O(d3 ) to O(? 2 d), where d is the dimensionality of the data and ? d is the sketching size. Theoretical analysis reveals that the regret of our methods is close to that of A DA FULL as long as the outer product matrix of gradients is approximately lowrank. Experimental results show that our A DA FD is comparable to A DA FULL and outperforms other state-of-the-art algorithms in online convex optimization as well as in training convolutional neural networks (CNN).