1. Introduction
In some real-world problems, only using single view information does not satisfy the real-world demand very well. Thus, solving problems by the representation with respect to multiple views, i.e. multi-view learning [1], becomes an important approach. Even further, when the multi-view learning is applied into the field of clustering, the multi-view clustering is proposed [2], which needs to integrate these multiple views together to perform clustering. Recently, much progress has been made in developing multi-view clustering field, especially multi-view subspace clustering [3]. Since multi-view subspace clustering can make it easier to deal with high-dimensional data by learning a unified representation for data on all views, it attracts much attention. For example, in [4], authors separate noise from each view and learn a shared low-rank transition probability matrix for multi-view spectral clustering. In [5], the low-rank sparse subspace clustering is extended to multi-view clustering, and the consistent clustering results are obtained. In addition, to make full use of the complementary information across views, work [6] uses the Hilbert-Schmidt independent criterion to explore the complementary information between each view and reduces redundancy in the self-representation of each view data. In [7], a consistency item is developed on the basis of [6], and the authors jointly optimize the diversity representation and spectral clustering. All these approaches aim to maximize the complementary information between views but ignore the differences between views. Therefore, reasonably assigning different weights to the views can make the clustering results more accurate. In [8], by considering the difference between the data on each view, authors propose a self-adaptive weighting method to assign different weight to each view. In [9], a multi-graph fusion algorithm of multi-view spectral clustering is proposed, where the self-adaptive weighting method is also adopted.
As shown in the current works mentioned above, the complementary information between perspectives is captured straightly in the original data space. However, when the original data in each view is very diverse, the complementary information captured directly in the original data space may have a huge loss. Therefore, different to the previous works, inspired by the latest revelation that the data over all perspectives have a very similar or close spectral block structure, we are able to construct the complementary information in the spectral embedding domain. Besides, to better capture the global structure of the given data in subspace clustering, we propose a novel rank-norm approximation method. Experimental results demonstrate that the proposed method has better tradeoff between clustering effect and robustness.
2. Preliminary
2.1 Low-rank Sparse Representation
The aim of low-rank representation (LRR) [10] is to find a low-rank representation matrix C ∈ ℝN×N for input data X ={xi ∈ ℝD}Ni=1 where D and N represent the dimension of each 1 data point and the number of data points, respectively. Mathematically, LRR problem is described as
\(\begin{aligned}\min _{\mathbf{C}} \operatorname{rank}(\mathbf{C})\mathbf{X}=\mathbf{X C},\end{aligned}\) (1)
where rank(C) is the rank function. Since LRR is able to reconstruct the data space under the low-rank constraint, it is widely leveraged to capture the global structure of the given data.
Besides the low-rank constraint, to represent each data point by a small number of data points from the same subspace, the sparse constraint is also imposed on the matrix C. Thus, we have
\(\begin{aligned}\min _{\mathbf{C}}\|\mathbf{X}-\mathbf{X C}\|_{F}^{2}+\beta_{1} \operatorname{rank}(\mathbf{C})+\beta_{2}\|\mathbf{C}\|_{1}, s.t.\operatorname{diag}(\mathbf{C})=\mathbf{0},\end{aligned}\) (2)
where ║·║1 is the tightest convex relaxation of the sparse norm ℓ0, the parameters β1 and β2 are the balance parameters for the low-rank constraint and the sparse constraint, respectively. Here, diag(C) = 0 represents that the diagonal vector of matrix C is a null vector, which is used for avoiding trivial solution.
2.2 Multi-view Subspace Clustering
Given a M-view dataset X={X(1), X(2), …, X(M)}, where M is the number of views and X(m) is the data on the m-th view, optimization problem (2) can be extended as
\(\begin{aligned}\begin{array}{l}\min _{\mathbf{C}^{(m)}}\left\|\mathbf{X}^{(m)}-\mathbf{X}^{(m)} \mathbf{C}^{(m)}\right\|_{F}^{2}+\beta_{1} \operatorname{rank}\left(\mathbf{C}^{(m)}\right)+\beta_{2}\left\|\mathbf{C}^{(m)}\right\|_{1} \\ \text { s.t. } \operatorname{diag}\left(\mathbf{C}^{(m)}\right)=\mathbf{0}, m=1,2 \ldots, M,\end{array},\end{aligned}\) (3)
where C(m) ∈ ℝN×N is the self-representation matrix of the m-th view. Since there exists common information shared among all views [11], we need to find a unified representation matrix S across all views. Consequently, the optimization problem about the unified representation matrix S ∈ ℝN×N can be written as
\(\begin{aligned}\min _{\mathbf{S}} \sum_{m=1}^{M} p_{m}\left\|\mathbf{U}^{(m)}-\mathbf{S}\right\|_{F}^{2},s.t. \operatorname{rank}\left(\mathbf{L}_{s}\right)=N-k,\end{aligned}\) (4)
where U(m) is the affinity matrix on the m-th view, expressed as U(m) = (C(m) + (C(m))T) / 2, Ls = IN - D(-1/2)SD(-1/2) is the Laplace matrix of matrix S with D being its degree matrix whose diagonal element Dii = ∑j≠iSij, the constraint (Ls)=N−k is used to guarantee that S contains k connected components [12], pm is the weight characterizing the importance of the m-th view. With regard to the weight of the m-th view, it should be inversely proportional to ║U(m) -S║2F . Thus, we leverage the weighting scheme in [8], where pm is set as 1 / (2║U(m)-S║2F).
Moreover, the optimization problem of multi-view subspace clustering under low-rank sparse constraint can be formulated as
\(\begin{aligned}\begin{array}{l}\min _{\mathbf{C}^{(m)}, \mathbf{S}, \mathbf{F}} \sum_{m=1}^{M}\left\{\left\|\mathbf{X}^{(m)}-\mathbf{X}^{(m)} \mathbf{C}^{(m)}\right\|_{F}^{2}+\beta_{1} \operatorname{rank}\left(\mathbf{C}^{(m)}\right)+\beta_{2}\left\|\mathbf{C}^{(m)}\right\|_{1}\right. \\ \left.\qquad+p_{m}\left\|\mathbf{U}^{(m)}-\mathbf{S}\right\|_{F}^{2}\right\}+\eta \operatorname{Tr}\left(\mathbf{F}^{T} \mathbf{L}_{s} \mathbf{F}\right), \\ \text { s.t. } \mathbf{C}_{i j}^{(m)}>0, \operatorname{diag}\left(\mathbf{C}^{(m)}\right)=\mathbf{0}, \mathbf{F}^{T} \mathbf{F}=\mathbf{I}_{k},\end{array}\end{aligned}\) (5)
where \(\begin{aligned}\min _{\mathbf{F}} \operatorname{Tr}\left(\mathbf{F}^{T} \mathbf{L}_{s} \mathbf{F}\right)\end{aligned}\) is equivalent to the constraint (Ls) = N − k , η represents its corresponding balance parameter. When problem (5) is solved, the clustering results can be derived straightly without performing other clustering.
3. The Proposed Approach
In this section, we first propose a novel rank-norm approximation which provides better tradeoff between the accuracy and robustness for LRR. Secondly, we propose an alternative way to integrate the complementary information across different views.
3.1 Novel Rank-norm Approximation
As we know, LRR problem inevitably involves solving a rank minimization problem, which is known to be non-convex, thus it is a NP-hard issue. An alternative for it is to relax the rank function by the nuclear norm. Such that the relaxed problem is convex and can be readily solved by the soft-thresholding operation on singular values. However, the nuclear norm and its enhanced methods are the sum of singular values essentially, leading to a biased estimation for the rank function. Towards this issue, a series of non-convex relaxations are proposed, such as the Schatten-p norm [13] and the Gamma norm [14] proposed recently. The Gamma norm is an extension of the min-max concave plus function, which can be nearly unbiased to approximate the rank function [15]. To illustrate it with more details, we first assume that the singular value decomposition (SVD) of the matrix C can be written as C = UΣVT, where U = [u1, u2, …, uN] and V = [v1, v2, …, vN] denote two unitary matrices, respectively, and Σ = diag(σ1, σ2,…, σN) is a diagonal matrix with σ1 ≥ σ2 ≥ … ≥ σN ≥ 0. Then, the Gamma norm of matrix C can be defined as
\(\begin{aligned}\|\mathbf{C}\|_{\gamma}=\sum_{i=1}^{N} J_{\gamma}\left(\sigma_{i}\right),\end{aligned}\) (6)
where Jγ(σi) is a piecewise function written as
\(\begin{aligned}J_{\gamma}\left(\sigma_{i}\right)=\int_{0}^{\sigma_{i}}\left[1-\frac{x}{\gamma}\right]_{+} d x=\left\{\begin{array}{ccc}\frac{\gamma}{2}, & \text { if } & \sigma_{i} \geq \gamma , \\ \sigma_{i}-\frac{\sigma_{i}^{2}}{2 \gamma}, & \text { if } & \sigma_{i}<\gamma\end{array}\right.\end{aligned}\) (7)
where [x]+ = max(x, 0). As shown in (7), the Gamma norm is an extremely large or small concave function, so the approximation error would be very small. Besides, one may observe that, the parameter γ decides on which piecewise of function would be exploited for Jγ(σi). However, the piecewise function in (7) is not continuous, and Jγ(σi) is heavily dependent on the parameter γ [16]. Hence, the Gamma norm based approximation of the rank function is very sensitive to the parameter γ, which leads to a weak robustness to γ.
To obtain a better tradeoff between the approximation error and robustness of the rank-norm approximation, we propose a novel rank-norm approximation method. In this method, by resorting to skills of the matrix transformation, small approximation error with good robustness can be achieved. Specifically, we first assume r and σi represent the rank and the i-th singular value of matrix C, respectively. Then, for the rank function of matrix C, we have
\(\begin{aligned}\operatorname{rank}(\mathbf{C})=r=\sum_{i=1}^{r} \frac{\sigma_{i}^{2}}{\sigma_{i}^{2}} \geq \sum_{i=1}^{r} \frac{\sigma_{i}^{2}}{\sigma_{i}^{2}+\varepsilon} \triangleq \varphi_{\varepsilon}(\mathbf{C}).\end{aligned}\) (8)
By observing the above equation, one may notice that the error between rank(C) and its lower bound φε (C) would be close to 0 when ε approaches 0. Therefore, we can use φε (C) to approximate the rank function in a very accurate way. On the other side, φε (C) would vary continuously with ε. And most importantly, when σi is much larger than ε, φε (C) barely changes with ε, thereby having an ideal robustness to ε. However, the rank function represented by φε (C) in (8) is an implicit function about the variable r, which would cause an intractable obstacle to solve the optimization problem. For it, we need to make a further transformation of φε (C), so as to obtain the explicit function about the rank r. To be specific, for φε (C), we have
\(\begin{aligned}\begin{array}{l}\varphi_{\varepsilon}(\mathbf{C})=\operatorname{Tr}\left(\begin{array}{cccccc}\sigma_{1}^{2}\left(\sigma_{1}^{2}+\varepsilon\right)^{-1} & 0 & \cdots & \cdots & \cdots & 0 \\ 0 & \ddots & 0 & 0 & \ddots & \vdots \\ \vdots & \ddots & \sigma_{r}^{2}\left(\sigma_{r}^{2}+\varepsilon\right)^{-1} & \ddots & \ddots & \vdots \\ \vdots & \ddots & 0 & 0 & \ddots & \vdots \\ \vdots & \ddots & \ddots & \ddots & \ddots & \vdots \\ 0 & 0 & 0 & 0 & 0 & 0\end{array}\right) \\ =\operatorname{Tr}\left(\left(\begin{array}{cccccc}\left(\sigma_{1}^{2}+\varepsilon\right)^{-1} & 0 & \cdots & \cdots & \cdots & 0 \\ 0 & \ddots & 0 & 0 & \ddots & \vdots \\ \vdots & \ddots & \left(\sigma_{r}^{2}+\varepsilon\right)^{-1} & \ddots & \ddots & \vdots \\ \vdots & \ddots & 0 & 0 & \ddots & \vdots \\ \vdots & \ddots & \ddots & \ddots & \ddots & \vdots \\ 0 & 0 & 0 & 0 & 0 & 0\end{array}\right)\left(\begin{array}{cccccc}\sigma_{1}^{2} & 0 & \cdots & \cdots & \cdots & 0 \\ 0 & \ddots & 0 & 0 & \ddots & \vdots \\ \vdots & \ddots & \sigma_{r}^{2} & \ddots & \ddots & \vdots \\ \vdots & \ddots & 0 & 0 & \ddots & \vdots \\ \vdots & \ddots & \ddots & \ddots & \ddots & \vdots \\ 0 & 0 & 0 & 0 & 0 & 0\end{array}\right)\right) \\ =\operatorname{Tr}\left(\left(\begin{array}{cccccc}\left(\sigma_{1}^{2}+\varepsilon\right)^{-1} & 0 & \cdots & \cdots & \cdots & 0 \\ 0 & \ddots & 0 & 0 & \ddots & \vdots \\ \vdots & \ddots & \left(\sigma_{r}^{2}+\varepsilon\right)^{-1} & \ddots & \ddots & \vdots \\ \vdots & \ddots & 0 & \varepsilon^{-1} & \ddots & \vdots \\ \vdots & \ddots & \ddots & \ddots & \ddots & \vdots \\ 0 & 0 & 0 & 0 & 0 & \varepsilon^{-1}\end{array}\right)\left(\begin{array}{cccccc}\sigma_{1}^{2} & 0 & \cdots & \cdots & \cdots & 0 \\ 0 & \ddots & 0 & 0 & \ddots & \vdots \\ \vdots & \ddots & \sigma_{r}^{2} & \ddots & \ddots & \vdots \\ \vdots & \ddots & 0 & 0 & \ddots & \vdots \\ \vdots & \ddots & \ddots & \ddots & \ddots & \vdots \\ 0 & 0 & 0 & 0 & 0 & 0\end{array}\right)\right) \\ =\operatorname{Tr}\left(\left(\begin{array}{cc}\operatorname{diag}\left(\sigma^{2}(\mathbf{C})\right)+\varepsilon \mathbf{I}_{r} & \mathbf{0} \\ \mathbf{0} & \varepsilon \mathbf{I}_{N-r}\end{array}\right)^{-1}\left(\begin{array}{cc}\operatorname{diag}\left(\sigma^{2}(\mathbf{C})\right) & \mathbf{0} \\ \mathbf{0} & \mathbf{0}\end{array}\right)\right), \\\end{array}\end{aligned}\) (9)
where diag(σ2 (C)) is a diagonal matrix with r squares of corresponding non-zero singular values of C. Then, based on the property of unitary matrix φε (C) can be further written as
\(\begin{aligned}\begin{array}{l}\varphi_{\varepsilon}(\mathbf{C})=\operatorname{Tr}\left(\left(\begin{array}{cc}\operatorname{diag}\left(\sigma^{2}(\mathbf{C})\right)+\varepsilon \mathbf{I}_{r} & \mathbf{0} \\ \mathbf{0} & \varepsilon \mathbf{I}_{N-r}\end{array}\right)^{-1}\left(\mathbf{\Sigma}^{T} \boldsymbol{\Sigma}\right)\right) \\ =\operatorname{Tr}\left(\mathbf{U} \boldsymbol{\Sigma}\left(\begin{array}{cc}\operatorname{diag}\left(\sigma^{2}(\mathbf{C})\right)+\varepsilon \mathbf{I}_{r} & \mathbf{0} \\ \mathbf{0} & \varepsilon \mathbf{I}_{N-r}\end{array}\right)^{-1} \boldsymbol{\Sigma}^{T} \mathbf{U}^{T}\right) \\ =\operatorname{Tr}\left(\mathbf{U} \mathbf{\Sigma} \mathbf{V}^{T} \mathbf{V}\left(\begin{array}{cc}\operatorname{diag}\left(\sigma^{2}(\mathbf{C})\right)+\varepsilon \mathbf{I}_{r} & \mathbf{0} \\ \mathbf{0} & \varepsilon \mathbf{I}_{N-r}\end{array}\right)^{-1} \mathbf{V}^{T} \mathbf{V} \boldsymbol{\Sigma}^{T} \mathbf{U}^{T}\right) \\ =\operatorname{Tr}\left(\mathbf{C} \mathbf{V}\left(\begin{array}{cc}\operatorname{diag}\left(\sigma^{2}(\mathbf{C})\right)+\varepsilon \mathbf{I}_{r} & \mathbf{0} \\ \mathbf{0} & \varepsilon \mathbf{I}_{N-r}\end{array}\right)^{-1} \mathbf{V}^{T} \mathbf{C}^{T}\right) \\\end{array}\end{aligned}\) (10)
Then, by resorting to the property of the inversion of unitary matrix, φε (C) can be further written as
\(\begin{aligned}\begin{array}{l}\varphi_{\varepsilon}(\mathbf{C})=\operatorname{Tr}\left(\mathbf{C}\left(\mathbf{V}^{T}\right)^{-1}\left(\begin{array}{cc}\operatorname{diag}\left(\sigma^{2}(\mathbf{C})\right)+\varepsilon \mathbf{I}_{r} & \mathbf{0} \\ \mathbf{0} & \varepsilon \mathbf{I}_{N-r}\end{array}\right)^{-1} \mathbf{V}^{-1} \mathbf{C}^{T}\right) \\ =\operatorname{Tr}\left(\mathbf{C}\left(\mathbf{V}^{T}\right)^{-1}\left(\mathbf{\Sigma}^{T} \mathbf{\Sigma}+\varepsilon \mathbf{I}_{N}\right)^{-1} \mathbf{V}^{-1} \mathbf{C}^{T}\right) \\ =\operatorname{Tr}\left(\mathbf{C}\left(\mathbf{V}\left(\boldsymbol{\Sigma}^{T} \boldsymbol{\Sigma}+\varepsilon \mathbf{I}_{N}\right) \mathbf{V}^{T}\right)^{-1} \mathbf{C}^{T}\right) \\\end{array}\end{aligned}\) (11)
Based on (11), the closed-form expression of φε (C) can be derived via the inverse process of SVD of C
\(\begin{aligned} \varphi_{\varepsilon}(\mathbf{C}) & =\operatorname{Tr}\left(\mathbf{C}\left(\mathbf{V}\left(\mathbf{\Sigma}^{T} \boldsymbol{\Sigma}\right) \mathbf{V}^{T}+\varepsilon \mathbf{I}_{N}\right)^{-1} \mathbf{C}^{T}\right) \\ & =\operatorname{Tr}\left(\mathbf{C}\left(\mathbf{V}\left(\mathbf{\Sigma}^{T} \mathbf{U}^{T} \mathbf{U} \boldsymbol{\Sigma}\right) \mathbf{V}^{T}+\varepsilon \mathbf{I}_{N}\right)^{-1} \mathbf{C}^{T}\right) \\ & =\operatorname{Tr}\left(\mathbf{C}\left(\mathbf{C}^{T} \mathbf{C}+\varepsilon \mathbf{I}_{N}\right)^{-1} \mathbf{C}^{T}\right) .\end{aligned}\) (12)
So far, we obtain the novel approximation for the rank function of matrix C in the closed-form way.
3.2 The Integration of Multi-view Data on Spectral Structure
In the data space, since there may exist dramatic divergence across each view, integrating the data from multiple views directly may cause a large information loss. However, work [17] reveals that the data from different views still have a very similar spectral block structure. Inspired by this discovery, we turn to establish a unified spectral block structure across all views in the spectral embedding space [18]. In this way, the issue caused by the obvious inconsistency of the data representation among all views can be avoided. Specifically, to obtain the spectral block structure on each view, we incorporate spectral clustering into the optimization, and then perform an adaptive integration based on minimizing the difference between the spectral block structure on each view and the unified spectral block structure. To be specific, the objective function is formulated as
\(\begin{aligned}\begin{array}{l}\min _{\mathbf{S}, \mathbf{F}_{m}} \sum_{m=1}^{M} p_{m}\left\|\mathbf{F}_{m} \mathbf{F}_{m}^{T}-\mathbf{S}\right\|_{F}^{2}+\beta_{3} \operatorname{Tr}\left(\mathbf{F}_{m}^{T} \mathbf{L}_{m} \mathbf{F}_{m}\right), \\ \text { s.t. } \mathbf{F}_{m}^{T} \mathbf{F}_{m}=\mathbf{I}_{k}, \operatorname{rank}\left(\mathbf{L}_{s}\right)=N-k, \mathbf{S} \mathbf{1}=\mathbf{1},\end{array}\end{aligned}\) (13)
where Lm = IN - Dm(-1/2)U(m)Dm(-1/2) is the Laplace matrix of the affinity matrix U(m) with Dm being its degree matrix, Fm ∈ ℝN×N denotes the spectral embedding matrix of the m-th view, FmFmT is the spectral block structure of the m-th view, S is the unified spectral structure across all views, 1 is a column vector that all elements are 1, β3 denotes the balance parameter for the spectral embedding term.
So far, the optimization problem (5) can be reformulated as
\(\begin{aligned}\begin{array}{l}\min _{\mathrm{C}^{(m)}, \mathbf{S}, \mathbf{F}, \mathbf{F}_{m}} \sum_{m=1}^{M}\left\{\left\|\mathbf{X}^{(m)}-\mathbf{X}^{(m)} \mathbf{C}^{(m)}\right\|_{F}^{2}+\beta_{1} \varphi_{\varepsilon}\left(\mathbf{C}^{(m)}\right)+\beta_{2}\left\|\mathbf{C}^{(m)}\right\|_{1}\right\} \\ \quad+\sum_{m=1}^{M}\left\{p_{m}\left\|\mathbf{F}_{m} \mathbf{F}_{m}^{T}-\mathbf{S}\right\|_{F}^{2}+\beta_{3} \operatorname{Tr}\left(\mathbf{F}_{m}^{T} \mathbf{L}_{m} \mathbf{F}_{m}\right)\right\}+\eta \operatorname{Tr}\left(\mathbf{F}^{T} \mathbf{L}_{s} \mathbf{F}\right), \\ \text { s.t. } \mathbf{C}_{i j}^{(m)}>0, \operatorname{diag}\left(\mathbf{C}^{(m)}\right)=\mathbf{0}, \mathbf{F}_{m}^{T} \mathbf{F}_{m}=\mathbf{I}_{k}, \mathbf{F}^{T} \mathbf{F}=\mathbf{I}_{k} .\end{array}\end{aligned}\) (14)
4. Optimization Algorithms
To solve the optimization problem (14), Alternating Direction Method of Multipliers (ADMM) is adopted here.
4.1 Update C(m)
With fixed S, F and Fm, the optimization problem can be reduced as
\(\begin{aligned}\min _{\mathbf{C}}\|\mathbf{X}-\mathbf{X C}\|_{F}^{2}+\beta_{1} \varphi_{\varepsilon}(\mathbf{C})+\beta_{2}\|\mathbf{C}\|_{1}-\beta_{3} \operatorname{Tr}\left(\mathbf{T}_{m}^{T} \mathbf{C} \mathbf{T}_{m}\right), s.t. \operatorname{diag}(\mathbf{C})=\mathbf{0}\end{aligned}\) (15)
where Tm = Dm(-1/2)Fm and the superscript of m is omitted for the sake of description convenience. Then, by introducing the auxiliary variables {Ci}3i=1 and W, (15) can be reformulated as
\(\begin{aligned}\begin{array}{l}\min _{\mathbf{W},\left\{\mathbf{C}_{i}\right\}_{i=1}^{3}}\|\mathbf{X}-\mathbf{X} \mathbf{W}\|_{F}^{2}+\beta_{1} \varphi_{\varepsilon}\left(\mathbf{C}_{1}\right)+\beta_{2}\left\|\mathbf{C}_{2}\right\|_{1}-\beta_{3} \operatorname{Tr}\left(\mathbf{T}_{m}^{T} \mathbf{C}_{3} \mathbf{T}_{m}\right), \\ \text { s.t. } \mathbf{W}=\mathbf{C}_{2}-\operatorname{diag}\left(\mathbf{C}_{2}\right), \mathbf{W}=\mathbf{C}_{1}, \mathbf{W}=\mathbf{C}_{3} .\end{array}\end{aligned}\) (16)
Moreover, its augmented Lagrangian is
\(\begin{aligned} L\left(\mathbf{W},\left\{\mathbf{C}_{i}\right\}_{i=1}^{3},\left\{\boldsymbol{\Omega}_{i}\right\}_{i=1}^{3}\right)= & \|\mathbf{X}-\mathbf{X} \mathbf{W}\|_{F}^{2}+\beta_{1} \varphi_{\varepsilon}\left(\mathbf{C}_{1}\right)+\beta_{2}\left\|\mathbf{C}_{2}\right\|_{1}-\beta_{3} \operatorname{Tr}\left(\mathbf{T}_{m}^{T} \mathbf{C}_{3} \mathbf{T}_{m}\right) \\ & +\frac{v}{2}\left(\left\|\mathbf{W}-\mathbf{C}_{2}+\operatorname{diag}\left(\mathbf{C}_{2}\right)\right\|_{F}^{2}+\left\|\mathbf{W}-\mathbf{C}_{1}\right\|_{F}^{2}+\left\|\mathbf{W}-\mathbf{C}_{3}\right\|_{F}^{2}\right) \\ & +\operatorname{Tr}\left(\boldsymbol{\Omega}_{1}^{T}\left(\mathbf{W}-\mathbf{C}_{2}+\operatorname{diag}\left(\mathbf{C}_{2}\right)\right)\right) \\ & +\operatorname{Tr}\left(\boldsymbol{\Omega}_{2}^{T}\left(\mathbf{W}-\mathbf{C}_{1}\right)\right)+\operatorname{Tr}\left(\mathbf{\Omega}_{3}^{T}\left(\mathbf{W}-\mathbf{C}_{3}\right)\right),\end{aligned}\) (17)
where v is the penalty parameter and {Ωi}3i=1 are the Lagrange dual variables.
4.1.1 Update W
By setting the partial derivative of (17) with respect to W as 0, W can be updated directly as
W =(2XTX + 3vIN)-1 (2XTX + v(C1 + C2 + C3) - Ω1 - Ω2 - Ω3). (18)
4.1.2 Update C1
Based on the gradient descent method, C1 can be updated iteratively as
C1(t+1) = C1(t) - ηC1▽C1, (19)
where t represents the number of iterations, ηC1 denotes the learning rate, and ∇C1 represents the gradient about C1. However, to obtain ∇C1 , we need to calculate the partial derivative of (17) with respect to C1. Firstly, we need to calculate the partial derivative of φε(C1) about C1. Specifically, based on the derivative rule about the trace of matrices, we have
∂(φε(C1))/∂(C1) = 2C1((C1)T C1 + εIN)-1(IN - (C1)T C1((C1)T C1 + εIN)-1). (20)
In addition, we need to calculate the partial derivative of ║W - C1║2F with respect to C1. To be specific, by resorting to the derivative rule about the squared Frobenius norm of the matrix, we can obtain
∂(║W - C1║2F)/∂(C1) =-2W + 2C1. (21)
Finally, we need to calculate the partial derivative of Tr(ΩT2(W - C1)) about C1. Specifically, based on the derivative rule about the trace of the matrix, we have
∂(Tr(ΩT2(W - C1)))/∂(C1) =-Ω2. (22)
In the end, based on (20), (21) and (22), the gradient ∇C1 in (19) can be further written as
∇C1 = vC1 - vW - Ω2 + 2C1(CT1C1 + εIN)-1(IN - CT1C1(CT1C1 + εIN)-1). (23)
4.1.3 Update C2
From [19] [20], the update rule of C2 is
C2 = C'2 - diag(C'2), (24)
where \(\begin{aligned}\mathbf{C}_{2}^{\prime}=\pi_{\frac{\beta_{2}}{v}}\left(\mathbf{W}+\boldsymbol{\Omega}_{1} / v\right)\end{aligned}\) , \(\begin{aligned}\pi_{\frac{\beta_{2}}{v}}(.)\end{aligned}\) and is the soft-thresholding operation applied entry-wise to (W + Ω1 / v) [20].
4.1.4 Update C3
Based on the gradient descent method, C3 can be updated iteratively as
C3(t+1) = C3(t) - ηC3 ▽C3 , (25)
where the gradient ▽C3 =vC3 - vW - β3TmTmT - Ω3.
4.1.5 Update {Ωi}3i=1
Given W and {Ci}3i=1, by the update rules for the dual variables in [5], dual variables {Ωi}3i=1 can be updated as
Ω1t+1 = Ω1t + v(W - C2),
Ω2t+1 = Ω2t + v(W - C1),
Ω3t+1 = Ω3t + v(W - C3). (26)
4.2 Update S
With fixed C(m), F and Fm, the optimization problem can be reduced as
\(\begin{aligned}\min _{\mathbf{S}} \sum_{m=1}^{M}\left\{p_{m}\left\|\overline{\mathbf{F}}_{m}-\mathbf{S}\right\|_{F}^{2}\right\}+\eta \operatorname{Tr}\left(\mathbf{F}^{T} \mathbf{L}_{s} \mathbf{F}\right), s.t. \mathbf{F}^{T} \mathbf{F}=\mathbf{I}_{k},\end{aligned}\) (27)
where \(\begin{aligned}\overline{\mathbf{F}}_{m} \triangleq \mathbf{F}_{m} \mathbf{F}_{m}^{T}.\end{aligned}\) By assuming the j-th entry of qi ∈ ℝN×1 and it can be set as ║F(i,:) - F(j,:)║22, the optimization problem (27) with regard to the i-th column of S can be formulated as
\(\begin{aligned}\min _{\mathbf{S}(:, i)} \sum_{m=1}^{M} p_{m}\left\|\overline{\mathbf{F}}_{m}(:, i)-\mathbf{S}(:, i)\right\|_{2}^{2}+\eta \mathbf{q}_{i}^{T} \mathbf{S}(:, i).\end{aligned}\) (28)
By taking the partial derivative about S(,:) on (28) as 0, we can obtain
\(\begin{aligned}\mathbf{S}(:, i)=\left(\sum_{m} p_{m} \overline{\mathbf{F}}_{m}(:, i)-\frac{\eta \mathbf{q}_{i}}{2}\right) / \sum_{m} p_{m}.\end{aligned}\) (29)
Thus, S can be obtained by implementing (29) on the each column of it.
4.3 Update F
With fixed C(m), S and Fm, the optimization problem can be reduced as
\(\begin{aligned}\min _{\mathbf{F}} \operatorname{Tr}\left(\mathbf{F}^{T} \mathbf{L}_{s} \mathbf{F}\right), s.t. \mathbf{F}^{T} \mathbf{F}=\mathbf{I}_{k}. \end{aligned}\) (30)
Obviously, the optimal solution of F is the matrix consisting of the eigenvectors of Ls corresponding to the k smallest eigenvalues.
4.4 Update Fm
With fixed C(m), S and F, the optimization problem can be reduced as
\(\begin{aligned}\min _{\mathbf{F}_{m}} \sum_{m=1}^{M} p_{m}\left\|\mathbf{F}_{m} \mathbf{F}_{m}^{T}-\mathbf{S}\right\|_{F}^{2}+\beta_{3} \operatorname{Tr}\left(\mathbf{F}_{m}^{T} \mathbf{L}_{m} \mathbf{F}_{m}\right), s.t. \mathbf{F}_{m}^{T} \mathbf{F}_{m}=\mathbf{I}_{k}\end{aligned}\) (31)
Similarly, Fm can be updated iteratively as
Fm(t+1) = Fm(t) - ηFm ▽Fm , (32)
where the gradient can be written as
\(\begin{aligned}\nabla_{\mathbf{F}_{m}}=\sum_{m=1}^{M} 2 p_{m}\left(\mathbf{F}_{m}-\mathbf{S}^{T} \mathbf{F}_{m}-\mathbf{S F}_{m}\right)+\beta_{3}\left(\mathbf{L}_{m} \mathbf{F}_{m}+\mathbf{L}_{m}^{T} \mathbf{F}_{m}\right).\end{aligned}\) (33)
So far, the update rules about all variables are provided above, these rules are then implemented repeatedly until the convergence or the achievement of maximum number of iteration. Furthermore, the update rules descripted above are summarized in the following algorithm.
Algorithm description ADMM
Input: X = {X}Mi=1, {βi}3i=1, pm, η, N, k, v
Outputs: Assignment of the data points to k clusters
1: Initialize: {Ci}3i=1 = 0, W = 0,{Ωi}3i=1 = 0, i = 1, 2, …, M
2: while not converged do
3: for m=1 to M do
4: Fix others and update W by solving (18)
5: Fix others and update C1 by solving (19)
6: Fix others and update C2 by solving (24)
7: Fix others and update C3 by solving (25)
8: Fix others and update {Ωi}3i=1 by solving (26)
9: Fix others and update S by solving (29)
10: Fix others and update F by solving (30)
11: Fix others and update Fm by solving (32)
12: end for
13: Update v
14:end while
15: Combine {C(m)}Mm=1 by the method of spectral structure fusion
4.5 Complexity Analysis
The main computation consumption of the algorithm is the update of C(m) and F. To be specific, the complexity of updating C(m) is O(N3) , since the matrix inversion and multiplication. The complexity of updating F is O(N3) , due to the implementation of SVD.
5. Experiments
5.1 Dataset Descriptions
To evaluate the performance of the proposed approach, four real-world multi-view datasets are used, which are Reuters [21], 3-sources1, Prokaryotic [22] and UCI Digit2. Reuters is a dataset containing documents in 5 languages, where 600 documents with 6 clusters are randomly sampled for this experiment. 3-sources is a dataset of news articles collected from three online news sources, where 169 articles with 3 views and 6 clusters are adopted for this experiment. Prokaryotic is a dataset describing 551 prokaryotic species in a heterogeneous multi-view way including text and several genomic representations, where 551 samples with 3 views and 4 clusters are adopted for this experiment. UCI Digit is a dataset consisting of handwritten digits (0-9), where 2000 examples with 3 views and 10 clusters are chosen for this experiment.
5.2 Experiment Setting
We evaluate clustering performance using five different metrics, such as recall, precision, F-score, normalized mutual information (NMI) and adjusted rand index (Adj-RI). For all these metrics, the higher value implies better performance.
Besides, to evaluate the results of the experiments, we compare the proposed approach with the state-of-the-art solutions, such as Robust Multi-view Spectral Clustering (RMSC) [4], Convex Sparse Multi-view Spectral Clustering (CSMSC) [3], Pairwise Multi-view Low-rank Sparse Subspace Clustering (Pairwise MLRSSC) [5], Centroid-based Multi-view Low-rank Sparse Subspace Clustering (Centroid MLRSSC) [5] and Multi-graph Fusion for Multi-view Spectral Clustering (GFSC) [9]. Apart from that, we also verify the effectiveness of the proposed rank-norm approximation. For the sake of fairness, the same framework under different rank-norm approximations are compared. In other words, the framework of unifying multi-view data on the spectral structure under the nuclear norm (MVSS (Nuclear)), the Gamma norm (MVSS (Gamma)) and the proposed rank-norm approximation (MVSS (Ours)) are compared. Especially in MVSS (Nuclear), to compare with the cutting-edge method, the weighted nuclear norm [23], the latest nuclear norm based method, is exploited.
Moreover, all parameters of existing approaches mentioned above are based on the respective parameter-searching strategy provided by the authors. Besides, all balance parameters in the proposed method are tuned from {50-3, 10-3, 50-2, 10-2, 50-1, 10-1, 1, 5, 10} . In addition, the parameter for the proposed rank-norm approximation ε is tuned from {50-3, 10-3, 50-2, 10-2, 50-1} , while the parameter for the Gamma norm γ is varied from 50-3 to 10-1 with step 50-3. Apart from that, the learning rate is set as 0.05, and the maximum number of iteration is 300.
5.3 Experiment Results
In Table 1, the clustering performance is compared sufficiently. As shown in Table 1, the performance of MVSS series (including MVSS (Nuclear), MVSS (Gamma), MVSS (Ours)) generally tend to outperform the performance of other algorithms. Hence, the effectiveness of the proposed integration of multi-view data on the spectral structure can be verified.
Table 1. Clustering performance of different algorithms
Moreover, one may observe that, the performance of MVSS series is much better than that of others in 3-sources, while the performance of MVSS series is just slightly better than that of others in Reuters. For example, F-score of MVSS series is obviously higher than that of the other approaches in 3-sources, while F-score of MVSS series is slightly higher than that of the other approaches in Reuters. This is because that, the divergence across each view of data in 3-sources is more obvious than that in Reuters. Besides, in the proposed MVSS series, one may find that, the clustering performance of MVSS (Ours), which is based on the proposed rank-norm approximation, is almost the same as it of MVSS (Gamma). Hence, the proposed rank-norm approximation can achieve almost the same performance of the rank function approximation via the Gamma norm. Nevertheless, to achieve the same performance, the method based on the Gamma norm may require finer-grained grid search for the parameter γ, compared with the parameter ε in the proposed rank-norm approximation. This is because that, better robustness can be derived by the proposed rank-norm approximation.
To further compare the robustness of the Gamma norm with that of the proposed method, Table 2-4 show how the change of the parameters of rank-norm approximation methods (i.e., ε in the proposed rank-norm approximation or γ in the Gamma norm) impacts on clustering performance. As the length limit, the comprehensive performance metric, F-score, NMI, Adj-RI, are used here only. In Table 2-4, the change of the parameters of rank-norm approximation methods is denoted by f which means the percentage change of the parameters compared with the value chosen in Table 1. For example, f = 5% means that, the parameters are increased and decreased by 5% from the value chosen in Table 1, and the corresponding metric in Table 2-4 is the highest one among the two F-score values. It can be seen that, the impact of the variation of f on the performance of MVSS (Ours) is smaller than its counterpart on the performance of MVSS (Gamma). So, it can be seen that the Gamma norm is more sensitive to the parameter γ . Therefore, we can conclude that, the proposed rank-norm approximation can provide a better robustness than the Gamma norm.
Table 2. F-score under different f
Table 3. NMI under different f
Table 4. Adj-RI under different f
The analysis of parameter sensitivity is shown in Fig. 1, where different combinations of two balance parameters chosen from [1, 2, 3, 4] are conducted the analysis and the F-score is taken as the performance metric. As shown in Fig. 1, under different parameter perturbations, the proposed method can obtain stable performance. In contrast, the performance of the proposed method is relatively sensitive to the parameter combination of η and β2.
Fig. 1. Illustration of the parameter sensitivity of the proposed method on UCI digit dataset (a) β1 vs β2. (b) β1 vs β3. (c) β2 vs β3 . (d) β1 vs η . (e) η vs β2 . (f) η vs β3.
6. Conclusion
In the multi-view clustering, to solve the performance deterioration problem caused by the huge diversity of the data representation in each view, we propose to construct the unified representation in the spectral embedding domain. In this way, the complementary information between perspectives can be unified very well. In addition, to better capture the global structure information of data in subspace clustering, we propose a novel low-rank constraint relaxation method via the tight lower bound on the rank function and develop an optimization algorithm to solve the multi-view clustering problem. Finally, experimental results demonstrate that the proposed method has better tradeoff between clustering effect and robustness.
References
- C. Xu, D. Tao and C. Xu, "A survey on multi-view learning," arXiv preprint arXiv:1304.5634, Apr. 2013.
- G. Chao, S. Sun, and J. Bi, "A survey on multi-view clustering," arXiv preprint arXiv:1712.06246, Dec. 2017.
- C. Lu, S. Yan and Z. Lin, "Convex sparse spectral clustering: Single-view to multi-view," IEEE Trans. Image Process., vol. 25, no. 6, pp. 2833-2843, Apr. 2016. https://doi.org/10.1109/TIP.2016.2553459
- R. K. Xia, P. Yan, D. Lei and Y. Jian, "Robust multi-view spectral clustering via low-rank and sparse decomposition," in Proc of the AAAI conference on artificial intelligence, Quebec, Canada, vol. 28, no. 1, pp. 2149-2155, Jun. 2014.
- M. Brbic, I. Kopriva, "Multi-view low-rank sparse subspace clustering," Pattern Recognit., vol. 73, pp. 247-258, Jan. 2018. https://doi.org/10.1016/j.patcog.2017.08.024
- X. Cao, C. Zhang, H. Fu, S. Liu and H. Zhang, "Diversity-induced multi-view subspace clustering," in Proc of the IEEE conference on computer vision and pattern recognition, pp. 586-594, 2015.
- X. Wang, X. Guo, Z. Lei, C. Zhang and S. Z. Li, "Exclusivity-consistency regularized multi-view subspace clustering," in Proc of the IEEE conference on computer vision and pattern recognition, pp. 923-931, 2017.
- F. Nie, J. Li and X. Li, "Self-weighted Multiview Clustering with Multiple Graphs," in Proc of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pp. 2564-2570, 2017.
- Z. Kang, G. Shi, S. Huang, W. Chen, X. Pu, J. T. Zhou and Z. Xu, "Multi-graph fusion for multi-view spectral clustering," Knowl. Based Syst., vol. 189, pp. 105102-105110, Feb. 2020. https://doi.org/10.1016/j.knosys.2019.105102
- G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu and Y. Ma, "Robust recovery of subspace structures by low-rank representation," IEEE Trans. Pattern Anal., vol. 35, no. 1, pp. 171-184, Jan. 2013. https://doi.org/10.1109/TPAMI.2012.88
- S. Liu, X. Liu, S. Wang and K. Muhammad, "Fuzzy-aided solution for out-of-view challenge in visual tracking under iot-assisted complex environment," Neural. Comput. Appl., vol. 33, pp.1055-1065, Feb. 2021. https://doi.org/10.1007/s00521-020-05021-3
- K. Fan, "On a theorem of Weyl concerning eigenvalues of linear transformations I," in Proc of the National Academy of Sciences of the United States of America, vol. 35, no. 11, pp. 652-655, Nov. 1949. https://doi.org/10.1073/pnas.35.11.652
- J. Cao, Y. Fu, X. Shi and B. W. Ling, "Subspace Clustering Based on Latent Low Rank Representation with Schatten-p Norm," in Proc of the 2020 2nd World Symposium on Artificial Intelligence, Guangzhou, China, pp. 58-62, Jul. 2020.
- H. Zhang, J. Yang, J. Xie, J. Qin and B. Zhang, "Weighted sparse coding regularized nonconvex matrix regression for robust face recognition," Inf. Sci., vol. 394, pp. 1-17, Jul. 2017. https://doi.org/10.1016/j.ins.2017.02.020
- S. Wang, D. Liu, and Z. Zhang, "Nonconvex relaxation approaches to robust matrix recovery," in Proc of the Twenty-Third International Joint Conference on Artificial Intelligence, pp. 1764-1770, Aug. 2013.
- S. Liu, S. Wang, X. Liu, A. H. Gandomi, M. Daneshmand, K. Muhammad and V. H. C. De Albuquerque, "Human memory update strategy: a multi-layer template update mechanism for remote visual monitoring," IEEE Trans. Multimedia, vol. 23, pp. 2188-2198, Mac. 2021. https://doi.org/10.1109/TMM.2021.3065580
- Y. Wang, L. Wu, X. Lin and J. Gao, "Multiview spectral clustering via structured low-rank matrix factorization," IEEE Trans. Neural Networks Learn. Syst., vol. 29, no. 10, pp. 4833-4843, 2018. https://doi.org/10.1109/tnnls.2017.2777489
- S. Liu, S. Wang, X. Liu, C. T. Lin and Z. Lv, "Fuzzy detection aided real-time and robust visual tracking under complex environments," IEEE Trans. Fuzzy Syst., vol. 29, pp. 90-102, Jun. 2020.
- I, Daubechies, M. Defrise and C. De Mol, "An iterative thresholding algorithm for linear inverse problems with a sparsity constraint," Commun. Pure Appl. Math., vol. 57, no. 11, pp. 1413-1457, Aug. 2004. https://doi.org/10.1002/cpa.20042
- D. L. Donoho, "De-noising by soft-thresholding," IEEE Trans. Inf. Theory, vol. 41, no. 3, pp. 613-627, May 1995. https://doi.org/10.1109/18.382009
- D. D. Lewis, Y. Yang, T.G. Rose and F. Li, "Rcv1: A new benchmark collection for text categorization research," J. Mach. Learn. Res., vol. 5, pp. 361-397, Apr. 2004.
- M. Brbic, M. Piskorec, V. Vidulin, A. Krisko, T. Smuc and F. Supek, "The landscape of microbial phenotypic traits and associated genes," Nucleic Acids Res., vol. 44, no. 21, pp. 10074-10090, Dec. 2016. https://doi.org/10.1093/nar/gkw964
- S. Gu, Q. Xie, D. Meng, W. Zuo, X. Feng and L. Zhang, "Weighted nuclear norm minimization and its applications to low level vision," Int. J. Comput. Vis., vol. 121, no. 2, pp. 183-208, Jan. 2017. https://doi.org/10.1007/s11263-016-0930-5