Correspondence Analysis

Correspondence analysis is a multivariate statistical technique that summarizes a set of categorical data in a two dimensional form. It’s like the equivalent of Principal Component Analysis but for categorical data.

Correspondence analysis is usually applied to contigency tables. In this post, we will apply it to a frequency matrix (term document matrix from bag of words representation).

The analysis can be done by row or by column. Below is an implementation of correspondence analysis, where row and column analysis are done at the same time.

correspondence <- function(ct, ind){
  
  #Parameters
  #ct : contingency table (or frequency table)
  #ind: which eigenvectors (first eigenvector is ommited)
  
  n <- sum(ct)
  rows <- nrow(ct)
  cols <- ncol(ct)
  
  #Correspondence Matrix
  F_fisher<-(ct)/n
  
  #Relative frequencies
  rtot<-(apply(ct,1,sum))/n
  ctot<-apply(ct,2,sum)/n
  Dr<-diag(rtot)
  Dc<-diag(ctot)
  
  
  Z<-(sqrt(solve(Dr)))%*%F_fisher%*%(sqrt(solve(Dc)))
  
  #Eigenvalues and eigenvector are obtained with SVD
  dvalsing<-svd(Z)
  
  #Two dimensional representation
  #Row analysis
  Cr<-(sqrt(solve(Dr)))%*%Z%*%dvalsing$v[,ind]
  #Column analysis
  Cc<-(sqrt(solve(Dc)))%*%t(Z)%*%dvalsing$u[,ind]
  
  return(list("Cr" = Cr, "Cc" = Cc))
}

Mexican discourses

In this post we will analize the discourses of mexican politicians, in particular, candidates for Mexico presidency. We have 11 discourses in total:

Our objective is to find patterns in the two dimensional of the discourses, that reflect information of the actual Mexico context regarding politics.

Putting it all together

We will use the bag of words representation for the discourses. The most frequent 500 words will be chosen for the analysis, and our final term document matrix will be a 11 x 500 matrix.

Next, we see the results of the correspondence analysis appplied to our term document matrix:

open

Insights

We can see that Ricardo Anaya and Roberto Madrazo are the furthest. That means in this context that they use words in their discourses that the other candidates don’t use frequently.

The three discourses from Andres Manuel are near from each other, and that was expected. And Margarita Zavala is close to Josefina Vazquez Mota. That makes sense, as their campaings are based on the idea of a woman in the presidency, so it’s logical that they use similar words in their discourses.

Another interesting insight is the closeness between Felipe Calderon and Margarita Zavala. It turns out that the team that helped Zavala in her campaign were former collaborators of Felipe Calderon, so maybe she was advised in the same way that Calderon. Check this new here.

The final insight was the closeness between Margarita Zavala and Jose Antonio Meade. Recently, Zavala has resigned from her candidacy, and, surprisingly, Jorge Camacho (former campaign chief from Zavala campaign) has anounced that he intends to vote for Meade. Perhaps he intends to vote for the candidate with the most similar ideas, and that would explain the closeness in our analysis. Check this new here.

Final thoughts

Correspondence analysis has proven to be useful in finding patterns on frequencey matrices. We saw how some of the political news can be reflected in a discourse analysis. For future work, we can use MDS in the term frequency matrix to obtain “data points” and train a classificator! But correspondence analysis is good for a initial representation.

Discourses

Discourses obtained from animalpolitico.com