Correspondence Analysis
Correspondence analysis is a multivariate statistical technique that summarizes a set of categorical data in a two dimensional form. It’s like the equivalent of Principal Component Analysis but for categorical data.
Correspondence analysis is usually applied to contigency tables. In this post, we will apply it to a frequency matrix (term document matrix from bag of words representation).
The analysis can be done by row or by column. Below is an implementation of correspondence analysis, where row and column analysis are done at the same time.
correspondence <- function(ct, ind){
#Parameters
#ct : contingency table (or frequency table)
#ind: which eigenvectors (first eigenvector is ommited)
n <- sum(ct)
rows <- nrow(ct)
cols <- ncol(ct)
#Correspondence Matrix
F_fisher<-(ct)/n
#Relative frequencies
rtot<-(apply(ct,1,sum))/n
ctot<-apply(ct,2,sum)/n
Dr<-diag(rtot)
Dc<-diag(ctot)
Z<-(sqrt(solve(Dr)))%*%F_fisher%*%(sqrt(solve(Dc)))
#Eigenvalues and eigenvector are obtained with SVD
dvalsing<-svd(Z)
#Two dimensional representation
#Row analysis
Cr<-(sqrt(solve(Dr)))%*%Z%*%dvalsing$v[,ind]
#Column analysis
Cc<-(sqrt(solve(Dc)))%*%t(Z)%*%dvalsing$u[,ind]
return(list("Cr" = Cr, "Cc" = Cc))
}
Mexican discourses
In this post we will analize the discourses of mexican politicians, in particular, candidates for Mexico presidency. We have 11 discourses in total:
- Roberto Madrazo Pintado (PRI 2006)
- Andres Manuel Lopez Obrador (PRD 2006) (PRD 2012) (MORENA 2018)
- Enrique Peña Nieto (PRI 2012 before and after being elected)
- Josefina Vazquez Mota (PAN 2012)
- Felipe Calderon (PAN 2006)
- Ricardo Anaya Cortes (PAN 2018)
- Jose Antonio Meade Kuribreña (PRI 2018)
- Margarita Ester Zavala Gomez del Campo (Independiente 2018)
Our objective is to find patterns in the two dimensional of the discourses, that reflect information of the actual Mexico context regarding politics.
Putting it all together
We will use the bag of words representation for the discourses. The most frequent 500 words will be chosen for the analysis, and our final term document matrix will be a 11 x 500 matrix.
Next, we see the results of the correspondence analysis appplied to our term document matrix:
Insights
We can see that Ricardo Anaya and Roberto Madrazo are the furthest. That means in this context that they use words in their discourses that the other candidates don’t use frequently.
The three discourses from Andres Manuel are near from each other, and that was expected. And Margarita Zavala is close to Josefina Vazquez Mota. That makes sense, as their campaings are based on the idea of a woman in the presidency, so it’s logical that they use similar words in their discourses.
Another interesting insight is the closeness between Felipe Calderon and Margarita Zavala. It turns out that the team that helped Zavala in her campaign were former collaborators of Felipe Calderon, so maybe she was advised in the same way that Calderon. Check this new here.
The final insight was the closeness between Margarita Zavala and Jose Antonio Meade. Recently, Zavala has resigned from her candidacy, and, surprisingly, Jorge Camacho (former campaign chief from Zavala campaign) has anounced that he intends to vote for Meade. Perhaps he intends to vote for the candidate with the most similar ideas, and that would explain the closeness in our analysis. Check this new here.
Final thoughts
Correspondence analysis has proven to be useful in finding patterns on frequencey matrices. We saw how some of the political news can be reflected in a discourse analysis. For future work, we can use MDS in the term frequency matrix to obtain “data points” and train a classificator! But correspondence analysis is good for a initial representation.
Discourses
Discourses obtained from animalpolitico.com
Comments