What are some of the deciding factors to take into consideration when choosing a similarity index. In what cases is a Euclidean Distance preferred over Pearson and vice versa?
2 に答える
Correlation is unit independent; if you scale one of the objects ten times, you will get different euclidean distances and same correlation distances. Therefore, correlation metrics is excellent when you want to measure distance between such objects as genes defined by their expression profile.
Often, absolute or squared correlation is used as a distance metrics, because we are more interested in the strength of the relationship than in its sign.
However, correlation is only suitable for highly dimensional data; there is hardly a point of calculating it for two- or three dimensional data points.
Also note that "Pearson distance" is a weighted type of Euclidean distance, and not the "correlation distance" using Pearson correlation coefficient.
It really depends on the application scenario you have in hand. Very briefly, if you are dealing with data where the actual difference in values of attributes is important, go with Euclidean Distance. If you are looking for trend or shape similarity, then go with correlation. Also note, that if you perform z-score normalization in each object, Euclidean Distance behaves similarly to Pearson correlation coefficient. Pearson is not sensitive to linear transformations of the data. There are other types of correlation coefficients that take into account the ranks of the values only, being insensitive to both linear and non linear transformations. Note that the usual use of correlation as dissimilarity is 1 - correlation, which does not respect all the rules for a metric distance.
There are some studies on which proximity measure select on a particular application, for instance:
Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Ivan G. Costa Filho, "Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 99, no. PrePrints, p. 1, , 2013