Popular metrics for measuring data similarity

Minh Nguyen
6 min readNov 1, 2021

i. Introduction

Working with data, we often want to make comparisons in the hope of similar objects will behave in a similar manner or belong to the same group. In the Information Retrieval (IR) context, we might want to compare two queries to return already well-established results to a search engine’s user. In a Recommender System (RS), we want to compare two items so we can show a customer something that is alike what she has bought before. That increases our system’s purchase.

There exist various such similarity (dissimilarity) methods depending on how we define what similar objects should be. For instance, if objects are two lists of discrete items, the lists can be considered similar if they share lots of common items. Or if two objects are represented as vectors, their similarity can be measured based on the distance or angle between them. However, due to the complex nature of data, and different task objectives, finding the most suitable metrics often is a trial-and-error process that we can’t be so sure of at the beginning of a project. Below, I explain some of the methods I frequently encountered while working as a data scientist.

ii. Similarity methods

1. Common items-based methods

1.1 Jaccard similarity

Jaccard Similarity is metric computing the similarity between two objects, mostly for the lists, sets of discrete items such as a list of tokenized textual sequences. The method’s similarity mechanism is based on the common items between two lists.

This is a simple method and has been used in a wide range of applications. For example, using this method, two textual sequences are considered similar if they share lots of tokens. However, due to its simplicity, precaution must be taken. In Natural Language Processing (NLP) context, the method does not consider the token frequency, but rather just counts the number of tokens that are common between two sequences. This is not an ideal solution since token frequency carries important information. Also, different sized sets with the same number of common members will result in the same Jaccard similarity. That might be less than ideal in some scenarios.

1.2 Hamming similarity

This classic method is the inversed form of the Hamming distance which is also based on common items between two lists. It is used for categorical variables. If the value (x) and the value (y) are the same, the distance D will be equal to 0. Otherwise, D is equal to 1.

While this is a simple method, but it does have applications in reality, like comparing two binary strings. In the recommender system context, the Hamming distance of two rating lists between two items can show meaningful information about their similarity.

2. Distance based similarity metrics

Distance-based similarity metrics rely on the idea similar vectors should be close, while the different ones are far apart on the vector space. To obtain a similarity score, we often adopt some inverse forms from the distance score. Below is one example where 1 is added to the denominator to make the similarity score be within [0–1] interval.

2.1 Euclidean Distance

Perhaps the most well-known metric in this class is Euclidean Distance, aka L2 distance or L2 norm. The method represents the shortest distance between two points.

Image source: https://en.wikipedia.org/wiki/Euclidean_distance

Although Euclidean distance is very common in clustering, there is a problem with Euclidean distance as a family of the Minkowski metric. That is the largest-scaled feature that would dominate the others. The similarity score is very sensitive to outliners. Normalization of continuous features is a solution to this problem. Or we can take average Euclidean distance. For two data points x, y in n-dimensional space, the average distance is defined as:

If the relative importance according to each attribute is available, then the Weighted Euclidean distance — can be used.

2.2 Manhattan Distance

Another method in this class is Manhattan Distance, aka L1 distance or L1 norm. This is the sum of absolute differences between points across all the dimensions. It is defined as

In the image below, the Manhattan distance between point A and point B is the total length of all the blue dotted lines.

Image source: http://bloggity.nurdz.com/gamedev-math/manhattan/

Compared to Euclidean distance, because this method does take any exponentiation form, it is said that Manhattan distance is more robust to outliers. Because of that, it is also faster to calculate. That is a huge benefit, especially if we have to deal with the “curse of dimensionality” while building a model.

3. Angle based similarity metrics

Working in the NLP field, we often face the situation where two textual sequences look very different, but in fact, they share similar meanings. Here is one example:

Sentence 1: I asked her to go out, but she refused it.

Sentence 2: She did not want to hang out with me tonight when I brought it up.

If the two sentences are represented as vectors, and because of their syntactical difference, the vectors might be very far away. In such a scenario, measuring similarity based on the angle between two vectors is a good alternative.

The cosine similarity is defined as dividing the inner product between two vectors by the product of the vector’s L2 norms:

The division makes the cosine similarity score bounded from -1 to 1. If both x, y are non-negative, then the score is actually bounded between 0 and 1.

This is a very popular similarity metric applied in NLP tasks like word embedding, sentence similarity… However, it’s not perfect. Cosine similarity is not invariant to shifts. If x was shifted to x+1, the cosine similarity would change. It is also independent of vector length.

Working with item-based Collaborative Filtering systems, items similarity is computed along with the rating columns. The basic form of Cosine similarity metric has a major drawback when it does not take into account the difference in users’ rating scale. The adjusted cosine similarity metric addresses this issue by subtracting a user’s average rating from each user’s rating. The metric is given by:

Centered cosine similarity or Pearson correlation coefficient measure is based on how many ratings of an item are far away from the average rating of that item over all users. This also helps mitigate the missing rating issue in a Recommender System.

4. References

https://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/itembased.html

https://www.sciencedirect.com/topics/social-sciences/pearson-correlation-coefficient

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144059

https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/

--

--