Evaluation in designing a recommender system (Part 3)

Minh Nguyen
4 min readOct 10, 2021

This is the third part of my series writing about evaluating recommender systems. Please find the previous parts here:

2. Coverage

One of the issues that a large-scale system often has to deal with is large-scale data. In the system, data just don’t sit still, but very is very dynamic instead. That poses a real challenge for any recommender system since only providing high-quality recommendations is no longer enough. It also has to cover as many items as possible to give users more choices for further sale optimization. Recommender Systems Coverage is the property that addresses the issue.

2.1 Item space coverage

Most commonly, the term coverage refers to the proportion of items that the recommendation system can recommend.

Here, we denote I as the set of available items and Ip as the set of items for which a prediction can be made.

In some cases, it may be desirable to weigh the items, for example, by their popularity or utility.

With r(x) gives out the usefulness of item x

Catalog coverage metric is the rate of distinct items that are recommended over a period of time to users:

Another popular alternative is sale diversity. This property measures how unequally different items are chosen by users when a particular recommender system is used. Using Shannon Entropy like the formula below is one way to measure such property:

Here, If the entropy is 0 when a single item is always chosen or recommended, and logn when n items are chosen or recommended equally often.

2.2 User space coverage

Ideally, we want all users to use the recommendation given by an RS. However, sometimes, due to low confidence or other issues, a system might not provide recommendations to all its users.

Expanding a list of RS users is desirable but we also need to consider the trade-off between coverage and accuracy. Measuring this property is straightforward, by dividing #RS users by total system users. A more suitable choice would be considering only a set of active users, instead of entire users.

2.3 Cold start

Cold start is the problem with the performance of the system on new items and new users. Here, we apply similar methods to calculate coverage property on “cold start” items. Thus, it is important to define what are “cold” items at the beginning. There are various ways to consider an item cold depending on different domains. One option is setting a threshold for the “coldness” of an item. It can be the amount of time an item exists in the system or how many ratings/purchases it has had so far. But some cold items are great to surface while recommending others might bring down the users’ impression of the product quality of a system. However, again, we should consider the trade-off between coverage and accuracy carefully.

3. Diversity

In recommenders that assist in information search, we can assume that more diverse recommendations will result in shorter search interactions. In business, diversity can help users explore more areas of interest that are likely to improve cross-selling. Diversity is generally defined as the opposite of similarity. The most explored method class for measuring diversity captures item-item similarity, typically based on item content. With a recommendation list R (|R| > 1), the method below takes the pairwise distance between items in the list into consideration:

Or the measure can be described indirectly under the “intra-list similarity” metric which is the aggregate pairwise similarity of items in the list:

Please note, with this method, a higher score denotes lower diversity of the list.

--

--