Evaluation in designing a recommender system (Part 1)

Minh Nguyen
3 min readSep 23, 2021

Evaluation settings

  1. Offline experiment

An offline experiment is performed by using a pre-collected data set of users choosing or rating items to simulate the behavior of users that interact with a recommendation system. This experiment only works if the assumption of user behavior when the dataset was collected will be similar enough to the user behavior at the time the recommender system is deployed holds.

Advantages: they require no interaction with real users, and thus allow for comparison between a wide range of candidate algorithms at a low cost.

Disadvantages: This experiment relies heavily on the quality of the testing dataset. Collecting a big dataset with a high degree of relevancy is a challenging task that is not always possible in reality. Another drawback is offline experiments is that they can answer a very narrow set of questions, typically questions about the prediction power of an algorithm. Therefore, mostly, the goal of the offline experiments is to filter out inappropriate approaches, leaving a relatively small set of candidate algorithms to be tested by the more costly user studies or online experiments.

Figure 1. Preprocessing data using k-fold cross-validation for an offline experiment. Falk, K. (2019). https://learning.oreilly.com/library/view/practical-recommender-systems/9781617292705/
Figure 2. The evaluation pipeline for the offline setting. Falk, K. (2019). https://learning.oreilly.com/library/view/practical-recommender-systems/9781617292705/

2. User study (controlled experiment)

A user study is conducted by recruiting a set of testers and asking them to perform several tasks requiring an interaction with the recommendation system. Through this, a record of behavior, quantitative measurements, and even qualitative data can be collected.

Advantages: By interacting with users, this is the only setting that allows us to capture qualitative data by asking testers before, during, and after the task is completed. With a carefully designed experiment, we can collect valuable feedback from users and test the influence of an RS on users’ behavior.

Disadvantages: Since involving hiring testers, perhaps this is the most expensive experiment that not all companies can afford to execute.

3. Online experiment

An online setting is when an RS is used by real users that perform real tasks which are oblivious to the experiment. Typically, such a system redirects a small percentage of the traffic to a recommendation engine and record the users’ interactions with it.

Advantages: This setting is the most trustworthy comparing to other experiment settings since if conduct properly, we can obtain realistic, unbiased data from users. This data enables a more comprehensive evaluation of a recommender engine.

Disadvantages: Since involving real users, running this experiment can be risky since if an RS is not performing well, it can bring down the users’ experience

Figure 3. A/B testing is a typical choice for the online experiment. In this setup, visitors are split into two groups: the test group that sees the new feature and a control group that continues as usual. Falk, K. (2019). https://learning.oreilly.com/library/view/practical-recommender-systems/9781617292705/

--

--