Understanding User Behavior For Document Recommendation

Personalized document recommendation systems aim to provide users with a quick shortcut to the documents they may want to access next, usually with an explanation about why the document is recommended. Previous work explored various methods for better recommendations and better explanations in different domains. However, there are few efforts that closely study how users react to the recommended items in a document recommendation scenario. We conducted a large-scale log study of users’ interaction behavior with the explainable recommendation on one of the largest cloud document platforms office.com. Our analysis reveals a number of factors, including display position, file type, authorship, recency of last access, and most importantly, the recommendation explanations, that are associated with whether users will recognize or open the recommended documents. Moreover, we specifically focus on explanations and conduct an online experiment to investigate the influence of different explanations on user behavior. Our analysis indicates that the recommendations help users access their documents significantly faster, but sometimes users miss a recommendation and resort to other more complicated methods to open the documents. Our results suggest opportunities to improve explanations and more generally the design of systems that provide and explain recommendations for documents.


INTRODUCTION
Personalized recommendation is taking place in almost every aspect of our life. It offers users more exposure to what they may be interested in, and helps them save time finding what they need. For cloud-based document platforms such as Microsoft Office 365 and Google Drive, recommendations aim to provide users with quick shortcuts to the documents they may want to access next, alleviating the burden of memorizing folder structure and easing the document management and access processes. Document recommendation has important differences compared to other recommendations such as movies and shopping items. People typically know a lot about the documents (e.g., the type of document, the author of the document and when they last interacted with it), and they often have a clear goal of finding or re-finding specific documents when they visit the document platform. There are also situations where users will want to open a document shared through collaborative work effort, even if they haven't seen the document before.
An accurate recommendation algorithm is important for the success of a document recommendation system. Additionally, explanations of why a document was recommended helps users recognize the document. Explanations can enhance the effectiveness, persuasiveness, and user satisfaction of personalized recommendation systems [29,30]. Recently, the topic of explainable recommendation has received increasing attention [31]. Various methods were proposed to provide explanations of the recommendation results (e.g., [2,13,23]). However, there is less work that specifically focuses on users' interactions with explanations and their effectiveness in document recommendation. Studying the effects of explanations on users' behavior is important to understand how they perceive explanations, identify better explanations designs, and improve the overall user experience.
In this paper, we focus on online document platforms, where recommendations and explanations are generated based on the documents users can access, their interaction history with the documents, and their network of collaborators. The study aims to answer three main research questions. Our first question aims to understand user behavior towards recommendations (RQ1): what are the characteristics of users' interaction with recommended documents on a cloud document platform? Beyond the basic characterization, we are particularly interested in the relationship between recommendation explanations and users' behavior, which leads to our second question (RQ2): how are explanations that reflect various interaction histories associated with user behavior for the recommended documents? As correlation does not indicate causality, knowing their association does not inform us of what effect do explanations have on users. We further examine a third question (RQ3): how is user behavior influenced by different explanations?
To answer these questions, we used large-scale log data from users' interactions with a major document platform, Microsoft Office 365. Figure 1 shows the interface on the initial page of the main website office.com, with a Recommended Document Pane (RDP) in the middle. Our observational log study characterizes users' interaction behavior towards the RDP, which answers the RQ1. We further conducted an online randomization study on explanations to better characterize the influence of explanations on user behavior, which answers the RQ2 and RQ3. Our results reveal interesting characteristics of user behavior towards various factors. The RDP helps users access their documents significantly faster. But there are also opportunities to improve explanations, e.g., users sometimes missed the document in the RDP and resorted to other more complicated methods to find the file. Our findings shed light on better designs of the recommendation explanations.
Our contributions of this paper are threefold: • Using large-scale observational log analysis, we provide the first characterization of users' behavior towards document recommendations in an online document platform.
• We examine, in detail, how explanations that reflect different interaction histories are correlated with user interaction with the recommended documents.
• Using an online randomization study, we investigate the impact of different explanations on users' behavior.

RELATED WORK 2.1 Characterizing User Behavior using Log Data
The development of centralized computing and Internet makes it possible to capture users' interaction with web service at a tremendous scale [12]. Large-scale log analysis enables researchers to understand and characterize user behavior in a wide range of scenarios, such as search engine [15,24,26], web browsing [1,25], and email [3,4,11]. There are two major types of log studies [12]: 1) observational log studies, where massive amounts of log data is observed and collected to provide a descriptive overview of user behavior, such as [3,4,27,28], and 2) experimental log studies, where in situ experiments are conducted and log data is collected and compared between the experiment group(s) and a control group. We conduct both types of log studies in this paper, involving over a million users. To the best of our knowledge, we are the first to deeply investigate user behavior towards recommendation explanations with such large-scale log analysis.

Explainable Recommendation
Explainable recommendation refers to recommendation systems that provide an explanation of why an items is recommended [30]. Two main strategies are used to generate explanations: one line of research focused on the interpretability of the recommendation model, such as topic modeling [19], matrix factorization [13], and deep learning [22], etc. Another strategy is through post-hoc analysis, where the recommendation model is treated as a black-box and separate methods are used to generate explanations. Examples include Markov logic networks [7], associate rule mining [21], etc. Since our focus is user behavior towards explanations rather than explanations generation, we treat our recommendation algorithm as a black box and employ a post-hoc heuristic explanation annotator. Explanations can be expressed in different styles, such as content-based [16], and context-based [17]. Moreover, explanations can be displayed in different ways, e.g., text sentences [9] 1and graphics [8]. We refer readers to [30] for a comprehensive review of explainable recommendation. We display explanations with natural language based on users' actions on the documents and their collaboration network.

User Reactions Towards Explanations
The effect of recommendation explanations needs to be evaluated with real users [14]. Existing works usually ask participants to answer surveys after the explanations are displayed. The metrics include participants' subjective ratings on quality, trust, satisfaction, efficiency, etc. [9,10,23] However, these evaluations usually happen under an experiment setting such as Amazon MTurk that does not reflect real user behavior. Only a few studies evaluate explanations' influence under real situations. Zhang et al. [31] evaluated their explanations on an online shopping platform using customers' click-through rate (CTR) and purchase rate. McInerney et al. [20] employed the rate of service users' playing at least one song from the recommended playlist on a music platform to evaluate the explanations. The metrics in both works are some forms of click rate, while the interactions with online documents are much richer. In this paper, we investigate various behavior metrics to characterize user behavior, including searching, recognizing and clicking behavior. To our knowledge, we are the first to investigate rich user behavior towards recommendation explanations.

ANALYSIS SCOPE AND LOG DATA
We first introduce our log data and analysis scope. More importantly, we introduce the concept of users' intent to open a document to pinpoint our focus on the right population.

Log Data
We analyze random samples of log data of the office.com web client in North America from two periods of time. The first period is for the observation log analysis (RQ1), with the range from May 1 to 31, 2019 involving millions of users. The second period is for the explanation-randomization study on 10% of these users (RQ2, RQ3), at the time period from August 19 to September 1, 2019.
As we focus on the RDP (see Figure 1), we only study users who have enough candidates in the RDP (i.e., 4 or more), so that they could see a full RDP page when visiting the website) and clicked on the RDP at least once during the analysis period. From this subset, we sample approximately 800K users and their (millions of) visits to office.com in the first period, and randomly sample 10% of the users in the second period to receive the randomization treatment.
The log data contains two types of interactions: 1) Interactions on the cloud platform: interactions on items, apps, and other links on office.com. 2) Interactions with the document: open, edit, comment, etc.In addition to user behavior information, the logs also contain rich metadata of the documents in the RDP, including a unique document id, display position index of every document, the type of recommendation explanations, document size, etc. The logs do not contain any document or explanation content or personally identifiable information (PII). Note that we treat the recommendation model and the explanation generation model as black boxes and only log the documents recommended to the user.

Users' Intent on office.com
Users visit office.com ( Figure 1) for a variety of reasons, including to find documents or to navigate to Office apps or sites. Figure 2 illustrates some examples of users' actions after visiting office.com. Sometimes users use the website as a hub to open a web app (e.g., Outlook), sometimes they have the intent to find and open a document. In this paper, we are interested in cases where users have the intent to open a document when they visit office.com. To capture the intent, we examine the subset of visits where users open a document somewhere within 3 minutes after they visit office.com. We select 3 minutes as the threshold since this is the 99 t h percentile of the interval between visiting and document opening according to our log data. It is noteworthy that somewhere includes all cases, such as RDP, the recent document list, etc. All of our analyses only involve these visits with users' intent to open a document.
According to the log data, the most common area that users resort to when having the intent to open a file is the RDP (65.5%). Moreover, the second common area, i.e., recent document list (20.4%), is more transparent where the order is just based on the recency, thus less interesting. As such, we focus on understanding user behavior on the RDP in this paper.

USERS' INTERACTIONS WITH RDP
In this section, we provide a comprehensive analysis of log data to answer RQ1 by investigating various aspects of users' behavior before and after opening the document. We study one of the most fundamental yet important factors: display position.

What Is Users' Click Behavior on The RDP?
To characterize users' click behavior on the RDP, we use a common metric click through rate (CTR) defined as follows.

CTR =
Number of Clicks Number of Visits (1) There are up to 16 candidates in total in the RDP, ranked by the recommendation model. When users visit office.com, the first four are shown and users can navigate to the other three pages (see Figure 1). We identify a few interesting findings from the figure.
• Documents on the left side have higher CTR than those of the right side on each page. The four pages share a similar pattern: the CTR decreases from the left to the right in one page. This can be caused by two factors: 1) ranking bias, the ranking order by the recommendation model, 2) interface bias, that users usually scan the RDP starting from the left to the right and may pay more attention to the documents at the beginning.
• The CTR jumps up between two pages, especially from the first to the second page (position 4 to 5). This reveals an interesting interface effect. If a user navigates to the next page, especially at the first navigation, it indicates that they notice the RDP and is leveraging it to find the document, thus leading to a higher CTR. Similar behavior is also observed in web search [6].

Is the RDP Really Helpful?
The CTR only reflects the ratio of whether users click on the documents in the RDP. It does not indicate whether the RDP benefits users when they want to find a document. To capture this, we further define two metrics: the recognize rate and the time to open.

Recognize Rate.
Given the recommendation algorithm successfully predicts the document that is eventually opened by the user and displays the documents in the RDP, will the user recognize it and open it from the RDP? Our analysis indicates that the algorithm often does a good recommendation, i.e., the documents that are eventually opened somewhere are recommended by the model and shown to users in the RDP. However, only in 73.4% of the cases users will recognize it and open it from the RDP. In the rest of the 26.6% cases, although the documents that are shown in the RDP and users see it, users miss the documents or rely on their habitual practice, and still open the documents elsewhere. Among these cases, 38.9% of them are from the recent document list and the rest of the 61.1% are opened elsewhere other than the direct access (i.e., one click) on office.com, such as through email, browsing, etc.
We define the recognize rate (RR) as follows, RR = Docs Opened from the RDP Eventually-opened Docs Shown in the RDP The RR is interestingly different from the CTR: the RR is based on an accurate recommendation and measures an interesting aspect of "success rate" for users to recognize the right recommendation, while the CTR depicts the interaction frequency. Figure 3b shows the interesting effect of position.
• Documents on the left side has higher RR than those of the right side on the first page. On the first page, the RR is similar to the CTR in a way that the RR is decreasing from position 1 to 4.
• Once users navigate to other pages, the RR remains very high.
After a big jump of the RR when users navigate to the second page, the RR remains at a high level, which is different from the CTR. This indicates that although not often (as indicated by the green dashed line), when users are actively looking for specific documents, they will maintain an active recognition behavior after navigating to later pages of the RDP, leading to the high RR.

Time to
Our results show that the RDP significantly shortens the time to open the document. It only takes 52.6% of the time compared to the cases when documents are in the RDP but opened elsewhere and 38.3% when documents are not shown in the RDP. Figure 3c shows the TTO on different positions for the documents opened from the RDP. The larger the position number, the longer it takes. Moreover, the increase of the time between pages is more significant than the increase within a page. This reflects the time needed for users to scan from left to the right, and to navigate to the next page.
In the rest of the analysis, we normalize the effect of display position (also plus file type) by dividing the marginal value.

HOW DO EXPLANATIONS ASSOCIATE WITH USERS' INTERACTION?
Given the basic characterization of the user behavior with the RDP, we answer the RQ2 by investigating the association between the recommendation explanations and the three behavior metrics described in the preceding section. Moreover, users' perception of documents builds on their historical interactions with the documents, which may affect users' reaction to the explanations. Therefore, we further investigate the relationship between the explanations and two aspects of users' historical interactions: the authorship and the time since last-open.

Explanations and Randomization Study
There are 14 predefined explanation types in the generator (see Table 1). An explanation generation model (independent of the recommendation model) ranks them and the corresponding language is generated from a pre-defined template. Note that since a document can have different activities during its lifecycle, the same document can show up with different explanations at different times. For simplicity, we group the 14 explanation types into four action groups (edit, comment, open, and share). As editing is one of the most common actions. we further divide the edit action by the subject (me versus others), as summarized in Table 1.
To remove the bias of the explanation generator while maintaining the validity of the explanation, we randomly sampled a subset of users and conducted an explanation-randomization study for two weeks (from August 19 to September 1, 2019). When a user visits office.com, for each document recommended by the model, we select the top four explanations and randomly pick one as the explanation displayed to the user. To reduce the bias of the documents with fewer explanations, we excluded the documents that have less than four explanation candidates.

Behavior Metrics among Explanations
We investigate the associations between the five explanation groups and the three behavior metrics as defined in Section 4.
All ANOVAs (with Greenhouse-Geisser correction if there is a sphericity violation) and pairwise post hoc t-tests (with Holm's sequential Bonferroni procedure correction) show significance (p < 0.05), thus we omit these statistics in the rest of this section.

CTR.
We highlight a few findings from the Figure 4a.
• Among the collaborative explanation groups, Comment by Others has the highest CTR. This reflects that compared to co-workers' editing and sharing action, commenting usually indicates feedback from collaborators, which requires more involvement and thus triggers more attention that leads to higher CTR.
• For the individual explanation groups, Edit by You has higher CTR.
Although opening a document is the most frequent explanation, our results reveal that users may be more familiar with and react  more actively to the documents they opened and edited than the documents they just opened and read.
• Share by Others has the lowest CTR. This indicates that users less frequently use the RDP to access shared documents compared to documents with other reasons.

RR.
We notice two interesting explanation groups that have reversed results in the RR, as shown in Figure 4b.
• Comment by Others has the highest CTR but the lowest RR. Users are very likely to click on documents with Comment by Others explanation (high CTR). However, if a document with this explanation is shown in the RDP, users are also likely to miss it (low RR) and open this document elsewhere. This reflects that users not only frequently use the RDP for these documents but also resort to other methods such as email to open them.
• Share by Others has the lowest CTR but the highest RR. Users are less likely to click on the shared documents in the RDP (low CTR). However, if they eventually open a shared file after visiting office.com, most of the cases they access it through the RDP (high RR). This shows that the RDP works effectively for users to open shared documents once they pay attention to. Figure 4c indicates that Comment by Others requires significantly less time than documents with other explanations. This is in line with the findings that documents with Comment by Others usually require more engagement, thus faster reactions.

How's Authorship × Explanations?
Whether the user is the author of the document (i.e., creator) will affect the user's reaction to the document. Understanding this is important to customize the explanations for documents with different authorship conditions. Figure 5 reveals several interaction that shows significance, as highlighted below.
• Comment by Others and Edit by Others have higher CTR, RR, and lower TTO when the user is the author. This shows that users are more likely to react actively to others' actions on the documents if they created these documents. They may be more interested in checking these activities since these documents are "theirs".
• Share by Others have lower CTR, lower RR, and higher TTO when the user is the author of the file. Shared documents have a reversed trend: users react to others' actions less actively if others share the documents that were originally created by themselves. Users initiated the documents and when the documents are shared by others back to them, they may feel they are already aware of the documents content , leading to less reaction.

How's Last-Open Interval × Explanations?
Another interesting user-behavior factor is the interval between the last and the current open time, i.e., time since last-open. We select four different intervals in Figure 6. The findings are summarized as follows: • Generally, the longer the time since last-open, the lower the CTR and the higher the TTO. The older the documents are, the less likely users will interact with them. Our finding suggests that similar to emails, the lifecycle of documents is also quite short [4].
• The RR is low if documents were opened earlier today. It becomes high once documents were opened earlier than yesterday. We observe a reverse trend between the CTR and the RR. As the CTR decreases, the RR increases. This indicates that the RDP can "remind" users about the old documents and becomes the major channel to access them. However, when documents are recent (i.e., opened earlier today), although users open them frequently through the RDP, users also use other methods to open the file.

HOW IS USER BEHAVIOR INFLUENCED BY DIFFERENT EXPLANATIONS?
We further answer RQ3 by conducting a pairwise comparison between different explanation groups. Our results indicate the differences between explanation pairs: when two explanations are valid for a document, showing one explanation will trigger more active reactions than the other. This reveals that there are opportunities to improve explanations under different contexts. For each pair among the five explanation groups (10 pairs in total), e.g., Comment by Others and Edit by Others, we first narrow down the cases where both explanations are in the top four candidate explanation list and either of them is displayed. Then, we compare user behavior between two cases, one with Comment by Others shown in the RDP, the other with Edit by Others shown in the RDP. Note that there is bias introduced by the candidate explanation list, i.e., the candidate list may reflect certain properties of the document. To remove the bias, we further normalize by dividing by the marginal value of each candidate explanation list.
We summarize the comparisons that lead to significantly different user behavior (based on t-test, with significance level at p = 0.05, and marginal level at p = 0.1.). We particularly focus on the results that are not in line with the results in Figure 4.
• Although Comment by Others has the highest CTR in Figure 4a, the pairwise comparison indicates that its CTR is only marginalsignificantly higher than that of Open by You and not higher than others. When Comment by Others and other explanations are both in the candidate list, displaying which explanation won't significantly affect users' behavior. The CTR stays high.
• The RR of Comment by Others and Share by Others have a reversed order. In Figure 4a, Comment by Others (the lowest RR) and Share by Others (the highest RR) are at opposite positions. However, in Figure 7b they are reversed. This reveals that when documents are shared and have comments by others, users are more likely to recognize them from the RDP when they are displayed with the Comment by Others.
• Open by You has significantly lower CTR and RR than all other explanations. Although this explanation is the most frequent one in the candidate list (see Table 1), it contains the least information, leading to inactive reactions.

DISCUSSION
Our results, not only reveal the characteristics of user behavior towards explainable recommendations, but also suggest better designs of the explanations for document recommendation systems. These findings can potentially be generalized to other known-item and navigational recommendation systems, e.g., [5,18]. We summarize a few potential suggestions driven by our findings.
• Section 5.3 reveals that when the user is the author of the file, showing Comment by Others or Edit by Others explanations can trigger more active reactions than others (see Figure 5).
• Section 5.4 suggests that if the RDP is showing an old document that has not been opened for a long time, the explanation Share by Others can help users to better recognize the file and faster access the file (see Figure 6).
• As shown in the pairwise comparison, Open by You contains the least information and does not trigger a lot reactions. Whenever there is other explanations that are available, documents should be shown with other explanations.
• Comparison matrices in Figure 7 can serve as a good reference when deciding between two explanations, depending on designers' goal. For instance, if the recognition is the major concern, Comment by Others is preferred by Share by Others. If the time is the concern, then the preference order is reversed.
There are some important limitations of this work. First, the three behavior metrics only depict certain aspects of user behavior. Other behaviors such as collaborative actions and detailed editing actions will be included in future work. Second, we did not analyze user behavior in interacting with other aspects of the website. The recent document list is of special interest because it represents a sizeable proportion of how users access files. We will compare the RDP and the recent document list in future work. Third, in the explanationrandomization study, we did not experiment with a "no explanation" option since this could adversely affect users' experience. We hope to try a limited study of this baseline to understand the effectiveness of the explanations in future work.

CONCLUSION
In this paper, we conduct large-scale log studies to characterize user behavior towards explainable recommendations. Our analysis leverages the data from a major cloud document platform office.com.
We define three metrics to depict user behavior before opening the documents through the Recommended Document Pane (RDP). We first study one-month data involving millions of users to understand behavior characteristics in light of these metrics. Then, through an explanation-randomization study, we analyze two-week worth of data involving hundreds of thousands of users to understand the association between recommendation explanations and user behavior, as well as the influence of explanations on user behavior. Our results reveal a number of interesting findings that shed light on better explanation design in the future.