Skip to main content

Julian Runge, Mark Williams, James Marr, Hernan Silberman, Yavuz Acikalin, Eric Seufert, Steve Detwiler, Dan Barnes

Reinforcement learning is the least widely understood and adopted of the three pillars of machine learning. While the term may sound complicated at first, the essence of reinforcement learning is something we all do every day: Do something, observe an outcome, change the course of action. Over and over again. Simple forms of reinforcement learning can be a powerful tool for personalized and dynamic content delivery in digital contexts. This post tries to give a concise accessible example of an application, focusing on content personalization in mobile apps.

Reinforcement learning in a nutshell: The agent (1) takes an action (2) that impacts its environment (3) and observes the impact on the environment, in the form of a pre-specified reward (4). It does this over and over again, increasingly favoring actions that lead to high rewards.

1. A primer on reinforcement learning and why we use it

Reinforcement learning is often regarded as the third main pillar of machine learning in addition to supervised and unsupervised learning. In a nutshell:

  • Supervised learning infers associations of certain inputs to certain outputs by observing a ground truth mostly given in the form of historic observations of said inputs and outputs.
  • Unsupervised learning learns a “new”, previously nondescript output from a set of given inputs following a fixed set of rules, e.g. inferring relevant clusters of users from log data of user behavior.
  • Reinforcement learning (RL) finally introduces agency on the end of the machine: It can take certain actions on instances, then observe a specific reward (that measures desirable impact on the mentioned instances) and adjust its decision policy to maximize said reward. The machine hence reinforces its decision policy by being rewarded (or punished) by its environment.

To date, examples of successful applications of RL are substantially harder to come by than examples of (un)supervised learning and they often revolve around robotics. In a directly consumer-facing context, digitization makes its application more attractive for three reasons:

  1. Digital products such as mobile apps have close to zero marginal cost of production and distribution, making it easy to offer services and products for free or ‘freemium’. This, in turn, leads many companies to interface with many more users (customers) than in the past in non-digital settings, providing a larger environment to learn from (per the title figure at the top of this post).
  2. Companies can observe and log plentiful data on their users’ interaction with the product.
  3. Digital products can be continuously adapted and flexibly personalized (see Figure 1 for examples from one of our apps), even well after users’ first adoption, leading most companies to regularly run experiments (A/B tests) on their websites or apps.

Even small companies hence easily have thousands of users with non-trivial amounts of data per user and use A/B tests to improve their offering. Simple forms of RL such as bandits and contextual bandits (more in the next section) can help make experimentation much more cost effective, especially when combined with existing institutional knowledge and collected user data. This is the methodological approach we chose.

Figure 1: Content tiles in THE N3TWORK app — digital surfaces can be flexibly personalized

2. How reinforcement learning is more effective than A/B tests

Multi-armed bandits are arguably the simplest form of RL. You can think of them as A/B tests that automatically roll out the best performing variant being tested. This also exemplifies how they can lower the cost of experimentation: While a two-armed revenue-facing A/B test would have each treatment live for 50% of users for the whole duration of the test, a bandit would increasingly send users to the treatment that generates more revenue. In the end, the bandit will expose fewer users to the treatment yielding lower revenue, increasing overall revenue from the same number of users.

Contextual bandits take things one step further: They do not only optimize for the variant with the overall highest reward, but they can take contextual data into account by optimizing for the variant with the highest reward in each context. Like this, they can automatically learn a personalization policy. In a simple example: Let’s say we A/B test two designs A and B, and we have two user segments defined by observed contexts C and D. 30% of our users are C, 70% are D. Design A leads to average user engagement X in context C and 2X in context D. Design B leads to average user engagement in context D and 2X in context C. If we run an A/B test sending 50% of users to each design, we get user engagement:

Engagement(A/B test)
= 0.5 * (0.3 * X+ 0.7 * 2X) + 0.5 * (0.3 * 2X + 0.7 * X)
= (0.15 + 0.7 + 0.3 + 0.35) * X
= 1.5X

If we instead have a contextual bandit learn an optimal policy (and assume that costs to learn the policy are negligible), we get user engagement:

Engagement(Contextual bandit)
= 0.3 * 2X + 0.7 * 2X
= 2X
Clearly, 2X >> 1.5X.

We hence have ~33% higher user engagement using the contextual bandit as it automatically learns from observed engagement what treatment (here design or B) to give to what user segment (here C or D).

3. Getting real: The tech stack

Over the years, N3TWORK has assembled an impressive set of shared technology facilitating the production, operation, and marketing of mobile apps. We like to call the tools and systems that form this shared technology ‘butlers’, each one of them helps achieve goals as derived from concrete business need. The complete set of butlers establishes the backbone of our app publishing business.

Examples of butlers are:

  • Dobby that assigns users to certain content configurations,
  • Meeseeks that allows data scientists to build tailored algorithms for content delivery,
  • Ferguson that provides data-driven personalization decisions in real-time or
  • Poole that ensures all collected data are safe and sound.

These butlers collaborate with each other and the app servers and clients to create delightful user experiences. They further allow the data science team to personalize content delivery leveraging data, methods of choice and continuous experimentation. We will provide concrete examples a bit further down the text (Section 6).

4. Offline evaluation

Prior to using models in the field, data scientists tend to evaluate their methods and resulting policies offline, e.g. Airbnb’s data science team. If you work on personalization, a great resource are existing A/B tests with interesting treatments. Say game designers ran an A/B test trying different difficulty configurations in a game. Per the example in section 2, data scientists can then use this A/B test’s data to assess if different user segments (as defined by logged data) react differently to the treatments: Users with previous high engagement and skill may find a low difficulty boring and react poorly while users who are new to the game and have not yet built their skills may prefer a lower difficulty.

Historic A/B tests allow data scientists to look for the presence of such heterogeneous treatment effects (i.e. if different users react differently to the given treatment) and to evaluate different policies for personalization of the treatment. (This paper by Athey and Imbens (2016) is a good starting point for further reading on treatment effect heterogeneity. Hitsch and Misra (2018) provide a comprehensive exposition how the effects of different personalization policies can be assessed from historic random trials.)

In this post, we consider an A/B test of different starter packs in a mobile game. The baseline results of the A/B test were promising in that the different packs led to pretty different outcomes in terms of revenue and conversion (=share of app users buying something in the app). Initial assessments using linear models and treatment dummies indicated that treatment effects on revenue and conversion were heterogeneous across users. Our user data hence seemed to capture some relevant heterogeneity among our users in regards to this starter pack, overall indicating an interesting candidate for personalization.

We then took to offline evaluation of Vowpal Wabbit’s contextual bandit (CB) implementation on data from the starter pack A/B test. Vowpal Wabbit is a fast out-of-core learning system originally developed at Yahoo Research and now with Microsoft Research. Its CB module has been successfully applied in the field, e.g. Li et al. (2012) or Bietti, Agarwal & Langford (2018). Here you go for a small piece of sample code that presents the core approach we used for offline evaluation in Python. Figure 2 shows treatment assignment results for two policies learned by CBs optimizing for different rewards. The policies learned by maximizing the two different rewards assign users quite differently to the three different starter packs. E.g. CB 1 assigns many more users to pack 3 than CB 2 while the inverse is true for pack 2. This offline analysis verifies that the CB is able to learn different treatment assignment policies based on different rewards.

Figure 2: Number of users assigned to different starter packs by two contextual bandits optimizing different rewards

Results were obtained using the first 100,000 arriving users of the A/B test for policy learning and the remaining A/B test users for evaluation based on 30 50% bootstraps. This approach is akin to an epsilon-first learning strategy. While simple, it is intuitive and effective for the purposes of this offline evaluation. The last columns of the panels in Figure 2 show the number of overlapping assignments in each bootstrap. These are the users where random assignment and policy-based assignment “agree”. We can use this sample to estimate counterfactual lift of the policy learned by the CB (see Hitsch and Misra 2018 for a detailed exposition). Figure 3 shows the distribution of reward lift under the respective policies over the best unpersonalized treatment for the 30 bootstraps. The CB is hence able to learn different policies based on different rewards, creating a significant lift in the respective reward over the best unpersonalized pack. This is essentially what we needed to know before taking this algorithm to the field for a production run with actual users.

Figure 3: Reward lift achieved by contextual bandit policies learned from A/B test data

5. Institutional knowledge and managerial control

Before we get to the field runs, a note on institutional knowledge. In addition to A/B tests, companies tend to have a pool of skilled people who often have incredible domain expertise. E.g. an experienced product manager is likely to have a better understanding of users than the reading of decades of user research documented in various scholarly journals can provide. Such institutional knowledge ingrained in the minds of managers and colloquial conversations can be crucial to devising effective personalization policies, e.g. by providing a managerial prior to start from. The example described in the next section (6.1) does this. It can also help further reduce the cost of experimentation by ruling out poor treatments and policy candidates from the get-go.

Further, managers like to exert control over outcomes — doing so is ultimately the reason for their existence. While applications of artificial intelligence (in the sense of machines with actual agency over outcomes) to well-defined technical problems such as gameplay or image recognition have found wide acceptance and are barely perceived to impede on managerial territory, product design and delivery of immediately consumer-facing content oftentimes still lie at the center of managerial focus. Product managers may be hesitant to give up control to a machine that they do not understand, particularly when that machine can make autonomous decisions. The use of managerial priors per the previous paragraph can help alleviate such concerns. In the case of RL, the choice of learning policy can also be important. Policies such as epsilon-greedy or epsilon-first are easy to explain to stakeholders who are familiar with A/B testing.

6. Online runs

6.1 Real-time personalization at app download

The first production example covers real-time personalization of the mentioned starter pack at app download. Figure 4 shows the decision architecture based on N3TWORK’s shared technology. A user device downloads the app from an app store. Upon the first launch of the app, the user device authenticates with the app server (step 1 in Figure 4) that in turn asks the real-time personalization service Ferguson for a personalization decision, also sending relevant contextual data with the request (step 2). Based on Ferguson’s decision the app server then tells Dobby, the content configuration store, what configuration the user should receive (step 3). Dobby continuously keeps the user device (=app client) appraised of this configuration assignment. The app server also sends data describing the personalization decision to Poole (step 4) where the data are written to a well-behaved database. Both the app client and server continuously send data to Poole to provide analysts and data scientists with users’ behavioral traces. These form the base of decision support, and functional and operational analytics more broadly. They are also the base for model building by data scientists and for automated reinforcement of the real-time decision model stored in Ferguson (step 5).

Figure 4: Schematic depiction of the real-time personalization system

To avoid a cold-start with costly full exploration (=A/B test), we seeded Vowpal Wabbit’s CB with data to execute a managerial prior for starter pack assignment to contexts and added 50% of exploration on top using epsilon-greedy as a learning policy. After we had collected sufficient data, we lowered exploration and switched on reinforcement. The bandit made use of its agency: It substantially changed the mix of starter packs sold. Overall purchasing increased compared to the managerial prior and a randomized hold-out group receiving the best unpersonalized starter pack.

6.2 Personalizing the downstream user experience — dynamic content delivery

The previous example focused on app download as a decision point and provided on-demand personalization in real-time. There are however many other cases where user experiences can be improved by proactively changing configurations over time. To accomplish this goal, we set up another butler — Meeseeks. It is named after Mr. Meeseeks from Rick and Morty as it essentially completes whatever assignment(s) data scientists tell it to. Figure 5 visualizes how. App server and client keep sending behavioral traces to Poole (step 1), generating a large pool of data. Data scientists use this data to segment, cluster and predict user behavior. Meeseeks allows them to come up with whatever personalization / content delivery logic they deem fit by simply accepting configuration assignments from offline (or online) models and implementing them by calling Dobby (step 3) that remains in a constant loop with the client device to deliver the appropriate configuration.

Figure 5: Dynamic content delivery system to personalize the downstream user experience

Coming back to our starter pack example: We use this workflow to re-assign users to a different starter pack if it seems that they did not like the initial one. Currently, we employ a sequence of bandits at different points in the downstream user experience to perform these re-assignments. We keep tweaking the personalization logic and experimenting with different learning regimes. We constantly evaluate conversion, revenue, and engagement compared to a randomized hold-out group receiving the best non-personalized pack. Things are looking good.

7. The gist of it all

Reinforcement learning may be the least widely understood and adopted of the three pillars of machine learning. One reason — that this post tries to work against — is the paucity of accessible reports from the frontline of business. Much of the writing on applications remains academic to date, e.g. Li et al. (2012) on news article recommendation or Schwartz, Bradlow & Fader (2017) on ad delivery.

Reinforcement learning holds massive potential in an era of abundant data and widespread experimentation. This post shows how it can be used for personalization and dynamic content delivery in mobile apps — both from an engineering and a data science point of view. We give an example of dynamic personalization of a starter pack in a mobile game that increased the app’s profitability by about 10%.

If you think this post was useful, please give it a clap and share it.