User Experience Optimization:From Heuristic Intervention to Unified Value Modeling

发表于 2025-12-07 标签计算广告，机器学习，推荐

In the evolution of Search, Ads, and Recommendation systems, User Experience (UX) is an unavoidable core challenge. In this article, we quantify UX as LT (Life Time), specifically referring to user retention (e.g., the number of days a user opens the app within a 7-day window).

Unlike pure content recommendation, which seeks to maximize total watch time, optimization in commercial or marketing-oriented sectors (Ads, E-commerce, Live Streaming) is about maximizing business value (Cost, GMV, etc.) while staying above a "UX Redline." More accurately, it is about maximizing the efficiency of exchanging "Unit LT" for "Business Metrics."

We typically use Holdout Experiments (Reverse Experiments) to measure how a business strategy affects LT. The factors influencing these results are multifaceted:

Explicit Factors: These are the most direct and well-known, involving the position and density of business items (e.g., start_pos, gap, load).
Implicit Factors: Supply quality and ranking accuracy. The quality and diversity of ad creatives or live-streaming content determine the appeal of the distribution queue. If supply is insufficient or ranking is inaccurate, users are less likely to be attracted to the platform.
Opportunity Cost (Backfill Logic): A frequently overlooked point. A holdout experiment compares the "Business Queue" with a "Backfill Queue" (usually the organic recommendation queue). The final impact on LT is essentially:

\[\Delta LT = LT_{Business} - LT_{Backfill}\]

The negative impact on LT is minimized only when the business content is as attractive as—or more attractive than—the organic content it replaces.

While improving supply and ranking accuracy are the fundamental drivers for LT, they take time to yield results. In day-to-day engineering, adjusting load, start pos, and gap remains the most immediate lever. Furthermore, we must establish strict defensive mechanisms to prevent short-term gains from masking long-term, cumulative UX damage.

This article explores the evolution of UX optimization through three stages: Short-term Defense (Heuristic Protection), Mid-term Tuning (Experience Modeling), and Long-term Alignment (Unified Value Modeling).

Short-term Strategy (Heuristic Defense & Protection)

In the early stages, the most effective tools are rule-based strategies based on traffic attributes or user segments. The core logic is "Rapid Loss Prevention" and "Layered Defense."

By manually or semi-manually "patching" the system (e.g., raising thresholds for specific segments), we protect high-risk or sensitive users. This includes protection for new users, returning users, or low-activity users. In search, it might involve setting different thresholds for "active intent" vs. "passive browsing" traffic.

Implementation Logic

We segment traffic by request features (e.g., channel entry) or user profiles (e.g., historical report/bounce rates). Mechanistically, we set independent, higher eCPM thresholds or increase start_pos/gap losses in the ranking stages to ensure business content is reduced or removed for these segments.

Pros: Simple to implement, low experiment cost.
Cons: Purely reactive (post-hoc), lacks generalization, and can easily become obsolete.

The Paradox: Why do these "patches" work?

If a system has a Static Threshold, it has theoretically fixed the exchange efficiency between business goals and LT. For example, in ads, a fixed eCPM threshold dictates that a "show" must bring in at least \(X\) revenue to justify the LT loss.

In an ideal world, a single static threshold should be globally optimal. However, these patches work because of three realities:

1.Value Estimation Bias: The system may overestimate the eCPM for certain groups (like new users). Raising the threshold acts as a manual calibration for this overestimation.
2.UX Loss Heterogeneity: Even if value estimation is accurate, the LT loss from the same "show" varies by user. A sensitive user might churn after one bad ad, while a resilient user remains unaffected.
3.Suboptimal Exchange Efficiency: Even with perfect predictions for value and experience, the "weight" given to each in the ranking formula might not be globally optimal.

However, while patches are necessary today, they shouldn't exist permanently. The goal is to improve the accuracy of both value and experience modeling and optimize their exchange ratio.

Mid-term Strategy (Experience Signal Modeling)

Heuristic strategies rely on the assumption that certain groups are sensitive. Mid-term optimization moves away from manual interventions toward dynamic, model-driven interventions.

We explicitly model the user's experience loss.We can categorize these modeling approaches into two types:

1. Direct Experience Signal Modeling (Correlation)

Offline: Predict the probability of negative behaviors (leave, report, dislike) after a given exposure: \(P(Negative | Context)\).
Online: Based on the predicted probability, dynamically increase thresholds or gaps to reduce "Load" for that specific request.

2. Uplift Modeling (Causality)

Offline: Use causal inference to model the change in LT (\(\Delta LT\)) and the change in business value (\(\Delta Cost\)) when a "Treatment" (showing a business item) is applied.
Online: Calculate the Marginal Exchange Efficiency

\[Efficiency = \frac{\Delta LT}{\Delta Cost}\]

We select traffic and users with the highest efficiency to perform "LT recovery" (reducing load), maximizing the LT regained for every dollar of revenue sacrificed.

However, uplift modelling has the following challenges:
(1) Label Sparsity: LT is a long-term, sparse metric. We often rely on Proxy Metrics (e.g., short-term stay time, interaction), and the correlation between these proxies and true LT determines the model's success.
(2)Counterfactual Data: Uplift models require "Treatment" and "Control" data. This requires a small portion of "exploration traffic" where ads are withheld, which can be costly.

Long-term Strategy (Unified Value Modeling)

The ultimate goal is to move away from "patching LT" and instead treat experience signals as a "Unified Currency" that is interchangeable with business metrics (Revenue, GMV).

We no longer treat LT as an external filter but as an internal cost/benefit within the Ranking Formula, enabling automated end-to-end optimization.Deriving the Optimal Ranking Formula

Most ranking problems can be formalized as a constrained optimization problem. For an ad queue:

\[\begin{align} \max_x &\sum_{i=0}^{n-1} ecpm_i \\ s.t. \frac{1}{n} &\sum_{i=0}^{n-1} lt_i \ge LT^{*} \end{align}\]

Using the Lagrangian Dual, we can derive the ranking score, similar derivation process can be found here 《搜索相关性：从建模到排序机制》

\[Score_{i} = ecpm_i + \lambda \cdot (lt_{i} - LT^{*})\]

Here, \(\lambda\) acts as the "Shadow Price" of LT—the exchange rate between LT and eCPM. If a user’s predicted \(lt_i\) for a specific request is high, the final \(score_i\) increases, making the item more likely to win.

There are several key considerations in this method

Proxy Metrics: Since direct LT prediction is difficult, we use easily observable process metrics as proxies.
Listwise Context: The best place for this prediction is the Evaluator/Rerank stage, where the model has the most context to predict how a specific sequence of items affects LT.
Solving for \(\lambda\): \(\lambda\) is not static. It can be solved via offline replay of historical data to find the global optimum or adjusted in real-time via a PID Controller to ensure the LT constraint (\(LT^*\)) is met.

Though the key considerations are described briefly here, each part actually requires detail investigation in order to solve the problem properly

Conclusion

User Experience optimization in complex search, ads, recommendation is not a one-time fix but a layered, systematic challenge.

Short-term (Heuristic Defense): Focuses on rapid identification and loss prevention. It uses "patches" to protect the system when value estimation or exchange rates are not yet optimal.

Mid-term (Experience Modeling): Uses correlation or Uplift models to scientifically quantify experience. It makes hidden costs explicit and provides the data foundation for long-term integration.

Long-term (Unified Value): The final form of optimization. It breaks the duality of "Business vs. Experience" by converting UX into a "Unified Currency." LT becomes a native value participating in automated resource allocation.

These three stages represent an evolution from Local Optima toward Global Optima. In practice, they should coexist: use short-term rules to hold the redline, mid-term models to improve efficiency, and continuously iterate toward the ideal of unified modeling.

中文版

在搜广推业务的演进过程中，用户体验（本文量化的方式是 LT，即用户在 APP 的留存情况，如 7 天内有几天打开了 app）是一个绕不开的核心命题。与纯内容推荐追求用户时长最大化不同，在广告、电商、直播等商业化或营销导向的业务中，用户体验优化的本质是在体验红线之下，寻求业务价值（Cost, GMV 等）的最大化；或者更准确地说，是追求单位 LT 兑换业务指标效率的最大化

通常，我们通过反转实验（Holdout Experiment）来度量业务策略对 LT 的影响。而影响反转实验 LT 指标的因素非常多：

显性因素：这是最常见和为人熟知的因素，指的是对应业务的 item 在展现时的位置和密度（比如说 start pos、gap、load 等）
隐性因素：供给侧的质量与排序侧的准确性。例如广告素材、直播间内容的质量与多样性，直接决定了分发队列的吸引力，都会直接影响分发的素材的多样性，进而影响用户是否会更容易被平台的内容吸引，这部分其实也跟我们系统的排序能力相关，因为除了供给需要充足，也需要准确的排序能力来把合适的内容推送给不同的用户
机会成本（回填逻辑）：这是常被忽略的一点。反转实验的本质是比较 “业务队列” 与 “回填队列（通常是自然推荐队列）” 的价值差异。业务策略对 LT 的最终影响，实际上等于 (业务内容的 LT) - (被挤占的自然内容的 LT)。只有当业务内容的吸引力接近甚至超过自然内容时，LT 的负向影响才会最小化

因此，虽然优质供给和精准排序是提升 LT 的根本，但其见效周期长。在实际工程迭代中，通过调控 Load、Start Pos 和 Gap 往往是影响 LT 最直接、见效最快的手段。此外，我们在追求业务增长的同时，必须建立严格的防御机制，防止因观测周期滞后，导致短期不可观测的体验受损在长期累积成巨大的负向效应。

本文将探讨用户体验优化的演进路径，将其归纳为短期防御（规则圈选）、中期调优（体验建模）与长期对齐（价值统一）三个阶段。

短期策略：规则防御与圈选保护

在优化的初期，最常见且有效的手段是基于流量属性或人群特征的规则类策略。其核心逻辑是 “快速止损” 与 “分层防御”：通过人工或半人工的方式识别高危 / 高敏流量，针对识别的预估不准确或敏感人群，通过差异化 load 策略（如门槛、start_pos、gap），即刻降低其体验受损风险

体验中常见的分人群、分流量类型等抬门槛等保护策略，就是这一类策略。其核心思路是通过人工 / 半人工的 “打补丁” 策略（分人群、分流量抬门槛）快速保护高危 / 高敏用户。这种思路在多个业务场景中都验证了其有效性。常见的如新用户保护、回流用户保护、非月活用户保护，或者是搜索中区分主、被动流量设置不同的门槛，都属于这一类思路

具体的实现逻辑也比较简单，往往是通过流量侧或人群侧的分层（往往需要一些数据分析来支持论证划分的有效性），比如说基于请求特征（如频道入口等）进行静态分流；或者在人群侧基于用户画像 / 历史行为（如新用户、低活用户、历史高举报 / 高跳出率用户等）进行动态人群包圈选

而在机制上，一般是在在排序不同阶段，为这些流量 / 人群包设置独立的、更高的门槛，或者设置更大的 start_pos loss/gap loss 来实现这些内容少出或不出的目的

这类方法的优点是简单和实验成本低；缺点则是纯后验规则，缺少泛化性，容易失效

基于规则的这些短期策略虽然能快速止损，但是一个值得探讨的问题是：为什么这些分流量、分人群策略会有效？

从流量视角，如果我们系统中有一个静态门槛，静态门槛已经固定了业务优化目标跟 LT 的兑换效率。以广告为例，最大化目标是收入（ecpm），而固定的 ecpm 门槛约束了 ecpm 和 LT 的边际兑换效率（更直观的理解，show 导致的 lt loss，至少要带来这么多的 ecpm 收益）。把上面的 ecpm 换成其他业务目标，原理类似

因此，静态门槛控制了整体的业务优化目标与 LT 的兑换效率，打平 load 的情况下，最优的门槛应该是一个静态门槛

那为什么很多体验分流量类型，分人群包去抬门槛来保护是有效的？我们可归纳为以下三个关键事实：

1）价值预估偏误：对于某些人群（如新用户），系统可能高估了其业务价值（如 ecpm）。抬高门槛相当于手动提高了这部分流量的兑换比，某种程度上是了这部分流量高估的校准
2）体验损失异质性：即使价值预估准确，相同的曝光（show）对不同用户造成的 LT 损失也不同（比如说敏感用户可能因一次不当曝光而流失，钝感用户则无感）。这需要系统能准确建模用户体验
3）兑换效率非最优：即使价值和体验均能准确预估，两者在排序公式中的权衡（兑换比）也可能不是全局最优的

因此，统一静态门槛实现全局最优的前提非常苛刻：价值预估准确、体验项预估准确、且两者的兑换效率已是最优。当任一条件不满足时，分人群 / 流量的策略便成为有效的 “排序补丁”

但长期看，应该通过把价值项和体验项做准，并且优化两者的兑换比，然后通过统一门槛来调控业务优化目标和 LT 的兑换比；分流量、分人群的策略是短期的补丁手段，长期看不应该存在线上

中期策略：体验信号建模与干预

上面说的圈选策略，是基于某一个或几个特征（如人为对人群划分和流量划分的纬度）。然后基于 “这些人群的的 LT 与业务指标兑换效率较低” 的假设，对这部分人施加降 load 的操作来降低对 LT 的影响，同时尽可能让业务指标无损

但这种方式的泛化性有限，依赖的假设可能会随着分布变化而不成立，因此更泛化的方式是通过建模去预估这些体验指标的变化。将用户的体验损失进行显性化建模。由上面定义的人工门槛干预，转向动态的、基于预估体验的干预手段

这种方式的核心思路是对用户的体验信号进行建模，然后基于建模的预估值，对用户施加更加个性化与效率更高的回捞策略。根据建模方式的不同，常用的建模方式可以分为两大类

体验信号直接建模：直接建模与用户 LT 相关的负向行为（如 leave、report、dislike）的概率，然后基于预估概率对用户施加不同的降 load 策略，属于相关性建模
uplift 建模：建模不同用户单位曝光（load）与 LT 关系，提升业务目标（如直播的 awld）与 LT 的边际兑换效率，提升 LT 回捞的效率，属于因果性建模

由于讲建模的文章比较多，这里只是简单介绍下这两种建模方式的主要思路

（1）体验信号建模（相关性）

离线建模：预估用户在给定曝光后，触发负面体验行为的概率
线上生效：基于预估的负向行为概率，通过提升门槛、后移 start position 或增加 gap 来降低用户看到的直播的 load

（2）uplift 建模（因果性）

离线建模：利用因果推断方法，建模用户在施加 treatment 后（如是否看到业务 item）带来的 LT 变化（\(\Delta LT\)）和业务指标变化（如广告是 \(\Delta cost\)）
线上生效：基于预估值，计算边际兑换效率，即 \(\frac{\Delta LT}{\Delta cost}\)，选择兑换效率高的流量和人群来做 LT 的回捞（同样通过上面限制 load 的方式即提升门槛、后移 start position 或增加 gap），从而达到打平业务指标的损失下，最大化回捞的 LT

这种方式的优势是，相较于基于规则的圈选，建模有更好的泛化性；同时通过 uplift 建模优化业务目标与 LT 的边际兑换效率。但是面临的挑战是建模难度会比较大，主要体现在
1）无论相关性或因果性建模都会遇到，label 获取问题，LT 是一个稀疏且长周期的指标，直接建模极难收敛，需要找一些相关的代理指标，这些代理指标跟 LT 指标的建模难度和与 LT 指标的相关性都会影响最终的建模效果
2）uplift 建模依赖获取反事实数据的准确性（即 label 的准确性），但严格上来说是没法获取到绝对的 ground truth 的，实际中会通过各种方式来模拟，同时为了获取 label，需要有一部分探索流量少出或不出直播，对线上有损（利用不出 item 反转损失可控）

长期策略：体验价值统一建模与兑换

无论是短期方案还是中期方案，都是基于先验的知识，或模型预估的体验信号，然后人工构造的规则（如前 x 刷不出）或 listwise loss 生效（如 gap loss、start pos loss）到最终的排序中，其都依赖人工构造特定的操作（如后移 start pos、缩小 gap 等）对线上体验的增益或损失

而长期方案的目标是彻底摆脱 “为 LT 单独打补丁” 的逻辑，将体验信号（LT 损失 / 增益）转化为与业务指标（如收入、gmv、直播观看时长）可等价交换的 “统一货币”，并在分发链路中实现端到端的自动化博弈。其核心思路是不再将 LT 视为一个外部拦截条件，而是将其量化为一个体验成本 / 收益，直接进入 Rank 排序公式

最优排序公式推导

由于大混排不同队列都有其最核心的优化目标，因此理论上所有问题都可以量化为一个带约束的最优化问题，以广告队列为例，最大化的目标 ecpm 。假设有 \(n\) 条请求，第 \(i\) 条请求广告的 ecpm 为 \(ecpm\_i\)，LT 的预估值为 \(lt_i\)（这里先用 lt 的原始定义，即 n 天内有几天打开了 app，实际建模往往需要用代理指标来替换），LT 的约束为 \(LT^{*}\)（当前仅考虑 LT 约束，多约束推导过程类似），则需要求解的问题的可形式化表达为如下

\[\begin{align} \max_x &\sum_{i=0}^{n-1} ecpm_i \\ s.t. \frac{1}{n} &\sum_{i=0}^{n-1} lt_i >= LT^{*} \end{align}\]

则通过拉格朗日对偶推导，可以推导出排序公式如下，推导过程与《搜索相关性：从建模到排序机制》类似

\[score_{i}= ecpm_i + \lambda \cdot (lt_{i} - LT^{*})\]

这里的 \(\lambda \cdot (lt_{i} - LT_{redline})\) 即是与 ecpm 对齐量纲的体验价值。直观理解这一项，如果当前用户预估在当前请求的 \(lt_i\) 值越大，则最终的 \(score_i\) 越大，越容易胜出，反之越难胜出

这种方式有几个关键点
1）由于 LT 直接预估的难度会比较大，因此往往需要找一些易于观测的过程指标作为代理指标，这个问题跟中期方案类似，其实建模思路都会遇到这个问题
2）预估的位置，混排最终 evaluator 阶段选择候选具备最充分的上下文 context 信息，理论上能够充分建模的上下文 context 对 LT 的影响。因此可以考虑在混排通过 listwise 模型模型建模对 LT 影响
3）\(\lambda\) 参数求解，通过历史回放求解最优的 \(\lambda\) 或者通过控制器实时调控。由于 LT 不是一个每时每刻都需要满足的约束，因此可以考虑用历史数据做天级别的回放来求解最优的历史

虽然这几个关键点都是一笔带过，但实际中每一点都会对结果有重要影响，每一点都值得写一篇文章来详细说明（这里就先挖个坑）

小结

用户体验优化，尤其是在搜广推这类复杂业务场景下，并非一个简单的、一蹴而就的工程问题，而是一个需要分阶段、分层次应对的系统性挑战。本文将其梳理为短期、中期、长期三个递进式的策略范式，它们分别对应着应急防御、精准干预和系统重构的不同层次需求。

短期策略（规则防御）核心在于快速识别与止损。它通过分人群、分流量的规则化 “补丁”，在价值预估偏误、体验损失异质性问题尚未解决时，为系统提供最直接的保护。其价值在于工程上的简单高效，但本质是对非理想现状（预估不准、兑换率非最优）的被动适应，缺乏泛化性与前瞻性

中期策略（体验建模）通过构建体验信号模型（相关性）或 Uplift 模型（因果性），我们试图更科学地量化用户体验的损失与收益，并基于预估值进行动态、个性化的干预。这一步的核心贡献在于将隐性的体验影响显性化，并为长期的系统融合提供了关键的数据和模型基础。其挑战则在于信号稀疏、建模复杂以及探索成本

长期愿景（价值统一）是用户体验优化的终极形态，旨在彻底打破 “业务” 与 “体验” 的二元对立。其核心理念是将体验价值转化为与业务目标同质、可自由兑换的 “统一货币”，并通过最优化理论将其内化于排序公式之中。这不再是为 LT “打补丁”，而是让 LT 作为一项原生价值，直接参与端到端的自动化博弈与资源分配。实现这一愿景的关键，在于能否找到可靠的体验代理指标，以及能否稳健地求解价值间的动态权衡参数（即上面公式中的 \(\lambda\)）

从本质上讲，这三个阶段刻画了一条从 “局部最优解” 逼近 “全局最优解” 的演进路径。在实践中，它们并非相互替代，而应协同共存：用短期规则守住底线，用中期模型提升效率，并持续朝着长期统一建模的理想架构迭代。未来的用户体验优化，必将是更深度的价值量化、更智能的实时调控，以及更彻底的系统级价值对齐