机器学习本质上是在学习数据的分布, 其有效性的假设是模型 training 和 serving 时的数据是独立同分布 (Independent and Identically Distributed, IID) 的,但是在实际应用中,由于采样有偏、具体场景等约束, training 的样本与 serving 时的样本并不是 IID 的。在广告场景下,最典型的就是训练 cvr 模型时,训练样本都是 post clicked 的,但是 serving 时,cvr 模型面临的是所有被召回的样本;这类问题也被称为 exposure bias 或 sample selection bias,除了 exposure bias,position bias 等也是常见的 bias。

本文首先会简单介绍一些机器学习中的常见 bias,并着重介绍上面提到的 exposure bias (也叫 sample selection bias) 的在当前的一些解决思路, 笔者将其总结为 Data Augmentation、IPS 和 Domain Adaption 三大类方法。

阅读全文 »

很久没更新技术文章了,草稿箱里还有几篇半成品一直被我以工作日事情太多、周六日需要休息为由 delay 了好几周;而现在站在 2021 年的起点,望着 2020 年的尾巴,不禁感慨一年就这么呲溜一下就过去了,总想写点东西来复盘一下 2020 这一年,还记得上次写的这种年度总结的文章是 2017 小结,那会还在上研一,现在回看这篇文章还是略有感慨,还是比较佩服当年那个充满激情与精力、对各种知识都充满好奇的自己;趁着元旦放假有空这几天,还是决定简单地对 2020 年做个总结,几年后再回头看看,或许会有不同的感悟。

阅读全文 »

Conversion has delay, meaning users may convert some time after clicking, and often the deeper the conversion funnel, the longer the delay. In computational advertising, delayed feedback mainly affects the following two scenarios:

  1. CVR model training
  2. Posterior-based bidding strategy adjustment

For scenario 1, the impact is: (1) sending samples to the model too early treats events that will eventually convert but haven’t received labels yet as negative examples, causing model underestimation; (2) sending samples to the model too late, i.e., waiting a sufficiently long time for all samples before sending to the model, causes the model to not update timely.

For scenario 2, the impact is when the controller controls cost/value=target, the denominator will be smaller than the actual value, causing control instability.

This article mainly introduces three papers’ approaches to this problem in scenario 1. Some methods involved can also be applied to scenario 2 (and if problem 1 can be well solved, bidding can also be based on predictions rather than posterior data).

阅读全文 »

转化是有延迟的,即在点击发生后过一段时间用户可能才会发生转化,且往往转化漏斗越深,延迟的时间越长;在计算广告中,delayed feedback 主要影响下面两个场景

  1. CVR 模型的训练
  2. 基于后验的调价策略

对于场景 1,影响体现在(1)过早把样本送入模型,把最终会转化但是还没回传 label 的事件当做负例,导致模型低估(2)过晚把样本送入模型,即让所有样本都等待一个足够长的时间才送入模型,导致模型没能及时更新

对于场景 2,影响体现控制器控制 cost / value = target 时,分母会小于实际值,导致控制的不稳定

本文主要介绍三篇 paper 针对这个问题在场景 1 的一些解决思路,其中涉及到的一些方法也能应用到场景 2 中(而如果问题 1 能被较好地解决,也能基于预估值而不是后验数据进行调价)

阅读全文 »

本文是 程序的表示、转换与链接 中第 10、11 周的内容,主要介绍了从源文件生成可执行文件的步骤 (预处理、编译、汇编、链接),并详细描述了其中的链接这一步骤中的两大过程:符号解析与重定位,并对比了链接输入的可重定位目标文件和输出的可执行目标文件的差别;对了解文件的从编译到执行原理有一定帮助,可配合 《链接、装载与库》 阅读笔记 一起阅读。

阅读全文 »

Embedding-based Retrieval in Facebook Search 是 FB 在 2020 年发表的一篇搜索场景下如何做向量化召回的 paper,整篇文章读下来,就像是一个奋战在一线的工程师向你娓娓道来他们是怎么从 0 到 1 构建一个召回系统,从训练数据与特征的选取, 到模型的 training 与 serving、再到把新的召回策略融入现有的 ranking system, 整篇 paper 并没有太多的公式与推导,但是却有很多在实战中总结出来的经验,而且这些经验相信也可以推广搜索以外的推荐 / 广告领域。本文主要是根据笔者对这篇 paper 的理解做一些提炼,推荐读原文。

阅读全文 »

Embedding-based Retrieval in Facebook Search is a 2020 Facebook paper on vector retrieval for search. Reading through it, it’s like a frontline engineer explaining how they built a retrieval system from 0 to 1, covering training data and feature selection, model training and serving, integrating new retrieval strategies into existing ranking systems. The paper doesn’t have many formulas and derivations, but has many practical lessons that can be applied beyond search to recommendation / advertising. This article mainly 提炼 s the author’s understanding, recommend reading the original.

阅读全文 »

Recommendation and advertising are two important businesses for many internet companies. Recommendation aims for DAU growth, or traffic growth, while advertising uses this traffic for monetization. Both problems are similar - each time traffic arrives, selecting top-k candidates from a large candidate set, both using the retrieval + ranking architecture, possibly with coarse ranking in between. Essentially, this is a trade-off between effectiveness and engineering.

If I had to identify the biggest technical difference between the two, I think it’s bidding. In advertising scenarios, the advertiser role is introduced, so besides user experience, we need to satisfy the advertisers’ needs (like volume, cost, etc.) to bring sustained revenue growth. Advertisers express their needs most directly through bidding, meaning how much they’re willing to pay per click/convert (truthful telling). This leads to the bidding research area. Many related papers are collected in rtb-papers.

This article mainly discusses Alibaba’s 2019 KDD paper Bid Optimization by Multivariable Control in Display Advertising. This paper solves two core bidding problems: bid formula and price adjustment strategy. From derivation of optimal bid formula to construction of bid controller, the paper’s overall modeling approach is worth learning. The entire derivation paradigm can be extended to more general bidding scenarios with strong practicality. I recommend reading the original paper.

阅读全文 »

推荐与广告可以说是很多互联网公司的两个重要业务,其中推荐是为了 DAU 的增长,或者说流量的增长,而广告则是利用这些流量进行变现。两者的要解决的问题也很相似,都是在每条流量到来的时候,要从一个庞大的候选集中选出 topk 个候选返回,基本都采用 召回 + 精排 的架构,中间还可能插入粗排,本质上都是在效果与工程之间做 trade-off。

如果说两者技术上最大的 diff,笔者认为是出价,因为在广告场景中引入了广告主 (advertiser) 这一角色,因此我们除了考虑用户体验,还需要满足金主爸爸们的诉求(如跑量、成本等),才能带来持续的收入增长,而金主爸爸们表达其诉求的最直接的手段就是出价,其含义就是愿意为每个 click / convert 付出多少钱 (truthful telling)。这带出来的就是 bidding 这一研究领域,关于这个领域在 rtb-papers 中有很多相关的 paper。

本文主要讲的是 2019 KDD 阿里的 Bid Optimization by Multivariable Control in Display Advertising,这篇 paper 解决了出价的两个的核心问题:出价公式和调价策略,从最优的出价公式的推导到出价控制器的构建,文章的总体的建模思路非常值得学习,整个推导的 paradigm 能够推广到更一般的出价场景, 实践性也较强,推荐读原文。

阅读全文 »

本文是 程序的表示、转换与链接 中第 7 周的内容,主要介绍了 C 语言程序中过程调用、也就是函数调用对应的机器级表示, 包括如何传递参数,如何将控制转移到被调用过程, 寄存器使用约定,递归函数的实现等等。 通过了解这些内容,能够更清楚机器执行的详细过程,同时也能更清楚函数调用过程中栈空间是如何变化的;课程选用的指令系统是前面介绍过的 IA-32 指令系统。

阅读全文 »
0%