吴良超的学习笔记

如何成为快速阅读高手

发表于 2021-07-25 标签读书，拾人牙慧

最近在找资料的时候意外发现一个与快速阅读相关的干货, 就是在得到这个 app 上的一门课《怎样成为快速阅读的高手》，里面介绍了快速阅读的一个比较系统的方法论，总结来说就是阅读三步法：评测、速读和精读。笔者对其中的不少观点有共鸣，因此在本文摘录一些印象深刻的地方，推荐去听原始的课程，也许会有更大收获，也算是支持一下课程的作者。

阅读全文 »

From internship to full-time work, I’ve encountered various advertising systems of different sizes—from small DSPs that are “small but complete,” to large media platforms that bundle SSP, ADX, and DSP together. This has given me a preliminary understanding of the advertising systems in the industry. Taking advantage of the May Day holiday, I’m organizing my current knowledge of advertising systems. Since the concept of an advertising system is vast and involves many components, I cannot cover everything comprehensively. This article mainly describes several aspects I care about from multiple perspectives (technical, business, product) in a concise manner. The content may not be complete—feedback and corrections are welcome.

Disclaimer: The content of this article is unrelated to my employer and is mainly based on my current understanding. I will only reference public information and will not involve any unpublished internal information from my employer. If any colleagues find sensitive content, please contact me for removal. In today’s era of open source, widely available papers, and increasing personnel mobility, I believe these general techniques are not the core—data + understanding of business + flexible assembly of these general techniques is what matters.

阅读全文 »

An Overview Of An Ad System

发表于 2021-05-05 标签计算广告

从实习到工作，接触过一些大大小小的广告系统，有麻雀虽小但五脏俱全的小 dsp，也有把 ssp、adx、dsp 都打包了的大媒体，算是对业界的广告系统有了一个初步的了解。趁着五一放假这几天，简单地梳理一下当前了解到的广告系统知识，主要是想对零散的知识做个整理，由于广告系统这个概念非常的大，涉及到的部分非常的多，无法面面俱到，所以本文主要是从几个视角（技术、业务、产品）言简意赅的描述一下笔者比较关心的几个部分，中间内容可能不全，欢迎交流指正。

特意声明，本文内容与笔者雇主无关，主要是基于笔者当前的认知梳理的内容；在撰写过程中只会引用公开的内容，不会涉及到笔者雇主内部未公开的信息；如相关同学觉得有敏感内容，可联系删除。而其实在崇尚开源、paper 漫天飞、人员流动越来越快的如今，笔者觉得这些通用技术并不是最核心的地方，数据 + 对业务的理解 + 灵活组装这些通用的技术才是。

阅读全文 »

Exposure Bias In Machine Learning

发表于 2021-04-03 标签计算广告，机器学习

Machine learning essentially learns the data distribution. Its effectiveness assumes that training and serving data are Independent and Identically Distributed (IID). However, in practice, due to biased sampling and specific scenario constraints, training samples and serving samples are not IID. In advertising scenarios, the most typical example is CVR model training, where training samples are all post-click, but during serving, the CVR model faces all retrieved samples. This type of problem is called exposure bias or sample selection bias. Besides exposure bias, position bias is also a common bias.

This article first briefly introduces common biases in machine learning, then focuses on exposure bias (also called sample selection bias) and current solution approaches. The author summarizes them into three main categories: Data Augmentation, IPS, and Domain Adaptation.

阅读全文 »

Exposure Bias In Machine Learning

发表于 2021-04-03 标签计算广告，机器学习

机器学习本质上是在学习数据的分布, 其有效性的假设是模型 training 和 serving 时的数据是独立同分布 (Independent and Identically Distributed, IID) 的，但是在实际应用中，由于采样有偏、具体场景等约束， training 的样本与 serving 时的样本并不是 IID 的。在广告场景下，最典型的就是训练 cvr 模型时，训练样本都是 post clicked 的，但是 serving 时，cvr 模型面临的是所有被召回的样本；这类问题也被称为 exposure bias 或 sample selection bias，除了 exposure bias，position bias 等也是常见的 bias。

本文首先会简单介绍一些机器学习中的常见 bias，并着重介绍上面提到的 exposure bias (也叫 sample selection bias) 的在当前的一些解决思路, 笔者将其总结为 Data Augmentation、IPS 和 Domain Adaption 三大类方法。

阅读全文 »

2020 小结

发表于 2021-01-02 标签闲话几句

很久没更新技术文章了，草稿箱里还有几篇半成品一直被我以工作日事情太多、周六日需要休息为由 delay 了好几周；而现在站在 2021 年的起点，望着 2020 年的尾巴，不禁感慨一年就这么呲溜一下就过去了，总想写点东西来复盘一下 2020 这一年，还记得上次写的这种年度总结的文章是 2017 小结，那会还在上研一，现在回看这篇文章还是略有感慨，还是比较佩服当年那个充满激情与精力、对各种知识都充满好奇的自己；趁着元旦放假有空这几天，还是决定简单地对 2020 年做个总结，几年后再回头看看，或许会有不同的感悟。

阅读全文 »

Delayed FeedBack In Computational Advertising

发表于 2020-12-05 标签计算广告，机器学习

Conversion has delay, meaning users may convert some time after clicking, and often the deeper the conversion funnel, the longer the delay. In computational advertising, delayed feedback mainly affects the following two scenarios:

CVR model training
Posterior-based bidding strategy adjustment

For scenario 1, the impact is: (1) sending samples to the model too early treats events that will eventually convert but haven’t received labels yet as negative examples, causing model underestimation; (2) sending samples to the model too late, i.e., waiting a sufficiently long time for all samples before sending to the model, causes the model to not update timely.

For scenario 2, the impact is when the controller controls cost/value=target, the denominator will be smaller than the actual value, causing control instability.

This article mainly introduces three papers’ approaches to this problem in scenario 1. Some methods involved can also be applied to scenario 2 (and if problem 1 can be well solved, bidding can also be based on predictions rather than posterior data).

阅读全文 »

Delayed FeedBack In Computational Advertising

发表于 2020-12-05 标签计算广告，机器学习

转化是有延迟的，即在点击发生后过一段时间用户可能才会发生转化，且往往转化漏斗越深，延迟的时间越长；在计算广告中，delayed feedback 主要影响下面两个场景

CVR 模型的训练
基于后验的调价策略

对于场景 1，影响体现在（1）过早把样本送入模型，把最终会转化但是还没回传 label 的事件当做负例，导致模型低估（2）过晚把样本送入模型，即让所有样本都等待一个足够长的时间才送入模型，导致模型没能及时更新

对于场景 2，影响体现控制器控制 cost / value = target 时，分母会小于实际值，导致控制的不稳定

本文主要介绍三篇 paper 针对这个问题在场景 1 的一些解决思路，其中涉及到的一些方法也能应用到场景 2 中（而如果问题 1 能被较好地解决，也能基于预估值而不是后验数据进行调价）

阅读全文 »

程序的表示、转换与链接 - week10、11

发表于 2020-10-03 标签程序的表示、转换与链接

本文是程序的表示、转换与链接中第 10、11 周的内容，主要介绍了从源文件生成可执行文件的步骤 (预处理、编译、汇编、链接)，并详细描述了其中的链接这一步骤中的两大过程：符号解析与重定位，并对比了链接输入的可重定位目标文件和输出的可执行目标文件的差别；对了解文件的从编译到执行原理有一定帮助，可配合《链接、装载与库》阅读笔记一起阅读。

阅读全文 »

Reading Notes on "Embedding-based Retrieval in Facebook Search"

发表于 2020-08-30 标签计算广告，机器学习

Embedding-based Retrieval in Facebook Search is a 2020 Facebook paper on vector retrieval for search. Reading through it, it’s like a frontline engineer explaining how they built a retrieval system from 0 to 1, covering training data and feature selection, model training and serving, integrating new retrieval strategies into existing ranking systems. The paper doesn’t have many formulas and derivations, but has many practical lessons that can be applied beyond search to recommendation / advertising. This article mainly 提炼 s the author’s understanding, recommend reading the original.

阅读全文 »