An Overview Of An Ad System

Posted on 2021-05-05 标签计算广告

From internship to full-time work, I’ve encountered various advertising systems of different sizes—from small DSPs that are “small but complete,” to large media platforms that bundle SSP, ADX, and DSP together. This has given me a preliminary understanding of the advertising systems in the industry. Taking advantage of the May Day holiday, I’m organizing my current knowledge of advertising systems. Since the concept of an advertising system is vast and involves many components, I cannot cover everything comprehensively. This article mainly describes several aspects I care about from multiple perspectives (technical, business, product) in a concise manner. The content may not be complete—feedback and corrections are welcome.

Disclaimer: The content of this article is unrelated to my employer and is mainly based on my current understanding. I will only reference public information and will not involve any unpublished internal information from my employer. If any colleagues find sensitive content, please contact me for removal. In today’s era of open source, widely available papers, and increasing personnel mobility, I believe these general techniques are not the core—data + understanding of business + flexible assembly of these general techniques is what matters.

Technical Perspective

The technical perspective mainly covers modules that technical teams iterate on daily. The overall architecture can be divided into retrieval + ranking, with two main areas of focus: bidding and modeling.

Retrieval and Ranking

Any recommendation or advertising system of scale follows a “retrieval + ranking” process to select top-k candidates for each request, sometimes with a coarse ranking stage in between. The reason is that the candidate set is too large, requiring a trade-off between engineering and effectiveness. If candidates are few and latency allows, directly ranking all candidates is undoubtedly best for effectiveness.

For a comprehensive overview of retrieval and ranking, I recommend reading 推荐系统技术演进趋势：从召回到排序再到重排 (Trends in Recommendation System Technology: From Retrieval to Ranking to Re-ranking).

Current retrieval systems mostly adopt multi-path retrieval, where each path has its own business purpose and goal, often complementary to each other—similar to the bagging idea in ensemble learning. For efficiency, most systems now use ANN (Approximate Nearest Neighbor) retrieval. The basic idea is to partition all candidates into \(m\) subspaces, then select \(n\) (\(m > n\)) subspaces during retrieval, and feed the top-k candidates with highest similarity to the query into the ranking stage. For more on ANN retrieval, see 图像检索：再叙 ANN Search; IVF-PQ is a commonly used retrieval method in the industry. Facebook’s “Embedding-based Retrieval in Facebook Search” is also worth reading—it reads like an engineer on the front lines walking you through how they built a retrieval system from scratch. I also wrote Embedding-based Retrieval in Facebook Search 阅读笔记 for reference.

Retrieval aims to select good candidates as quickly as possible, often using simple features and models. Ranking, with far fewer candidates than retrieval, can use more complex features and models—various cross features and complex structures can be tried in ranking. Ranking is essentially about mining features based on business needs. Airbnb’s paper “Real-time Personalization using Embeddings for Search Ranking at Airbnb” is worth learning from—it doesn’t involve esoteric model structures or hyperparameters, but shows the author’s deep understanding of business and how to incorporate that understanding into the model. I wrote Real-time Personalization using Embeddings for Search Ranking at Airbnb 阅读笔记 for reference.

An important note: splitting the ranking process into multiple stages to select top-k candidates can cause the optimization objectives of each stage to diverge. This is mentioned in the article 《推荐系统技术演进趋势：从召回到排序再到重排》:

If model-based retrieval is used in the retrieval stage, theoretically the same optimization objective as the ranking model should be adopted. Especially if ranking uses multi-objective optimization, the retrieval model should also adopt the same multi-objective optimization. Similarly, if coarse ranking is included, it should also use the same multi-objective optimization as fine ranking—optimization objectives across stages should be consistent. Otherwise, if objectives are inconsistent, high-quality candidates for ranking objectives may be filtered out in earlier stages, affecting overall effectiveness.

If the above view is that retrieval should adapt to ranking, Facebook’s “Embedding-based Retrieval in Facebook Search” proposes letting ranking adapt to new retrieval. The paper notes that new ANN retrieval results might not be recognized by ranking:

since the current ranking stages are designed for existing retrieval scenarios, this could result in new results returned from embedding based retrieval to be ranked sub-optimally by the existing rankers

Modeling

The importance of modeling goes without saying—often a few per mille improvement in offline metrics can bring significant online gains, and almost all rule-based constraints in the system can be replaced by models. Here are several aspects I consider important for model effectiveness: data, training, and prediction calibration.

Data

Data is arguably the factor with the greatest impact on model effectiveness. This can be divided into data pipeline and feature engineering.

The core goal of the data pipeline is to obtain ground truth for various events (click, convert) timely and accurately. “Timely” means events should be fed to the model as quickly as possible; “accurate” means events should be correctly labeled.

The data pipeline is closely related to conversion attribution. Attribution can be understood as obtaining and joining labels. The most common is last-touch attribution, but there are other methods like multi-touch attribution. This typically involves advertiser reporting and actual joining, which I won’t detail here—for more, see rtb-papers. Here are several issues I consider worth attention: delayed feedback, bias, and cross-channel data.

Delayed feedback

In advertising, the CVR model is a classic example of this problem. Conversions are delayed—users may convert some time after clicking, and deeper conversion funnels tend to have longer delays.

There are two choices: (1) Wait for labels to fully return before training, e.g., if true labels fully return within a day, do daily training—but this violates the “timely” principle; (2) Feed data to the model in real-time for online training—but this violates the “accurate” principle since some labels haven’t returned. In practice, timeliness and accuracy are a trade-off.

This is formalized as the delayed feedback problem. See Delayed FeedBack In Computational Advertising for methods to address this in online-training scenarios—mainly using techniques like importance sampling to weight samples, or letting samples enter the model multiple times and deriving new probability expressions to ensure unbiased samples.

Bias

Training data can have various biases, commonly exposure bias and position bias.

Exposure bias: Only exposed samples can enter the training set, so training samples are a small subset of serving samples.
Position bias: Ads in more prominent positions have higher click probability.

For exposure bias, see Exposure Bias In Machine Learning, which summarizes methods into three categories: Data Augmentation, IPS (Inverse Propensity Scoring), and Domain Adaptation.

In retrieval, exposure bias is often simplified to a negative sampling problem. The article SENet 双塔模型：在推荐领域召回粗排的应用及其它 lists several possible negative sampling methods:

Option 1: Exposed but not clicked data
This causes Sample Selection Bias. Our experience is this data is still needed, but should be mixed with other negative sampling methods at certain ratios to mitigate the bias.
Option 2: Global random negative sampling
Randomly sample from the global item pool as retrieval/coarse ranking negatives. YouTube DNN two-tower model does this. This ensures distribution consistency, but negatives are too different from positives, so how much the model learns is questionable.
Option 3: In-batch random negative sampling
Only include positives, then during training, select items from other users’ positives in the batch as negatives. Given a user, this randomly selects from all other users’ positives to construct negatives. Google’s two-tower retrieval model uses this.
Option 4: Random negative sampling from exposed data
Randomly select from all exposed data as negatives. We’ve tested this and it works in some scenarios.
Option 5: Popularity-based random negative sampling
Global random selection, but more popular items have higher probability of being selected as negatives. Many studies show this has positive effect. The implicit assumption: if a popular item wasn’t clicked/viewed by a user, it’s more likely a true negative. This also suppresses popular items, increasing personalization.
Option 6: Hard negative sampling
Select difficult examples as negatives. Hard examples have more loss and information content. But what counts as “hard” varies—Airbnb and others explore various hard negative mining methods.

In summary, negative sampling in retrieval and coarse ranking is an area worth exploring. Our experience: in 2019, mixing Option 1 + Option 3 worked well for FM retrieval, but not for two-tower models later. Global random didn’t work for either FM or two-tower, sometimes showing significant negative impact. There’s no unified conclusion—this area has some artistic elements.

Cross-channel data

Besides own data, leveraging cross-channel data can improve effectiveness. The core is recovering users whose CVR is underestimated through cross-channel data.

Traditional advertising divides bidding into SSP ↔︎ ADX ↔︎ DSP. Large media like WeChat and TikTok contain all three—they’re media (SSP), auction platforms (ADX), and also handle advertiser 投放 (DSP). The models discussed above mainly concern the DSP part of the media.

Media often only use data generated on their own platform—WeChat can only access user behavior on WeChat, not TikTok. But advertisers often advertise across multiple media. A user might click an ad on media A but convert on media B. If media A only uses its own data, it treats this as a negative example, underestimating CVR and abandoning this high-potential conversion user. Using media B’s data lets the model learn the true CVR.

How to get this data? One way is advertisers attribute media A data and report back to media B; another is media cooperation, like federated learning in advertising applications.

After obtaining this data, the simplest approach is adding it to the original model. Alternatively, build an independent model and apply it to bidding or original CVR.

Feature engineering

If the data pipeline decides which data to use, feature engineering determines whether we can fully mine that data’s value—“data and features determine the upper bound of machine learning, while models and algorithms only approach that bound.”

Feature engineering mines potentially useful features based on business characteristics, then validates with A/B experiments. Though features are business-specific, there are common patterns: attribute features, statistical features, sequence features.

Attribute features: User/ad properties like user age, gender; ad category, style.
Statistical features: Count of specific operations (click, view) by user on specific dimensions (category, position, advertiser) in specific time ranges (7d, 3d, 12h, 1h).
Sequence features: User behavior sequence in a time period, like last 30 clicked ads/items. See Alibaba’s DIN for typical application.

Another important part is feature selection. The simplest approach is deleting features and retraining, but this is expensive for long training times. A better approach is obtaining feature importance during training. For interpretable models like logistic regression and tree models, this is straightforward. But for embedding-based DNNs, each feature only has an embedding—how to do feature selection? Common methods:

Attention unit: Add an attention unit for each feature (like DIN’s activation unit). The attention output measures feature importance. The assumption is the model can learn weights for important features.
Embedding weight: Judge importance by embedding magnitude, commonly L1 or L2 norm. The assumption is important features have larger embeddings—by analogy to LR, NN can be approximated as multiple LRs.

Both methods have strong assumptions. Is there a more direct method? Going back to the original idea—delete a feature, retrain, and evaluate—can we achieve this during training? Yes: during training, approximately zero out the embedding via masking to “drop” the feature, then compute AUC through multi-head.

Additionally, in retrieval, for efficiency, online systems often only extract single-side user/ad features. Cross features may work better but can’t be directly extracted online—consider methods like distillation, see Alibaba’s “Privileged Features Distillation for E-Commerce Recommendations” or Distillation 简介.

Training

After data determines the upper bound, how does the model approach it? Several aspects deserve attention: model structure, initialization, optimizer, and loss function. This mainly concerns NN models in DeepLearning—when compute and data reach certain levels, NN achieves SOTA.

Model structure

NN structures are diverse—recommendation: Wide&Deep, DeepFM, DIN, DCN; CV: AlexNet, VGG, Inception, ResNet; NLP: Transformer, BERT. These are just papers from when I was in school—there are many more recent ones.

Due to deep learning’s uninterpretability, model building has little theoretical foundation—mostly trial and error, inferring reasons from effects. Hence autoML for automatic structure and parameter search. But beyond esoteric parts, there are practical experiences:

VC Dimension: Describes the relationship between data size and model size—model parameters should be proportional to data size, otherwise overfitting or underfitting.
Attention: Simply put, dynamic weighting. Intuitively, different features have different importance—give higher weights to important ones. In NN, attention is used for embeddings and hidden units. For embedding attention, see SENet 双塔模型; for hidden unit attention, see Learning Hidden Unit Contributions.
Multitask: Two common uses: (a) Multiple related tasks can increase data and improve jointly; (b) For scenarios requiring accurate estimates (like ad CTR/CVR), data mixes multiple scenarios with different priors—separate heads for each scenario to maintain accuracy.

Model training

Training process: initialization, optimizer, loss function.

Initialization affects results. From optimization perspective, NN is non-convex—bad initialization starts at poor position. From BP perspective, too small or large initial values cause gradient vanishing or explosion. See deeplearning.ai’s Initializing neural networks.

Training is an optimization problem—BP continuously corrects parameters to minimize loss. See Parameter optimization in neural networks. Involves hyperparameter choices: learning rate, batch size, optimizer. See 一个框架看懂优化算法之异同 for optimizer differences.

Loss functions derive from MLE or MAP—assuming training samples follow certain distributions, training maximizes joint probability. MSE assumes error follows normal distribution; cross entropy assumes estimates follow Bernoulli distribution—both unified under GLM framework.

Industry often converts to classification with cross entropy loss. Some regression losses can convert via weighted logistics regression—see YouTube’s Deep Neural Networks for YouTube Recommendations “Modeling Expected Watch Time”. Cross entropy variants:

Reweight: Various sample weighting—direct physical meaning (watch time), or importance sampling derived.
Auxiliary/regularization: Add auxiliary tasks or regularization like center loss.

Prediction Calibration

In recommendation, only ranking matters—overall over/under estimation doesn’t affect ranking. But advertising involves billing—accuracy matters. Over-estimation charges advertisers more; under-estimation loses platform revenue.

So in advertising, models need AUC and calibration error. To maintain ranking while calibrating, use isotonic regression. See 使用 Isotonic Regression 校准分类器.

Also, sample reweighting in loss changes positive/negative distribution, causing biased estimates. Strategies: calibrate during training, or transform predictions directly—see Delayed FeedBack In Computational Advertising “Fake negative calibration”.

Bidding

Retrieval + ranking and modeling apply to both recommendation and advertising. Bidding is what I consider the biggest difference—in advertising, the advertiser role is introduced, and bidding lets advertisers express their willingness to pay for impression/click/conversion.

Ad platforms offer various products for different advertiser needs: cost-control products, volume-focused products, brand ads with strict impression guarantees. For these products, the strategy ultimately lands on bidding—in eCPM-ranked systems, bid is the most flexible factor.

For these various needs, the same optimization modeling approach applies. Alibaba published a paper describing this, providing good guidance for bid formula derivation and controller construction. See my notes: 《Bid Optimization by Multivariable Control in Display Advertising》阅读笔记.

Business Perspective

Business perspective covers problems directly faced in iteration, affecting customer experience and platform revenue. The techniques above serve these business needs. I summarize as: ramp-up, cost, sustained volume, and cold start.

Ramp-up

Ramp-up is the first challenge—materials approved, campaign created, budget ready, but will it ramp up? Not necessarily.

广告投放不起量都是怎么解决的？ describes various “tricks” optimizers use. Generally: optimize creatives, relax targeting, increase bid & budget, create more campaigns.

From media/platform perspective, what can help advertisers with ramp-up?

Ramp-up difficulty increases with more ad candidates—interpretable via E&E. With stable DAU, media has limited ad impressions. Giving new campaigns more exposure squeezes mature campaigns. And new campaigns may not sustain volume after ramp-up—this is exploration.

A simple idea: fix exploration quota for ramp-stage campaigns—a “green channel.” Second question: which campaigns get through? Equal opportunity isn’t optimal—different campaigns perform differently after ramp-up. Better: model expected post-ramp performance, decide entry. Also avoid duplicate campaigns—give more advertisers exploration opportunity.

Cost

Most advertisers care about ROI (brand ads can be considered long-term ROI). ROI products require sensitive payment data, so adoption takes time. Cost products are more common—assuming truthful bidding, use advertiser bid as cost target, keep actual cost close to bid.

In eCPM systems, bid, CTR, CVR must be accurate for cost control.

First, CTR/CVR estimation accuracy. Difficulty: training labels are 0/1, but need to estimate a rate. No absolute ground truth—training approximates positive/negative ratio in samples. Changing sample distribution requires correction in loss or predictions. Also, by law of large numbers, more samples → mean approaches expected value. But many campaigns have few clicks/conversions, so CVR estimation is poor.

For this problem with no ground truth and limited large number law applicability, need additional strategies—most common is isotonic regression, calibrating based on posterior data. Also consider whether posterior data at calibration granularity is sufficient.

Second factor: bidding. If estimates were perfect, advertiser’s bid would be optimal—but clearly not. Hence controllers adjust bids to control cost—bidding is the fallback for estimation errors.

Cold Start

Cold start is a persistent problem in both ads and recommendation—new users/campaigns lack history, models/strategies don’t learn well, performance suffers. In ads, new campaign cold start is more common and severe—DAU/MAU is stable, but advertisers keep creating campaigns.

Cold start worsens above problems: harder estimation accuracy, harder cost control. Solutions from two angles: modeling and strategy.

Modeling: Many cold start papers for embedding-based models—basic idea: make cold item embedding close to warm-up state. Two approaches:

Use meta-network to generate ID embedding for cold items
MAML-based training for faster embedding convergence

First approach: Use item’s meta info (even cold items have this) through a small network to generate embedding, train with auxiliary tasks. Tasks vary based on business understanding—e.g., minimize error between meta-network output and mature-stage embedding, or retrain on cold items.

See papers: - Warm Up Cold-start Advertisements - Learning to Warm Up Cold Item Embeddings

Second approach: Meta learning for “learning to learn”—MAML lets models quickly learn even with few samples. See: - Model-Agnostic Meta-Learning - MeLU

Above methods help cold item ID embedding converge faster. Another method: Add cold item’s generalizable features like image, text via pretrained models (VGG, BERT) as embeddings.

Strategy: Cold start campaigns need extra measures—give “green channel” exploration quota, like ramp-up (ramp-up is essentially a cold start problem). This involves E&E (see EE 问题概述). Quota should be personalized based on expected performance—personalized allocation beats equal allocation. But this squeezes mature campaigns—need trade-off between cold start and mature.

Besides extra support, cold start bidding needs consideration—no posterior data. Ideas: use similar campaigns’ data? Trust estimates and sum predictions? Or relax cost requirements for cold start campaigns in product?

In practice, need to educate advertisers: cold start has instability, more likely to exceed cost. If customer loyalty is low or market has competitive alternatives, platform often bears these losses.

Sustained Volume

Sustained volume is the problem after campaigns pass cold start and enter mature stage. Advertisers pursue two things: cost and volume. With cost controlled, more volume is better. But reality: campaigns often lose volume soon after mature stage, often never recover—short campaign lifecycle.

Causes include: 1. Advertisers frequently modify bid, targeting, budget—disturbs stability 2. Estimation inaccuracy, unstable bid control—sudden volume spike or drop 3. Fierce competition—limited impressions, increasing campaigns, some mature campaigns lose volume 4. Natural decay—target population mostly exposed, or creative natural decay

This leads to advertisers creating duplicate campaigns for volume, worsening ramp-up, cold start, competition, and system load. So sustaining volume is critical for all ad systems.

Aside from hard-to-change factor #3, some ideas:

Advertisers should reduce frequent modifications—platform should educate them. What frequency counts as “frequent”? How much should targeting be relaxed? Need A/B testing to provide reference.
Reduce system fluctuation—ensure infra high availability, cautious A/B experiments that might affect campaign stability.
Apply extra strategies for campaigns losing volume: identify them and determine strategy—similar to cold start support.

Summary from business perspective: ramp-up, cold start, cost, sustained volume are key advertiser concerns and directly measure platform value. Less technically focused, need more consideration of product forms and customer service.

Product Perspective

Product perspective is advertiser perspective—product forms platform exposes to advertisers. Generally divided into brand advertising and performance advertising. Here’s a brief overview of basic knowledge.

Brand Advertising

Brand advertising is often the initial ad mode for developing systems. Advertiser pays first, platform guarantees impressions—CPT and GD are typical. CPT is simple—fixed ad slot at specific position. GD involves inventory estimation and allocation.

Inventory estimation: GD commits future impressions, so sold inventory can’t exceed future. Use time series models to estimate.
Inventory allocation: Construct bipartite graph with inventory and GD plans as supply and demand, allocate via algorithms like HWM and SHALE.

Above addresses GD’s impression guarantee. With finer optimization, other issues:

GD and performance ads compete (share limited impressions)—need joint optimization for maximum revenue.
GD advertisers also care about effectiveness—can’t just give low-quality traffic. This becomes multi-constraint optimization.

If considering these, traditional allocation algorithms have issues. Inventory estimation gives total impressions—how much GD can take needs a number or CPM-based allocation. Also, allocation directly assigns impressions without quality judgment. We don’t want to give poor traffic to advertisers or squeeze bidding too much.

In summary, GD needs to consider impression guarantee, effectiveness, premium rate, and impact on performance ads—all must be modeled.

Performance Advertising

Performance advertising dominates the market—most advertisers care about ROI. The techniques above largely serve performance ads: cost control, volume, etc.

Deep Conversion Products

With deeper optimization, common performance ad objectives may not satisfy all advertisers. Some industries want to optimize directly to ROI, not just cost, and can provide data.

Two options: (1) Advertiser reports data to platform for conventional training; (2) Advertiser won’t share labels—use Federated Learning. I have practical experience with this—see 字节跳动联邦学习平台 Fedlearner.

After modeling, typically apply via bidding, unless advertiser provides bid for deep objectives. This is often a multi-objective constrained optimization—see bidding section above.

Summary

This article summarizes computational advertising knowledge from three perspectives:

Technical perspective: “Retrieval + ranking” structure; focus on modeling and bidding
Business perspective: Four problems—ramp-up, cost, sustained volume, cold start
Product perspective: Brand advertising, performance advertising, deep conversion advertising

The content is extensive and messy >_<, but only the tip of the iceberg in advertising systems. Due to confidentiality, many specific methods aren’t mentioned, but the ideas should be universally applicable. Feedback welcome.