
TABLE I: The DNN model performance on Moments dataset
when removing ID features. Ad ID, campaign ID, advertiser
ID and product ID are removed one by one from V1 to V4.
V5 removes all other 9 ID features.
Model New Ads Old Ads Overall
DNN 0.8175 0.8449 0.8436
V1 0.8206 0.8386 0.8375
V2 0.8233 0.8354 0.8332
V3 0.8267 0.8326 0.8322
V4 0.8269 0.8323 0.8315
V5 0.8293 0.8304 0.8299
TABLE II: The DNN model performance on KDD2012 Cup
dataset when removing ID features. Ad ID, description ID and
advertiser ID are removed one by one from V1 to V3. Removal
of more ID features causes severe performance decrease in all
situations.
Model New Ads Old Ads Overall
DNN 0.6924 0.7503 0.7414
V1 0.6942 0.7458 0.7372
V2 0.6941 0.7436 0.7350
V3 0.6963 0.7407 0.7327
is also challenging since there exist large amounts of small
advertisers and niche products with a narrow appeal.
Table I shows the performance of base model (DNN [3],
currently used in Tencent platform) when features mentioned
above are gradually removed. We can observe that the perfor-
mance increases on new ads but decreases severely on old ads,
hence the overall performance is deteriorated. Similar results
can be reproduced in public dataset as shown in Table II. It
demonstrates the duality of these features: beneficial for the
old ads while ineffectual for new ones. However, the distribu-
tion of some features are moderate, e.g., product and industrial
category. Those features tend to have small cardinality and the
training samples for tail elements are sufficient enough. We
denote the former type of feature as fine-grained feature and
the latter type as coarse-grained feature. From this perspective,
the cold-start issue can be characterized as the inability to
provide effective embeddings for fine-grained features, thereby
impairing the final condensed ad representation.
Existing methods in CVR prediction employ similar tech-
niques developed in click-through rate (CTR) prediction [4]–
[6]. These methods aim to learn effective low- and high-order
feature interactions and thus significantly improve the perfor-
mance. However, these methods treat all the features equally
without distinction. As mentioned above, new ads tend to have
unreasonable embeddings for fine-grained features. Interacting
with such features may not yield expected effectiveness, and
further impairs the performance for new ads. Thus, special
attention should be attached to the feature grouping in cold-
start scenarios.
Intuitively, as shown in Figure 2(a), disregarding the dif-
ference between feature types and utilizing a single model
on all ads may not work well on new ads, despite its good
performance on old ads. On the other hand, discarding the
fine-grained features directly in Figure 2(b) can increase
the performance on new ads significantly, but sacrifices the
accuracy on old ads as fine-grained features contain unique
character and is indispensable. Figure 2(c) combines the
two models mentioned above and builds separate models for
different ads, use Figure 2(a) to predict CVR for old ads and
Figure 2(b) for new ads. It performs much worse since new
ads have much fewer data and much more intrinsic noise,
which make the complicated models hard to learn. Moreover,
decoupling suffers from different model update effectiveness,
which results in a large precision gap between old and new
ads. Meanwhile, maintaining multiple models simultaneously
consumes huge amounts of both computation and storage
resources, as well as required human effort. Hence, it is desired
to design a more powerful and efficient model.
In a real production scenario, new ads tend not to be
loners, but instead implicitly exist in smaller groups and larger
clusters. An advertiser may share same creative features among
its ads in one campaign. Ads within same category always have
overlapping features and collective commonalities. Moreover,
statistic shows that duplication of ads exists universally in our
platform. These are just few examples and the intrinsic con-
nection between ads is far more complicated than we thought.
Traditional models view an ad as an isolated individual and
ignore the potential of collective patterns. Though ad-specific
representations contain ample information for old ads, we
argue that those new ads are encouraged to facilitate collective
patterns to obtain more robust and general representations.
Since different new ads share some identical feature values,
these values divide new ads into groups naturally. The model
can extract informative group-level representations for new
ads with few conversions. In this way, cold-start issue can
be effectively alleviated.
To this end, we propose Automatic Fusion Network (Aut-
oFuse) to simultaneously learn multiple levels of represen-
tation with different feature groups. We illustrate simplified
architecture of AutoFuse in Figure 2(d). For each ad, we
first learn an ad-level representation with all the features to
depict the unique individual character. Then, we discard fine-
grained features and only utilize coarse-grained features to
learn group-level representation to portray the collective in-
formation in a broad and abstract perspective. Finally, through
dynamic fusion, we integrate these two level representations
to obtain a robust and general representation for this ad. The
dynamic fusion module can combine ad-level and group-level
representations selectively according to the popularity feature
of each ad, thereby producing a synergy effect between ad-
and group-level learning.
To evaluate our proposed method, we conduct extensive ex-
periments on public datasets and real-world industrial conver-
sion datasets sampled from Tencent ad platform. Experimental
results demonstrate that AutoFuse consistently outperforms
state-of-the-art baseline methods across all datasets, showing
significant improvement on not only old ads, but also new ads
especially.
The main contributions of this work are summarized as
follows:
3441
Authorized licensed use limited to: Tencent. Downloaded on April 22,2025 at 08:35:27 UTC from IEEE Xplore. Restrictions apply.
评论