inter-layer relationships simultaneously, but they still suf-
fer from poor graphic performance, such as a lack of lay-
out variety or spatial non-alignment. To this end, we pro-
pose a CNN-LSTM-based generative adversarial network
(GAN) conditioned by the input canvases to generate lay-
outs, which has a balanced performance on both graphic
and content-aware metrics.
CNN-LSTM is proved effective in time series forecast-
ing or behavior analysis tasks [6, 14]. To enable this time-
sensitive model in layout generation, we propose design se-
quence formation (DSF) to generate design sequences that
imitate the design processes of human designers. In par-
ticular, elements in layouts are reorganized to involve im-
plicit temporal features, and less important ones can be dis-
carded painlessly. It is in line with the logic of human-
computer interaction logic [5] and has the potential to help
train the LSTM model on a training set of size smaller than
20,000 [18]. GAN is a generative model that contains a
discriminator and a generator gaming against each other to
learn the distribution of training data. In the proposed de-
sign sequence GAN (DS-GAN), the discriminator is design-
sequence-aware and will supervise the ”design” process,
i.e., generated layouts, of the generator under the constraints
of the given canvas. As far as we know, this paper is the first
adoption of CNN-LSTM in layout generation.
Since content-aware visual-textual presentation layout
remains a novel task, there is only one public dataset in the
field, and it has insufficient variety. In this paper, we first
construct and release a new dataset and benchmark named
PKU PosterLayout, which consists of 9,974 poster-layout
pairs and 905 images, i.e., non-empty canvases. Each lay-
out is represented by a set of elements labeled with class
and bounding box. We collect data from multiple sources to
guarantee diversity and variety in content, domain, and lay-
out, supporting it as a challenging benchmark expected to
encourage further research. Besides the dataset, we propose
and clearly define new metrics to accompany the old ones,
a total of eight graphic and content-aware metrics. They
evaluate the layouts in terms of utilization, non-occlusion,
and aesthetics. Both quantitative results and visualized re-
sults show that the proposed approach outperforms other ap-
proaches by generating proper layouts on diverse canvases.
We summarize the contribution of this paper as follows:
• A new and more challenging dataset and benchmark
for content-aware visual-textual presentation layout,
PKU PosterLayout, consists of 9,974 poster-layout
pairs and 905 images, with greater diversity and va-
riety in content, domain, and layout.
• An algorithm for design sequence formation (DSF)
converts plain layout data into design sequences in-
volving temporal features by imitating the design pro-
cess of human designers.
• A CNN-LSTM-based GAN, design sequence GAN
(DS-GAN), is conditioned by images and learns the
distribution of design sequences to generate content-
aware visual-textual presentation layouts. It makes
a good trade-off between graphic and content-aware
metrics, which outperforms the other approaches.
2. Related Work
Research on content-agnostic visual-textual presentation
has developed for a relatively long time, assuming the given
canvas is empty. O’Donovan et al. [15] proposed an energy-
based model that penalizes the part of layouts that violates
pre-defined, complex design principles and thus could ob-
tain a more desirable one after non-linear inverse optimiza-
tion. The authors further presented a system [16] adopt-
ing this model with simpler principles, such as the size of
elements and pair alignment, to alleviate time-consuming
problem in heuristics.
Li et al. proposed LayoutGAN [12], taking a big step
forward in data-driven approaches by introducing GANs in
layout tasks. It adopted a differentiable wireframe render-
ing layer flattening layouts and canvases into wireframe im-
ages, remaining the discrimination process an image classi-
fication problem. In contrast, it differed from a conventional
GAN in starting from a random initial layout that is primi-
tively valid and modulating it into an eligible one instead of
synthesizing layouts from fully random noise. The authors
further presented an attribute-conditioned LayoutGAN [13]
that guides the layout with the given element attributes, such
as minimum size, fixed aspect ratio, and reading order of
elements. Moreover, it accompanied elements dropout in
the discrimination process, forcing the discriminator to be
aware of the local pattern of layouts, which is helpful in
visual-textual presentation layout. Besides the element at-
tributes, Zheng et al. [22] demonstrated the efficiency of
concerning the visual and textual semantics of the elements
and presentation topics. They proposed an embedding net-
work fusing cross-modal features to condition the GAN.
Kikuchi et al. proposed LayoutGAN++ [9] demonstrat-
ing an improvement in handling user-specific constraints by
optimizing layout in latent space. It got rid of using wire-
frame images with respect to the findings that the rendering
layer is unstable with a dataset of a limited size. Similarly,
Lee et al. [10] were concerned with user-specific constraints
and dealt with them using a graph neural network modeling
elements as nodes and their relationships as edges. Clar-
ification is needed that these user-specific constraints are
merely inter-layout and insufficient for the task interested
in this paper. Specifically, content-aware visual-textual pre-
sentation layout concerns both inter-layout and inter-layer
relationships, i.e., layout and canvas, which is driven by
canvas with no mandatory constraints attached. However,
the ideas behind these content-agnostic approaches are still
评论