SIGMOD-Companion ’24, June 9–15, 2024, Santiago, Chile. Yupu Zhang et al.
1 INTRODUCTION
Across the last several years, more and more enterprises have
rapidly migrated cloud-native applications to cloud-native infras-
tructures [
7
], in the form of microservices [
5
], serverless and con-
tainer [
25
,
32
] technologies. With a vast number of applications
running on hundreds to thousands of machines, this distributed
architecture is highly complex yet extremely fragile [
41
], prone
to interruptions due to failures [
19
], and can even lead to partial
paralysis of the Internet [
29
]. Therefore, monitoring is crucial for
checking the operational status of applications. It not only requires
issuing alerts when failures occur but also demands early detec-
tion of bugs and issues hidden in the development environment
that may be exposed in the production environment, aiming to
prevent system interruptions [
29
]. However, compared to previ-
ous architectures, the unique characteristics of cloud-native archi-
tecture (e.g., Intrusiveness, Resilience, Reliability, etc. [
1
]) make
traditional monitoring solutions and strategies inadequate for mon-
itoring tasks [16, 35].
In recent years, observability, as an extension of monitoring,
has become an indispensable feature of the environment of cloud-
native architectures [
27
]. Logs, metrics, and traces, known as the
three pillars of observability, are the raw data needed to obtain an
internal view of the health and behavior of applications and mi-
croservices [
26
]. Logs, as a crucial data source for monitoring and
observability, capture the details of each request and can be used for
debugging [
37
], root cause analysis [
41
], exploratory troubleshoot-
ing [
14
], and other applications, making them indispensable for
any production-grade system [29].
In any production-grade system, the volume of logs increases sig-
nicantly over time and with business growth. Building a low-cost
log engine for an observability platform is an extraordinary mission.
We summarize the challenges we encountered in our production
environment as follows:
Challenge 1: Heavy and Skewed Log Writes. The hundreds
or thousands of various microservices and programs running on
the cloud-native infrastructures generate a large amount of logs
every day, with log generation times concentrated and frequently
encountering bursts of write demands. For example, in our produc-
tion environment, many users generate several hundred terabytes
of logs daily, and the total volume of logs produced each day contin-
ues to increase with business growth. Within a day, log writes are
mainly concentrated within a few hours. Therefore, the ability to
store and rapid write such massive log data at a low cost is crucial.
Challenge 2: Low Frequency and Heavy Log Queries. Com-
pared to write operations, the frequency of log queries is much
lower, and the majority of logs will be never queried. However,
executing precise queries within such a vast volume of data and
within an acceptable latency (ranging from hundreds of millisec-
onds to a few seconds) is undeniably challenging. Moreover, many
queries involve a wide time span, often ranging from a day to a
week, and sometimes even longer, up to a month or more. Therefore,
establishing reasonable data partitioning and designing ecient
and practical indexes and caches are essential.
Challenge 3: Various Log Queries and Important Log Ag-
gregations. In addition to the basic full-text queries, a log engine
needs to support several other crucial types of queries to meet
the requirements of monitoring and observability. Utilizing AND
queries is essential for ltering relevant events or operations that
meet multiple conditions, providing a more comprehensive context.
Additionally, prex fuzzy queries can be employed to quickly locate
or lter logs related to services or components with specic prexes,
facilitating further analysis and issue resolution. Log aggregation
is crucial for identifying trends and helping users recognize bot-
tlenecks, performance issues, or even network threats based on
data collected over a period. However, histogram queries for high-
frequency words suer from signicant time and resource consump-
tion, limiting their capability for rapid trend analysis. Therefore,
designing a system that can optimize various queries and eciently
index data is paramount.
Challenge 4: Low-Cost Log Engine System. The trait of logs
growing with time and business makes low cost a necessary require-
ment for a log engine. It is also an indispensable part of a low-cost
observability platform. Here, low cost refers to the ecient writes,
storage, and queries of massive log data with fewer resources within
an acceptable time frame. Resources here primarily include CPU,
memory, I/O, etc. Therefore, utilizing low-cost storage for massive
logs and designing a dedicated cost-controllable index framework
for this specic scenario is both necessary and practical.
However, there is no existing log engine that meets all of the
above requirements. Among these log engines, some choose to
have no index at all [
9
,
18
,
23
], some choose to build inverted
indexes in real-time when writing logs [
2
,
3
,
7
,
9
,
40
]. Specically,
SLS [
9
] oers two modes: one with no index and another utilizing
inverted indexes. Additionally, ClickHouse [
39
] oers an index-free
architecture and uses the standard Bloom lter [
6
] as the index.
Having no index at all allows the log engine to write logs quickly
but sacrices support for ecient queries. Constructing an inverted
index of a size comparable to the data size during log writing can
lead to slow writing speeds and high storage costs. The use of
an index-free architecture with Bloom lters as the log indexes
provides ecient log indexing for log queries with minimal impact
on log writing speed. However, the standard Bloom lter is not
suitable for a low-cost log engine. When using the standard Bloom
lter for word ltering, fetching all the Bloom lters into memory
at once would incur signicant I/O overhead. On the other hand, if
only the word related bits from all the Bloom lters are retrieved
into memory, the storage medium needs to have ecient random
access capability. Both of these approaches do not align with our
denition of low cost. Furthermore, none of the aforementioned log
engines optimize for various critical queries, especially histogram
queries for high-frequency words.
In this paper, we propose a cost-eective cloud-native log engine,
called ESTELLE, equipped with a low-cost pluggable log index
framework to address the challenges mentioned above. To address
heavy and skewed log writes, we introduce object storage to enable
low-cost storage of logs and their indexes. We apply a cloud-native
architecture with storage-compute separation to support linear
scaling of write capacity, and we carefully design an approximately
lock-free log writing process. To handle low-frequency and heavy
log queries, we adopt a dual time ltering strategy, implement
multiple caches, and introduce an ecient indexing framework. To
address various log queries and important log aggregations, we
congure an index set with multiple pluggable components for
相关文档
评论