【北理工袁野、深算院秦建斌等2024VLDB】nsDB Architecting the Next Generation Database by Integrating Neural and Symbolic Systems.pdf

胖酒精灯

132

7页

1次

2025-04-17

免费下载

nsDB: Architecting the Next Generation Database by Integrating

Neural and Symbolic Systems

Ye Yuan

Beijing Institute of Technology

Beijing, China

yuan-ye@bit.edu.cn

Bo Tang

Southern Univ. of Sci. and Tech.

Shenzhen, China

tangb3@sustech.edu.cn

Tianfei Zhou

Beijing Institute of Technology

Beijing, China

ztfei.debug@gmail.com

Zhiwei Zhang

Beijing Institute of Technology

Beijing, China

zwzhang@bit.edu.cn

Jianbin Qin

Shenzhen University

Shenzhen, China

qinjianbin@szu.edu.cn

ABSTRACT

In this paper, we propose nsDB, a novel neuro-symbolic database

system that integrates neural and symbolic system architectures

natively to address the weaknesses of each, providing a strong

database capable of data managing, model learning, and complex

analytical query processing over multi-modal data. We employ a

real-world NBA data analytical query as an example to illustrate

the functionality of each component in nsDB and highlight the

research challenges to build it. We then present the key design

principles and our preliminary attempts to address them.

In a nutshell, we envision that the next generation database

system nsDB integrates the complex neural system with the simple

symbolic system. Undoubtedly, nsDB will serve as a bridge between

databases with AI models, which abstracts away the AI complexities

but allows end users to enjoy the strong capabilities of them. We

are in the early stages of the journey to build nsDB, there are many

opening challenges, e.g., in-database model training, multi-objective

query optimization, and database agent development. We hope the

researchers from dierent communities (e.g., system, architecture,

database, articial intelligence) could tackle them together.

PVLDB Reference Format:

Ye Yuan, Bo Tang, Tianfei Zhou, Zhiwei Zhang, and Jianbin Qin. nsDB:

Architecting the Next Generation Database by Integrating Neural and

Symbolic Systems. PVLDB, 17(11): 3283 - 3289, 2024.

doi:10.14778/3681954.3682000

1 INTRODUCTION

On one hand, either traditional relational database systems (e.g.,

PostgreSQL [

], MySQL [

]) or modern big data systems (e.g.,

Spark [

], Flink [

], Hive [

]) employs symbolic system (a.k.a. al-

gebraic computation [

]) as the building brick in the system ar-

chitecture. In particular, the complex data processing procedure

in them is transferred to exact computation with expressions con-

taining variables and are manipulated as symbols, i.e., relational

This work is licensed under the Creative Commons BY-NC-ND 4.0 International

License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of

this license. For any use beyond those covered by this license, obtain permission by

emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 17, No. 11 ISSN 2150-8097.

doi:10.14778/3681954.3682000

Find clips of LeBron James dunking

from the Los Angeles Lakers' regular

season videos where he scored at

least 30 points in those games.

NBA statistics table S NBA game video V

query result

…

Figure 1: User query in video database

algebra. The major advantage of symbolic system is that it provides

exact computation and its computation procedure is step-by-step

and explicit. On the other hand, both machine learning and deep

learning in the eld of articial intelligence utilize mathematical

models (a.k.a neuro system [

]) to learn from data and generalize

to unseen data, and thus perform tasks without explicit instructions.

The representative mathematical models are statistical algorithms

and articial neural networks. In recent years, the neuro system

brings huge attention as its success in natural language processing

(NLP), computer vision (CV), speech recognition, etc. The most im-

portant properties of neuro system are intuitive and unconscious.

In recent years, many applications in various domains [

]

have been emerged, which cannot be eciently processed by either

symbolic-based data management system or neuro-based articial

intelligence system independently.

Example. Considering the illustrated example in Figure 1, the data

analysts in NBA marketing team want to advertise NBA all-star

game by promoting the NBA super star “Lebron James” [

]. Hence,

they want to nd the clips from the NBA data repository such

that Lebron James is dunking in these games when his team is

“Los Angeles Lakers” and he scored at least 30 points. Inherently,

it is not trivial to answer by either symbolic-based databases or

neuro-based AI systems as it includes two fundamental tasks: (i)

identify the specic frames from video database, i.e., the frames

Lebron James is dunking; and (ii) nding all these frames in a large

video database with attribute constraints, e.g., scored at least 30

points and in Los Angeles Lakers.

A straight forward idea to address the above query is combining

the abilities of both symbolic system and neuro system. In the liter-

ature, integrating ML tasks into database system has been studied

3283

at the beginning of 2000 [

]. Many techniques have been proposed

over these years in both academia [

] and in-

dustry [

]. In particular, these system architectures

can be classied into three categories: AI-centric, UDF-centric, and

relation-centric. Zhou et al. [

] proposed a novel RDBMS by seam-

lessly integrating these three architectures. However, none of these

existing solutions can natively process the analytical query in the

above example. The core reason is that the result accuracy of the AI

models is ignored among them as they assume the used AI models

are given and well-trained. For example, during the above query

processing, none of them take the accuracy of dierent dunking

action recognition models into account.

The nsDB vision. To overcome the limitations of existing solu-

tions, in this work, we envision a novel type of neuro-symbolic

database system nsDB to process these new emerged queries. It in-

tegrates neural with symbolic systems to address the weaknesses of

each, providing a strong database capable of data managing, model

learning, complex and multi-model analytical query processing.

Specically, nsDB abstracts away the complexities of AI models,

and allows end users to build AI projects and use them for their

individual upstream applications, even they are without any code

skills, AI expertise and system developing experiences. To achieve

that, the neuro system is abstracted as a native-supported module

in nsDB, and the result accuracy and processing latency are con-

sidered simultaneously during query optimization. However, it is

not trivial to achieve the above goal as the implicit property of the

neuro system compromises accuracy and performance inherently.

For example, the more accurate of the dunk action detection model,

the higher the model inference latency.

The rest of the paper is organized as follows. We briey analyze

the unique aspects of nsDB within the context of extensive ongo-

ing work in Section 2. In Section 3, we rst introduce the system

architecture of nsDB, then highlight the research challenges of

each component, last present our design paradigms and prelimi-

nary ideas to address them. We discuss the generality of nsDB in

Section 4 and conclude this vision paper in Section 5.

2 RELATED WORK

In this section, we dierentiate our proposal nsDB from the most

relevant systems and techniques in the literature.

Neuro-Symbolic database system. Numerous researches [

–

] have been studied to

integrate DB and AI workloads in both academia and industry since

2000. The architecture of existing solutions can be classied into

three representative categories: (i) AI-centric, (ii) UDF-centric, and

(iii) relation-centric. To overcome the limitation of the solutions in

each category, a mixed solution was proposed [

] which integrates

the above three architecture categories. All these solutions (includ-

ing of our nsDB) provide AI model inferences for various analytical

tasks. However, none of the existing solutions have emerged as the

de-facto standard until now. The major reasons can be summarized

by three aspects: (i) model training, (ii) performance goal, and (iii)

optimization strategy, as shown in Table 1.

(I) Model training: Almost all existing AI and DB integrated sys-

tems assume the underlying AI models are well-trained and the

Table 1: Comparison of AI and DB integration solutions

Architecture Model Performance Optimization

category training goal strategy

AI-centric [33] Yes Latency-only Symbolic

UDF-centric [35] No Latency-only Symbolic

Rel.-centric [55] No Latency-only Symbolic

Mixed sol. [57] No Latency-only Neuro-symbolic

Our nsDB Yes Latency-accuracy Neuro-symbolic

integrated systems are designed for ecient model inferences. How-

ever, the fact is that model training cannot be ignored in real-world

applications. To make the matter worse, model training is not trivial

to support by the above integrated systems as they are designed to

provide excellent model inferring performance. Existing systems in

AI-centric category (e.g., Google Big Query [

], Amazon Redshift

ML [

]) train these models by ooading to the underlying DL sys-

tems (e.g., PyTorch, Tensorow). Obviously, it is not ecient as the

training data should be pre-prepared and it relies on other systems.

(II) Performance goal: The performance goal of existing AI and

DB integrated systems is only the query processing latency as the

underlying AI models are well-trained, which means the accuracy

of these AI models are xed. However, the same task can be pro-

cessed by multiple AI models. Moreover, dierent models have dif-

ferent result accuracy for the same task. For example, ArcFace [

FaceNet [

], and EigenFace [

] are the typical models for face

recognition task, and the result accuracy of them are dierent.

(III) Optimization strategy: Existing systems [

–

]

with AI-centric architecture ooad the inference computation to

the decoupled AI runtimes. Thus, their query optimizer only use

symbolic rules to process the predicates in the analytical query

and ignore the optimization of the complex neuro-based computa-

tions. The UDF-centric systems [

] use UDF to model the

neuro-based computations, and apply the symbolic-based optimiza-

tion strategies on the UDF-based logical plan. The relation-centric

systems [

] employ the relations to represent the model param-

eter tensor and extend the traditional relational algebra to tensor

relation algebra and optimize them in a holistic symbolic manner.

A simple co-optimization idea (i.e., devising novel query trans-

formation rules) of symbolic and neuro operators in the complex

analytical queries has been proposed in [

]. However, it cannot

achieve low latency and high accuracy goal simultaneously.

In this work, we envision the next generation database system

nsDB, which provides in-database model training, and a novel

neuro-symbolic query optimizer is devised in it to co-optimize the

performance latency and result accuracy of the complex analytical

query processing. The last row of Table 1 shows the unique aspects

of nsDB w.r.t. the existing DB and AI integrated systems.

Query processing over multi-modal data. Conducting complex

query over multi-modal data is an active research topic [

]

in database community in recent years. The general idea of them

to process complex query on multi-modal data is decomposing

the query into several subqueries and executing them to dier-

ent systems [

]. Our nsDB diers from them in two ways:

(i) it integrates both symbolic and neural operators to process

the dierent tasks in the complex query over multi-modal data,

3284