A Relational Model of Data for Large Shared Data Banks.pdf

chirpyli

123

10页

3次

2022-11-07

免费下载

A Relational Model of Data for Large Shared Data Banks

E. F. Codd

∗

IBM Research Laboratory, San Jose, California

ABSTRACT

Future users of large data banks must be protected from

having to know how the data is organized in the machine

(the internal representation). A prompting service which

supplies such information is not a satisfactory solution. Ac-

tivities of users at terminals and most application programs

should remain unaﬀected when the internal representation of

data is changed and even when some aspects of the external

representation are changed. Changes in data representation

will often be needed as a result of changes in query, update,

and report traﬃc and natural growth in the types of stored

information.

Existing noninferential, formatted data systems provide

users with tree-structured ﬁles or slightly more general net-

work models of the data. In Section 1, inadequacies of these

models are discussed. A model based on n-ary relations,

a normal form for data base relations, and the concept of

a universal data sublanguage are introduced. In Section 2,

certain operations on relations (other than logical inference)

are discussed and applied to the problems of redundancy

and consistency in the user’s model.

1. RELATIONAL MODEL AND NORMAL

FORM

1.1 Introduction

This paper is concerned with the application of elementary

relation theory to systems which provide shared access to

large banks of formatted data. Except for a paper by Childs

[1], the principal application of relations to data systems has

been to deductive question-answering systems. Levein and

Maron [2] provide numerous references to work in this area.

In contrast, the problems treated here are those of data in-

dependence—the independence of application programs and

terminal activities from growth in data types and changes

in data representation—and certain kinds of data inconsis-

tency which are expected to become troublesome even in

nondeductive systems.

The relational view (or model) of data described in Sec-

tion 1 appears to be superior in several respects to the graph

or network model [3, 4] presently in vogue for noninferential

systems. It provides a means of describing data with its nat-

ural structure only—that is, without superimposing any ad-

ditional structure for machine representation purposes. Ac-

cordingly, it provides a basis for a high level data language

∗

E. F. Codd. 1970. A relational model of data

for large shared data banks. Commun. ACM 13,

6 (June 1970), 377-387. DOI=10.1145/362384.362685

http://doi.acm.org/10.1145/362384.362685

which will yield maximal independence between programs

on the one hand and machine representation and organiza-

tion of data on the other.

A further advantage of the relational view is that it forms

a sound basis for treating derivability, redundancy, and con-

sistency of relations—these are discussed in Section 2. The

network model, on the other hand, has spawned a number of

confusions, not the least of which is mistaking the derivation

of connections for the derivation of relations (see remarks in

Section 2 on the “connection trap”).

Finally, the relational view permits a clearer evaluation of

the scope and logical limitations of present formatted data

systems, and also the relative merits (from a logical stand-

point) of competing representations of data within a single

system. Examples of this clearer perspective are cited in

various parts of this paper. Implementations of systems to

support the relational model are not discussed.

1.2 Data Dependencies in Present Systems

The provision of data description tables in recently de-

veloped information systems represents a major advance to-

ward the goal of data independence [5, 6, 7]. Such tables

facilitate changing certain characteristics of the data rep-

resentation stored in a data bank. However, the variety

of data representation characteristics which can be changed

without logically impairing some application programs is still

quite limited. Further, the model of data with which users

interact is still cluttered with representational properties,

particularly in regard to the representation of collections of

data (as opposed to individual items). Three of the principal

kinds of data dependencies which still need to be removed

are: ordering dependence, indexing dependence, and access

path dependence. In some systems these dependencies are

not clearly separable from one another.

1.2.1 Ordering Dependence

Elements of data in a data bank may be stored in a va-

riety of ways, some involving no concern for ordering, some

permitting each element to participate in one ordering only,

others permitting each element to participate in several or-

derings. Let us consider those existing systems which either

require or permit data elements to be stored in at least one

total ordering which is closely associated with the hardware-

determined ordering of addresses. For example, the records

of a ﬁle concerning parts might be stored in ascending order

by part serial number. Such systems normally permit appli-

cation programs to assume that the order of presentation of

records from such a ﬁle is identical to (or is a subordering

of) the stored ordering. Those application programs which

take advantage of the stored ordering of a ﬁle are likely to

fail to operate correctly if for some reason it becomes nec-

essary to replace that ordering by a diﬀerent one. Similar

remarks hold for a stored ordering implemented by means

of pointers.

It is unnecessary to single out any system as an exam-

ple, because all the well-known information systems that

are marketed today fail to make a clear distinction between

order of presentation on the one hand and stored ordering

on the other. Signiﬁcant implementation problems must be

solved to provide this kind of independence.

1.2.2 Indexing Dependence

In the context of formatted data, an index is usually

thought of as a purely performance-oriented component of

the data representation. It tends to improve response to

queries and updates and, at the same time, slow down re-

sponse to insertions and deletions. From an informational

standpoint, an index is a redundant component of the data

representation. If a system uses indices at all and if it is to

perform well in an environment with changing patterns of

activity on the data bank, an ability to create and destroy

indices from time to time will probably be necessary. The

question then arises: Can application programs and termi-

nal activities remain invariant as indices come and go?

Present formatted data systems take widely diﬀerent ap-

proaches to indexing. TDMS [7] unconditionally provides

indexing on all attributes. The presently released version

of IMS [5] provides the user with a choice for each ﬁle: a

choice between no indexing at all (the hierarchic sequential

organization) or indexing on the primary key only (the hi-

erarchic indexed sequential organization). In neither case is

the user’s application logic dependent on the existence of the

unconditionally provided indices. IDS [8], however, permits

the ﬁle designers to select attributes to be indexed and to

incorporate indices into the ﬁle structure by means of ad-

ditional chains. Application programs taking advantage of

the performance beneﬁt of these indexing chains must refer

to those chains by name. Such programs do not operate

correctly if these chains are later removed.

1.2.3 Access Path Dependence

Many of the existing formatted data systems provide users

with tree-structured ﬁles or slightly more general network

models of the data. Application programs developed to work

with these systems tend to be logically impaired if the trees

or networks are changed in structure. A simple example

follows.

Suppose the data bank contains information about parts

and projects. For each part, the part number, part name,

part description, quantity-on-hand, and quantity-on-order

are recorded. For each project, the project number, project

name, project description are recorded. Whenever a project

makes use of a certain part, the quantity of that part com-

mitted to the given project is also recorded. Suppose that

the system requires the user or ﬁle designer to declare or

deﬁne the data in terms of tree structures. Then, any one

of the hierarchical structures may be adopted for the infor-

mation mentioned above (see Structures l–5).

Now, consider the problem of printing out the part num-

ber, part name, and quantity committed for every part used

in the project whose project name is “alpha.” The follow-

ing observations may be made regardless of which available

tree-oriented information system is selected to tackle this

problem. If a program P is developed for this problem as-

suming one of the ﬁve structures above—that is, P makes

Structure 1. Projects Subordinate to Parts

File Segment Fields

F PART part #

part name

part description

quantity-on-hand

quantity-on-order

PROJECT project #

project name

project description

quantity committed

Structure 2. Parts Subordinate to Projects

File Segment Fields

F PROJECT project #

project name

project description

PART part #

part name

part description

quantity-on-hand

quantity-on-order

quantity committed

Structure 3. Parts and Projects as Peers

Commitment Relationship Subordinate to Projects

File Segment Fields

F PART part #

part name

part description

quantity-on-hand

quantity-on-order

G PROJECT project #

project name

project description

PART part #

quantity committed

Structure 4. Parts and Projects as Peers

Commitment Relationship Subordinate to Parts

File Segment Fields

F PART part #

part description

quantity-on-hand

quantity-on-order

PROJECT project #

quantity committed

G PROJECT project #

project name

project description

Structure 5. Parts, Projects, and Commitment

Relationship as Peers

File Segment Fields

F PART part #

part name

part description

quantity-on-hand

quantity-on-order

G PROJECT project #

project name

project description

H COMMIT part #

project #

quantity committed

no test to determine which structure is in eﬀect—then P

will fail on at least three of the remaining structures. More

speciﬁcally, if P succeeds with structure 5, it will fail with

all the others; if P succeeds with structure 3 or 4, it will fail

with at least 1,2, and 5; if P succeeds with 1 or 2, it will

fail with at least 3, 4, and 5. The reason is simple in each

case. In the absence of a test to determine which structure

is in eﬀect, P fails because an attempt is made to exceute a

reference to a nonexistent ﬁle (available systems treat this

as an error) or no attempt is made to execute a reference to

a ﬁle containing needed information. The reader who is not

convinced should develop sample programs for this simple

problem.

Since, in general, it is not practical to develop application

programs which test for all tree structurings permitted by

the system, these programs fail when a change in structure

becomes necessary.

Systems which provide users with a network model of

the data run into similar diﬃculties. In both the tree and

network cases, the user (or his program) is required to ex-

ploit a collection of user access paths to the data. It does

not matter whether these paths are in close correspondence

with pointer-deﬁned paths in the stored representation—in

IDS the correspondence is extremely simple, in TDMS it is

just the opposite. The consequence, regardless of the stored

representation, is that terminal activities and programs be-

come dependent on the continued existence of the user access

paths.

One solution to this is to adopt the policy that once a user

access path is deﬁned it will not be made obsolete until all

application programs using that path have become obsolete.

Such a policy is not practical, because the number of access

paths in the total model for the community of users of a

data bank would eventually become excessively large.

1.3 A Relational View of Data

The term relation is used here in its accepted mathemat-

ical sense. Given sets S

,...,S

(not necessarily dis-

tinct), R is a relation on these n sets if it is a set of n-tuples

each of which has its ﬁrst element from S

, its second ele-

ment from S

, and so on.

We shall refer to S

as the jth

domain of R. As deﬁned above, R is said to have degree n.

Relations of degree 1 are often called unary, degree 2 binary,

degree 3 ternary, and degree n n-ary.

For expository reasons, we shall frequently make use of an

array representation of relations, but it must be remembered

that this particular representation is not an essential part

of the relational view being expounded. An array which

represents an n-ary relation R has the following properties:

1. Each row represents an n-tuple of R.

2. The ordering of rows is immaterial.

3. All rows are distinct.

4. The ordering of columns is signiﬁcant—it corresponds

to the ordering S

,...,S

of the domains on which

R is deﬁned (see, however, remarks below on domain-

ordered and domain-unordered relations).

5. The signiﬁcance of each column is partially conveyed

by labeling it with the name of the corresponding do-

main.

More concisely, R is a subset of the Cartesian product S

× ...× S

supply (supplier part project quantity)

1 2 5 17

1 3 5 23

2 37 9

2 75 4

4 1 1 12

Fig. 1. A relation of degree 4

component (part part quantity)

15 9

25 7

35 2

2 6 12

36 3

47 1

67 1

Fig. 2. A relation with two identical domains

The example in Figure 1 illustrates a relation of degree

4, called supply, which reﬂects the shipments-in-progress of

parts from speciﬁed suppliers to speciﬁed projects in speci-

ﬁed quantities.

One might ask: If the columns are labeled by the name of

corresponding domains, why should the ordering of columns

matter? As the example in Figure 2 shows, two columns

may have identical headings (indicating identical domains)

but possess distinct meanings with respect to the relation.

The relation depicted is called component. It is a ternary

relation, whose ﬁrst two domains are called part and third

domain is called quantity. The meaning of component (x, y,

z) is that part x is an immediate component (or subassem-

bly) of part y, and z units of part x are needed to assemble

one unit of part y. It is a relation which plays a critical role

in the parts explosion problem.

It is a remarkable fact that several existing information

systems (chieﬂy those based on tree-structured ﬁles) fail to

provide data representations for relations which have two or

more identical domains. The present version of IMS/360 [5]

is an example of such a system.

The totality of data in a data bank may be viewed as a

collection of time-varying relations. These relations are of

assorted degrees. As time progresses, each n-ary relation

may be subject to insertion of additional n-tuples, deletion

of existing ones, and alteration of components of any of its

existing n-tuples.

In many commercial, governmental, and scientiﬁc data

banks, however, some of the relations are of quite high de-

gree (a degree of 30 is not at all uncommon). Users should

not normally be burdened with remembering the domain

ordering of any relation (for example, the ordering supplier,

then part, then project, then quantity in the relation sup-

ply). Accordingly, we propose that users deal, not with

relations which are domain-ordered, but with relationships

which are their domain-unordered counterparts.

To accom-

plish this, domains must be uniquely identiﬁable at least

within any given relation, without using position. Thus,

where there are two or more identical domains, we require

in each case that the domain name be qualiﬁed by a distinc-

tive role name, which serves to identify the role played by

that domain in the given relation. For example, in the rela-

tion component of Figure 2, the ﬁrst domain part might be

In mathematical terms, a relationship is an equivalence

class of those relations that are equivalent under permu-

tation of domains (see Section 2.1.1).

of 10

免费下载

数据库关系模型

文档被以下合辑收录

数据库论文（共5篇）

收录数据库领域相关论文

文档被以下合辑收录

相关文档

评论