暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
A Relational Model of Data for Large Shared Data Banks.pdf
123
10页
3次
2022-11-07
免费下载
A Relational Model of Data for Large Shared Data Banks
E. F. Codd
IBM Research Laboratory, San Jose, California
ABSTRACT
Future users of large data banks must be protected from
having to know how the data is organized in the machine
(the internal representation). A prompting service which
supplies such information is not a satisfactory solution. Ac-
tivities of users at terminals and most application programs
should remain unaffected when the internal representation of
data is changed and even when some aspects of the external
representation are changed. Changes in data representation
will often be needed as a result of changes in query, update,
and report traffic and natural growth in the types of stored
information.
Existing noninferential, formatted data systems provide
users with tree-structured files or slightly more general net-
work models of the data. In Section 1, inadequacies of these
models are discussed. A model based on n-ary relations,
a normal form for data base relations, and the concept of
a universal data sublanguage are introduced. In Section 2,
certain operations on relations (other than logical inference)
are discussed and applied to the problems of redundancy
and consistency in the user’s model.
1. RELATIONAL MODEL AND NORMAL
FORM
1.1 Introduction
This paper is concerned with the application of elementary
relation theory to systems which provide shared access to
large banks of formatted data. Except for a paper by Childs
[1], the principal application of relations to data systems has
been to deductive question-answering systems. Levein and
Maron [2] provide numerous references to work in this area.
In contrast, the problems treated here are those of data in-
dependence—the independence of application programs and
terminal activities from growth in data types and changes
in data representation—and certain kinds of data inconsis-
tency which are expected to become troublesome even in
nondeductive systems.
The relational view (or model) of data described in Sec-
tion 1 appears to be superior in several respects to the graph
or network model [3, 4] presently in vogue for noninferential
systems. It provides a means of describing data with its nat-
ural structure only—that is, without superimposing any ad-
ditional structure for machine representation purposes. Ac-
cordingly, it provides a basis for a high level data language
E. F. Codd. 1970. A relational model of data
for large shared data banks. Commun. ACM 13,
6 (June 1970), 377-387. DOI=10.1145/362384.362685
http://doi.acm.org/10.1145/362384.362685
which will yield maximal independence between programs
on the one hand and machine representation and organiza-
tion of data on the other.
A further advantage of the relational view is that it forms
a sound basis for treating derivability, redundancy, and con-
sistency of relations—these are discussed in Section 2. The
network model, on the other hand, has spawned a number of
confusions, not the least of which is mistaking the derivation
of connections for the derivation of relations (see remarks in
Section 2 on the “connection trap”).
Finally, the relational view permits a clearer evaluation of
the scope and logical limitations of present formatted data
systems, and also the relative merits (from a logical stand-
point) of competing representations of data within a single
system. Examples of this clearer perspective are cited in
various parts of this paper. Implementations of systems to
support the relational model are not discussed.
1.2 Data Dependencies in Present Systems
The provision of data description tables in recently de-
veloped information systems represents a major advance to-
ward the goal of data independence [5, 6, 7]. Such tables
facilitate changing certain characteristics of the data rep-
resentation stored in a data bank. However, the variety
of data representation characteristics which can be changed
without logically impairing some application programs is still
quite limited. Further, the model of data with which users
interact is still cluttered with representational properties,
particularly in regard to the representation of collections of
data (as opposed to individual items). Three of the principal
kinds of data dependencies which still need to be removed
are: ordering dependence, indexing dependence, and access
path dependence. In some systems these dependencies are
not clearly separable from one another.
1.2.1 Ordering Dependence
Elements of data in a data bank may be stored in a va-
riety of ways, some involving no concern for ordering, some
permitting each element to participate in one ordering only,
others permitting each element to participate in several or-
derings. Let us consider those existing systems which either
require or permit data elements to be stored in at least one
total ordering which is closely associated with the hardware-
determined ordering of addresses. For example, the records
of a le concerning parts might be stored in ascending order
by part serial number. Such systems normally permit appli-
cation programs to assume that the order of presentation of
records from such a file is identical to (or is a subordering
of) the stored ordering. Those application programs which
take advantage of the stored ordering of a file are likely to
fail to operate correctly if for some reason it becomes nec-
essary to replace that ordering by a different one. Similar
remarks hold for a stored ordering implemented by means
of pointers.
It is unnecessary to single out any system as an exam-
ple, because all the well-known information systems that
are marketed today fail to make a clear distinction between
order of presentation on the one hand and stored ordering
on the other. Significant implementation problems must be
solved to provide this kind of independence.
1.2.2 Indexing Dependence
In the context of formatted data, an index is usually
thought of as a purely performance-oriented component of
the data representation. It tends to improve response to
queries and updates and, at the same time, slow down re-
sponse to insertions and deletions. From an informational
standpoint, an index is a redundant component of the data
representation. If a system uses indices at all and if it is to
perform well in an environment with changing patterns of
activity on the data bank, an ability to create and destroy
indices from time to time will probably be necessary. The
question then arises: Can application programs and termi-
nal activities remain invariant as indices come and go?
Present formatted data systems take widely different ap-
proaches to indexing. TDMS [7] unconditionally provides
indexing on all attributes. The presently released version
of IMS [5] provides the user with a choice for each le: a
choice between no indexing at all (the hierarchic sequential
organization) or indexing on the primary key only (the hi-
erarchic indexed sequential organization). In neither case is
the user’s application logic dependent on the existence of the
unconditionally provided indices. IDS [8], however, permits
the file designers to select attributes to be indexed and to
incorporate indices into the file structure by means of ad-
ditional chains. Application programs taking advantage of
the performance benefit of these indexing chains must refer
to those chains by name. Such programs do not operate
correctly if these chains are later removed.
1.2.3 Access Path Dependence
Many of the existing formatted data systems provide users
with tree-structured files or slightly more general network
models of the data. Application programs developed to work
with these systems tend to be logically impaired if the trees
or networks are changed in structure. A simple example
follows.
Suppose the data bank contains information about parts
and projects. For each part, the part number, part name,
part description, quantity-on-hand, and quantity-on-order
are recorded. For each project, the project number, project
name, project description are recorded. Whenever a project
makes use of a certain part, the quantity of that part com-
mitted to the given project is also recorded. Suppose that
the system requires the user or file designer to declare or
define the data in terms of tree structures. Then, any one
of the hierarchical structures may be adopted for the infor-
mation mentioned above (see Structures l–5).
Now, consider the problem of printing out the part num-
ber, part name, and quantity committed for every part used
in the project whose project name is “alpha.” The follow-
ing observations may be made regardless of which available
tree-oriented information system is selected to tackle this
problem. If a program P is developed for this problem as-
suming one of the five structures above—that is, P makes
Structure 1. Projects Subordinate to Parts
File Segment Fields
F PART part #
part name
part description
quantity-on-hand
quantity-on-order
PROJECT project #
project name
project description
quantity committed
Structure 2. Parts Subordinate to Projects
File Segment Fields
F PROJECT project #
project name
project description
PART part #
part name
part description
quantity-on-hand
quantity-on-order
quantity committed
Structure 3. Parts and Projects as Peers
Commitment Relationship Subordinate to Projects
File Segment Fields
F PART part #
part name
part description
quantity-on-hand
quantity-on-order
G PROJECT project #
project name
project description
PART part #
quantity committed
Structure 4. Parts and Projects as Peers
Commitment Relationship Subordinate to Parts
File Segment Fields
F PART part #
part description
quantity-on-hand
quantity-on-order
PROJECT project #
quantity committed
G PROJECT project #
project name
project description
Structure 5. Parts, Projects, and Commitment
Relationship as Peers
File Segment Fields
F PART part #
part name
part description
quantity-on-hand
quantity-on-order
G PROJECT project #
project name
project description
H COMMIT part #
project #
quantity committed
no test to determine which structure is in effect—then P
will fail on at least three of the remaining structures. More
specifically, if P succeeds with structure 5, it will fail with
all the others; if P succeeds with structure 3 or 4, it will fail
with at least 1,2, and 5; if P succeeds with 1 or 2, it will
fail with at least 3, 4, and 5. The reason is simple in each
case. In the absence of a test to determine which structure
is in effect, P fails because an attempt is made to exceute a
reference to a nonexistent file (available systems treat this
as an error) or no attempt is made to execute a reference to
a file containing needed information. The reader who is not
convinced should develop sample programs for this simple
problem.
Since, in general, it is not practical to develop application
programs which test for all tree structurings permitted by
the system, these programs fail when a change in structure
becomes necessary.
Systems which provide users with a network model of
the data run into similar difficulties. In both the tree and
network cases, the user (or his program) is required to ex-
ploit a collection of user access paths to the data. It does
not matter whether these paths are in close correspondence
with pointer-defined paths in the stored representation—in
IDS the correspondence is extremely simple, in TDMS it is
just the opposite. The consequence, regardless of the stored
representation, is that terminal activities and programs be-
come dependent on the continued existence of the user access
paths.
One solution to this is to adopt the policy that once a user
access path is defined it will not be made obsolete until all
application programs using that path have become obsolete.
Such a policy is not practical, because the number of access
paths in the total model for the community of users of a
data bank would eventually become excessively large.
1.3 A Relational View of Data
The term relation is used here in its accepted mathemat-
ical sense. Given sets S
1
,S
2
,...,S
n
(not necessarily dis-
tinct), R is a relation on these n sets if it is a set of n-tuples
each of which has its first element from S
1
, its second ele-
ment from S
2
, and so on.
1
We shall refer to S
j
as the jth
domain of R. As defined above, R is said to have degree n.
Relations of degree 1 are often called unary, degree 2 binary,
degree 3 ternary, and degree n n-ary.
For expository reasons, we shall frequently make use of an
array representation of relations, but it must be remembered
that this particular representation is not an essential part
of the relational view being expounded. An array which
represents an n-ary relation R has the following properties:
1. Each row represents an n-tuple of R.
2. The ordering of rows is immaterial.
3. All rows are distinct.
4. The ordering of columns is significant—it corresponds
to the ordering S
1
,S
2
,...,S
n
of the domains on which
R is defined (see, however, remarks below on domain-
ordered and domain-unordered relations).
5. The significance of each column is partially conveyed
by labeling it with the name of the corresponding do-
main.
1
More concisely, R is a subset of the Cartesian product S
1
×
S
2
× ...× S
n
.
supply (supplier part project quantity)
1 2 5 17
1 3 5 23
2 37 9
2 75 4
4 1 1 12
Fig. 1. A relation of degree 4
component (part part quantity)
15 9
25 7
35 2
2 6 12
36 3
47 1
67 1
Fig. 2. A relation with two identical domains
The example in Figure 1 illustrates a relation of degree
4, called supply, which reflects the shipments-in-progress of
parts from specified suppliers to specified projects in speci-
fied quantities.
One might ask: If the columns are labeled by the name of
corresponding domains, why should the ordering of columns
matter? As the example in Figure 2 shows, two columns
may have identical headings (indicating identical domains)
but possess distinct meanings with respect to the relation.
The relation depicted is called component. It is a ternary
relation, whose first two domains are called part and third
domain is called quantity. The meaning of component (x, y,
z) is that part x is an immediate component (or subassem-
bly) of part y, and z units of part x are needed to assemble
one unit of part y. It is a relation which plays a critical role
in the parts explosion problem.
It is a remarkable fact that several existing information
systems (chiefly those based on tree-structured files) fail to
provide data representations for relations which have two or
more identical domains. The present version of IMS/360 [5]
is an example of such a system.
The totality of data in a data bank may be viewed as a
collection of time-varying relations. These relations are of
assorted degrees. As time progresses, each n-ary relation
may be subject to insertion of additional n-tuples, deletion
of existing ones, and alteration of components of any of its
existing n-tuples.
In many commercial, governmental, and scientific data
banks, however, some of the relations are of quite high de-
gree (a degree of 30 is not at all uncommon). Users should
not normally be burdened with remembering the domain
ordering of any relation (for example, the ordering supplier,
then part, then project, then quantity in the relation sup-
ply). Accordingly, we propose that users deal, not with
relations which are domain-ordered, but with relationships
which are their domain-unordered counterparts.
2
To accom-
plish this, domains must be uniquely identifiable at least
within any given relation, without using position. Thus,
where there are two or more identical domains, we require
in each case that the domain name be qualified by a distinc-
tive role name, which serves to identify the role played by
that domain in the given relation. For example, in the rela-
tion component of Figure 2, the rst domain part might be
2
In mathematical terms, a relationship is an equivalence
class of those relations that are equivalent under permu-
tation of domains (see Section 2.1.1).
of 10
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。