Data Model - Intro/IssuesWhy produce a data model?
The data model should help us assess whether we are
keeping the right information about the right entities
so that the system will meet the requirements of users.
But what are the requirements of users? For traditional
IT systems, this is usually answered by referring to the
business needs the system must meet. Even for library systems,
this question is relatively straight-forward to answer in terms of
meeting the data processing requirements of the day to day
operations of a library (catalogue, acquisitions, circulation).
But ALEG is an unusual system ...
But for ALEG, this task isn't quite so simple. ALEG is
designed to be a tool to aid inquiry and research, to further
understanding of Australia's literature. Some requirements
and usages are easy to understand: provide a simple way to
find out basic information about a work or an author.
However, the potential value of a research tool relates not
only to the breadth of the resource on which it operates
but also on how flexible it is; how the researcher can use it to
answer questions the designers of the system did not
anticipate.
A key part of ALEG's potential is the way it can help
to reveal and elucidate the relationships between the
'entities' making up Australian literature; the authors,
works, publishers, movements, genres, cultural and political
forces.
So, the research value of ALEG is not so much in the 'raw'
data of who wrote what and when (vital as that is), but
how that 'raw' data can be view as coherent clumps - the
relationships that are unveiled when the core data is
analysed.
So the data model is even more important
than usual
Much of information technology is "cut and paste". Problems
form recurring
patterns
(often referred to as design patterns), and the job of
system designers and implementors is to analyse the problem, match
it against well known patterns and then use the appropriate
technology to implement the solution (not a trivial task!).
"Different" problems (such as ALEG) are harder to match at the
"global" level (it isn't a inventory management system, a
human resources system or even a library catalogue system).
And because we all like to think in
metaphors, unless
we really understand what we are trying to do with ALEG, we'll
end up using the wrong metaphor and building the wrong system.
So the purpose of these data model
documents is to make explicit our understanding
This data model should make it clear what data we are storing,
what relationships we are representing, and why.
OK - What is ALEG?
I assert that ALEG is more like a police investigation system.
These systems accept large amounts of data about suspects,
crimes, relationships, rumours and allow investigators to
trawl through it and discover relationships.
Many of these systems have been built, but from my own limited
experience of them, they've had mixed success. Early attempts
were hampered by hardware limitations - it takes a lot of
horsepower to represent complex relationships, especially when
you aren't using an appropriate data model!
One approach - Topic Maps
Recently the ISO published a new standard which attempts to
define an approach to representing complex classifications
and representations of relationships on an underlaying
data set. This standard is called Topic Maps, and
although the
ISO defining document
isn't especially edifying, there are several resources offering
a more approachable introduction to Topic Maps:
In a nutshell, Topic Maps provide a framework for defining topics of interest
separate from the material being linked to the topics. A Topic Map allows the
definition of:
Topics. Topics can be assigned topic "types", which group
related types of topics together. For example, "Australia" and "Oman" may
be topics, both of type "Country". The topic type is just another topic.
Associations between topics. Topics can be linked by topic associations.
For example, the "Queensland" topic may be linked to the "Australia" topic by the
"is part of" association. Association types ("is part of") are themselves
topics.
Occurrences. Topics can be linked to parts of the underlying information
resource being described. For example, the "Queensland" topic could be linked
to a document describing the Great Barrier Reef as a holiday destination. An
occurrence role can also be provided to describe the type of information resource
being linked (advertisement, image, etc). Occurrences roles are also topics in
their own right.
Facets. The underlying information resources can be described by
arbitrary property-name/property-value pairs (which themselves can both be "topics").
Facets allow information resources to be filtered based on their properties, much
as is possible with standard metadata properties.
Topic Associations could allow very powerful automated processing where
the right semantics are defined and understood. For example, and application
that understood that the "is part of" association type was transitive would
know that if Topic X (eg, Sunnybank) "is part of" Topic Y (eg, Brisbane)
and Topic Y "is part of" Topic Z (eg, Queensland), then Topic X "is
part of" Topic Z.
Topics can be involved in multiple associations. For example,
Sunnybank can be associated with the "urban area" topic
and/or the "suburb" topic. Brisbane can be associated with the
"city" topic, as could the Sydney and Melbourne topics.
What is so special about Topic Maps?
Topic Maps are interesting for several reasons:
They define a flexible
way to represent relationships between underlying data and
arbitrary topics, and between topics and other topics.
They push a lot of the representation of relationships
from 'program code' to 'data'. That is, relationships are
represented less by hard-coded program logic and more by
relationships stored as data.
Topic Maps build relationships between 'entities' in the
underlying data. They do so not by altering the underlying
data but by building independent topic structures which
overlay this data. Hence (and intriguingly), Topic Maps
can be built by systems quite independent from the systems
which hold the core data set - the Topic Map systems just need
a way to address the entities in the core data set.
This all sounds good to me:
Relationships are represented more in the data structures,
less in the program code. This allows the system to evolve
with less reliance on the program code. Want a new relationship
to be represented? Don't plead with a programmer to implement
it, just augment the Topic Map data to represent it.
Relationships are represented in Topic Map data structures, not
the "core data" data structures. This might mean that the core
data structures are less likely to change over time, and hence the
code/user interface which manages them are less likely to require
change. New relationships are hence cheaper to implement, and
less likely to cause system-wide ripples.
Related work
The Australian Science and Technology Heritage Centre has produced
the Online Heritage
Resource Manager. An interesting part of this system is the
set of relationship tables which record the relationships between
entities.
The International Council on Archives
ISAAR (CPF)
document (International Standard Archival Authority Record
for Corporate Bodies, Persons and Families) defines an approach
for representing relationships between corporate/human entities.
The World Wide Web consortium (W3C)
Resource Description Framework (RDF) standard defines a technique
for associating metadata with resources. Essentially, this achieves
the same thing as topic maps, except that topic maps contain the
inbuilt semantics of topics, associations, occurrences and facets
which RDF does not define. For more information on RDF,
visit Dave Beckett's
Resource Description Framework (RDF) Resources.
The International Council Of Museums have produced
the CIDOC
Conceptual Reference Model which represents an "ontology" for
cultural heritage information. Their model uses an Object Oriented (O-O)
approach to organising the entities of their systems.
What next
Is this all jumping the gun? Isn't discussion of the way we're
going to represent the ALEG data preempting a thorough analysis
of what we are going to store?
Yes and no...
The language you use determines the approach you take when
thinking about a problem. If you think in terms of punched
cards or hierarchical databases or XML datastructures, you'll
find that your approach is couched in these terms.
Coming up with a strong language to represent the problem
is often half the battle. I assert that the ALEG data modelling
problem consists of two sub-problems:
a fairly standard data processing problem to deal with
500,000 odd records of three or four basic record types (work,
creator, publication/edition/instantiation, maybe holdings or
other references to accessible copies of works), some of which contain
a large amount of text.
a relationship recording/inferring/querying problem which
allows the representation of a rich network of relationships
between these 500,000 odd records
In the data modelling documents which follow, I'm taking
a Topic Map bias. That is, I'm assuming that Topic Maps
are a good way to represent the complex mesh of relationships
which will make ALEG a valuable research tool.
Data modelling documents: