ALEG
Weekly Report - Week 18, 1 September 2000
What I've done
- On Monday Judith Pearce from the NLA held a mini workshop
on the use-case methodology using ALEG as the example. This
lead to some interesting discussion on how some things would
be done (and from my point of view, especially how record
status would be managed and move from 'unauthorised' or
'in-progress' or 'hidden' to 'authorised' or 'public'
- Continued work defining a Java class hiearchy to represent
the objects ALEG needs to manage. The approach I'm taking
is to separate the 'logical' view of the FRBR / INDECS / Harmony ABC
models from the database implementation, which is basically
a completely normalised structure modelled on Topic Maps.
You'd expect ALEG to have tables (data base record structures)
with names such as "Agent", "Work", "Expression", "Manifestation",
"Holding". However, it doesn't. Instead, it has "Topics" and
"Relationships" between topics. So, an Agent is just a topic
which usually happens to have a bunch of relationships (such
as a relationship with a "Gender" topic (male or female).
Actually, some relationships are a bit more complex, and
are modelled as INDECS style events - a birth event for
an agent, a publication event for an expression resulting
in a manifestation.
But this simple topic-map style database approach benefits
from being mapped into the familiar subject-matter specific
'objects' such as agent and work, which is what the Java
class hierarchy does - one part of it abstracts away the
storage and retrieval mechanisms, which happen to be based
on a relational database and topic maps.
- A major chunk of work this week was creating and tuning a
database access layer which will hopefully be high
performance and make the writing of the classes which interact
with it as simple and robust as possible. Because of the
'radical' normalisation, this system will splatter information
across many database records rather than clumping it in
a mega-agent or work record. The upside of this approach is
extreme flexibility - adding new attributes is simple, adding
them into search structures or using thesaurus based browsing
should be simple. The downside is performance - retrieving
a logical agent or work 'record' requires many, possibly dozens
of database accesses. So, making these accesses as fast as possible
is imperative.
To this end, I've been testing and tuning the database
infrastructure against 2000 dummy "event" records, each
of which has 10 event-relationship records (20,000 event-relationships).
An initial benchmark run to read all of these records (4,000
select statements retrieving 22,000 records) took 185 seconds
on Pentium Celeron 500MHz machine with 128MB of RAM running
NT4 and Oracle 8.0.6 and Java Run time 1.2.2. This hardware
configuration is worth about $1500. (We will be
transfering the system to Solaris soon - the NT system is just
interim whilst awaiting disks for the Library's Sparc machine.)
This time was progressively improved as follows:
- Use the newly released Hotspot 2 JVM: 90 seconds
- Use prepared SQL statements (where the Oracle DB does
not have to continually parse the SQL): 63 seconds
- Add indexes to the tables to avoid full-table scans
for the records: 35 seconds (the results of this 'tuning'
step where not that dramatic as these tables fit into
memory - hopefully we'll have enough memory on the final
system so that most of the database fits into memory!)
- Replace the JDBC-ODBC bridge drivers with Oracle native
OCI drivers: 22 seconds
- Force Oracle to use a particular index which the
Oracle query optimizer didnt choose by default: 18 seconds
At this point, we were reading over 1,000 rows per second
and issuing over 220 select statements per second, which
should provide the level of response times we are
seeking, even with our radically normalised database design.
But a goal of the ALEG database infrastructure is to
avoid issuing queries at all by caching as much as
possible of the database in the Java server. For example,
looking up the "Gender" topics with a value of "male" or
"female" from the database for every query isn't very
sensible.
SO, a topic cache is a fundamental structure in the infrastructure.
All queries go through the cache, which returns results directly
if found in the cache, and issues requests to the database and
stores the result back in the cache if not. Updates have to
also be processed through the cache to keep the cache consistent
with the database. A single cache structure will service all ALEG
users: maintainers and query-ers.
Although Java provides some primative tools for cache
management, it is likely something more sophisticated will be
required at some stage, based on a Least-Recently-Used
algorithim to flush large, unused items from the cache.
The currently cache structure has 3 separate 'bins':
- A large object cache, which needs to be aged and objects
discarded. These objects are also diffusely referenced - works,
agents, names, etc
- A small, highly referenced object cache containing
simple attribute values (gender), dates, concepts, places
etc. It is anticipated that objects in this cache will
never be flushed.
- Objects which are not cached, because they are accessed
infrequently and/or are very large - lumps of text, update
history, etc
Something else keeping in mind is making the core
retrieval functions parallelisable. The Library's Sparc
computer has 2 processors, so *if* this system were to
end up on this machine (or any multi-processor machine),
it would probably be worthwhile making the retrieval of
large objects such as works and agents into multiple
processors which can be executed in parallel. The
improvements are often not that dramatic - adding another
processor to a system which does lots of memory and disk
access rarely improves throughput by more than 60-70%,
and splitting a task into several bits that require
synchronisation adds overhead as well.
I decided to spend time on this 'server side' infrastructure
now rather than later, and defer the data loading of
AUSTLIT and BAL data because whilst working out how to
load that data I concluded that it needed a lot of
infrastructure in place to do it efficiently. The data
in AUSTLIT and BAL requires a lot of processing. For example,
an AUSTLIT name field often contains a pseudonym name
and the author's real name, possibly including multiple variant
names and designators such as "nee", plus a birth/death date.
Efficiently processing these and matching names is non-trivial,
and requires lots of programming code, much of it provided
by the infrastructure.
- Upgraded Oracle 8.0.6 to 8.1.6. This was a frustrating
experience, particularly as Oracle silently installed its
own (old!) version of Java which overrode the existing
Java installation and because the OCI JDBC drivers required
a different database name from the JDBC drivers (which took
me a while to discover!). Also, the installation just
took a long time, and required several reruns to get it to
install what I wanted. Even now, although it says that the
Oracle InterMedia option is installed, it doesn't seem to
work properly, with one tool saying that it isn't installed!
I wont waste any more time on this on the NT platform,
as Oracle 8.1.5 is already installed in the Solaris machine
we'll be moving to in a few weeks.
What I haven't done but need to do soon!
- Document how ALEG will handle some tricky cases - The "Poets of the
Month" works from the mid 1970's and "Down the Lake with Half a Chook".
These are amongst the most "difficult" cases Tessa and Kathy can
come up with, so if we think the proposed data model can handle these,
we'll be happy!
Next week
- Build more infrastructure classes
- Parse and load the rest of the test AUSTLIT data
- If all goes well (think it will), ask Fran to dump the
entire database, and load that!
- Load at least some parts of the BAL/LAW data
Summary
- Although I'm doing some things out of the order
I originally thought, it makes sense to build the
infrastructure now and use it to load the AUSTLIT and BAL
data, although it does remove the instant gratification
of seeing lots of data in the new database structure.
I'm happy with the performance of the database (I hope
I'm not suprised when we load more data!), and with the
performance of the database infrastructure classes.