ALEG
Weekly Report - Week 19, 8 September 2000
What I've done
- Created a new ALEG database. This seemed to be necessary
to install the Oracle 8.1.6 interMedia component, as the
installation process failed repeatedly on the database
converted from Oracle 8.0.5 (complaining that it could not
be performed on a database which was compatable with Oracle
8.0.5, even after I removed the compatability parameter from
the init.ora parm file). After importing the old database
into the new database and setting the appropriate permissions,
the interMedia index worked fine. However, once again the
Oracle database creation process unaccountably and silently
changed the system PATH, but after the previous experience
with installing Oracle 8.1.6, I was alert to this!
- Developed the topic retrieval and searching infrastructure.
This was something else I was going to delay until loading
more test data, but even loading Fran's AUSTLIT Authior file
extract actually requires sophisticated searching & matching, so
it is another thing better done properly up-front.
A major consumer of time this week was tuning the search
infrastructure using some sample 'difficult' searches, such
as "males, born between 1912 and 1920 in NSW, Newcastle or Richmond
with a name of Smith or Smithers, died in 1960, with a language
attribute of English or Thai" (that last bit - the language
attribute, isnt part of the agent data model, but it might
be following discussions with Annette in the previous week
on classifing non-ALEG agents....). This query has a few
interesting aspects. It used the interMedia text searching
(for a name of Smith or Smithers, although not strictly
necessary because that is an exact match on a complete
field (name-suffix (aka Surname)). Also, it uses the "Place"
hierarchy because "NSW" includes many places (some with
nested places themselves), and Newcastle and Richmond are the
names of several place topics, possibly with their own
hierarchies (suburbs of Newcastle...).
The tuning process involved:
- defining appropriate indices
- avoiding some joins by pre-lookup from a Java cache the
topic id's associated with very common topics ("Male", "English",
"Thai")
- deciding to use a pure relational join approach (which
used approximately 20 joins) versus using sub-queries
Defining a few indices completely eliminated full table access.
Using some sub-queries greatly improved performance (as reported
by the EXPLAIN PLAN cost and elapsed time), although it was hard
to predict the effect of adding more sub-queries - at one point
more sub-queries resulted in a greater cost. Caching the tables
in memory eliminated disk IO and resulted in a one-second
response to the above query, with 60,000 topic records and
relationships loaded, although I'm very wary of extrapolating
beyond this to the 1 million records plus system we will be
operating with. I'm cautiously optimistic that tuning of the
system will result in very good response times. The programmatic
interface to the database has been logically decoupled from the
query interface - that is, the query interface defined a "logical"
query, not SQL or even table names or column names. Hence the
tuning can be performed without requiring changes to the query
interface.
- Used the new infrastructure to load Fran's AUSTLIT author extract
plus an extra 35 author descriptions which I'd handpicked as
being especially difficult to process (such as pseuds linked to
agents with variant and nee names), and this seems to have worked
well (more checking required Monday!)
What I haven't done but need to do soon!
- Document how ALEG will handle some tricky cases - The "Poets of the
Month" works from the mid 1970's and "Down the Lake with Half a Chook".
These are amongst the most "difficult" cases Tessa and Kathy can
come up with, so if we think the proposed data model can handle these,
we'll be happy!
Next week
- Decide how to implement object identifiers (such as
AUSTLIT author Id, ISBN, ABN immutable numbers). Treating
these as full topics is probably over the top, but a more
light-weight and hence faster treatment introduces a second
type of attribute (currently we just have topics - even events
are just fancy topics with their own topics and topic
relationships). Notes, update info, source info and other
topic relationships are automatically available to all
topics, however humble (such as a place name), but wont be
available to a simply implemented object identifier. But,
maybe this is not necessary, so I'm leaning to an efficient
implementation at the expense of flexability.
- Test and tune the infrastructure classes by loading the
AUSTLIT data
- Load at least some parts of the BAL/LAW data
- The disks for the Solaris system have arrived,
so I'll work with Fran configuring them next week and
possibly migrating the ALEG database fron NT to Solaris.
Summary
- The backend query system was a major item on the
'to do' list, so I'm relieved that it is largely completed
and that the performance looks like it will be good.
I'm a bit frustrated that I still haven't loaded
much data - the 'instant gratification' of seeing the
data in the new data model is eagerly awaited, especially
as once it is loaded we'll be able to get much better
estimates of query times.