ALEG
Weekly Report - Week 24, 13 October 2000
What I've done
- Deleted the 30,000 experimental review records erroneously
loaded in ALEG fron AUSTLIT (required development of the infrastrcuture
to delete works, expressions, manifestations and dependent data)
- Constructed expression and manifestation records for periodical issues
referenced in sources and linked to these from expressions. This went
well; over 180,000 links created, 38,000 periodical issues created (for
1610 periodical). Just over 300 links were not created automatically
because the source information was too incomplete or 'suspicious':
I manually classified some of them (which were unambiguously just
typos), but the rest will be looked at by experts!
- With the bulk of data loaded, I thought it was a good time to revist
some performance tests. We now have 1.3 million "topics" and
3.2 million relationships between topics, and the earlier tests
were conducted on a dataset one tenth this size. After some initial
(disturbing!) results, some simple Oracle configuration to prod
the query optimiser gave encouraging results. For example,
a query to return the works with these characteristics:
- poems
- written between 1900 and 1999
- topics "bush" and "isolation"
- either title or first-line-of-verse containing the word "bush"
- authored by a male born between 1865 and 1879
- published in a periodical with a name containing the word "bulletin"
returns 3 works and executes in 1 second (well, the first
execution takes 3 seconds, as it takes 2 seconds for Oracle to
work out how to most efficiently process the query). That's not
bad, as some of the selection criteria (eg, "poems" and "works
written between 1900 and 1999") involve quite large numbers of
works.
Although I'm hesitant about predicting performance for the whole
system based on this query, the result is encouraging. If performance
wasn't good with the current strange, completely normalised database
structure, now would be the time to change it. But I'm currently
comfortable that the design will be OK.
I'm slightly concerned about how best to handle queries which
aren't based on attributes which only appear against one type of
record - for example, against alternative titles which could be
recorded against the expression rather than the work, or genders
which could be assigned to a pseudonym rather than the natural
person, but I'll let those concerns bubble away in the background
for now...
- Hooked up a very basic ALEG record-dump function to the Apache/Tomcat
web frontend/java servlet infrastructure, just to verify there would be
no unexpected problems running the ALEG java classes and accessing the
Oracle database using the JDBC drivers. This worked fine.
What I haven't done but need to do soon!
- Document how ALEG will handle some tricky cases - The "Poets of the
Month" works from the mid 1970's and "Down the Lake with Half a Chook".
These are amongst the most "difficult" cases Tessa and Kathy can
come up with, so if we think the proposed data model can handle these,
we'll be happy!
Next week
- Think about incorporating the BAL data, or at least think about
talking to Kerry about it next week!
- Start work on the web-specific infrastructure, especially
partitioning the system into database interaction, XML generation,
XML translation (initially into HTML) and HTML delivery components.
Summary