ALEG
Weekly Report - Week 28, 10 November 2000
What I've done
- Spruced up the demonstration searching and retrieval web interface
so that it is basically functional. This generated lots of suggestions
and comments and more questions between Kerry, Annette, Marie-Louise and myself
especially related to what is displayed, how it is rendered and sort options.
- The good thing about having something like this interface is that even though
it isn't slick it lets people enter queries and find things which are badly
wrong! Some of the things highlighted were poor performance on some queries
and issues with the scalability of server infrastructure, and it is these
things which I spent most of the week fiddling with; testing different
approaches and tuning the system.
On the query side, the most interesting results were to do with queries
involving the gender of the creator. Eg, works created in 1930 by men
on the subject of "greed" (one of Kerry's). Originally this took over a
minute to find the single work, because the database engine thought that
gender was as good a discriminator as the subject. So, it found all
men, found their creations, discarded those which weren't on the subject
of greed and from the rest, discarded those not written in 1930.
This is a poor approach, as there are lots of men, and they've
produced lots of works, very few of which have a subject of "greed".
A far better approach is to start with all the works with a subject
of greed, then whittle those down to those produced in 1930 and then
finally apply the gender test.
Unfortunately, the database thinks the date test is very hard to
apply because checking for a date of 1930 isn't that simple. Works
may be recorded with a date of "somewhere between 1928 and 1930", or
"somewhere between 1929 and 1933", and these must both the returned
to a query asking for a date of "1930". So, the date is out as a
good discriminator.
When asked to evaluate whether subject or gender was better, the
Oracle database chose gender (sadly!), and a change I made to the query made
it choose subject instead, resulting in the query taking a few
seconds.
I must look into other ways to make the Oracle database better
evaluate options, perhaps by generating more statistics about the
data in our tables.
- Another tuning issue was the operation of the Java Virtual Machine (JVM) which
runs the web scripts. We go to a lot of effort to build up information
about works, agents, etc, gathering it together from the relationships
recorded in our topic maps and presenting it in response to a query.
The natural reaction is to cache as much of this as possible, so that
should the same agent appear as author of another work, or another work
should appear as the source of this work, then if those works/agents/etc
are already cached inside the JVM, we don't have to go back to the database
and find the components and construct it again.
However, counterbalancing this is the extra work we and the JVM must
do to manage the cache. Java has an interesting way of managing memory -
very advanced and robust, but it doesn't scale very well (in current
implementations) to very large memory spaces. So, there is a tradeoff
between having to discard objects and rebuild them and manage a large
collection of them. As a natural hoarder, I'm into caching, blind to
the hidden costs of managing my "cache". But in the hard, cold figures
revealed by running the ALEG JVM with different settings, it was apparent
that it is much faster to keep a modest cache, discard the oldest entries
and rebuild objects as required.
The JVM has lots of settings to fiddle related to memory management, and
we have the size of the cache (how big we allow it to grow, and how hard
we prune it back) to consider, and the algorithm to manage keeping track
of what items are the most valuable to keep (most likely to be reused in
the near future).
So, I experimented with these settings using the Sun HotSpot JVM and
found things seems to work fastest with:
- a Java heap of 40MB (growable to 120MB)
- a new object ("nursery") size of 8MB - this is where the JVM allocates
newly created objects - see
THE JAVA HOTSPOT PERFORMANCE ENGINE ARCHITECTURE (slightly propagandarish)
and Garbage Collection
by Bill Venners for lots of fascinating details.
- no incremental garbage collection - this just adds overhead in our application
- a maximum cache size of 6000 items, which we prune down to the most recently
used 3000 when this limit is reached
As well, a separate permanent cache is managed of universally "interesting"
and widely referenced topics, such as gender, workType, subjects, etc.
Managing the 6000 item cache so that we can record which items have been
most recently used and discard the rest at prune time is an interesting
problem. I think there are 2 approaches:
- maintain a linked list, and as an item is referenced, move it to the end
of the list. At prune time, discard those objects at the start of the list.
This is quite a bit of work on each reference, but discarding is very fast.
- maintain a sequence number starting at 1 and incrementing at each object
referenced. On each reference, store the current value in the object
being referenced. At prune time, examine each object in the cache and
discard those with a sequence number under some threshold. This is
quick on each reference, but a lot of work at discard time!
I implemented the first method, and measured the work to manage the link
list as very small (well under 1 millisecond), but I'll consider implementing
the second option and comparing it some rainy day.
So, in the production environment, our JVM will use probably around 100MB.
With our 1GB of memory, this will leave lots of space for an Oracle database
cache, which should result in most of the frequently accessed indices to
the data being fully cached in memory. I won't be able to blame slow disk I/O
as a cause of poor performance!
- Traced the operation of the frequently executed java code
and optimized it as much as possible for performance. Some of the improvements
made were quite significant. I'm very wary of premature optimization
(see the "Parting Thoughts" section at the bottom of this
page!), but if things aren't fast, people don't care that they are right (and vica
versa)!
- Started designing the data maintenance system (at last!) Worked out
a basic internal component diagram and the flows between the
browser and the server.
- Removed a bunch of duplicate manifestations and embodiment events
which somehow were created during loading/matching of the sources.
- With Marie-Louise, Annette and Tony Ralli met with Peter Higgs
(IPR Systems), Virgina
Gordon (New Media Connections)and Libby Gleeson (Chair of the Australia
Society of Authors) about possible areas of mutual interest
between ALEG and OzAuthors
What I haven't done but need to do soon!
- Document how ALEG will handle some tricky cases - The "Poets of the
Month" works from the mid 1970's and "Down the Lake with Half a Chook".
These are amongst the most "difficult" cases Tessa and Kathy can
come up with, so if we think the proposed data model can handle these,
we'll be happy!
Next week
- ALEG Partner meeting
- Continue on the data maintenance (input) suite - develop prototype agent
maintenance screens.
- Continue discussions on user interface presentation options, sort orders, long -v- brief displays
Summary
- Well, I'd hoped last week to have some prototype data maintenance screens
to discuss for the Partners meeting, but this won't be done. But
I'm very happy that the demo search system is exercising all the vital
components of the query system, and they are all working well and
performing acceptably.