ALEG
Data Model Issues as at 1 August 2000
- Holdings
- Issues
- Finding the holdings of something is not trivial. Does the user
want to find the work, any or a specific expression of
the work, or any or a specific manifestation of any
or a specific expression? ISBN's may be reasonably
good for finding expressions at the exact manifestation
level, but no good for the expression and work level.
- Searches on Kinetica at the author/title (work, expression
sort-of) level have been demonstrated to be not very
precise (but maybe they are precise enough).
- Searching against Kinetica is expensive (pay per
search). It would be possible to query a range of
likely sources of holdings information (Uni, State
libraries) in parallel using Z39.50, but problems with
author/title matching remain.
- The system should rank
holdings so that resources closest or cheapest to the
user are shown first. But this requires the system to
"know" where the user is, and where the resources are,
and how much it will cost the user to get them.
- Imagine a user wants to read "Core of My Heart" by
"Mackellar, Dorothea". Would ALEG allow them to just directly
find out holdings information for all anthologies, collected
works, selected works, periodical issues containing
that poem (about 22 recorded on AUSTLIT), or would it require
the user to nominate one work and find the holdings for that,
nominate another work, find the holdings for that and
so on?
- Impact
- Can't implement the holdings part of the system until we
have a strategy to address the above issues. Showing
users holdings information is potentially a very useful
function of ALEG.
- Options
- For ranking results:
- Use any knowledge
we have about the user to rank holdings based on
location. Try to guess the user's location based
on domain name, and/or ask them which state they are
in and remember what they tell us using an HTTP cookie.
- Let the user register a preference and remember it by
an HTTP cookie.
- For registered uses, record their location.
Otherwise, group results by state so that they can easily
find local resources.
For finding holdings:
- Give the user the option of finding
holdings for just the manifestation or for all author/title
matches.
- Base the holdings information on Kinetica. Negotiate with
Kinetica to somehow limit or control costs, absorb
them within the ALEG project, or charge them back to the user (!)
somehow. For exact manifestation matches, use the
Kinetica immutable number (if we can via Z39.50!).
Otherwise, do an author/title search on the work/
expression. If possible, use the Kinetica author
control number to id the author!
- Ditto - don't use Z39.50, but some 'special' rumoured
direct access to Kinetica (???)
- Use parallel Z39.50 queries to significant Z39.50
targets: Universities and State Libraries
- Recommendation
- Ranking:
-
The first time a user does a holdings search, ask them how
they'd like the results ordered. Remember this, but allow
the user to change it later. The user should be able to
specify their preference at the state or library level.
When asking them, we'll default the preference list
based on any domain name or registered user info we can find
or guess about them.
Retain this information based on their HTTP cookie, not
their userid (as many users will not have userids, or
will share userids).
Regardless of heuristics used or settings remembered, the system
must allow the user to change the basis of ranking (because the
user may move around, or be conducting a query for someone else).
Finding: - Option a (let user specify particular manifestation
or author/title)
Regarding the source for holdings defer decision pending discussions with NLA. If Kinetica
approach fails for any reason, discuss parallel Z39.50
search option with ALEG partners.
Regarding issue "v.", initially the user would have to
nominate a work containing the poem and search for
holdings on that - ie, ALEG would not automatically
search for all holdings of all works in which the
poem had appeared.
- Expression - formOfExpression values
- Issue
-
ALEG would like to use interoperable metadata descriptions
as far as possible to describe the expression
form. But the closest 'standard' list, the Dublin Core
Type
element values is a bit light on - no 'moving image'
for example.
- Recommendation
- Use the FRBR nominated list rather than the DC:Type vocabulary, as it
is felt the FRBR list is a better match for this attribute.
Map these names to DC:Type when describing the expression
in DC terms.
- Physically Describing Manifestations
- Issue
-
ALEG would like to use interoperable metadata descriptions
as far as possible to describe the
manifestation's format (such as the
Dublin Core format
element). It would also like to describe manifestations
in terms likely to be useful for users, eg, 'Braille', 'Talking Book'.
But sticking to a strict DC Format recommended vocabularly doesn't
allow these user friendly descriptions.
- Impact
- Interoperability -v- perceived user friendliness.
- Options
-
- Use DC:Format's vocabulary (MIME types) but augment the
vocabularly with local terms such as 'Braille',
'Talking Book', 'Large Print', 'Handwritten/Manuscript'
- Ditto (option a) but down-map these local terms to DC format
terms when generating DC descriptions and supporting DC
element searching
- Created a separate manifestation attribute
specialPhysicalCharacteristics which can be used to enter
this information (from a defined vocabularly) and used by the
user as a search field ('find all large print books published
since 1995 with a subject of the Vietnam War')
- Recommendation
- Implement option b. Splitting the format description over
two fields (one DC:format, one not) would unnecessarily complicate
searching.
- Works which are Websites
- Issue
- The web is being used to publish much material which
is of interest to ALEG. There is no great issue where
a single work/expression is manifested on a web page with
a URL (other than concerns over its longevity). The issue being
raised here concerns works which are web sites, such
as:
- Impact
- It is reasonable to expect that web users of ALEG
will be biased towards identifing accessing web based
resources, because they are likely to be free and easy
to access. So, it is probably useful for ALEG to
include resources such as Tirra Lirra and identify
them as being web based collections of material
- Recommendations
- Treat electronically published periodicals as just periodicals
which happen to have a manifestation which is represented
by a URL, eg, Gangway can be dealt with as a periodical
(usual worry of link rot when zine folds, so prefer
publications archived by
Pandora).
- Treat websites which are pure autobiographies, biographies,
novels, collections of poems, collections of short stories
as works (with one of the above
workTypes), which just
happen to have been manifested electronically, as having
manifestation
formats of 'text/html', 'image/gif', 'video/mpeg' (however
many apply) and a medium of 'website'.
- For websites which contain a collection of material and
would not likely exist in any other form
(ie, they make use of web-specific features such as
hyperlinking and multi-media, such as Tirra Lirra
and John Tranter's
site), describe them as a work
and assign a generic workType "collection",
use however many work formTypes and genres as
apply, assign the formOfExpression "collection" to the
expression, and to the manifestation assign
formats of 'text/html', 'image/gif', 'video/mpeg' (however
many apply) and a medium of 'website'
It is important that the URLs referenced by ALEG are
monitored for continuing existence and relevance. Perhaps
ALEG should incorporate a link checker which periodically (forntightly?)
checks all URLs and reports unavailable addresses to the ALEG
Web Link Monitor for manual investigation.
- References to Manuscript Collections held in RAAM
- Issue
- ALEG will often want to reference material held by the
NLA's RAAM
(Register of Australian Archives and Manuscripts) system.
- Impact
- RAAM contains valuable information on many ALEG
agents. Showing the user that this information available
would be an important service delivered by ALEG.
- Options
-
- Search RAAM dynamically when a user views information
about an agent, and if information is found, display
a link to RAAM
- Search RAAM periodically (biannually?) to perform a
"sweep" across all ALEG agents and find/confirm
whether RAAM has information on the agent and record
to the RAAM URL which can be shown to the user
- Receive notifications from RAAM when new material is
made available, update the ALEG agent entity appropriately
to link to the agent information on RAAM
- Manually record links to RAAM URL's in ALEG agent entities
as part of creating/maintaining the agent entity
- Recommendation
- Initially implement option d (manual linkage). Investigate
automated options b and c as time permits and in conjunction with the
RAAM redevelopment.
- Thesaurus structure
- Issues
- This is an unstructured ramble attempting to discuss some
approaches we could use for thesaurus construction and
representation.
- Thesaurus -v- Topic Map
When you think "thesaurus" maybe you think of a simple hierarchy
with broader and narrower terms, and perhaps even related terms
("teaching" and "learning")
and "use for" terms to catch misspellings and define preferred
terms ("Kosciuszko/Kosciusko", "cemetary for graveyard").
Maybe the thesaurus you are thinking about even represents antonyms.
Related terms, synonyms, antonyms and preferred terms move
a thesaurus away from a strict hierarchy into representing
something much messier (and more useful).
But a problem remains - it is still hard to choose into which
single hierarchy you're going to place a term.
There is no problem with terms with multiple meanings - they
can go into different hierarchies with the meaning
disambiguated by the broader, narrower,related terms. For
example, two terms named "fencing" can be added - one to the
"sport" broader term, one to the "rural activities" broader
term.
But where in a thesaurus would you place Charles Dickens?
In the "non ALEG agent" hierarchy, to be sure, but how would
that hierarchy be organised to represent the orthogonal
"broader terms" of Charles Dickens, such as "novelist",
"English", "19th Century", "male"?
It is hard
to classify the term "Dickens, Charles" as being all of
novelist, male, English, 19th century without creating a
deep thesaurus structure which duplicates those terms in
some arbitrary order over and over again. And describing
"Dickens, Charles" as a narrower term for "male", itself
a narrower term for "novelist", itself a narrower term for
"England" is completely ridiculous.
But the facts that Charles Dickens "is a" man, and "is a"
19th Century figure, and "is a" novelist and "is a" English
person, and furthermore than a novelist "is a" writer are all useful
if ALEG is to be able to answer questions
such as "which 19th Century English writers have been
described as influencing 20th Century Australian novels?".
What data structures could support us answering such questions
without commiting us to maintaining horrendous amounts of
detailed data?
Topic Maps are one approach. With
Topic Maps, "Dickens, Charles" could be represented as a topic
of type 'non-ALEG agent'. There would be other topics:
- "novelist", of topic type "occupation"
- "writer" of topic type "occupation"
- "19th Century" of topic type "era"
- "English" of topic type "nationality"
Some simple, universal relationships between these topics
could be defined:
- the occupation topic "writer" is a broader term for the occupation topic "novelist"
- the non-ALEG agent topic "Dickens, Charles" is linked to the occupation topic "novelist"
- the non-ALEG agent topic "Dickens, Charles" is linked to the era topic "19th Century"
- the non-ALEG agent topic "Dickens, Charles" is linked to the nationality topic "English"
The final three links may be established when "Dickens, Charles" is entered into
the system. When the non-ALEG agent topic "Dickens, Charles" is
linked to a work as being a influence on that work, these other
existing links with the "Dickens, Charles" automatically lead to the
work being influenced.
It is important to note that the topics can be linked at any time,
incrementally enriching the system. For example, it may be
decided that "British" nationality is at least as important as
"English" nationality. Adding in this concept requires:
- defining a new topic "British" of topic type "nationality"
- linking this new topic as being a broader term for the nationality
topic "English" (note: temporal scopes can complicate this
relationship, as "Irish", "Welsh", "Scottish" nationality may
all be 'narrower' terms for British nationality but only
for defined time frames!)
It is specifically worth noting that this exercise did not
require revisting all the topics (such as "Dickens, Charles") linked
to "English" nationality, so extending the "topic map" can be
incremental and often very simple, yet allows significant new
groupings and relationships to be accessed.
How would Charles Dickens' works be represented? Perhaps the
simplest way is to define them as new instances of "non-ALEG
works" topics, so:
- we'd define a new topic of type "non-ALEG work" named "Great Expectations"
- we'd link that topic to the "non-ALEG agent" topic "Dickens, Charles"
in a relationship of type "author"
Possibly we'd link "Great Expectations" to other topics as well,
depending on how important it was to be able to classify the
"Great Expectations" work. For example, we could link
"Great Expectations" to:
- a date topic ({1860?]) with a "when published" relationship
- a form topic (novel) with a "has form" relationship
- a genre topic (?) with a "has genre" relationship
- a subject topic ("poverty"?) with a "has subject" relationship
Maybe it is outside the scope to answer questions such as
"which Australian works have been influenced by 19th Century
novels (Australian or non-Australian)
which have 'hardship' as a subject?" (the "poverty" subject may be
related somehow to the "hardship" subject).
But simplier
questions may be in scope: "which Australian works have as
subjects Vietnamese poets who have written about the Vietnam
War?", and the same flexible linking of structures is required
to answer both.
Just because the structure is in place does not mean it has to
be completed, but when adding a Vietnamese writer as a subject,
maybe the indexer thinks it worth linking them to the "poet"
occupation topic, and linking them to the "vietnam war" subject topic with
a "has written about topic" relationship.
- Should the ALEG Thesaurus be deeply nested?
Popular directories such as Yahoo!
and the Open Directory Project
provide deeply nested (and cross linked) structures which allow the
user to browse and dig deeper through an immense subject catalogue.
Pure search engines such as Google
use no subject catalogue at all - they don't support browse at all, just
search.
But if you have a subject catalogue, it seems to make sense to allow
users to browse it. But anything containing tens of thousands
of terms (or even hundreds of terms) cannot be successfully browsed -
it is too big!
So how many of layers of nesting do you need? I think the answer is
to keep nesting until the subjects remaining at the lowest level seem
to sensibly form a group and are few enough in number to be conveniently
scanned. Rarely would a level contain more than 50 terms, because it
is hard to scan and manipulate a list much bigger than that.
By using Topic Maps and associations between topics, it is easy
to build a 'many headed thesaurus', where a term is represented as part
of many broader terms: "Dickens, Charles" is part of many orthogonal
hierarchies:
- nationality: non Australian/British/English
- occupation: creator/writer/novelist
- era: 19th Century
- political orientation: ...
- literary movements: ...
- Thesaurus and non-thesaurus subjects
If it is worth subject cataloging using a term, is it worth
letting a user browse by that term?
- Getting there - building the ALEG Thesaurus
Moving/Restructing AUSTLIT's thesaurus - how (to be developed)
- Recommendation
- The first 3 issues were discussed by the AUSTLIT
team at some length on 26 July and these approaches were suggested:
- Non-ALEG agents and works will be stored in the same
data structures as ALEG agents and works, but will be marked
as 'non-ALEG'. Non-ALEG works includes works by ALEG agents
which are not 'creative literary' (for example, Judith Wright's essays
on conservation, Harry Feroka's magnificent oil-on-canvas portraits).
Non-ALEG agents and works will not be searchable by name or
title, but will be searchable as subjects.
Non-ALEG agents and works can be described in what-ever detail
as the indexer sees fit.
This approach, for example, would allow not just Charles Dickens
Don Bradman and Mr Ed to be representated using the same framework as
ALEG agents, but also, for example, Vincent Van Gogh (as an agent)
and "Starry Night" as a work (hmmm.. what would the workType be???).
So, "Starry Night" could be happily represented as the subject of
a work, and topic mapped and hence be used to answer queries such
as "find all works which have as a subject European oil paintings"
- We like the operation of the Open
Directory Project (Humans do it better ) as
implemented by the Google Web
Directory.
Questions of 'how far to nest?' are best answered by assessing
the OPD example directories against the ALEG topics and considering
what will be most effective for the user.
- Non-thesaurus terms will be the exception rather than the
rule. Non-thesaurus terms are those subjects which are not
linked into the thesaurus structure, and hence are not browsable
by the user. Getting rid of non-thesaurus terms as soon as possible
will be the goal of a thesaurus control task/team/system, by:
- adding them
to one or more places in the thesaurus structure (hence converting them
into thesaurus terms) or
- replacing them with another term or
- simply deleting them.