ALEG
Data Model Issues as at 1 August 2000

  1. Holdings

    Issues

    1. Finding the holdings of something is not trivial. Does the user want to find the work, any or a specific expression of the work, or any or a specific manifestation of any or a specific expression? ISBN's may be reasonably good for finding expressions at the exact manifestation level, but no good for the expression and work level.

    2. Searches on Kinetica at the author/title (work, expression sort-of) level have been demonstrated to be not very precise (but maybe they are precise enough).

    3. Searching against Kinetica is expensive (pay per search). It would be possible to query a range of likely sources of holdings information (Uni, State libraries) in parallel using Z39.50, but problems with author/title matching remain.

    4. The system should rank holdings so that resources closest or cheapest to the user are shown first. But this requires the system to "know" where the user is, and where the resources are, and how much it will cost the user to get them.

    5. Imagine a user wants to read "Core of My Heart" by "Mackellar, Dorothea". Would ALEG allow them to just directly find out holdings information for all anthologies, collected works, selected works, periodical issues containing that poem (about 22 recorded on AUSTLIT), or would it require the user to nominate one work and find the holdings for that, nominate another work, find the holdings for that and so on?

    Impact

    Can't implement the holdings part of the system until we have a strategy to address the above issues. Showing users holdings information is potentially a very useful function of ALEG.

    Options

    For ranking results:

    1. Use any knowledge we have about the user to rank holdings based on location. Try to guess the user's location based on domain name, and/or ask them which state they are in and remember what they tell us using an HTTP cookie.

    2. Let the user register a preference and remember it by an HTTP cookie.

    3. For registered uses, record their location. Otherwise, group results by state so that they can easily find local resources.

    For finding holdings:

    1. Give the user the option of finding holdings for just the manifestation or for all author/title matches.

    2. Base the holdings information on Kinetica. Negotiate with Kinetica to somehow limit or control costs, absorb them within the ALEG project, or charge them back to the user (!) somehow. For exact manifestation matches, use the Kinetica immutable number (if we can via Z39.50!). Otherwise, do an author/title search on the work/ expression. If possible, use the Kinetica author control number to id the author!

    3. Ditto - don't use Z39.50, but some 'special' rumoured direct access to Kinetica (???)

    4. Use parallel Z39.50 queries to significant Z39.50 targets: Universities and State Libraries

    Recommendation

    Ranking:

    The first time a user does a holdings search, ask them how they'd like the results ordered. Remember this, but allow the user to change it later. The user should be able to specify their preference at the state or library level.

    When asking them, we'll default the preference list based on any domain name or registered user info we can find or guess about them.

    Retain this information based on their HTTP cookie, not their userid (as many users will not have userids, or will share userids).

    Regardless of heuristics used or settings remembered, the system must allow the user to change the basis of ranking (because the user may move around, or be conducting a query for someone else).

    Finding:

    Option a (let user specify particular manifestation or author/title)

    Regarding the source for holdings defer decision pending discussions with NLA. If Kinetica approach fails for any reason, discuss parallel Z39.50 search option with ALEG partners.

    Regarding issue "v.", initially the user would have to nominate a work containing the poem and search for holdings on that - ie, ALEG would not automatically search for all holdings of all works in which the poem had appeared.

  2. Expression - formOfExpression values

    Issue

    ALEG would like to use interoperable metadata descriptions as far as possible to describe the expression form. But the closest 'standard' list, the Dublin Core Type element values is a bit light on - no 'moving image' for example.

    Recommendation

    Use the FRBR nominated list rather than the DC:Type vocabulary, as it is felt the FRBR list is a better match for this attribute. Map these names to DC:Type when describing the expression in DC terms.

  3. Physically Describing Manifestations

    Issue

    ALEG would like to use interoperable metadata descriptions as far as possible to describe the manifestation's format (such as the Dublin Core format element). It would also like to describe manifestations in terms likely to be useful for users, eg, 'Braille', 'Talking Book'.

    But sticking to a strict DC Format recommended vocabularly doesn't allow these user friendly descriptions.

    Impact

    Interoperability -v- perceived user friendliness.

    Options

    1. Use DC:Format's vocabulary (MIME types) but augment the vocabularly with local terms such as 'Braille', 'Talking Book', 'Large Print', 'Handwritten/Manuscript'

    2. Ditto (option a) but down-map these local terms to DC format terms when generating DC descriptions and supporting DC element searching

    3. Created a separate manifestation attribute specialPhysicalCharacteristics which can be used to enter this information (from a defined vocabularly) and used by the user as a search field ('find all large print books published since 1995 with a subject of the Vietnam War')

    Recommendation

    Implement option b. Splitting the format description over two fields (one DC:format, one not) would unnecessarily complicate searching.

  4. Works which are Websites

    Issue

    The web is being used to publish much material which is of interest to ALEG. There is no great issue where a single work/expression is manifested on a web page with a URL (other than concerns over its longevity). The issue being raised here concerns works which are web sites, such as:

    Impact

    It is reasonable to expect that web users of ALEG will be biased towards identifing accessing web based resources, because they are likely to be free and easy to access. So, it is probably useful for ALEG to include resources such as Tirra Lirra and identify them as being web based collections of material

    Recommendations

    1. Treat electronically published periodicals as just periodicals which happen to have a manifestation which is represented by a URL, eg, Gangway can be dealt with as a periodical (usual worry of link rot when zine folds, so prefer publications archived by Pandora).

    2. Treat websites which are pure autobiographies, biographies, novels, collections of poems, collections of short stories as works (with one of the above workTypes), which just happen to have been manifested electronically, as having manifestation formats of 'text/html', 'image/gif', 'video/mpeg' (however many apply) and a medium of 'website'.

    3. For websites which contain a collection of material and would not likely exist in any other form (ie, they make use of web-specific features such as hyperlinking and multi-media, such as Tirra Lirra and John Tranter's site), describe them as a work and assign a generic workType "collection", use however many work formTypes and genres as apply, assign the formOfExpression "collection" to the expression, and to the manifestation assign formats of 'text/html', 'image/gif', 'video/mpeg' (however many apply) and a medium of 'website'

    It is important that the URLs referenced by ALEG are monitored for continuing existence and relevance. Perhaps ALEG should incorporate a link checker which periodically (forntightly?) checks all URLs and reports unavailable addresses to the ALEG Web Link Monitor for manual investigation.

  5. References to Manuscript Collections held in RAAM

    Issue

    ALEG will often want to reference material held by the NLA's RAAM (Register of Australian Archives and Manuscripts) system.

    Impact

    RAAM contains valuable information on many ALEG agents. Showing the user that this information available would be an important service delivered by ALEG.

    Options

    1. Search RAAM dynamically when a user views information about an agent, and if information is found, display a link to RAAM

    2. Search RAAM periodically (biannually?) to perform a "sweep" across all ALEG agents and find/confirm whether RAAM has information on the agent and record to the RAAM URL which can be shown to the user

    3. Receive notifications from RAAM when new material is made available, update the ALEG agent entity appropriately to link to the agent information on RAAM

    4. Manually record links to RAAM URL's in ALEG agent entities as part of creating/maintaining the agent entity

    Recommendation

    Initially implement option d (manual linkage). Investigate automated options b and c as time permits and in conjunction with the RAAM redevelopment.

  6. Thesaurus structure

    Issues

    This is an unstructured ramble attempting to discuss some approaches we could use for thesaurus construction and representation.

    1. Thesaurus -v- Topic Map

      When you think "thesaurus" maybe you think of a simple hierarchy with broader and narrower terms, and perhaps even related terms ("teaching" and "learning") and "use for" terms to catch misspellings and define preferred terms ("Kosciuszko/Kosciusko", "cemetary for graveyard"). Maybe the thesaurus you are thinking about even represents antonyms.

      Related terms, synonyms, antonyms and preferred terms move a thesaurus away from a strict hierarchy into representing something much messier (and more useful). But a problem remains - it is still hard to choose into which single hierarchy you're going to place a term.

      There is no problem with terms with multiple meanings - they can go into different hierarchies with the meaning disambiguated by the broader, narrower,related terms. For example, two terms named "fencing" can be added - one to the "sport" broader term, one to the "rural activities" broader term.

      But where in a thesaurus would you place Charles Dickens? In the "non ALEG agent" hierarchy, to be sure, but how would that hierarchy be organised to represent the orthogonal "broader terms" of Charles Dickens, such as "novelist", "English", "19th Century", "male"?

      It is hard to classify the term "Dickens, Charles" as being all of novelist, male, English, 19th century without creating a deep thesaurus structure which duplicates those terms in some arbitrary order over and over again. And describing "Dickens, Charles" as a narrower term for "male", itself a narrower term for "novelist", itself a narrower term for "England" is completely ridiculous.

      But the facts that Charles Dickens "is a" man, and "is a" 19th Century figure, and "is a" novelist and "is a" English person, and furthermore than a novelist "is a" writer are all useful if ALEG is to be able to answer questions such as "which 19th Century English writers have been described as influencing 20th Century Australian novels?".

      What data structures could support us answering such questions without commiting us to maintaining horrendous amounts of detailed data?

      Topic Maps are one approach. With Topic Maps, "Dickens, Charles" could be represented as a topic of type 'non-ALEG agent'. There would be other topics:

      • "novelist", of topic type "occupation"
      • "writer" of topic type "occupation"
      • "19th Century" of topic type "era"
      • "English" of topic type "nationality"

      Some simple, universal relationships between these topics could be defined:

      • the occupation topic "writer" is a broader term for the occupation topic "novelist"
      • the non-ALEG agent topic "Dickens, Charles" is linked to the occupation topic "novelist"
      • the non-ALEG agent topic "Dickens, Charles" is linked to the era topic "19th Century"
      • the non-ALEG agent topic "Dickens, Charles" is linked to the nationality topic "English"

      The final three links may be established when "Dickens, Charles" is entered into the system. When the non-ALEG agent topic "Dickens, Charles" is linked to a work as being a influence on that work, these other existing links with the "Dickens, Charles" automatically lead to the work being influenced.

      It is important to note that the topics can be linked at any time, incrementally enriching the system. For example, it may be decided that "British" nationality is at least as important as "English" nationality. Adding in this concept requires:

      1. defining a new topic "British" of topic type "nationality"
      2. linking this new topic as being a broader term for the nationality topic "English" (note: temporal scopes can complicate this relationship, as "Irish", "Welsh", "Scottish" nationality may all be 'narrower' terms for British nationality but only for defined time frames!)

      It is specifically worth noting that this exercise did not require revisting all the topics (such as "Dickens, Charles") linked to "English" nationality, so extending the "topic map" can be incremental and often very simple, yet allows significant new groupings and relationships to be accessed.

      How would Charles Dickens' works be represented? Perhaps the simplest way is to define them as new instances of "non-ALEG works" topics, so:

      • we'd define a new topic of type "non-ALEG work" named "Great Expectations"
      • we'd link that topic to the "non-ALEG agent" topic "Dickens, Charles" in a relationship of type "author"

      Possibly we'd link "Great Expectations" to other topics as well, depending on how important it was to be able to classify the "Great Expectations" work. For example, we could link "Great Expectations" to:

      • a date topic ({1860?]) with a "when published" relationship
      • a form topic (novel) with a "has form" relationship
      • a genre topic (?) with a "has genre" relationship
      • a subject topic ("poverty"?) with a "has subject" relationship

      Maybe it is outside the scope to answer questions such as "which Australian works have been influenced by 19th Century novels (Australian or non-Australian) which have 'hardship' as a subject?" (the "poverty" subject may be related somehow to the "hardship" subject).

      But simplier questions may be in scope: "which Australian works have as subjects Vietnamese poets who have written about the Vietnam War?", and the same flexible linking of structures is required to answer both.

      Just because the structure is in place does not mean it has to be completed, but when adding a Vietnamese writer as a subject, maybe the indexer thinks it worth linking them to the "poet" occupation topic, and linking them to the "vietnam war" subject topic with a "has written about topic" relationship.

    2. Should the ALEG Thesaurus be deeply nested?

      Popular directories such as Yahoo! and the Open Directory Project provide deeply nested (and cross linked) structures which allow the user to browse and dig deeper through an immense subject catalogue.

      Pure search engines such as Google use no subject catalogue at all - they don't support browse at all, just search.

      But if you have a subject catalogue, it seems to make sense to allow users to browse it. But anything containing tens of thousands of terms (or even hundreds of terms) cannot be successfully browsed - it is too big!

      So how many of layers of nesting do you need? I think the answer is to keep nesting until the subjects remaining at the lowest level seem to sensibly form a group and are few enough in number to be conveniently scanned. Rarely would a level contain more than 50 terms, because it is hard to scan and manipulate a list much bigger than that.

      By using Topic Maps and associations between topics, it is easy to build a 'many headed thesaurus', where a term is represented as part of many broader terms: "Dickens, Charles" is part of many orthogonal hierarchies:

      • nationality: non Australian/British/English
      • occupation: creator/writer/novelist
      • era: 19th Century
      • political orientation: ...
      • literary movements: ...

    3. Thesaurus and non-thesaurus subjects

      If it is worth subject cataloging using a term, is it worth letting a user browse by that term?

    4. Getting there - building the ALEG Thesaurus

      Moving/Restructing AUSTLIT's thesaurus - how (to be developed)

    Recommendation

    The first 3 issues were discussed by the AUSTLIT team at some length on 26 July and these approaches were suggested:

    1. Non-ALEG agents and works will be stored in the same data structures as ALEG agents and works, but will be marked as 'non-ALEG'. Non-ALEG works includes works by ALEG agents which are not 'creative literary' (for example, Judith Wright's essays on conservation, Harry Feroka's magnificent oil-on-canvas portraits).

      Non-ALEG agents and works will not be searchable by name or title, but will be searchable as subjects.

      Non-ALEG agents and works can be described in what-ever detail as the indexer sees fit.

      This approach, for example, would allow not just Charles Dickens Don Bradman and Mr Ed to be representated using the same framework as ALEG agents, but also, for example, Vincent Van Gogh (as an agent) and "Starry Night" as a work (hmmm.. what would the workType be???). So, "Starry Night" could be happily represented as the subject of a work, and topic mapped and hence be used to answer queries such as "find all works which have as a subject European oil paintings"

    2. We like the operation of the Open Directory Project (Humans do it better ™) as implemented by the Google Web Directory.

      Questions of 'how far to nest?' are best answered by assessing the OPD example directories against the ALEG topics and considering what will be most effective for the user.

    3. Non-thesaurus terms will be the exception rather than the rule. Non-thesaurus terms are those subjects which are not linked into the thesaurus structure, and hence are not browsable by the user. Getting rid of non-thesaurus terms as soon as possible will be the goal of a thesaurus control task/team/system, by:

      • adding them to one or more places in the thesaurus structure (hence converting them into thesaurus terms) or
      • replacing them with another term or
      • simply deleting them.


Home > Data Model
Kent Fitch
k.fitch@adfa.edu.au
1 August 2000