ALEG
Implmentation Options

Draft 0.31

...the neurotic has problems,
the psychotic has solutions.
Thomas Szasz

Never undertake a project unless it is manifestly
important and nearly impossible.
Edwin Land, founder, Polaroid Corporation

Index

  1. Introduction
  2. Risks - introduction
  3. Off the shelf solutions
  4. Proprietary Text/Document Databases
  5. The proposed custom built approach
  6. An alternative custom built approach
  7. Operating System and Hardware
  8. Risks - the custom built approach


  1. Introduction

    The purpose of this document is to discuss implementation options for the ALEG project. That is, how to take the design and turn it into a working prototype by September 2000 and a production system by January 2001 with the least risk (both developmental and maintenance) and cost.

  2. Risks - introduction

    Humans are notoriously poor at assessing and dealing with risk. Risk is often considered too hard to assess or actively manage, so on a personal level, we usually "package up" the risk and pay for someone else to manage it (which is why insurance companies have such huge advertising budgets, expense accounts, flash buildings and comfortable executives). Regardless of thorough analyses on how clearly 'purely' rational it is to self-insure on everything from car insurance to health care, few people take this approach as the worst-case outcome (a fire destroying their $200,000 mortgaged home) is so much more feared than a death by a thousand insurance premium payments.

    For many people and organisations risk is not a threat or a cost, but something to be sought, embraced, managed and used for competitive advantage. However, unless risk management is your 'job', this pursuit of risk must be at the expense of doing something else, presumabley something which you are comparatively good at, or want to do, such as running a library or playing golf.

    In order to "manage" risk, you have to know

    People rarely assess any of these things well. Whether you're a householder who thought they'd offloaded storm-damage risk or a computer centre who thinks they've paid for 24x7 1 hour replacement service on critical computer gear, it often comes as a suprise to the seller of the risk how hard it can be to get the risk buyer to see things their way!

    Software development is a particularly risky process. Some estimates suggest that only 24% of IT projects undertaken by large companies are completed successfully. Another commentary claims that "in 1995 American companies and government agencies will spend $81 billion for canceled software projects" - that is, projects which delivered nothing at all.

    At this stage in the ALEG project, the most significant traditional project risks have been dealt with. We are confident that if the project as designed is successfully implemented it will make a worthwhile contribution to the study of Australian Literature. An enthusiastic and capable project team is in place. The business model and its risks are addressed elsewhere.

    This just leaves the comparatively minor implementation risks. A major aim of this document is to understand and address those risks in the context of the recommended implementation option.

  3. Off the shelf solutions

    There are no known off the self solutions which implement the ALEG data model. The closest category of tool is the library catalogue. However these tools do not support:

    Without the richness of the FRBR and INDECS contributions to the ALEG data model, ALEG would become just another catalogue and not able to support the research activities which is its reason for existence.

  4. Proprietary Text/Document Databases

    A possible approach is to start with a proprietary databases which specialises in management of text and documents which is becoming a large industry segment with the rise in popularity of XML.

    Some of the offerings in this domain are:

    Is a text database of particular relevance to this project? I (Kent Fitch) assert not. As the data model shows, ALEG is interested in and is required to represent strongly typed attributes and relationships (ie, well defined, constrained, chosen from controlled vocabulary) rather than blobs of free text. In this respect, ALEG is much more like a traditional database than would appear at first glance.

    There are 'unusual' requirements for a traditional database, such as a Z39.50 interface and the ability to export and manipulate data in XML. However,it is easier (less risky) to layer these behaviours on top of a robust, scalable, efficient "standard" relational database rather than to bend a text or document processing technology (which may come with Z39.50 and even "native" XML) so that it can implement the core traditional database processing we require.

    In many ways, the current hype surrounding XML databases is reminiscent of the short-lived excitement surrounding object databases in the early 1990's. The arguments in favour of storing objects natively seemed to make sense, however it turned out not to be that difficult to serialised or "flatten" objects into a relational structure, and the benefits of relational technology (wide range of products, costs, stability, speed, third-party and industry support, widespread expertise) were pretty compelling. Standard, well understood and general solutions won out over specialist, niche and technically optimised solutions.

    ALEG is not a toy-sized database. Initially there will be around 500,000 titles, which given the Work/Expression/ Manifestation model, and coupled with agents, events and subject will result in several million 'records'. Proven speed, scalability, tuning and performance tools and expertise, backup and recovery and reliability are the most important criteria, not an interesting architecture or peripheral or emerging technologies such as XPath and XQL support.

  5. The proposed custom built approach

    The proposed custom-built system is based on these components:

    1. Relational database

      Oracle8 is a widely used, powerful, scalable database. It is already licensed by ADFA.

    2. Free text search capability

      The Oracle interMedia Text product supports free text searching (including phrase searching) on Oracle database tables.

    3. Web server

      Apache is the most widely used web server by a large margin. Free, open source, actively maintained.

    4. Application environment

      Tomcat is the reference implementation of Java Servlet 2.2 largely sponsered by Sun and donated to the Apache project. Free, open source and actively maintained.

    5. XML/XSL support

      Xerces is a fully function XML parser supporting the W3C Document Object Model (DOM) Version 2 (contributed to the Apache project by IBM). The DOM defines an interface which a program can use to access and manipulate and XML document. Xalan is a commercial quality and complete implementation of the W3C XML Stylesheet Language (XSL).

      Both Xerces and Xalan are free, open source widely used and actively maintained.

    6. Z39.50 support

      Yaz is a free, open source and maintained Z39.50 toolkit with prebuilt sample targets and origins.

    7. DHTML Web browser

      Microsoft's Internet Explorer 5.5 supports a flexible and fast client-side programming environment, allowing an application to be integrated with the browser. Users do not need to download new versions as standard browser cache management ensures the current version of the application is in use by the user. IE 5.5 is a free and available for all Windows operating systems since Windows 95 and the Mac.

    Notes on the above model:

    1. The custom ALEG code running in the web server (Apache/Tomcat) environment will be written in Java, and split into 2 parts - a "business logic" layer which implements the core functionality of the system and a "presentation/formatting" layer. The business logic layer generates XML to be formatted by the presentation/formatting layer.

    2. The Xerces XML parser and DOM implementation is used to manipulate XML.

    3. The Xalan XSL processor is used to format/translate XML into HTML and other required representations.

    4. The business logic layer communicates with the Oracle database using the standard JDBC API. (JDBC: "Java DataBase Connectivity" is an Application Programming Interface which defines how a program written in the Java language communications with a database.)

    5. Z39.50 services will be based on the YAZ Z39.50 toolkit. Z39.50 clients will connect to the ALEG Z39.50 target which in turn will issue requests via HTTP to the standard ALEG web server.

    6. The business logic layer may need to query Z39.50 targets (depending on how access to holdings is implemented - for example, apparently a 'direct' non Z39.50 connection from ALEG to Kinetica has not be ruled out as a possibility). If this is required, the business logic layer will request information from the ALEG Z39.50 origin based on the YAZ Z39.50 toolkit which will issue the queries and pass information back to the business logic layer. The communication mechanism between the business logic layer and the ALEG Z39.50 origin has not yet been decided.

    7. General users wanting to access ALEG will do so using either any HTML2 capable browser (not shown) connecting to the ALEG web server or a Z39.50 origin.

    The proposed custom built approach has the advantages of:

  6. An alternative custom built approach

    If a Microsoft-based solution were preferred, the above proposal could be modified to use the following technologies:

    The development time required would be similar and the functionality delivered would be largely identical to the proposed approach.

  7. Operating System and Hardware

    As noted above, the proposed approach could be delivered on a variety of operating systems.

    The Linux/Intel combination is probably the most cost effective, but operational preference of the hosting institution (probably ADFA) are often the most important determinant of operating system and hardware selection.

    As an indicative sizing exercise, based on the following assumptions:

    1. database size - 500,000 title, 20,000 agents, 10 million table rows, 500 MB of data
    2. peak loads of 10 concurrent users, 1 search per second, 5 HTTP requests per second

    then a relatively small amount of hardware would be required to run in a Linux/INTEL environment. There are many determinates of the required configuration, but a sample hardware inventory would be:

    Such as machine would probably cost between $4,000 and $5,000. (As an indicative cost, a DELL 1300 configured as above has a list price of $US3,180 ($A5,500), but ADFA as a tertiary institution would get a much better price than this from its regular supplier!) If more memory is required, a further 512MB would typically cost between $1,000 and $2,000, depending on supplier.

    Depending on a business-case analysis, more hardware may be required to provide a fault-tolerant service. For example, disks could be mirrored and or a hot-swappable box could be configured and constantly updated to take over in case of hardware failure of the main machine.

  8. Risks - the custom built approach

    1. The open source software will not be supported

      A commonly cited risk with open source software (OSS) is that it is not supported by any particular company and hence may become unsupportable. The counter argument is that proprietary software supported by companies can become unsupported if the company goes out of business, or just decides to nolonger support it.

      But the OSS technologies selected above have some very significant great advantages over most proprietary software:

      1. They are very widely used. There are millions of web sites based on Apache, and tens of thousands based on Java servlets. Even if for some reason the OSS became unsupported by the OSS community, there would be tremendous incentives for commercial support to be offered.

      2. They implement well document and standard interfaces. The OSS should be 'plug compatible' with other offerings implementing the key standards we are relying upon to build our software: HTTP, XML, DOM, XSL, Servlets API, Z39.50, JDBC. There are currently many (sometimes dozens) of independent implementations of these standards to choose from.

      3. These products are very well documented and the source code is published. This means that in the last resort it is possible to maintain them yourself (although the chances of this happening given the above points is very remote).

    2. The custom built code will not be supportable.

      Given the unqiue nature of ALEG, any implementation will require a substantial code base, so this risk is not unique to the custom built approach. In fact, because this code will be written in a widely used language (Java) and interfaces with other components using widely used and industry standard application programming interfaces (API's) such as HTTP, XML, DOM, XSL and JDBC it could be argued that the risk is much less than if a proprietary and not-so-widely used coding "platform" were used.

      Nevertheless, the risk remains and must be addressed by:

      1. rigourous code-level documentation using the facilities of Javadoc to generate as much documentation from the source-code comments and structures as possible.

      2. system level design and implementation document describing the overview of the design, aimed explicitly at getting a new programmer enough information to tackle immediate problems and plan extentions by explaining the background of why the system is as it is.

      3. making the development of adequate system documentation a requirement of system acceptance.


Home > Design
Kent Fitch
k.fitch@adfa.edu.au
28 July 2000