ALEG
Weekly Report - Week 32, 7 December 2001
What I've done
- The popup standard text selector in the maintenance system is now
context sensitive - ie, you get different standard text depending on
which text field you are in when you invoke the text selector (F2 key).
- Database table defnitions for the customer accounting and user database
have been created but
no actual code has been written yet.
- The system has been stable this week. For the first week ever, the server
Java class base has not been changed, and hence the server has not been restarted
for a week. (Changes were made to stylesheets, but these are detected and
dynamically reloaded.) It is encouraging that after a week of continuous
operation, memory use is stable (ie, there appear to be no memory leaks).
- Most of the week was spent with the CD-ROMs containing the scanned "full text" (and
picture!) images of works and reviews. Although copyright issues regarding these
are far from finalised, we've decided to load them now to at least provide a backup
of the CD-ROM's and to (hopefully) allow AustLit staff to use them for research
purposes.
I think everything was scanned and saved to CD in the "TIFF" image format, and
many or most were also saved in PDF format. I've been working with the TIFF
images, converting them to GIF format and reducing the colour depth to 8 grey scale.
This greatly reduces the image size, from a typical 3 - 4 MB for the TIFF to
50-500 KB for the GIF; still very large images!
The reason for the size is that the scanned images are very large - often a substantial
part of a broadsheet page, and the scanning process (and original material) means
lots of noise, which doesn't compress well!
I think there are 56 CD's in the main series plus about 6 "extra's" which I think
were done a bit later. The 56 CD's have now been processed with only 2 images
failing - one due to a read error on the CD, one due to an error in the encoded
TIFF file. I'll try to find these particular images on the PDF format CD's next
week.
Most of the CD's where 80-90% full. There were a 2 different directory structures on the CD,
and neither was going to be helpful for automatating the image conversion process, so I wrote
a simple program to read the images from the CD's and give them "better" names. Unfortunately,
my PC reads CD's quite slowly, so to speed things up I read them on the Sun box as well (which
gave me some exercise).
I experimented with a few different conversion parameters using the freeware
XnView v1.25 graphics manipulation program.
Unfortunately, the images are huge (2408 x 3426 is a common size), and were scanned at
variously 200dpi and 300 dpi in 256 greyscales. Converting to binary (black/white) didn't
work very well due to the grainy nature of some of the text and images. Converting to
4 greyscales worked well for most, and 8 greyscales for almost all images, so we decided
to go for 8 greyscales (undithered) even though this increased file sizes slightly.
I thought about converting to PNG (see
also the W3C PNG page) rather than
GIF - certainly the images where generally 10% smaller in PNG, but Netscape 4.x did not
display the images at their correct size, and couldn't print them at all. (But printing
images of this size is problematic anyway...).
Late Friday, with CD's all over the place, I think all except the 2 images mentioned
above have been converted - 16,968 images, 6.0GB (excluding the "extra" CD's).
Now all we have to do is link them to our work/expression/manifestations! The images
are keyed on BRN, but reviews in the old AustLit did not have a BRN, so images of
reviews have use the BRN of the work plus a running number to give them a unique
name. This will force manual intervention in the linking of images to works, but
hopefully a program will make this as simple as possible.
Maybe someday someone could use a program such as XnView and process each
image manually to reduce the size - reduce noise, reduce number of greyscales where appropriate,
crop image, occassionaly resize...
Next Week
- More tasks from Kerry and Annette's list.
- Off site backup procedures.
- Implement customer accounting and new user identification functions.
- Program to help linking of full text images to works.
- Process the "extra" image CD's
Next few weeks
- Multiple creation events for a work as a mechanism for allowing date ranges
to be associated with agents responsible for works, eg editors of a periodical.
- Refining NBD Holdings searches.
- Search usage analysis.
- Combining searches