DLM DLM Forum'99 Europe
Full text
Cooperation Europe-wide

[Home] [ Programme DEENFR]
_

Digital Archive Costs: Facts and Fallacies

by Kevin Ashley

Introduction

This paper, was produced some time before the DLM Forum '99. It is an indication of what was presented during the forum itself. New information is being made available week by week, and I will do my best to ensure that I reflect as much recent material as possible. Nonetheless, many of the basic parameters one uses to begin a costing exercise for a digital archive are well understood and they are summarised here.

The problem of costing

Many organisations now feel that they can no longer delay the implementation of some form of digital archive, whether its goal be purely one of preservation of digital records, or also of providing access to them. The work of the DLM Forum and other similar initiatives world-wide mean that many of the technical issues are now resolved, or at least more clearly understood. A wealth of advice is available on matters such as ideal storage media, metadata management, copyright and legal issues, file formats, migration methods and access protocols. Faced with an increasing volume of digital records to deal with, the knowledge that the technical issues are either solved or can be overcome, and public or internal demands for increased access to modern archival material, it is difficult to postpone the day when some form of digital archiving takes place. The easy availability of network access and the increasing problem of dealing with older digital material also contribute to the urgency felt by many organisations. However, the main barrier that many face is the difficulty of deciding how much such a service might cost.

Similar problems are faced by libraries, archives and museums seeking to use digitisation as a new form of surrogate, increasing access to valuable but fragile original documents and objects. Although the creation of digital surrogates involves a very specific set of costs which are not applicable to records which are ‘born digital’ (that is, those which begin their life in a computer) many of the ongoing costs - preservation, access, support for users - are very similar in overall character. One must beware, however, of drawing too close a parallel between the cost profiles of a digitisation project and a digital records project. It is important to understand what aspects are common, and what set each apart. Part of the difficulty in understanding costs has been the lack of working examples from which to learn and the difficulty in extrapolating costs from pilot projects (of which there have been many) to full-scale public services.

What answers can I offer ?

I do not claim a perfect understanding of all the parameters that may affect the costs of such an exercise in your organisation. I can, however, speak from experience gained over 25 years of digital preservation of one sort or another in my organisation, the University of London Computer Centre (ULCC.) Not all of this experience has been in operating true archives (in the sense of repositories which permanently preserve their contents) and not all of it has been subject to commercial and competitive pressure. In particular, not all of our experience has been successful. We have made mistakes during that period, but those mistakes have taught us valuable lessons. We have reached a position where we now operate digital preservation services under contract to a number of other bodies in the UK, including the British Library, and we have advised many others on how to set up such services for themselves. Our best-known contract is probably that with the Public Record Office of England and Wales, for whom we operate NDAD (http://ndad.ulcc.ac.uk/) - the National Digital Archive of Datasets.

For ULCC to do this cost-effectively we have had to develop a keen understanding of the factors that influence costs and know how to translate knowledge from pilots, demonstrations and proof-of-concept studies to large-scale public services. We have learnt through experience, and I hope to be able to share this with you.

Lack of knowledge creates fallacies

Lack of knowledge and facts in any area creates a vacuum which inevitably becomes filled with rumour, legend, fears and half-truths. This has certainly been the case in recent years with digital archives. Often this is the result of asking those whose experience covers only part of the necessary area of expertise - whether they be archivists, computing specialists, records managers or administrators - to deal with something where they cannot even identify the main influences on cost, much less the costs themselves. None of these professional groups can be blamed for not being able to cost such services fully. Until very recently there were almost no useful sources of information available on the costs of a digital preservation service. Those that did exist were either difficult to extrapolate, because they applied to subtly different types of service, or difficult to scale, because they came from pilot studies and research programmes.

For instance, work was done in the 1980s in the UK and elsewhere by those in the scientific research community on the relative costs of data storage and recomputation. Many of these groups, working in areas such as climate research, computational chemistry and engineering, were using computers whose cost was measured in millions of euros to carry out calculations which could take days or even weeks of computer time. One naturally wished to save the results of such computations since they represented a significant capital investment. However, the results were often extremely large, whereas the information needed to recreate them - the input data and programs - were tiny in comparison. Although all large computer centres (such as ULCC) offered managed digital preservation facilities, the costs of these were also high and users were often limited in how much space they could consume. It was sensible, therefore, to examine case-by-case whether it was cheaper overall to save the data and pay storage costs, or to discard it and possibly pay for its recomputation in the future. This analysis was influenced by two variables - the likely time before the data would be needed again and the likelihood that it would be needed at all (illustrations 1-3). Recomputation costs fall with time in the short term since computer power gets faster and cheaper. (In the longer term, it rises - when all those who did the original computations have gone, it may be difficult for someone else to recreate their work.) Taking into account the likelihood of reuse is to simply evaluate the risk that one must incur any cost at all. If reuse is very unlikely, then it pays to discard everything even if recomputation is very expensive.

ashley1.GIF (8249 bytes)

(Illustration 1)

ashley2.GIF (14187 bytes)

(Illustration 2)

ashley3.GIF (18710 bytes)

(Illustration 3)

Interesting as these studies are, since they tell us a good deal about the raw costs of storage at the time, they are of little use to us now. Storage costs have changed so much that it is very difficult to extrapolate from the mid-1980s to now. But more importantly, with organisational records we do not have the option of recomputation. If we do not preserve information, it is lost forever.

This lack of information has led, in my experience, to the promulgation of two fallacious beliefs regarding digital archives:

Put another way, the first belief says that you can reduce the costing for an archive to a unit cost per megabyte or gigabyte. The second says that when you do this reduction, you won’t like the answer you get. Experience shows both these beliefs to be ill-founded. One can reduce costs to euros per gigabyte, but the ratio will continually change in most archives as the holdings change. In some specific instances, the ratio will hold true for a specific archive, but it will be completely different for another archive with very similar facilities. Data storage costs are not zero, but they are not by any means the dominant costs. I suspect this wish to characterise storage costs in this way comes from the dominance of storage costs in the world of traditional paper archives, where the space consumed by the archive and the costs of maintaining its environment are indeed a dominant factor.

In fact, the primary influences on archival cost for a digital archive of any type are analagous with those which influence a more traditional archive or library. However, the terms used to describe them are often different and the way in which we measure certain influences must often be different. In addition, the relative importance of different factors - such as size of archive and frequency of access - are quite different for a paper archive and a digital one.

The range of costs, recurrent and capital, are great. One can invest as little as 1500 euros for a very basic digital preservation project, or as much as 2 million or more. Ongoing costs, expressed in those pointless costs/Gbyte, range from 4 euros/Gbyte/year to 400 euros/Gbyte/year.

What information is available

I have said that little or no information is available to help us understand costing issues. At the time I submitted my abstract to DLM99 this was true. In recent months, two publications have gone some way to improving matters. Charles Dollar’s recent book, ‘Authentic Electronic Records’, deals with a wealth of issues concerning digital archiving. Its appendices contain detailed cost information from the US national archives (NARA) on the costs of their digital preservation programme. It is difficult, however, to extrapolate these costs to a much large or smaller archive than that operated by NARA. It is also difficult to work out how those costs might change if one wanted to operate a very different type of service - one which offered public network access to digital archives as well as digital surrogates, or one which dealt with a different range of types of input material.

Another recent publication of interest comes from the UK National Preservation Office. "digital culture: maximising the nation’s investment" (ed. Mary Feeney, ISBN 0-7123-4645-7) summarises the results of a number of studies funded by the NPO and JISC in the UK relating to all aspects of digital collection management. Costing models are covered in one chapter, although again one must be aware that the possibilities covered include digital surrogate services and non-permanent preservation archives. I will deal with this recent study in more depth at DLM99.

Primary cost influences

Before attempting to model costs, one needs to assess what type of service one is dealing with. Digital preservation services fall into a variety of models. At one extreme is the ‘safe-deposit box’ model offered by a number of scientific data archives. In these systems, the material deposited is not expected to be available to anyone other than the depositor or the depositor’s research group. The owners of the information are expected to deal with future data conversion and access issues, although the service provided may provide assistance with conversion of some common data formats. The service provider usually only guarantees two things:

The depositor is usually not concerned with the mechanics of how data is preserved or what it is preserved on. This service model is one of the few where data volumes really are the dominant cost factor. Even here, however, a number of other factors come into play:

In some extreme cases, these can alter our costs considerably. In particular, almost all digital storage systems will incur large overheads when dealing with a very large number of very small files. Even though total data volumes may be small, the total system cost will increase dramatically. In the worst cases, more space will be used for storing file metadata than is used for storing the data itself.

Most archives will want to offer something substantially more than this basic safe-deposit service. Assuming that you do, evaluate which of the following service elements you may be responsible for; it is assumed that the basic task of preservation is being undertaken:

Once you have established which of these service aspects you will have to undertake, you can begin to evaluate the influences on your overall costs.

Acquiring material

Who is responsible for selection/appraisal ? Can this be applied by defining a policy, whose application is simple (and therefore cheap) or must decisions be made on a case-by-case basis? The cost implications here are exactly the same as for any other type of archive. The only special consideration may be that appraising some types of digital material may require staff to have specialist training.

What volumes of material will you be dealing with ? This needs to be measured using a number of metrics. One is the simple data volume, in gigabytes, terabytes or whatever unit is appropriate. Another is the number of files involved: a thousand one-megabyte files take the same space as a single one-gigabyte file, but will require a great deal more effort to accession. Another is the frequency of accessions. If our one thousand files come in a single transfer, and are all related, then costs will inevitably be lower than if each is transferred as a separate object at different times.

How many depositors are involved ? The greater the number of people you must liase with, the greater your costs. In particular, if each depositing organisation has a different internal structure and working practices, it may be very difficult to impose a single transfer mechanism on them all. This will also have a considerable influence on costs.

Do you have control over the format of deposited material ? Dealing with a small number of file formats, or even one, allows for simpler procedures at the time of deposition and future migration. Each additional format imposes a one-off cost to develop procedures to deal with it.

Do you have control over when material will arrive and how quickly it must be processed ? Again, the issues here are no different than for a traditional archive. If you have the freedom to plan or spread your workload you can reduce costs.

Metadata - will it arrive with the deposits ? Must it be recreated from other sources (e.g. paper documents) ? Will these be provided automatically or must the archive search it out ? Deposits which arrive with little or no metadata can impose very great demands on archives which are responsible for locating this metadata. One must either place a cap on the amount of effort expended on this, or accept possibly huge spiralling costs for some transfers of material.

Preservation

Do you need to undertake conversion of data or metadata after deposit for permanent preservation ? Are there any other conversions needed to provide access to the data (if this is part of your service) ?

How often will media need to be replaced and at what cost ? The choice of media will be influenced by data volumes and access requirements. Sometimes it is cheaper to use media with a short but predictable lifetime which can be replaced cheaply, particularly if frequent access is required. For other uses, it is best to choose longer-living media with higher replacement costs. Neither avoids the need to regularly monitor the data for signs of data loss.

How often is migration to new file formats likely to be required ? You cannot predict migration costs until the need to migrate occurs, but you can make a reasoned guess as to when this might happen.

What are the basic costs of your storage system (including tape/disk drives, robotic systems, supporting storage management software) and media ? How efficiently will files be stored on the media ? Again, many small files will probably take up a lot more space than you might imagine: anything up to two or three times as much in the worst cases.

Can you share your preservation system with other archives/projects/organisations ? There are considerable economies of scale in large storage systems, ‘large’ meaning 100s of Terabytes upwards at the present time. You can often share these basic facilities with other parts of your organisation or even other organisations. You may be able to purchase digital storage services from a supplier who will still allow you intellectual control of your deposits, and avoid any capital investment at all.

Preparations for access

How much cataloguing is necessary ? How much information can be extracted automatically from the resources (i.e. dates of creation, authors, titles from document metadata) and how much must be written manually (e.g. administrative histories, contextual background) ?

To what extent can users be expected to make sense of the resources themselves (e.g. specialist researchers accessing a scientific data archive) and to what extent must this be explained via the catalogues ? Can you get depositors to share part of this work, or other researchers ? The concept of a central physical archive whose catalogues and metadata are supplied by a distributed group of specialists is a practical one and worth exploring in some environments, particularly where the level of work means that a particular type of expertise may only be needed for a few days or weeks a year.

How much work is necessary to deal with issues of closure/data protection ? If manual efforts are required to removed names from digital documents or databases, then per-file costs will become very high. Again, this situation is analogous to that with paper archives. Some types of digital archive (such as databases) can, however, be dealt with much more straightforwardly. For these, anonymisation may mean simply removing one data element (a column in a table.)

Supporting access

How frequent will access be ? This needs to be measured by a number of metrics: what is the frequency of access to any one digital object; what is the total access per day, in terms of number of objects and gigabytes supplied (assuming network access); what is the distribution pattern of these accesses (time of day and sections of the archive.) These all correspond directly with paper archives. It is still likely that a small proportion of holdings will account for a great deal of access. We also know that dealing with many small documents is more expensive than a small number of large ones, even of the total volume (in weight of paper or megabytes delivered) is the same.

What is the expectation of those accessing the archive in terms of access times ? People are notoriously impatient with online services. They may be content to wait two hours for a paper document to be delivered from your stacks, but after three seconds waiting for an online document they are convinced the system has failed and only incessant, repeated clicking of all available buttons will make it work again. Supporting high-speed access usually demands expensive peripherals and sophisticated storage management algorithms (but not necessarily expensive storage media.)

What is the unit of access, and are accesses ‘bundled’ ? If you need to provide access to (say) individual pages of a document this can be relatively more expensive than only providing access to an entire document collection. Conversely, if you can predict later accesses based on recent access patterns (e.g., following page 1, 2 and 3, page 4 of the same document is very likely to be accessed) your storage system can delivery greater apparent performance than it might appear capable of.

Can you charge for access ? Is that a fixed charge or a unit charge ? Clearly, unit charges are likely to reduce overall demand. They will, however, create greater expectations in your users. When they are paying for something, they demand more of it than when it is free at the point of use.

What level of support will your users need ? Expert users need little or no assistance.

What total number of users do you expect ? You have fixed costs associated with every user you register, even if they make little or no use of your service. One member of staff per 1000 users is not unusual, purely to deal with issues of registration and account management.

How many modes of access must you cater for ? For some types of digital resource, different users will expect very different views of the archive to be provided. Additional view types impose initial development costs, and ongoing support costs.

Experience which may or may not be useful

Our own experience is that staff account for 70% or more of our total costs. This is for a service which must provide a full range of public access services, catalogues its holdings, and where a great deal of time is spent in depositor liaison and and locating metadata and contextual information. The next greatest costs relate to the capital and maintenance costs for software and hardware associated with access (not with data preservation.) These costs are influenced primarily by the number of users and the frequency of access to holdings, not by the total volume of the archive. Most of the remaining costs are now, in economic terms, relatively inelastic with respect to volume - that is, they vary slightly with the volume of the archive, but not by a great deal.

What is a much greater influence is the number of accessions we deal with. Each deposit takes up (expensive) staff time in liaison and creation of finding aids. Whether the end result is 10 kilobytes of data or 100 gigabytes does not make a great deal of difference (although it does have some bearing) even though one takes up 10 million times as much storage space as the other. Looked at another way, we can double the total size of our repository - from 300 to 600 terabytes - and only add another 12% per year, at most, to our overall costs. This takes into account both servicing the capital debt and dealing with additional recurrent costs. Data volume, then is not the biggest influence, and the simple cost of a gigabyte of storage is not frightening, unless you are easily scared.

Kevin Ashley (K.Ashley@ulcc.ac.uk; http://www.ulcc.ac.uk/Staff/Kevin+Ashley/)

September 1999

 

_
Page designed by Jean-Michel Cornu, maintained by the DLM-Forum secretariat - Please send your comments to the DLM chief editor