ESA/ESO, Garching, Germany
December 1997
This is the last lecture of a series of lectures given during the Canary Islands Winter School for Astrophysics. The topic of the school was Astrophysics from Large Data Bases - the Internet Age.
The first four lectures dealt with the requirements and implementation considerations of scientific data archives for "big science" space and ground observational facilities. Handouts containig copies of the viewgraphs were given to the participants. This last lecture attempts to provide an overview of concepts for the future.
Technology evolution
Technology is still proceeding exponentially
Technology driver is entertainment and consumer electronics, not science
Multimedia applications require enormous processing power
Interactive TV will define bandwidth requirements
Networks will be ubiquitous and indispensable..
...and they will be used in different and unexpected ways
Crystal ball gazing
There will be a quantum jump in storage technology- nanotechnology and holographic storage devices
Low cost supercomputers: "piles of PCs" (Beowulf machines).
Bandwidth will become a non-issue
The Web will provide computing services, not just information
The differences between local and remote will vanish
Computers will become as unobstrusive as electric motors are today
Data Mining
In view of the very large amounts of data generated by state-of-the art observing facilities, the selection of data for a particular archive researvh project quickly becomes an unmanagable task.
hence, archive researchers have first to do a pre-selection of the possibly interesting data sets on the basis of the catalogue, then assess each observation by visually examining it (preview) and/or running an automated task to determine its suitability.
Such procedures are currently used for archive research with the HST Science Archive. This is only acceptable if the data volume is limited.
The ESO/CDS Data Mining Project aims at closing the gap and develop methods and techniques that will allow a thorough exploitation of the VLT Science Archive.
Data Mining Approaches
The basic concept is to not have to ask for individual data sets, but instead to be able to ask for all information pertaining to a set of search criteria.
In addition to parameters contained in the information catalogue, the seach criteria should include parameters which pertain to the science content of the observations.
This implies that parameters which describe the science content have to be generated after the observations. The proper time is during the ingest of the data into the archive.
These parameters can then be correlated with other information
The concept is to create an environment that contains both extracted parametric information from the data plus references to existing data bases and catalogueas
The environment then establishes a link between the raw data and the published knowledge with the immediate result of having the possibility to derive classification and other statistical samples
Knowledge Discovery in Data Bases(KDD)
Data mining as described above is a step in the KDD process: application of specific algorithm(s) to produce a particular enumeration of patterns over the data base
Knowledge Discovery in Data Bases is the extraction of implicit, previously unknown, and potentially useful knowledge from data bases.
Determination of science related parameters
The aim of parametrization must be the enumeration of statistically relevant and physically meaningful parameters. Examples: integrated energy fluxes of objects, colors, morphology, distribution. IDENTIFICATION.
This will lead to data archives which are organized by objects rather than by data sets (Albrecht, R., Albrecht, M.A., Adorf, H.M., Hook, R., Jenkner, H., Murtagh, F., Pirenne, P., Rasmussen, B.F., Archival Research with the ESO Very Large Telescope. In: Proceedings of the Workshop on Astronomical Archives, Trieste, Albrecht M.A. and Pasian, F. (Eds.), ESO Workshop and Conference Proceedings Series No. 50, pg. 133-141, 1994.)
A promising beginning are tools like SExtractor. This software package allows the extraction and parametrization of objects on large image frames. (Bertin, E., Arnouts, S., Astronomy & Astrophysics Supplement, Vol. 117, pg. 393, 1995).
Electronic links will be used to collect parameters on the objects thus extracted from other sources, for instance data bases from other wavelength regions. These parameters will either be physically imported and added, or they will be attached to the objects through hyperlinks.
Processing
Classification. The challenge of classification is to select the minimum number of classes such that objects in the same cluster are as similar as possible and objects in different classes are as dissimilar as possible.
However, this has to be done in in such a way that membership of an object in a particular class is meaningful in terms of the physical processes which are responsible for the condition of the object.
This is not always the case for traditional classification systems in astronomy: the binning criteria were determined by the characteristics of the detector and the physiology of the human classifier.
This is also not necessarily the case for statistical approaches (clustering, pattern recognition, neural network), because no physics is involved in establishing the classes.
The current emphasis in on automatic classification and on the data base access mechanisms to mine terabyte sized data bases.
The Archive Research Environment
It is evident that the optimum exploitation of the above concepts require a special computational infrastructure
Given the large data volumes, the need to access heterogenious data bases, and to execute different software packages we need an environment tailored to these requirements.
The Research Station
The Research Station is the next step after the personal workstation
It consists of a powerful local processor which is networked to other machines and to data bases and knowledge bases.
It has a configurable personalized interface which allows access to all services and functions in a consistent and efficient manner.
The emphasis should be on visualization and conceptualisation (model building). This can be realized through multiple screens, big screen projection ("flight simulator/planetarium), or through virtual reality (VR).
Demonstrational Interfaces/Activity Programming
Demonstrational interfaces let the user perform actions on concrete example objects, while at the same time constructing abstract "programs" (better: operational sequences).
Involves "guessing" by the computer, i.e. the dynamic change of default values based on context.
Context can be derived from a model of the operations (e.g. CCD calibration), or from the occurance of previous instances.
Operations models can be represented as metacode, or as rule bases, or a combination.
"Learning" from previous instances can be done through a trainable neural net
The combination of demonstrational and active notebook interface (e.g. the interface to the Mathematica package on the Macintosh computer) is probably the ideal reasearch oriented interface.
What about Virtual Reality
Emering technology. Will be available in about 5 years.
"Virtual reality techniques hold the key to the ultimate user interface"
Applications: military, aerospace, education, ENTERTAINMENT.
VR removes the limitations of the 2-D screen
Provides 3-D display capability through immersive VR
With improved natural language interfaces dat analysis/data base exploration/"discussions" become porssible
Most astronomers' offices already constitute a mild non-immersive VR environment
Natural Language Interfaces
Would require the use of carefully controlled terminology in order to prevent unwanted operations
Requires instant written feedback to keep track of what is going on
Personally, I cannot see myself sitting in my office talking to my screen. This could be overcome by changing habits (everybody is doing it), or by suitable technological aids (telephone headset or similar).
At this time NL interfaces are too cumbersome and too expensive. This might change soon, at which point we should re-evaluate.
NL will be introduced through the audio capabilities of the WWW
An alternative approach
All the developments described above are, or will be, the result of technological progress (more powerful hardware) and the application of concepts which already exist, but have so far not been possible to apply.
This approach can be characterized as: more of the same.
However, what we have to aim for is a paradigm change, in order to achieve results which are not just more accurate, but which are of a different nature.
And we have to get the computer to help us do this.
A past example is the intruduction of numerical integration in stellar evolution models, after analytical models had reached their limits.
In other words, we have to examine the process of doing astronomical research and investigate where and how state of the art computing technology can be employed.
The Astronomical Research Process
Has only recently been defined in epistemological terms: the model of the research process as developed by Sir Karl Popper (1972) comes closest to what most natural scientists do when they ``do science''
The research process starts with the input of signals, either through sensory perception, or through measuring devices which register signals which are either too faint or not suited for our senses. We know this step as data acquisition.
The next step is the transformation of the input data into meaningful values, quite often literally the ``data reduction'' from a jumble of instrument dependent individual measurements to a much smaller, coherent and consistent set of parameters.
By injecting concepts into the collection of parameters we construct models. Concepts range from very simple, such as a linear correlation, to the very complex, like evaporating black holes. The injection of concepts happens spontaneously and associatively, it is a result of the evolution of our brain.
Models come in two flavors, hypotheses and theories, the difference being that a hypothesis is an as-of-yet unsubstantiated and incomplete theory. Given the fact that no theory is ever complete it is more correct to say that all models are hypotheses. This is in agreement with the historical observation that even ``wrong'' models served well as good hypotheses in a heuristic sense.
Good models allow to make predictions as to future observations. They also allow to add to our pool of concepts by abstraction and generalization. If a model conflicts with observations we have to discard it. Since we can never be certain that any model will forever withstand the test of future observations Popper concludes that in science we can never demonstrably attain the ``truth''.
Asking the question where in this process the most progress has been made historically we tend to think that it has been in the first step: the introduction of ever more powerful telescopes and detectors, and the opening of more spectral windows has allowed to quite literally include observations of the whole universe into the building of models.
I would contend, however, that the most progress has been made in the application of concepts: the scientific revolution (i.e. paradigm change) during the period of enlightenment removed concepts like that of the supernatural, of magic and of the subjective from our model building tools, which indeed provided us with the very basis of what we today call scientific thinking.
The Scientific Library
Models found through the process described above are described by the scientits using a combination of natural language (with exactly defined semantic content of crucial elements usually called technical terms) and mathematical representation.
In other words, a scientific publication, and, more generally, the scientific library constitute a knowledge base, right now encoded in the idiosynchratic literary style of different authors with different cultural and language backgrounds.
In astronomy we have converged on one main representation language which we call scientific English, the quality of which, however, differs considerably between scientists, limiting their ability to convey, as an author, or to internalize, as a reader, a scientific model.
It is thus desirable to define a meta language for conveying scientific information, which is both human readable and computer processable.
On Language Standardization
For the past 20 years essentially all important astronomical publications have been publised in English.
While this is a disadvantage for the non-native English speakers it is an enormous advantage for the science of astronomy
In no other science are all active scientists able to communicate with each other so easily.
All activities which have the potential of a deviation from this situation must therefore be forcefully resisted.
A meta-language for representation and processing
Even with all-English publications human-to-human knowledge transfer is suboptimal
Computer-assisted processing of published knowledge is impossible
First step towards a meta language: data dictionary and thesaurus
VISION: "publishing" will not be done in the form of papers, but as additions or modifications to a global knowledge base
Consistency checking, novelty, truth maintenence, etc., is immediately and easily possible - this eliminates refereeing
Byproduct: the knowledge base, or segments of it, can be mapped into different natural languages, (even languages which the contributors do not speak) and at different levels (such as textbook, or popular description).
Electronic Publishing
In the beginning "electronic publishing" was little more than preparing a publication on a word processor and sending it to another computer
The American Astronomical Society (AAS) started to accept abstracts for AAS meetings in 1981. Because of early problems with standards and conventions related to formatting and special characters there was a pause and a re-start with the introduction of Tex/Latex
Tex/Latex provided a very important service to our science during the past decade. It is now anachronistic and should be discontinued.
The AAS has pioneered electronic publishing: ApJ and ApJ Letters have been electronically available for years
"electronic-only" journels have started.
Advantages: quick, no shipping required, searchable, potentially processable
Problems: too quick? What constitutes the "paper": what's on my disk, or what's on the disk of the publisher? Copyright, referncing, quoting.
For e-only publications: refereeing
Search Services
In addition to improved access and timely availability electronic publications have the advantage of being searchable
There are organisations like the NASA Astrophysics Data System (ADS) which specialise in such services. The aim is to free the user from having to read an increasingly enormous amount of material in order to find the desired information
Advanced search services are becoming available for a medium in which reading through all available material is totally prohibitive: The World Wide Web
The above search services are convenient and useful. However, as of today they are still mainly text-string oriented and not context oriented
Long Term Goals
The long term goal has to be to consider the body of electronically available publications as a data base much like an astronomical catalogue
In analogy to knowledge discovery in a numerical data base we then do knowledge discovery in this data base which contains concepts, models, and hypotheses: discovery of implied, previously unknown, and potentially useful knowledge from such a data base.
Alternatively, candidate models can be injected into such a data base with the aim of either supporting or disproving the model.
Having the contents of this data base represented in a meta language would facilitate this process enormously. However, I contend that some advances should be possible even on the basis of just text in scientific English.
It is obvious that the capability to do this would immediately lead to enormous advances in scientific productivity.