#########################
                 # The Satimage database #
                 #########################

1. Sources:

     (*) This database is taken from the ftp anonymous "UCI Repository Of 
	 Machine Learning Databases and Domain Theories" 
	 (ics.uci.edu: pub/machine-learning-databases).

    The  database was in use in the European StatLog project, which
    involves comparing the performances of machine learning,
    statistical, and neural network algorithms on data sets from real-world
    industrial areas including medicine, finance, image analysis, and
    engineering design.


     (a) Author:

        This database was provided to UCI by:

	        Ross D. King
                Department of Statistics and Modelling Science
                University of Strathclyde
                Glasgow G1 1XH
                Scotland U.K.
                +44 41 552-4400 x 3033
                Fax +44 41 552-4711
                ross@turing.uk.ac

     (b) Original source:
	
	The original Landsat data for this database was generated
	from data purchased from NASA by the Australian Centre
	for Remote Sensing, and used for research at:

		The Centre for Remote Sensing
		University of New South Wales
		Kensington, PO Box 1
		NSW 2033 Australia.



2. Past Usage:

    Feng,C., Sutherland,A., King,S., Muggleton,S. & Henery,R.
    (1993). Comparison of Machine Learning Classifiers to 
    Statistics and Neural Networks. AI & Stats Conf. 93.

    D. Michie, D.J. Spiegelhalter, and C.C. Taylor, editors.
    Machine learning, Neural and Statistical Classification.
    Ellis Horwood Series In Artificial Intelligence,
    England, 1994.

    Voz J.L., Verleysen M., Thissen P. and Legat J.D.,
    Suboptimal Bayesian classification by vector quantization with small clusters
    ESANN95-European Symposium on Artificial Neural Networks,
    April 1995, M. Verleysen editor, D facto publications, Brussels, Belgium.
    
    Guerin-Dugue, A. and others,
    Deliverable R3-B4-P - Task B4: Benchmarks,  Technical report,
    Elena-NervesII "Enhanced Learning for Evolutive Neural Architecture", 
    ESPRIT-Basic Research Project  Number 6891,
    June 1995



3. Relevant Information:

    This database was generated from Landsat Multi-Spectral Scanner
    image data.  These and other forms of remotely sensed imagery can
    be purchased at a price from relevant governmental authorities. The
    data is usually in binary form, and distributed on magnetic
    tape(s).

    The Landsat satellite data is one of the many sources of information
    available for a scene. The interpretation of a scene by integrating
    spatial data of diverse types and resolutions including multispectral
    and radar data, maps indicating topography, land use etc. is expected
    to assume significant importance with the onset of an era characterised
    by integrative approaches to remote sensing (for example, NASA's Earth
    Observing System commencing this decade). Existing statistical methods 
    are ill-equipped for handling such diverse data types. Note that this
    is not true for Landsat MSS data considered in isolation (as in
    this database). This data satisfies the important requirements
    of being numerical and at a single resolution, and standard maximum-
    likelihood classification performs very well. Consequently,
    for this data, it should be interesting to compare the performance
    of other methods against the statistical approach.

    One frame of Landsat MSS imagery consists of four digital images
    of the same scene in different spectral bands. Two of these are
    in the visible region (corresponding approximately to green and
    red regions of the visible spectrum) and two are in the (near)
    infra-red. Each pixel is a 8-bit binary word, with 0 corresponding
    to black and 255 to white. The spatial resolution of a pixel is about
    80m x 80m. Each image contains 2340 x 3380 such pixels.

    The present database is a (tiny) sub-area of a scene, consisting of 
    82 x 100 pixels.
    The binary values were converted to their present ASCII form by Ashwin  
    Srinivasan. The classification for each pixel was performed on the basis 
    of an actual site visit by Ms. Karen Hall, when working for Professor
    John A. Richards, at the Centre for Remote Sensing at the University
    of New South Wales, Australia. Conversion to 3x3 neighbourhoods was done 
    by Alistair Sutherland.
    The initial test and training sets available at the "UCI Repository Of 
    Machine Learning Databases" were concatanated and mixed to obtain this 
    "satimage" database.  

    Each line of data corresponds to a 3x3 square neighbourhood
    of pixels completely contained within the 82x100 sub-area. Each line
    contains the pixel values in the four spectral bands 
    (converted to ASCII) of each of the 9 pixels in the 3x3 neighbourhood
    and a number indicating the classification label of the central pixel.
    The aim is to predict this classification, given the multi-spectral     
    values. 

    The database contains thus 6435 patterns with 36 attributes (4 spectral 
    bands x 9 pixels in neighbourhood) plus the class label.
    The attributes are numerical, in the range 0 to 255 (8 bits).
    The class label is a code for the following classes:

    Number	Class

    1		red soil
    2		cotton crop
    3		grey soil
    4		damp grey soil
    5		soil with vegetation stubble
    6		mixture class (all types present)
    7		very damp grey soil

    NB. There are no examples with class 6 in this dataset-
    they have all been removed because of doubts about the 
    validity of this class.

    
    The data is given in random order and certain lines of data
    have been removed so you cannot reconstruct the original image
    from this dataset.
    
    In each line of data the four spectral values for the top-left
    pixel are given first followed by the four spectral values for
    the top-middle pixel and then those for the top-right pixel,
    and so on with the pixels read out in sequence left-to-right and
    top-to-bottom. Thus, the four spectral values for the central
    pixel are given by attributes 17,18,19 and 20. If you like you
    can use only these four attributes, while ignoring the others.
    This avoids the problem which arises when a 3x3 neighbourhood
    straddles a boundary.




4. Summary Statistics:

   The dynamic of the attributes is in [27-157], with a mean value 83.47
   and a standard deviation egal to 17.6.  
   The database resulting from the centering and reduction by attribute of the Satimage 
   database is on the ftp server in the `REAL/satimage/satimage_CR.dat.Z' file.

   Class Distribution: 

     Class  Instances Percentage
	1     1533      23.82 %
 	2      703      10.92 %
	3     1358      21.10 %
	4      626       9.73 %
	5      707      10.99 %
	7     1508      23.43 %




5. Confusion matrix obtained with the k_NN classifier on the
   satimage_CR.dat database (test with the Leave_One_Out method). 
   k was set to 3 in order to reach the minimum error rate : 8.89 +/- 1.6%.

      {{0,     1,    2,    3,    4,    5,    7},
       {1,  98.1,  0.2,  1.1,  0.1,  0.5,  0.0},
       {2,   0.0, 96.5,  0.1,  0.7,  2.0,  0.7},
       {3,   0.5,  0.1, 93.4,  4.6,  0.0,  1.4},
       {4,   0.0,  0.8, 13.7, 70.6,  0.8, 14.1},
       {5,   3.1,  0.8,  0.1,  0.8, 89.7,  5.5},
       {7,   0.0,  0.1,  1.9,  7.3,  2.0, 88.7}}




6. Result of the Principal Component Analysis:

The Principal Components Analysis is a very classical method in pattern
recognition [Duda73].
PCA reduces the sample dimension in a linear way for the best
representation in lower dimensions keeping the maximum of inertia. The
best axe for the representation is however not necessary the best axe
for the discrimination. After PCA, features are selected according to
the percentage of initial inertia which is covered by the different
axes and the number of features is determined according to the
percentage of initial inertia to keep for the classification process.
This selection method has been applied on the satimage_CR database.
When quasi-linear correlations exists between some initial features,
these redundant dimensions are removed by PCA and this preprocessing is
then recommended.  In this case, before a PCA, the determinant of the
data covariance matrix is near zero; this database is thus badly
conditioned for all process which use this information (the quadratic
classifier for example).

The following files are available for the satimage database:
  - ``satimage_PCA.dat.Z'', the projection of the ``satimage_CR'' database on its 
        principal components (sorted in a decreasing order of the related 
        inertia percentage; so, if you desire to work on the database projected on 
        its x first principal components you only have to keep the x first attributes 
        of the satimage_PCA.dat database and the class labels (last attribute)). 

  - ``satimage_corr_circle.ps'', a graphical representation of the 
        correlation between the initial attributes and the two first 
        principal components,

  - ``satimage_proj_PCA.ps'', a graphical representation of the 
        projection of the initial database on the two first principal 
        components,


Table here below provides the inertia percentages associated to the 
eigenvalues corresponding to the principal component axis sorted in 
the decreasing order of their associated inertia percentage.
99 percent of the total database inertia will remain if the 17 first principal
components are kept.


      Eigen   Value   Inertia    Cumulated
      value          percentage   inertia
  
	1    16.3274  45.35       45.35
	2    14.3575  39.88       85.24
	3    1.57658   4.38       89.61
	4    0.88933   2.47       92.09
	5    0.65945   1.83       93.92
	6    0.60908   1.69       95.61
	7    0.37060   1.03       96.64
	8    0.19197   0.53       97.17
	9    0.12981   0.36       97.53
       10    0.12588   0.35       97.88
       11    0.08386   0.23       98.11
       12    0.06657   0.18       98.30
       13    0.06449   0.18       98.48
       14    0.05722   0.16       98.64
       15    0.04557   0.13       98.77
       16    0.04422   0.12       98.89
       17    0.04078   0.11       99.00
       18    0.03677   0.10       99.10
       19    0.02896   0.08       99.18
       20    0.02773   0.08       99.26
       21    0.02622   0.07       99.33
       22    0.02480   0.07       99.40
       23    0.02224   0.06       99.46
       24    0.02053   0.06       99.52
       25    0.01918   0.05       99.57
       26    0.01866   0.05       99.63
       27    0.01798   0.05       99.68
       28    0.01728   0.05       99.72
       29    0.01540   0.04       99.77
       30    0.01494   0.04       99.81
       31    0.01449   0.04       99.85
       32    0.01285   0.04       99.88
       33    0.01212   0.03       99.92
       34    0.01082   0.03       99.95
       35    0.01005   0.03       99.98
       36    0.00844   0.02      100.00
  
This matrix can be found in the satimage_EV.dat file.
    
The Discriminant Factorial Analysis (DFA) can be applied to a learning
database where each learning sample belongs to a particular class
[Duda73].  The number of discriminant features selected by DFA is fixed
in function of the number of classes (c) and of the number of input
dimensions (d); this number is equal to the minimum between d and c-1.
In the usual case where d is greater than c, the output dimension is
fixed equal to the number of classes minus one and the discriminant
axes are selected in order to maximize the between-variance and to
minimize the within-variance of the classes.
The discrimination power (ratio of the projected between-variance over
the projected within-variance) is not the same for each discriminant
axis: this ratio decreases for each axis. So for a problem with many
classes, this preprocessing will not be always efficient as the last
output features will not be so discriminant. This analysis uses the
information of the inverse of the global covariance matrix, so the
covariance matrix must be well conditioned (for example, a preliminary
PCA must be applied to remove the linearly correlated dimensions).

The Discriminant Factorial Analysis (DFA) has been applied on the 18
first principal components of the satimage_PCA database (thus by
keeping only the 18 first attributes of these databases before to apply
the DFA preprocessing) in order to build the satimage_DFA.dat.Z
database file, having 5 dimensions (the satimage database having 6
classes).


[Duda73]
Duda, R.O. and Hart, P.E.,
Pattern Classification and Scene Analysis,
John Wiley & Sons, 1973.