######################### # The Satimage database # ######################### 1. Sources: (*) This database is taken from the ftp anonymous "UCI Repository Of Machine Learning Databases and Domain Theories" (ics.uci.edu: pub/machine-learning-databases). The database was in use in the European StatLog project, which involves comparing the performances of machine learning, statistical, and neural network algorithms on data sets from real-world industrial areas including medicine, finance, image analysis, and engineering design. (a) Author: This database was provided to UCI by: Ross D. King Department of Statistics and Modelling Science University of Strathclyde Glasgow G1 1XH Scotland U.K. +44 41 552-4400 x 3033 Fax +44 41 552-4711 ross@turing.uk.ac (b) Original source: The original Landsat data for this database was generated from data purchased from NASA by the Australian Centre for Remote Sensing, and used for research at: The Centre for Remote Sensing University of New South Wales Kensington, PO Box 1 NSW 2033 Australia. 2. Past Usage: Feng,C., Sutherland,A., King,S., Muggleton,S. & Henery,R. (1993). Comparison of Machine Learning Classifiers to Statistics and Neural Networks. AI & Stats Conf. 93. D. Michie, D.J. Spiegelhalter, and C.C. Taylor, editors. Machine learning, Neural and Statistical Classification. Ellis Horwood Series In Artificial Intelligence, England, 1994. Voz J.L., Verleysen M., Thissen P. and Legat J.D., Suboptimal Bayesian classification by vector quantization with small clusters ESANN95-European Symposium on Artificial Neural Networks, April 1995, M. Verleysen editor, D facto publications, Brussels, Belgium. Guerin-Dugue, A. and others, Deliverable R3-B4-P - Task B4: Benchmarks, Technical report, Elena-NervesII "Enhanced Learning for Evolutive Neural Architecture", ESPRIT-Basic Research Project Number 6891, June 1995 3. Relevant Information: This database was generated from Landsat Multi-Spectral Scanner image data. These and other forms of remotely sensed imagery can be purchased at a price from relevant governmental authorities. The data is usually in binary form, and distributed on magnetic tape(s). The Landsat satellite data is one of the many sources of information available for a scene. The interpretation of a scene by integrating spatial data of diverse types and resolutions including multispectral and radar data, maps indicating topography, land use etc. is expected to assume significant importance with the onset of an era characterised by integrative approaches to remote sensing (for example, NASA's Earth Observing System commencing this decade). Existing statistical methods are ill-equipped for handling such diverse data types. Note that this is not true for Landsat MSS data considered in isolation (as in this database). This data satisfies the important requirements of being numerical and at a single resolution, and standard maximum- likelihood classification performs very well. Consequently, for this data, it should be interesting to compare the performance of other methods against the statistical approach. One frame of Landsat MSS imagery consists of four digital images of the same scene in different spectral bands. Two of these are in the visible region (corresponding approximately to green and red regions of the visible spectrum) and two are in the (near) infra-red. Each pixel is a 8-bit binary word, with 0 corresponding to black and 255 to white. The spatial resolution of a pixel is about 80m x 80m. Each image contains 2340 x 3380 such pixels. The present database is a (tiny) sub-area of a scene, consisting of 82 x 100 pixels. The binary values were converted to their present ASCII form by Ashwin Srinivasan. The classification for each pixel was performed on the basis of an actual site visit by Ms. Karen Hall, when working for Professor John A. Richards, at the Centre for Remote Sensing at the University of New South Wales, Australia. Conversion to 3x3 neighbourhoods was done by Alistair Sutherland. The initial test and training sets available at the "UCI Repository Of Machine Learning Databases" were concatanated and mixed to obtain this "satimage" database. Each line of data corresponds to a 3x3 square neighbourhood of pixels completely contained within the 82x100 sub-area. Each line contains the pixel values in the four spectral bands (converted to ASCII) of each of the 9 pixels in the 3x3 neighbourhood and a number indicating the classification label of the central pixel. The aim is to predict this classification, given the multi-spectral values. The database contains thus 6435 patterns with 36 attributes (4 spectral bands x 9 pixels in neighbourhood) plus the class label. The attributes are numerical, in the range 0 to 255 (8 bits). The class label is a code for the following classes: Number Class 1 red soil 2 cotton crop 3 grey soil 4 damp grey soil 5 soil with vegetation stubble 6 mixture class (all types present) 7 very damp grey soil NB. There are no examples with class 6 in this dataset- they have all been removed because of doubts about the validity of this class. The data is given in random order and certain lines of data have been removed so you cannot reconstruct the original image from this dataset. In each line of data the four spectral values for the top-left pixel are given first followed by the four spectral values for the top-middle pixel and then those for the top-right pixel, and so on with the pixels read out in sequence left-to-right and top-to-bottom. Thus, the four spectral values for the central pixel are given by attributes 17,18,19 and 20. If you like you can use only these four attributes, while ignoring the others. This avoids the problem which arises when a 3x3 neighbourhood straddles a boundary. 4. Summary Statistics: The dynamic of the attributes is in [27-157], with a mean value 83.47 and a standard deviation egal to 17.6. The database resulting from the centering and reduction by attribute of the Satimage database is on the ftp server in the `REAL/satimage/satimage_CR.dat.Z' file. Class Distribution: Class Instances Percentage 1 1533 23.82 % 2 703 10.92 % 3 1358 21.10 % 4 626 9.73 % 5 707 10.99 % 7 1508 23.43 % 5. Confusion matrix obtained with the k_NN classifier on the satimage_CR.dat database (test with the Leave_One_Out method). k was set to 3 in order to reach the minimum error rate : 8.89 +/- 1.6%. {{0, 1, 2, 3, 4, 5, 7}, {1, 98.1, 0.2, 1.1, 0.1, 0.5, 0.0}, {2, 0.0, 96.5, 0.1, 0.7, 2.0, 0.7}, {3, 0.5, 0.1, 93.4, 4.6, 0.0, 1.4}, {4, 0.0, 0.8, 13.7, 70.6, 0.8, 14.1}, {5, 3.1, 0.8, 0.1, 0.8, 89.7, 5.5}, {7, 0.0, 0.1, 1.9, 7.3, 2.0, 88.7}} 6. Result of the Principal Component Analysis: The Principal Components Analysis is a very classical method in pattern recognition [Duda73]. PCA reduces the sample dimension in a linear way for the best representation in lower dimensions keeping the maximum of inertia. The best axe for the representation is however not necessary the best axe for the discrimination. After PCA, features are selected according to the percentage of initial inertia which is covered by the different axes and the number of features is determined according to the percentage of initial inertia to keep for the classification process. This selection method has been applied on the satimage_CR database. When quasi-linear correlations exists between some initial features, these redundant dimensions are removed by PCA and this preprocessing is then recommended. In this case, before a PCA, the determinant of the data covariance matrix is near zero; this database is thus badly conditioned for all process which use this information (the quadratic classifier for example). The following files are available for the satimage database: - ``satimage_PCA.dat.Z'', the projection of the ``satimage_CR'' database on its principal components (sorted in a decreasing order of the related inertia percentage; so, if you desire to work on the database projected on its x first principal components you only have to keep the x first attributes of the satimage_PCA.dat database and the class labels (last attribute)). - ``satimage_corr_circle.ps'', a graphical representation of the correlation between the initial attributes and the two first principal components, - ``satimage_proj_PCA.ps'', a graphical representation of the projection of the initial database on the two first principal components, Table here below provides the inertia percentages associated to the eigenvalues corresponding to the principal component axis sorted in the decreasing order of their associated inertia percentage. 99 percent of the total database inertia will remain if the 17 first principal components are kept. Eigen Value Inertia Cumulated value percentage inertia 1 16.3274 45.35 45.35 2 14.3575 39.88 85.24 3 1.57658 4.38 89.61 4 0.88933 2.47 92.09 5 0.65945 1.83 93.92 6 0.60908 1.69 95.61 7 0.37060 1.03 96.64 8 0.19197 0.53 97.17 9 0.12981 0.36 97.53 10 0.12588 0.35 97.88 11 0.08386 0.23 98.11 12 0.06657 0.18 98.30 13 0.06449 0.18 98.48 14 0.05722 0.16 98.64 15 0.04557 0.13 98.77 16 0.04422 0.12 98.89 17 0.04078 0.11 99.00 18 0.03677 0.10 99.10 19 0.02896 0.08 99.18 20 0.02773 0.08 99.26 21 0.02622 0.07 99.33 22 0.02480 0.07 99.40 23 0.02224 0.06 99.46 24 0.02053 0.06 99.52 25 0.01918 0.05 99.57 26 0.01866 0.05 99.63 27 0.01798 0.05 99.68 28 0.01728 0.05 99.72 29 0.01540 0.04 99.77 30 0.01494 0.04 99.81 31 0.01449 0.04 99.85 32 0.01285 0.04 99.88 33 0.01212 0.03 99.92 34 0.01082 0.03 99.95 35 0.01005 0.03 99.98 36 0.00844 0.02 100.00 This matrix can be found in the satimage_EV.dat file. The Discriminant Factorial Analysis (DFA) can be applied to a learning database where each learning sample belongs to a particular class [Duda73]. The number of discriminant features selected by DFA is fixed in function of the number of classes (c) and of the number of input dimensions (d); this number is equal to the minimum between d and c-1. In the usual case where d is greater than c, the output dimension is fixed equal to the number of classes minus one and the discriminant axes are selected in order to maximize the between-variance and to minimize the within-variance of the classes. The discrimination power (ratio of the projected between-variance over the projected within-variance) is not the same for each discriminant axis: this ratio decreases for each axis. So for a problem with many classes, this preprocessing will not be always efficient as the last output features will not be so discriminant. This analysis uses the information of the inverse of the global covariance matrix, so the covariance matrix must be well conditioned (for example, a preliminary PCA must be applied to remove the linearly correlated dimensions). The Discriminant Factorial Analysis (DFA) has been applied on the 18 first principal components of the satimage_PCA database (thus by keeping only the 18 first attributes of these databases before to apply the DFA preprocessing) in order to build the satimage_DFA.dat.Z database file, having 5 dimensions (the satimage database having 6 classes). [Duda73] Duda, R.O. and Hart, P.E., Pattern Classification and Scene Analysis, John Wiley & Sons, 1973.