Correspondence Analysis
Correspondence analysis is a principal components analysis method for the display of rows and columns of a two-way contingency table as points in a low-dimensional vector space (Carr 1990). The geometry of the rows, which in our data set represent the individual sand samples, is related to the geometry of the columns, which represent the point count parameters, resulting in a "correspondence" between the rows and columns (Greenacre 1984). Carr (1990) provides a short, but useful, summary of the technique:
A relatively simple transformation is applied to a contingency table to yield a square, symmetric matrix for which eigenvalues and eigenvectors are calculated. From the eigenvalues and eigenvectors, factor loadings are calculated separately for the individuals and the attributes. By combining factor loadings, individuals and attributes can be plotted simultaneously in a two-dimensional plan to yield a clustering pattern.
Although correspondence analysis is statistically based, it is primarily a geometric technique (Greenacre 1984). As Ringrose (1992) has noted, the algebraic technique employed by correspondence analysis is purely deterministic; therefore it provides little indication of the strength of any apparent relationships. For that reason many authors (Baxter 1991, Escoufier and Junca 1986) emphasize the exploratory, as opposed to confirmatory (Lewis 1986), nature of its results. Melguen (1974) first recognized the usefulness of correspondence analysis as a tool for recognizing and characterizing sedimentary facies.
One of the primary reasons we choose correspondence analysis as a method of data reduction and exploration is that the technique requires only that all values in the data matrix be positive (zeros are acceptable) and that all row and column totals be greater than zero (Hill 1979:10; Weller and Romney 1990:72). These are important assumptions in a data base that contains point count data gathered from samples derived from an extremely heterogeneous set of source rocks. Many of the grain type parameters recorded are not present in all portions of the study area. This type of fundamental between-sample compositional variability, necessary to petrofacies model building, represents one of the greatest differences separating the analytical requirements of petrological analysis from those of instrumental characterization studies of clay chemistry. In instrumental studies there is an expectation that all of the elements and compounds under study will be present in all of the samples; usually only the relative concentrations of elements and compounds exhibit variation between samples.
In general, several correspondence analysis trials are run over the length of a project. The first trial uses the data from all sand samples collected, and all point count parameters that occur at a rate of greater than one percent in the data set. Subsequent trials limit the data to minimize the "noise" contributed by samples with extremely different compositions. For instance, a sand with unusually high calcite will plot as a point at one end of the correspondence axes, with all other samples grouped at the other end. No exploratory data is gained by this type of plot, so known “odd” samples are progressively excluded as we explore the relationships in the overall data set. In this process, we may utilize a subcomposition of the point count parameters, or create recalculated parameters by summing the counts for parameters representing rock and mineral types that are formed together geologically and that plotted close together in earlier correspondence analysis trials.
Often, the first two factors will account for approximately half of the variation, and are readily interpretable in terms of tectonic origin of the sand sample when the ranked parameter optimal scores are examined. For example, on the first factor the lithic parameters generally received the lowest (negative) factor scores while the minerals received the highest (positive) factor scores. Each axis can be thus be interpreted by examining the parameters with the lowest and highest rankings. The opposition of factors on the axes of the correspondence analysis reflects the fundamental data structure, making correspondence analysis an excellent petrofacies refinement tool.
Discriminant Analysis
Criticism of petrofacies modeling has focused on two issues: (1) the need to evaluate the degree of intrapetrofacies compositional variability (Lindauer 1992b:278); and (2) the need for a rigorous test statistic to evaluate sherd membership in a given petrofacies (Cable 1989:8). The use of discriminant function analysis to evaluate the group membership of sand samples using the point-counted compositional data allows both issues to be addressed.
Because petrofacies are defined using compositional and geographic criteria, it is expected that some compositional overlap will occur between sand samples collected from different areas in a region. The discriminant analysis permits us to estimate the degree of intrapetrofacies variability by comparing assigned group memberships (based on compositional and geographic knowledge) with the posterior probability of group membership (based on assignment by the discriminant functions). The discriminant analysis also identifies other source zones with which a given petrofacies' sands may exhibit compositional overlap. Finally, the sand temper point count data, recorded from sherd thin sections, can be treated as "unknown" cases, and classified as to petrofacies membership by the discriminant functions. This allows the probabilistic assignment of sherd samples to petrofacies and serves as an independent check on the binocular microscopic characterization of temper composition.
This map illustrates how the statistical modeling was conducted in overlapping zones for the San Pedro Valley. The first model was constructed to evaluate the northern half of the main valley, an area that is geographically and culturally separate from the southern half. The two petrofacies that border the northern and southern units were included in each model (N and S) to prevent the creation of an artificial division between the models. A more extensive boundary was used between the Aravaipa and northern San Pedro models since they share a mountain range.
Discriminant analysis is a statistical technique that is designed to study the differences between two or more groups of objects with respect to several variables simultaneously (Klecka 1980). In our study, the data are grouped by petrofacies. Individual sand and sherd samples are the objects, and point count parameters are the variables. Discriminant analysis is used to address two distinct problems. The first problem was to determine the set of functions that best discriminate between the sand groups and test the accuracy of the resulting discrimination. The second problem involves the use of the discriminant functions to assign sherds, inferred to have originated from one of the petrofacies, to these groups in a probabilistic manner.
Often, multiple discriminant analyses are conducted. The first tests the assignment of sands to generic groups, such as a mineral-rich group of petrofacies or a the rock-fragment-rich group of petrofacies. Subsequent models may divide the generic groups further, such as dividing a rock-fragment-rich group into volcanic-rich versus metamorphic-rich petrofacies. The final models assign sand to individual petrofacies with the generic groups. Fortunately, we have not yet had to resort to more than three nested levels of discriminant analyses. The logratio transform, developed by Aitchison (1986) for the analysis of compositional data, is applied to the point count data prior to the analyses.
The results of the nested discriminant analyses can be brought together in an overall classification matrix (example from Tonto Basin) that shows how many sand samples were accurately predicted using the discriminating variables. This raw accuracy is the sum of correct predictions divided by the total number of cases). Klecka (1980:50) notes that while the percentage of cases predicted accurately is the most intuitive measure of discrimination, the magnitude of this percentage should be judged in relation to the expected percentage of correct classifications made by random assignment. A proportional reduction in error statistic, tau, can be calculated. Tau gives a standard measure of improvement over a random assignment regardless of the number of groups (Klecka 1980:50-51). The maximum value for tau is 1.0; this value represents no errors in prediction. A value of zero indicates no improvement over random assignment.
The equation for tau is presented here:
The calculation of tau is included on the classification tables for each basin.
Once the discriminant functions have been created for the sands of a basin, they remain unchanged pending the addition of sand samples to the data set. As sherds are point counted, they can be added to the analysis as unweighted samples, that is, they are unknowns to be classified by the discriminant functions. They are not used to create the discriminant functions or the classification groups.
In some cases, we have found it necessary to delete certain parameters from the discriminant analysis in order to analyze sherds. For example, at certain times and in certain places in prehistoric Arizona, a mixed temper of crushed mica schist and sand was used as temper. In this case, it may be impossible to estimate the amount of mica attributable to the sand temper, and mica may need to be left out of the discriminant analysis in order to analyze that set of sherds. Leaving out a parameter generally results in a less accurate model, but it is preferable to being unable to compare the sherds to their sand sources at all.