Introduction
Germplasm banks play an important role in genetic resources conservation. Characterization of genetic resources is the process of describing accessions with respect to a particular set of characters 6. The evaluation process has several objectives: i) to measure the genetic variability of the studied group; ii) to establish the representativeness of the collection in relation to the genetic variability of species in a region or to intra-species genetic variability; and iii) to identify genes that can be used for research and practical purposes, such as biotic resistance 11. The evaluation of total conserved accessions is affected by limitations such as experimental plot areas, labor availability and logistics. These limitations call for trait evaluations on consecutive field assays, each of them with a subset of manageable number of accessions. Associated with the complete evaluation of a germplasm bank by agronomic and morphological descriptors, two considerations should be taken into account: a) to conduct consecutive assays for evaluating a large number of accessions, and b) to use multivariate statistical methods for analysis of set quantitative traits. Complete evaluation of accessions from a germplasm bank requires the inclusion of a set of checks or controls in all assays, performed over the seasons. These checks connect the assays and allow the comparison of the different accessions.
Taba et al. (1998) reported the results of field evaluations to develop a core subset of Caribbean maize (Zea mays l.) accessions from the maize germplasm bank of CIMMYT (Centro Internacional de Mejoramiento de Maíz y Trigo). A total of 498 accessions were evaluated in two sets of 249 accessions each, and seven common checks. The authors present a combined analysis of these data sets. Firstly, they propose a mixed linear model including the following effects: assay, accession and their interaction, plus other design-associated factors. These effects were estimated and removed from the observed value of each entry for each trait, leaving only the genotype effect. Secondly, the adjusted trait means were used to cluster accessions into homogeneous groups through a multivariate analysis. Following this same approach, Reeb et al. (2007) developed a model to evaluate 145 maize accessions from the Active Germplasm Bank (BAP) of the Experimental Station at the Instituto Nacional de Tecnología Agropecuaria (EEA INTA) in Pergamino, Argentina. Both models have a high degree of unbalance and the interactions result in a large number of empty cells, making difficult the effects estimation. In addition, given that they are univariate models, they ignore relationships structure among variables within datasets.
On the other hand, data from every assay are arranged into matrices of accession by markers sharing, over time, a group of accessions characterized in all assays. This may be seen as a three-way data structure, where input matrices are incomplete but connected through common checks. The first way comprises the accessions, the second one, the characterization traits and the third one, the assays. Thus, a three-way data analysis should be a useful tool to deal with the multidimensional nature of the information.
Materials and methods
Plant material
Maize accessions were evaluated in three assays. Assay 1 was planted in Pergamino (Buenos Aires, Argentina) during the 2006/2007 season, and included 23 accessions of different racial forms: Cristalino Colorado 11; Cristalino Amarillo 1; Dentado Amarillo 6; and Dentado Blanco 8. Assay 2 was conducted in Ferré (Buenos Aires, Argentina) during the 2006/2007 season. This assay included 24 accessions of different racial forms: Cristalino Colorado 12; Avatí Morotí Ti 1; Dentado Amarillo 3; Dentado Blanco 8; and non-classifiable type 1. Seven common checks were included in both assays: four synthetic open pollinated (BS13p, Candelaria INTA, Payagua INTA and SP1234) and three accessions (ARZM14007 of Dentado Blanco, ARZM14023 of Dentado Amarillo and ARZM18004 of Cristalino Colorado). Assay 3 included all the 47 accessions of assays 1 and 2 and the seven common checks, and it was carried out in Ferré during the 2011/2012 season. Assay 3 was used as a reference for the comparative analysis of the methodology proposed. Random Complete Block Design with two replications was used for the three assays. Twelve morphological, agronomic and phenological traits were measured according to IBPGR descriptors 12: plant height (PL.HEIGHT-cm); ear height (E.HEIGHT-cm); kernel width (K.WIDTH-mm); kernel length (K.LENGTH-mm); number of kernel rows (N.ROW); ear diameter (E.DIAM-mm); 1000 kernel weight (W1000-gr); yield (YIELD-kg); prolificacy (PROL); root lodging (ROOT.L-%); days to anthesis (D.ANT) and days to silking (D.SILK) 3.
Principal Components Analysis on the concatenated table
Principal Component Analysis (PCA) adequately describes a set of n individuals and p variables through a small set of variables expressed as linear combinations of the original ones. In this way the information is optimally represented in a reduced dimensional space. In addition, the initial variables are usually correlated, whereas the new ones are not. This transformation facilitates data interpretation because it is possible to infer about linear relations between variables and similarities about individuals 5,27. In this study, a single PCA matrix of 54 accessions by 12 variables was performed using the mean value of accessions over replications per assay and subsequent concatenation of the assays.
Principal Component Analysis on the matrix obtained after the estimation and elimination of model effects
A mixed linear model was adjusted for each variable according to Reeb et al. (2007) and Taba et al. (1998). This model firstly estimates the effect of each factor and then eliminates it from the value of each variable obtained for each accession:
where:
y ikl = the observed value
μ = the general mean
G i = the effect of the i-th accession i=1,…, 54
E k = the effect of the k-th assay, k=1.2
B l(k) = the effect of the block-within-assay set, l=1.,2
Ɛ ikl = the random error.
As a result, a matrix of 54 accessions by 12 adjusted variables was used to perform a PCA.
Principal Component Analysis on the reference assay
A matrix of individuals-by-variables was obtained using the average of the replications, based on data from the reference trial conducted in the 2011/2012 season. This matrix was used to perform the PCA.
Generalized Procrustes Analysis on incomplete but connected trials
An algorithm based on Generalized Procrustes Analysis (GPA) 8,9,25 was developed to generate a consensus configuration of the common checks from a multivariate approach. GPA can be applied to a set of individuals described by the same variables in different conditions. This technique searches for an optimal consensus configuration of the different individuals-by-variables data matrices. The consensus is obtained through a series of iterative algebraic steps that include translation, rotation, reflection and scaling of each individual configuration, optimizing a goodness-of-fit criterion. The latter relies on maintaining the relative distances among elements of the individual configurations and on minimizing the sums of squares between analogous points, i.e., points that correspond to the same element under different configurations. After the initial standardization or the translations have been done and all the configurations have been transformed, an iteration is completed. The process is repeated until the change in the residual square sums between two consecutive steps is less than a particular value. The consensus configurations are obtained from the average of all individual transformed configurations. Once the iterative process of the GPA ends, the total variability can be partitioned in the form of a table of analysis of variance (ANOVA).
Some applications of this technique are related to measuringthe consensus between the agronomic and molecular information 1,3,10,18,20,22,26. On the other hand, GPA can be applied in crop characterization and other agronomic objectives 4,19,23,27.
The experimental situation under study can be analyzed using incomplete matrices which contain the information from q assays. In each k-th assay (k=1 … q), a set of (n+n k ) accessions is assessed by p variables. Each set of (n + n k ) accessions includes a set of n common checks measured under q conditions, and n k other accessions, different in each k-th assay (k = 1…q). In matrix notation, information on the conditions is represented by q matrices X k . They have both: n rows representing the individuals measured under all conditions (hereafter referred to as individuals in common), and n k rows representing those individuals which correspond only to the k-th condition. The latter are unique to each particular condition and differ among conditions. Therefore, each X k data matrix can be partitioned into two submatrices: , of order (nxp) with the coordinates of the individuals in common in the k-th condition, and , of order (nkxp) with the coordinates of the individuals unique to the k-th condition (Figure 1).
A common space where to project all the set of individuals is required for analyzing the relationships between the individuals. However, an assumption of the GPA is that all objects are measured under all conditions.
Thus, the GPA-based algorithm must be applied on the n individuals in common, while the individuals unique to each condition are considered as supplementary elements 17. The n k individuals in the k-th condition are centered, scaled and rotated through the same transformations applied for the individuals in common, but are excluded from the calculation of the parameters involved in these operations. The individuals in common act as pivots for the rest of the accessions, being the geometric reference for the successive rotations. Therefore, it is essential to select individuals in common that satisfy some requirements for obtaining reliable and accurate results.
The assumptions of the proposal are that individuals in common are clustered into groups showing significant differences between the considered traits and that these groups remain stable across conditions. These are verified by MANOVA, checking the absence of genotype-by-assay interaction and the presence of two or more significantly different groups. If these assumptions are verified, then the relationships between individuals in each condition can be analyzed through the set of individuals in common (Figure 2).
That is to say, individuals that belong only to the k-th condition are compared with the individuals in common that correspond to this same condition,. Then, using the GPA performed on the individuals in common, is compared with through the consensus configuration Y. Finally, individuals in common within de k’-th condition are compared with individuals corresponding only to this condition,.
In this multivariate solution, GPA is used as a basis for obtaining a factorial plane where all individuals are projected (7, 13, 14, 15, 17). Thus, individuals are grouped according to the similarity of their multivariate behavior 2.
A brief description of the steps involved in the methodological proposal is presented here:
- Fit a MANOVA model to the individuals in common, taking into account the main effects, the interactions and the design terms. The aim is to verify the assumptions of absence of genotype-by-assay interaction effect and presence of, at least, two significant different groups of the individuals in common (significant accession effect). The multivariate model is:
where:
y ikl = the l-th multivariate replication of the p observable variables,
y ikl = (y ikl1 , y ikl2 , ...y iklp )´;
μ = the general mean;
G i = the effect of the i-th accession, i=1,…,7;
E k = the effect of the k-th assay, k=1,2;
B l(k) = the block-within-trial set, l=1,2;
Ɛ ikl = the random error.
- Generate the assay configurations X k , k=1…q by carrying out a PCA on each separate assay. Each X k matrix is considered to be partitioned into and . The matrix contains the coordinates of the individuals in common that belong to the k-th condition (controls). The matrix contains the coordinates of those individuals belonging only to the k-th condition.
- Center each configuration X k , k=1…q on the gravity center of the individuals in common.
- Scale each matrix X k , k=1…q, initialize the scale factors and calculate the initial residual sum of squares.
- Perform the rotation of the individuals in common. Compute the rotation matrix of each matrix and use it to rotate the individuals only belonging to the condition.
- Calculate the consensus configuration Y as the average of the configurations of the rotated individuals in common, and obtain a new residual sum of squares.
- If necessary, adjust the scaling factors and then recalculate the consensus configuration Y and the residual sums of squares.
- If the difference in the residual sum of squares between subsequent steps is greater than the set tolerance, go to step 5; else, concatenate into a single matrix X.
- Finally, perform a PCA on X to obtain the coordinates of the individuals.
The consensus configuration obtained was compared with those from the PCAs performed as described in (2.2), (2.3) and (2.4). For comparison, correlation coefficient between matrices of distances between individuals projected on the principal plane was calculated. These correlation coefficients were computed for each strategy. The proposed algorithm was implemented in an ad-hoc Matlab routine 16.
Results and discussion
Principal Components Analysis on the concatenated matrix
The principal plane obtained by applying the PCA on the concatenated matrix (2.2) explained 68.54% of total variability. The PCA plot (Figure 3) shows a horizontal gradient from left to right of phenological traits, plant height, ear height, kernel width, kernel length and 1000 kernel weight, and in opposite direction: number of kernel rows.
Yield and Prolificacy contributed to the vertical axis formation, establishing an upward gradient, whereas root lodging established a downward gradient.
The analysis of the individual’s factorial plane (Figure 4, page 41) shows that the accessions of Dentado Blanco are placed on the right end.
These accessions include the plants of largest size and longest flowering cycle, with ARZM14066 and ARZM14090 standing out from the rest.
The accessions of Dentado Amarillo and Cristalino are placed on the left end. These accessions showed a larger number of kernel rows and a shorter flowering cycle than those of Dentado Blanco, particularly ARZM18046 and ARZM18012. The accessions having the highest yield values are situated on the upper end of the vertical axis. This group is composed by the synthetic varieties used as checks (Candelaria INTA, Payagua INTA, BS13p and SP1234), the accessions ARZM18055 and ARZM18056 of Dentado Amarillo and ARZM18035 and ARZM18049 of Cristalino Colorado. In contrast, the accessions ARZM14003 of Dentado Blanco and ARZM14418 of Avatí Morotí Ti are located at the lower end of the vertical axis, with the lowest yield and largest root lodging percentage.
The accessions of Dentado Amarillo are dispersed, showing the following distribution: the largest plants were grouped with the accessions of Dentado Blanco, those having high yield values were grouped with the synthetic varieties used as checks whereas those of low yield values are grouped with Avatí Morotí Ti. The accessions of Cristalino Colorado constitute an homogeneous group in terms of plant and kernel size and cycle duration, but they are separated into two subgroups in regard to high yield (upper left quadrant) and low yield values (lower left quadrant).
Principal Components Analysis on the matrix of adjusted means
The PCA principal plane obtained after the estimation and elimination of the model effects (2.3) explained 65.14% of the total variability. Traits behaved as in the previous analysis, except for number of kernel rows and prolificacy that contributed to a lesser extent to axis formation. In addition, ear diameter and yield contributed to form the vertical axis. Accessions were included in the same groups as in 3.1.
Principal Components Analysis on the reference trial
The PCA principal plane on data from the reference trial (2.4) explained 61.24% of total variability. Traits behaved as in previous analyses, except for number of kernel rows that contributed to the formation of the vertical axis. Accessions were grouped as before.
Generalized Procrustes Analysis on incomplete but connected trials
After verifying the assumptions, no genotype-by-environment interaction was detected and three groups of individuals in common revealed highly significant differences. Group 1: ARZM14007; Group 2: ARZM14023, ARZM18004, SP1234; and Group3: BS13p, Candelaria INTA, Payagua INTA. In step 2 of the presented methodological proposal, the replicates for each accession were averaged and a PCA was performed on each assay. Within the algorithm, and before the iterative process, both configurations were centered and scaled (Steps 3 and 4) and the initial residual sum of squares was calculated. Then, rotation and scaling were made until the convergence criterion was satisfied (Steps 5 to 8). The algorithm converged in two iterations (Table 1).
Figure 5 shows the configuration of individuals obtained by the methodology proposed in this research (Step 9).
The accessions are grouped according to their race. Dentado Blanco landraces were separated from Dentado Amarillo and Cristalino along the first principal axis. The synthetic varieties are positioned at the upper end of the vertical axis, which is associated with yield. Accessions were grouped as in the other cases already described.
Table 2 shows the correlation coefficients between matrices of distance among individuals obtained from the final configurations of the different strategies.
The correlation coefficients indicate a high degree of agreement among them. The highest correlation coefficient value was observed between the presented methodological proposal and the strategy of model effects elimination. This was a very promising and encouraging result, since it may improve the way in which germplasm collections are characterized. So far, this goal has been performed by the elimination of model effects, a univariate methodological approach that requires a laborious process. This is due to the model being highly unbalanced, thus hindering the estimation of the effects. In this context the new proposal stands as a useful tool for data analysis in germplasm characterization and evaluation process, providing good results with easy implementation.
The common checks were clustered into significant groups serving as gravity center for accessions that are similar to them. Figure 5 (page 42) shows that non-synthetic common checks were positioned close to the rest of the accessions belonging to their race. Common checks are required to be stable (without genotype-by-assay interaction) and to show a distinct behavior, so that they can serve as geometric reference for those accessions belonging only to a given assay.
Conclusions
The objective of this study was to develop a multivariate methodology based on Generalized Procrustes Analysis (GPA) to define similar accession groups in the context of the characterization assays. The GPA is applied to a group of checks, common to all assays. The accessions that are only present in a given assay are considered as supplementary elements. Common checks must fulfill certain assumptions, i.e., they are required not to interact with the evaluation assays and so to be stable across assays. The efficiency of the proposal was illustrated with seven common checks. After verifying the above-mentioned assumptions, the proposed technique was applied to obtain a factorial plane where all evaluated accessions are projected. This configuration was compared with those obtained from three strategies: concatenation of trials, estimation and elimination of the model effects and a reference assay. The correlations between matrices of distance among individuals reveal that the presented proposal provides similar results to those given by the currently used methodology. This traditional methodology is based on the estimation and elimination of effects, and has several disadvantages: a high degree of unbalance and many empty cells that complicate the estimates, in addition of being a univariate approach. These results support that the developed proposal is useful for identifying sets of accessions similar to those involved in germplasm evaluation trials, considering the multivariate structure of the data set.