Optimal sample size and data arrangement method in estimating correlation matrices with lesser collinearity: A statistical focus in maize breeding

Olivoto T., Nardino M., Carvalho I., Follmann D., Ferrari M., de Pelegrin A., Szareski V., de Oliveira A., Caron B., de Souza V.

January 2017

Abstract

Information about data arrangement methodologies and optimal sample size in estimating the pearson correlation coefficient (r) among maize traits are still limited. Furthermore, some data arrangement methodologies currently used may be increasing multicollinearity in multiple regression analysis. This study aimed to investigate the statistical behavior of the r and the multicollinearity of correlation matrices among maize traits in different data arrangement scenarios and different sample sizes. Data from 45 treatments [15 simple maize hybrids (zea mays l.) conducted in three locations] were used. Eleven traits were accessed and three datasets (scenarios) were formed: (1) coming from all the sampled observations (plants), n = 900; (2) coming from the average of five plants per plot, n = 180; and (3) coming from the average of treatments, n = 45. A thousand estimates of r were held in each scenario to 60 sample sizes by bootstrap simulations with replacement. Confidence intervals (ci) were estimated. One hundred eighty correlation matrices were estimated and the condition number (cn) calculated. Data coming from average values of plots and average values of treatments overestimates the r up to 24 and 34 percent, resulting in an increase of 24 and 131 percent in the matrices cn. Trait pairs with high r require a smaller number of plants, being the ci inversely proportional to the magnitude of the r. Two hundred and ten plants are sufficient to estimate the r in the ci of 95 percent textless 0.30. Key words: average values, bootstrap, confidence intervals, sample tracking, zea mays l.

Type

Journal article

Publication

In: African Journal of Agricultural Research, 12(2):93–103, 10.5897/AJAR2016.11799