The impact of disease prevalence rate in training set on performance of random forest and threshold Bayes A methods

Document Type : Research Paper

Author

Assistant Professor in Genetics and Animal Breeding Animal Science department Islamic Azad University,Astara Branch P.O. Box 1141, Astara, Iran.

Abstract

The objective of current study was to investigate the role of disease prevalence rate of training set and genomic architecture on performance of random forest (RF) and threshold Bayes A (BTA) in threshold traits. For this purpose, genomic population were simulated to reflect variations in heritability (0.05 and 0.25), number of QTL (150 and 600) and linkage disequilibrium (low and high) for 30 chromosomes. To create binary phenotype with different disease prevalence rate, at first, 5 percent of training set animals which had the lowest phenotype average defined code 1 (or diseased) and 95 percent of others defined code 0 (or healthy). This process continued with a 5% increase rate until 50 percent of animals had code 1 in training set. In both random forest and Bayes A methods, genomic accuracy with increase in disease prevalence rate 5 to 20 percent was increased, and afterwards to achieve of 50 percent was decreased. The negative effect of high levels of disease prevalence rate on genomic accuracy was higher than low levels of it. Overall, RF was fluctuation to variations of genetic architecture and disease prevalence rate. Despite the higher accuracy of TBA at different scenarios, RF showed a better performance when high-heritability traits were controlled by a large number of QTLs. Despite the important role of genetic basis of the population analyzed, the best method to predict genomic breeding value of threshold traits depend on disease prevalence rate.

Keywords


اعلاء نوشهر، ف. (1395). به کارگیری مدل های آماری انتخاب ژنومی و آنالیز QTLدر برنامه های اصلاح نژادی گوسفند .پایان نامه دکترا، دانشکده کشاورزی دانشگاه تبریز.
بانه، ح.، نجاتی جوارمی، ا.، رحیمی میانجی، ق. و هنرور، م. (1396). ارزیابی ژنومی صفات آستانه ای با معماری های ژنتیکی متفاوت با استفاده از روشهای بیزی. پژوهشهای تولیدات دامی.15، ص.ص. 154-149.
بهمرام، ر. (1392). مطالعه شبیه سازی صحت ارزش اصلاحی و پیشرفت ژنتیکی درصفات آستانه با ارزیابی کلاسیک و ژنومی. پایان نامه دکترا، دانشکده کشاورزی دانشگاه فردوسی مشهد.
صادقی، س. (1396). ارزیابی معماری ژنومی صفات گسسته با جانهی داده‌های ژنومی گوسفند توسط روش‌های آستانه‌ای بیزی و یادگیری ماشین. پایان نامه دکترا، دانشکده کشاورزی دانشگاه تبریز.
Badke, Y.M., Bates, R.O., Ernst, C.W., Fix, J., and Steibel, J.P. (2014). Accuracy of estimation of genomic breeding values in pigs using low-density genotypes and imputation. G3: Genes| Genomes| Genetics. 4(4): 623-631.
Bo, Z., ZHANG, J.-j., Hong, N., Long, G., Peng, G., XU, L.-y., et al. (2017). Effects of marker density and minor allele frequency on genomic prediction for growth traits in Chinese Simmental beef cattle. Journal of Integrative Agriculture. 16(4): 911-920.
Breiman, L. (2001). Random forests. Machine learning. 45(1): 5-32.
Buch, L.H., Kargo, M., Berg, P., Lassen, J., and Sørensen, A.C. (2012). The value of cows in reference populations for genomic selection of new functional traits. Animal. 6(6): 880-886.
Calus, M., De Roos, A., and Veerkamp, R. (2008). Accuracy of genomic selection using different methods to define haplotypes. Genetics. 178(1): 553-561.
Chen, L., Li, C., Sargolzaei, M., and Schenkel, F. (2014). Impact of genotype imputation on the performance of GBLUP and Bayesian methods for genomic prediction. PloS one. 9(7): e101544.
Dekkers, J.C. (2010). Animal genomics and genomic selection. Paper presented at the Adapting animal production to changes for a growing human population. Proceedings of International Conference, Lleida, Spain.
Egger-Danner, C., Cole, J., Pryce, J., Gengler, N., Heringstad, B., Bradley, A., et al. (2015). Invited review: overview of new traits and phenotyping strategies in dairy cattle with a focus on functional traits. Animal. 9(2): 191-207.
Ghafouri-Kesbi, F., Rahimi-Mianji, G., Honarvar, M., and Nejati-Javaremi, A. (2017). Predictive ability of Random Forests, Boosting, Support Vector Machines and Genomic Best Linear Unbiased Prediction in different scenarios of genomic evaluation. Animal Production Science. 57(2): 229-236.
Goddard, M. (2009). Genomic selection: prediction of accuracy and maximisation of long term response. Genetica. 136(2): 245-257.
Goldstein, B.A., Hubbard, A.E., Cutler, A., and Barcellos, L.F. (2010). An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings. BMC genetics. 11(1): 49.
González-Recio, O., and Forni, S. (2011). Genome-wide prediction of discrete traits using Bayesian regressions and machine learning. Genetics Selection Evolution. 43(1): 7.
Gorgani Firozjah, N., Atashi, H., Dadpasand, M., and Zamiri, M. (2014). Effect of marker density and trait heritability on the accuracy of genomic prediction over three generations. Journal of Livestock Science and Technologies. 2(2): 53-58.
Habier, D., Fernando, R.L., and Dekkers, J.C. (2009). Genomic selection using low-density marker panels. Genetics. 182(1): 343-353.
Hayes, B. (2007). QTL mapping, MAS, and genomic selection. A short-course. Animal Breeding & Genetics Department of Animal Science. Iowa State University. 1(1): 3-4.
Hayes, B., and Goddard, M.E. (2001). The distribution of the effects of genes affecting quantitative traits in livestock. Genetics Selection Evolution. 33(3): 209.
Hayes, B.J., Bowman, P.J., Chamberlain, A., and Goddard, M. (2009). Invited review: Genomic selection in dairy cattle: Progress and challenges. Journal of Dairy Science. 92(2): 433-443.
Jónás, D., Ducrocq, V., and Croiseau, P. (2017). The combined use of linkage disequilibrium–based haploblocks and allele frequency–based haplotype selection methods enhances genomic evaluation accuracy in dairy cattle. Journal of Dairy Science. 100(4): 2905-2908.
Ke, X., Hunt, S., Tapper, W., Lawrence, R., Stavrides, G., Ghori, J., et al. (2004). The impact of SNP density on fine-scale patterns of linkage disequilibrium. Human Molecular Genetics. 13(6): 577-588.
Meuwissen, T., Hayes, B., and Goddard, M. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics. 157(4): 1819-1829.
Muir, W. (2007). Comparison of genomic and traditional BLUP‐estimated breeding value accuracy and selection response under alternative trait and genomic parameters. Journal of Animal Breeding and Genetics. 124(6): 342-355.
Naderi, S., Yin, T., and König, S. (2016). Random forest estimation of genomic breeding values for disease susceptibility over different disease incidences and genomic architectures in simulated cow calibration groups. Journal of Dairy Science. 99(9): 7261-7273.
Ogutu, J.O., Piepho, H.-P., and Schulz-Streeck, T. (2011). A comparison of random forests, boosting and support vector machines for genomic selection. Paper presented at the BMC proceedings.
Piccoli, M.L., Braccini, J., Cardoso, F.F., Sargolzaei, M., Larmer, S.G., and Schenkel, F.S. (2014). Accuracy of genome-wide imputation in Braford and Hereford beef cattle. BMC genetics. 15(1): 157.
Pimentel, E.C., Wensch-Dorendorf, M., König, S., and Swalve, H.H. (2013). Enlarging a training set for genomic selection by imputation of un-genotyped animals in populations of varying genetic architecture. Genetics Selection Evolution. 45(1): 12.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., Maller, J., Sklar, P., De Bakker, P. I. and Daly, M. J. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 81: 559-575.
Sargolzaei, M., and Schenkel, F.S. (2009). QMSim: a large-scale genome simulator for livestock. Bioinformatics. 25(5): 680-681.
Shirali, M., Miraei-Ashtiani, S. R., Pakdel, A., Haley, C., Navarro, P. and Pong-Wong, R.,(2015). A comparison of the sensitivity of the bayesC and genomic best linear unbiased prediction (GBLUP) methods of estimating genomic breeding values under different quantitative trait locus (QTL) model assumptions. Iranian Journal of Applied Animal Science. 5: 41-46.
Solberg, T., Sonesson, A., and Woolliams, J. (2008). Genomic selection using different marker types and densities. Journal of animal science. 86(10): 2447-2454.
Sun, X., Fernando, R., and Dekkers, J. (2016). Contributions of linkage disequilibrium and co-segregation information to the accuracy of genomic prediction. Genetics Selection Evolution. 48(1): 77.
Technow, F., and Melchinger, A.E. (2013). Genomic prediction of dichotomous traits with Bayesian logistic models. Theoretical and applied genetics. 126(4): 1133-1143.
Wang, C., Ding, X., Wang, J., Liu, J., Fu, W., Zhang, Z., et al. (2013). Bayesian methods for estimating GEBVs of threshold traits. Heredity. 110(3): 213-219.
Wang, C., Li, X., Qian, R., Su, G., Zhang, Q., and Ding, X. (2017). Bayesian methods for jointly estimating genomic breeding values of one continuous and one threshold trait. PloS one. 12(4): e0175448.
Wiggans, G., VanRaden, P., and Cooper, T. (2011). The genomic evaluation system in the United States: Past, present, future. Journal of Dairy Science. 94(6): 3202-3211.
Yin, T., Pimentel, E., Borstel, U.K.v., and König, S. (2014). Strategy for the simulation and analysis of longitudinal phenotypic and genomic data in the context of a temperature× humidity-dependent covariate. Journal of Dairy Science. 97(4): 2444-2454.
Zhang, Z., Zhang, Q., and Ding, X. (2011). Advances in genomic selection in domestic animals. Chinese science bulletin. 56(25): 2655-2663.