NextGP.jl: Modules and Functions
NextGP.jlrelies on several other Julia packages, which could be easily avoided and may be removed in the later releases.- Currently one core dependency is the
StatsModels.jlpackage, for model expression, and fixed effect design matrix generation. - Checking
StatsModels.jl's manual for at leastformulaandcategorical datacould be useful.
Basic Model
NextGP.jl uses the following basic model:
\mathbf{y}=\mathbf{X}\mathbf{b}+\sum{\mathbf{Zu}}+\sum{\mathbf{M\beta}}+\mathbf{e} $
b is a vector of fixed effects
u is a vector of polygenic effects in the model with a covariance structure $\mathbf{u}\sim N\left(0,\mathbf{A}\sigma_{a}^{2}\right)$
A is a relationship matrix from the pedigree
Z is a design matrix allocating animals to records
M are matrices of genotypes
\[\mathbf{\beta}\]
are vectors of marker effects
Public Functions
Public functions of NextGP are documented here
NextGP.MCMC.runLMEM — Functionfunction runLMEM(formula,userData,nChain,nBurn,nThin;myHints=Dict{Symbol,Any}(),blockThese=[],outFolder="outMCMC",VCV=[],userPedData=[],summaryStat=Dict{Any,Any}())formulais the model equatio as made available by theStatsModels.jlpackageuserDatais aDataFrameincludinglhsandrhsvariables (other than defined byPEDandSNP)nChain,nBurnandnThinare the chain length, burn-in period, and the thining interval used for the MCMC sampling- Users can define coding of their variables (e.g. full dummy coding) by providing
myHints. CheckStatsModels.jl's manual forcategorical datacould be useful.
NextGP.makeA — FunctionmakeA(s::Any, d::Any)Makes pedigree-based relationship matrix. adapted from http://morotalab.org/Mrode2005/relmat/createA.txt
NextGP.makePed — FunctionmakePed(inputFile::String,userData)Makes pedigree using PedigreeBase package
NextGP.makeG — FunctionmakeG(inputFile::String;method=1)Makes genomic relationship matrix based on vanRaden method 1 (defult) or method 2
makeG(inputData::Array{Float64,2};method=1)Makes genomic relationship matrix based on vanRaden method 1 (defult) or method 2
NextGP.SNP — Function function SNP(name,path;map="")- Defines SNP information for further analysis.
pathis the path for the marker file, or the matrix for the marker genotypes.- Marker files are currently expected to be ordered as the phenotype data.
- The method to be applied to the data,
GBLUP,BayesB,BayesRetc, is defined in the prior setting. - Map file is optional. If not provided, a Bayesian Regression model with common variance for all SNPs will be applied. If provided, shoul match the order in the genotype file.
- One most avoid overlapping marker sets by using different
names.
NextGP.BayesPR — Function function BayesPR(r,v)ris the region size. In other words, the number of SNPs that share a common variance.1: each SNP has its own (co)variance99: SNPs on the same chromosome has the same (co)variance9999: All SNPs have the same (co)variance- One can define any other region size, for example, 30, 40 or 100
vis an estimate of the variance for the distribution of SNPs
NextGP.BayesB — Function function BayesB(pi,v;estimatePi=false)piis the proportion of SNPs to be included in the model at each McMC cycel. IfestimatePi=true, it is only used as a starting value.vis the variance for the prior distribution of SNPs.estimatePiistrueifpiis estimated. By default it is ´false´
NextGP.BayesC — Function function BayesC(pi,v;estimatePi=false)piis the proportion of SNPs to be included in the model at each McMC cycel. IfestimatePi=true, it is only used as a starting value.vis the variance for the prior distribution of SNPs.estimatePiistrueifpiis estimated. By default it is ´false´
NextGP.BayesR — Function function BayesR(pi,class,v;estimatePi=false)piis the vector of proportion of SNPs for each variance class. IfestimatePi=true, it is only used as a starting value.classis the vector of scales of common SPN variance for each variance class. The scales should be in the increasing order. For example, [0.0,0.0001,0.001,0.01].vis the variance for the prior distribution of SNPs.estimatePiistrueifpiis estimated. By default it is ´false´
NextGP.BayesLV — Function function BayesLV(v,f,covariates,zeta)vis the variance for the prior distribution of SNPs.fis the model formula for the variancecovariatesis theDataFramethat includes explanatory varibles for the variance of each SNP.zetais the error variance in the model for the SNP variances.- If
estimateVarZetaistrue, it assumes that the error variance for the model for the SNP variances is 0.01 percent of the variance of the log-variances. IfestimateVarZetaisfalse, it useszetaas the error variance for the log-linear variances (fixed value). IfestimateVarZetais aFloat64, it assumes that the error variance for the model for the SNP variances isestimateVarZetapercent of the variance of the log-variances.
Internals
Functions that are used internally are documented here
NextGP.prepMatVec.prep — Functionfunction prep(f::StatsModels.TermOrTerms, inputData::DataFrame;userHints=Dict{Symbol,Any}(),path2ped=[],priorVCV=[])NextGPrelies onStatsModels.jlpackage for model expression (f), and fixed effect design matrix generation.- Details for the model expression (
f), and fixed effects coding specifications (e.g., effect or dummy coding) can be found atStatsModels.jl. - Design matrices for random effects are generated either own internal functions or using
StatsModels.jlsmodelcols, depending on how user defined the model term in the model. - Reads in marker data, and mean-centers the columns.
- Finally returns lhs vector and rhs matrices.
- By default:
- all
Intrhs variables are madeCategorical, - all
Stringrhs variables (also those madeCategorical) are dummy coded, except those defined by the user inuserHints, - all
Floatrhs variables are centered.
- all
Convenience functions
Convergency statistics of parameter param, using MCMCChains.jl
using MCMCChains,StatsPlots,StatsBase,CSV
function convergencyMCMC(param;summary=false,plots=false,outFolder=pwd()*"/outMCMC")
param = CSV.read("$outFolder/$(param)Out",DataFrame,header=true)
namesParam = names(param)
param = Matrix(param)
if summary==true
chn = Chains(param,namesParam)
display(chn)
if plots==true
display(plot(chn))
end
param = mean(Matrix(param),dims=1)
else
param = mean(Matrix(param),dims=1)
end
return param
end
- If
summary=true, will print convergency statistics for McMC - If
plots=true, will print trace plot(s) of McMC outFolderis the folder for the McMC output. By default it searches for the folder "outMCMC" in the current directory.