NextGP.jl: Modules and Functions
NextGP.jl
relies on several other Julia packages, which could be easily avoided and may be removed in the later releases.- Currently one core dependency is the
StatsModels.jl
package, for model expression, and fixed effect design matrix generation. - Checking
StatsModels.jl
's manual for at leastformula
andcategorical data
could be useful.
Basic Model
NextGP.jl
uses the following basic model:
\mathbf{y}=\mathbf{X}\mathbf{b}+\sum{\mathbf{Zu}}+\sum{\mathbf{M\beta}}+\mathbf{e} $
b is a vector of fixed effects
u is a vector of polygenic effects in the model with a covariance structure $\mathbf{u}\sim N\left(0,\mathbf{A}\sigma_{a}^{2}\right)$
A is a relationship matrix from the pedigree
Z is a design matrix allocating animals to records
M are matrices of genotypes
\[\mathbf{\beta}\]
are vectors of marker effects
Public Functions
Public functions of NextGP are documented here
NextGP.MCMC.runLMEM
— Functionfunction runLMEM(formula,userData,nChain,nBurn,nThin;myHints=Dict{Symbol,Any}(),blockThese=[],outFolder="outMCMC",VCV=[],userPedData=[],summaryStat=Dict{Any,Any}())
formula
is the model equatio as made available by theStatsModels.jl
packageuserData
is aDataFrame
includinglhs
andrhs
variables (other than defined byPED
andSNP
)nChain
,nBurn
andnThin
are the chain length, burn-in period, and the thining interval used for the MCMC sampling- Users can define coding of their variables (e.g. full dummy coding) by providing
myHints
. CheckStatsModels.jl
's manual forcategorical data
could be useful.
NextGP.makeA
— FunctionmakeA(s::Any, d::Any)
Makes pedigree-based relationship matrix. adapted from http://morotalab.org/Mrode2005/relmat/createA.txt
NextGP.makePed
— FunctionmakePed(inputFile::String,userData)
Makes pedigree using PedigreeBase package
NextGP.makeG
— FunctionmakeG(inputFile::String;method=1)
Makes genomic relationship matrix based on vanRaden method 1 (defult) or method 2
makeG(inputData::Array{Float64,2};method=1)
Makes genomic relationship matrix based on vanRaden method 1 (defult) or method 2
NextGP.SNP
— Function function SNP(name,path;map="")
- Defines SNP information for further analysis.
path
is the path for the marker file, or the matrix for the marker genotypes.- Marker files are currently expected to be ordered as the phenotype data.
- The method to be applied to the data,
GBLUP
,BayesB
,BayesR
etc, is defined in the prior setting. - Map file is optional. If not provided, a Bayesian Regression model with common variance for all SNPs will be applied. If provided, shoul match the order in the genotype file.
- One most avoid overlapping marker sets by using different
name
s.
NextGP.BayesPR
— Function function BayesPR(r,v)
r
is the region size. In other words, the number of SNPs that share a common variance.1
: each SNP has its own (co)variance99
: SNPs on the same chromosome has the same (co)variance9999
: All SNPs have the same (co)variance- One can define any other region size, for example, 30, 40 or 100
v
is an estimate of the variance for the distribution of SNPs
NextGP.BayesB
— Function function BayesB(pi,v;estimatePi=false)
pi
is the proportion of SNPs to be included in the model at each McMC cycel. IfestimatePi=true
, it is only used as a starting value.v
is the variance for the prior distribution of SNPs.estimatePi
istrue
ifpi
is estimated. By default it is ´false´
NextGP.BayesC
— Function function BayesC(pi,v;estimatePi=false)
pi
is the proportion of SNPs to be included in the model at each McMC cycel. IfestimatePi=true
, it is only used as a starting value.v
is the variance for the prior distribution of SNPs.estimatePi
istrue
ifpi
is estimated. By default it is ´false´
NextGP.BayesR
— Function function BayesR(pi,class,v;estimatePi=false)
pi
is the vector of proportion of SNPs for each variance class. IfestimatePi=true
, it is only used as a starting value.class
is the vector of scales of common SPN variance for each variance class. The scales should be in the increasing order. For example, [0.0,0.0001,0.001,0.01].v
is the variance for the prior distribution of SNPs.estimatePi
istrue
ifpi
is estimated. By default it is ´false´
NextGP.BayesLV
— Function function BayesLV(v,f,covariates,zeta)
v
is the variance for the prior distribution of SNPs.f
is the model formula for the variancecovariates
is theDataFrame
that includes explanatory varibles for the variance of each SNP.zeta
is the error variance in the model for the SNP variances.- If
estimateVarZeta
istrue
, it assumes that the error variance for the model for the SNP variances is 0.01 percent of the variance of the log-variances. IfestimateVarZeta
isfalse
, it useszeta
as the error variance for the log-linear variances (fixed value). IfestimateVarZeta
is aFloat64
, it assumes that the error variance for the model for the SNP variances isestimateVarZeta
percent of the variance of the log-variances.
Internals
Functions that are used internally are documented here
NextGP.prepMatVec.prep
— Functionfunction prep(f::StatsModels.TermOrTerms, inputData::DataFrame;userHints=Dict{Symbol,Any}(),path2ped=[],priorVCV=[])
NextGP
relies onStatsModels.jl
package for model expression (f
), and fixed effect design matrix generation.- Details for the model expression (
f
), and fixed effects coding specifications (e.g., effect or dummy coding) can be found atStatsModels.jl
. - Design matrices for random effects are generated either own internal functions or using
StatsModels.jl
smodelcols
, depending on how user defined the model term in the model. - Reads in marker data, and mean-centers the columns.
- Finally returns lhs vector and rhs matrices.
- By default:
- all
Int
rhs variables are madeCategorical
, - all
String
rhs variables (also those madeCategorical
) are dummy coded, except those defined by the user inuserHints
, - all
Float
rhs variables are centered.
- all
Convenience functions
Convergency statistics of parameter param
, using MCMCChains.jl
using MCMCChains,StatsPlots,StatsBase,CSV
function convergencyMCMC(param;summary=false,plots=false,outFolder=pwd()*"/outMCMC")
param = CSV.read("$outFolder/$(param)Out",DataFrame,header=true)
namesParam = names(param)
param = Matrix(param)
if summary==true
chn = Chains(param,namesParam)
display(chn)
if plots==true
display(plot(chn))
end
param = mean(Matrix(param),dims=1)
else
param = mean(Matrix(param),dims=1)
end
return param
end
- If
summary=true
, will print convergency statistics for McMC - If
plots=true
, will print trace plot(s) of McMC outFolder
is the folder for the McMC output. By default it searches for the folder "outMCMC" in the current directory.