NextGP.jl: Modules and Functions

NextGP.jl relies on several other Julia packages, which could be easily avoided and may be removed in the later releases.
Currently one core dependency is the StatsModels.jl package, for model expression, and fixed effect design matrix generation.
Checking StatsModels.jl's manual for at least formula and categorical data could be useful.

Basic Model

NextGP.jl uses the following basic model:

\mathbf{y}=\mathbf{X}\mathbf{b}+\sum{\mathbf{Zu}}+\sum{\mathbf{M\beta}}+\mathbf{e} $

b is a vector of fixed effects

u is a vector of polygenic effects in the model with a covariance structure $\mathbf{u}\sim N\left(0,\mathbf{A}\sigma_{a}^{2}\right)$

A is a relationship matrix from the pedigree

Z is a design matrix allocating animals to records

M are matrices of genotypes

\[\mathbf{\beta}\]

are vectors of marker effects

Public Functions

Public functions of NextGP are documented here

NextGP.MCMC.runLMEM — Function

function runLMEM(formula,userData,nChain,nBurn,nThin;myHints=Dict{Symbol,Any}(),blockThese=[],outFolder="outMCMC",VCV=[],userPedData=[],summaryStat=Dict{Any,Any}())

formula is the model equatio as made available by the StatsModels.jl package
userData is a DataFrame including lhs and rhs variables (other than defined by PED and SNP)
nChain,nBurn and nThin are the chain length, burn-in period, and the thining interval used for the MCMC sampling
Users can define coding of their variables (e.g. full dummy coding) by providing myHints. Check StatsModels.jl's manual for categorical data could be useful.

source

NextGP.makeA — Function

makeA(s::Any, d::Any)

Makes pedigree-based relationship matrix. adapted from http://morotalab.org/Mrode2005/relmat/createA.txt

source

NextGP.makePed — Function

makePed(inputFile::String,userData)

Makes pedigree using PedigreeBase package

source

NextGP.makeG — Function

makeG(inputFile::String;method=1)

Makes genomic relationship matrix based on vanRaden method 1 (defult) or method 2

source

    makeG(inputData::Array{Float64,2};method=1)

Makes genomic relationship matrix based on vanRaden method 1 (defult) or method 2

source

NextGP.SNP — Function

    function SNP(name,path;map="")

Defines SNP information for further analysis.
path is the path for the marker file, or the matrix for the marker genotypes.
Marker files are currently expected to be ordered as the phenotype data.
The method to be applied to the data, GBLUP, BayesB, BayesR etc, is defined in the prior setting.
Map file is optional. If not provided, a Bayesian Regression model with common variance for all SNPs will be applied. If provided, shoul match the order in the genotype file.
One most avoid overlapping marker sets by using different names.

source

NextGP.BayesPR — Function

    function BayesPR(r,v)

r is the region size. In other words, the number of SNPs that share a common variance.
- 1: each SNP has its own (co)variance
- 99: SNPs on the same chromosome has the same (co)variance
- 9999: All SNPs have the same (co)variance
- One can define any other region size, for example, 30, 40 or 100
v is an estimate of the variance for the distribution of SNPs

source

NextGP.BayesB — Function

    function BayesB(pi,v;estimatePi=false)

pi is the proportion of SNPs to be included in the model at each McMC cycel. If estimatePi=true, it is only used as a starting value.
v is the variance for the prior distribution of SNPs.
estimatePi is trueif pi is estimated. By default it is ´false´

source

NextGP.BayesC — Function

    function BayesC(pi,v;estimatePi=false)

pi is the proportion of SNPs to be included in the model at each McMC cycel. If estimatePi=true, it is only used as a starting value.
v is the variance for the prior distribution of SNPs.
estimatePi is trueif pi is estimated. By default it is ´false´

source

NextGP.BayesR — Function

    function BayesR(pi,class,v;estimatePi=false)

pi is the vector of proportion of SNPs for each variance class. If estimatePi=true, it is only used as a starting value.
class is the vector of scales of common SPN variance for each variance class. The scales should be in the increasing order. For example, [0.0,0.0001,0.001,0.01].
v is the variance for the prior distribution of SNPs.
estimatePi is trueif pi is estimated. By default it is ´false´

source

NextGP.BayesLV — Function

    function BayesLV(v,f,covariates,zeta)

v is the variance for the prior distribution of SNPs.
f is the model formula for the variance
covariatesis the DataFrame that includes explanatory varibles for the variance of each SNP.
zeta is the error variance in the model for the SNP variances.
If estimateVarZeta is true, it assumes that the error variance for the model for the SNP variances is 0.01 percent of the variance of the log-variances. If estimateVarZeta is false, it uses zeta as the error variance for the log-linear variances (fixed value). If estimateVarZeta is a Float64, it assumes that the error variance for the model for the SNP variances is estimateVarZeta percent of the variance of the log-variances.

source

Internals

Functions that are used internally are documented here

NextGP.prepMatVec.prep — Function

function prep(f::StatsModels.TermOrTerms, inputData::DataFrame;userHints=Dict{Symbol,Any}(),path2ped=[],priorVCV=[])

NextGP relies on StatsModels.jl package for model expression (f), and fixed effect design matrix generation.
Details for the model expression (f), and fixed effects coding specifications (e.g., effect or dummy coding) can be found at StatsModels.jl.
Design matrices for random effects are generated either own internal functions or using StatsModels.jls modelcols, depending on how user defined the model term in the model.
Reads in marker data, and mean-centers the columns.
Finally returns lhs vector and rhs matrices.
By default:
- all Int rhs variables are made Categorical,
- all String rhs variables (also those made Categorical) are dummy coded, except those defined by the user in userHints,
- all Float rhs variables are centered.

source

Convenience functions

Convergency statistics of parameter `param`, using `MCMCChains.jl`

using MCMCChains,StatsPlots,StatsBase,CSV

function convergencyMCMC(param;summary=false,plots=false,outFolder=pwd()*"/outMCMC")
        param = CSV.read("$outFolder/$(param)Out",DataFrame,header=true)
        namesParam = names(param)
        param = Matrix(param)
                if summary==true
                        chn = Chains(param,namesParam)
                        display(chn)
                        if plots==true
                                display(plot(chn))
                        end
                        param = mean(Matrix(param),dims=1)
                else
                        param = mean(Matrix(param),dims=1)
                end
        return param
end

If summary=true, will print convergency statistics for McMC
If plots=true, will print trace plot(s) of McMC
outFolder is the folder for the McMC output. By default it searches for the folder "outMCMC" in the current directory.

NextGP.jl: Modules and Functions

Basic Model

Public Functions

Internals

Convenience functions

Convergency statistics of parameter param, using MCMCChains.jl

Convergency statistics of parameter `param`, using `MCMCChains.jl`