NextGP.jl: Modules and Functions

  • NextGP.jl relies on several other Julia packages, which could be easily avoided and may be removed in the later releases.
  • Currently one core dependency is the StatsModels.jl package, for model expression, and fixed effect design matrix generation.
  • Checking StatsModels.jl's manual for at least formula and categorical data could be useful.

Basic Model

NextGP.jl uses the following basic model:

$

\mathbf{y}=\mathbf{X}\mathbf{b}+\sum{\mathbf{Zu}}+\sum{\mathbf{M\beta}}+\mathbf{e} $

b is a vector of fixed effects

u is a vector of polygenic effects in the model with a covariance structure $\mathbf{u}\sim N\left(0,\mathbf{A}\sigma_{a}^{2}\right)$

A is a relationship matrix from the pedigree

Z is a design matrix allocating animals to records

M are matrices of genotypes

\[\mathbf{\beta}\]

are vectors of marker effects


Public Functions

Public functions of NextGP are documented here

NextGP.MCMC.runLMEMFunction
function runLMEM(formula,userData,nChain,nBurn,nThin;myHints=Dict{Symbol,Any}(),blockThese=[],outFolder="outMCMC",VCV=[],userPedData=[],summaryStat=Dict{Any,Any}())
  • formula is the model equatio as made available by the StatsModels.jl package
  • userData is a DataFrame including lhs and rhs variables (other than defined by PED and SNP)
  • nChain,nBurn and nThin are the chain length, burn-in period, and the thining interval used for the MCMC sampling
  • Users can define coding of their variables (e.g. full dummy coding) by providing myHints. Check StatsModels.jl's manual for categorical data could be useful.
source
NextGP.makeAFunction
makeA(s::Any, d::Any)

Makes pedigree-based relationship matrix. adapted from http://morotalab.org/Mrode2005/relmat/createA.txt

source
NextGP.makePedFunction
makePed(inputFile::String,userData)

Makes pedigree using PedigreeBase package

source
NextGP.makeGFunction
makeG(inputFile::String;method=1)

Makes genomic relationship matrix based on vanRaden method 1 (defult) or method 2

source
    makeG(inputData::Array{Float64,2};method=1)

Makes genomic relationship matrix based on vanRaden method 1 (defult) or method 2

source
NextGP.SNPFunction
    function SNP(name,path;map="")
  • Defines SNP information for further analysis.
  • path is the path for the marker file, or the matrix for the marker genotypes.
  • Marker files are currently expected to be ordered as the phenotype data.
  • The method to be applied to the data, GBLUP, BayesB, BayesR etc, is defined in the prior setting.
  • Map file is optional. If not provided, a Bayesian Regression model with common variance for all SNPs will be applied. If provided, shoul match the order in the genotype file.
  • One most avoid overlapping marker sets by using different names.
source
NextGP.BayesPRFunction
    function BayesPR(r,v)
  • r is the region size. In other words, the number of SNPs that share a common variance.
    • 1: each SNP has its own (co)variance
    • 99: SNPs on the same chromosome has the same (co)variance
    • 9999: All SNPs have the same (co)variance
    • One can define any other region size, for example, 30, 40 or 100
  • v is an estimate of the variance for the distribution of SNPs
source
NextGP.BayesBFunction
    function BayesB(pi,v;estimatePi=false)
  • pi is the proportion of SNPs to be included in the model at each McMC cycel. If estimatePi=true, it is only used as a starting value.
  • v is the variance for the prior distribution of SNPs.
  • estimatePi is trueif pi is estimated. By default it is ´false´
source
NextGP.BayesCFunction
    function BayesC(pi,v;estimatePi=false)
  • pi is the proportion of SNPs to be included in the model at each McMC cycel. If estimatePi=true, it is only used as a starting value.
  • v is the variance for the prior distribution of SNPs.
  • estimatePi is trueif pi is estimated. By default it is ´false´
source
NextGP.BayesRFunction
    function BayesR(pi,class,v;estimatePi=false)
  • pi is the vector of proportion of SNPs for each variance class. If estimatePi=true, it is only used as a starting value.
  • class is the vector of scales of common SPN variance for each variance class. The scales should be in the increasing order. For example, [0.0,0.0001,0.001,0.01].
  • v is the variance for the prior distribution of SNPs.
  • estimatePi is trueif pi is estimated. By default it is ´false´
source
NextGP.BayesLVFunction
    function BayesLV(v,f,covariates,zeta)
  • v is the variance for the prior distribution of SNPs.
  • f is the model formula for the variance
  • covariatesis the DataFrame that includes explanatory varibles for the variance of each SNP.
  • zeta is the error variance in the model for the SNP variances.
  • If estimateVarZeta is true, it assumes that the error variance for the model for the SNP variances is 0.01 percent of the variance of the log-variances. If estimateVarZeta is false, it uses zeta as the error variance for the log-linear variances (fixed value). If estimateVarZeta is a Float64, it assumes that the error variance for the model for the SNP variances is estimateVarZeta percent of the variance of the log-variances.
source

Internals

Functions that are used internally are documented here

NextGP.prepMatVec.prepFunction
function prep(f::StatsModels.TermOrTerms, inputData::DataFrame;userHints=Dict{Symbol,Any}(),path2ped=[],priorVCV=[])
  • NextGP relies on StatsModels.jl package for model expression (f), and fixed effect design matrix generation.
  • Details for the model expression (f), and fixed effects coding specifications (e.g., effect or dummy coding) can be found at StatsModels.jl.
  • Design matrices for random effects are generated either own internal functions or using StatsModels.jls modelcols, depending on how user defined the model term in the model.
  • Reads in marker data, and mean-centers the columns.
  • Finally returns lhs vector and rhs matrices.
  • By default:
    • all Int rhs variables are made Categorical,
    • all String rhs variables (also those made Categorical) are dummy coded, except those defined by the user in userHints,
    • all Float rhs variables are centered.
source

Convenience functions

Convergency statistics of parameter param, using MCMCChains.jl

using MCMCChains,StatsPlots,StatsBase,CSV

function convergencyMCMC(param;summary=false,plots=false,outFolder=pwd()*"/outMCMC")
        param = CSV.read("$outFolder/$(param)Out",DataFrame,header=true)
        namesParam = names(param)
        param = Matrix(param)
                if summary==true
                        chn = Chains(param,namesParam)
                        display(chn)
                        if plots==true
                                display(plot(chn))
                        end
                        param = mean(Matrix(param),dims=1)
                else
                        param = mean(Matrix(param),dims=1)
                end
        return param
end
  • If summary=true, will print convergency statistics for McMC
  • If plots=true, will print trace plot(s) of McMC
  • outFolder is the folder for the McMC output. By default it searches for the folder "outMCMC" in the current directory.