grapp API#

Linear Algebra#

Non-standardized Linear Operators#

These operators work with scipy.sparse.linalg, and provide the ability to do matrix products against the unmodified genotype matrix that is represented by a GRG.

class grapp.linalg.ops_scipy.SciPyXOperator(*args, **kwargs)#

A scipy.sparse.linalg.LinearOperator on the genotype matrix represented by the GRG, which allows for multiplication between the GRG and a matrix or vector. This is for the non-standardized matrix, which just contains discrete allele counts.

Can perform the operation \(X \times A\) (_matmat) or \(X \times \overrightarrow{v}\) (_matvec).

Parameters:

grg (pygrgl.GRG) – The GRG the operator will multiply against.
direction (pygrgl.TraversalDirection) – Determines whether the matrix is \(X\) (pygrgl.TraversalDirection.UP) or \(X^T\) (pygrgl.TraversalDirection.DOWN).
dtype (TypeAlias) – The numpy.dtype to use.
haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.
miss_values (Optional[numpy.typing.NDArray]) – If non-None, must be a vector of length num_mutations, which provides a per- mutation value for missingness (applied per haplotype). Usually the per-Mutation mean value (e.g., missingness-adjusted allele frequency) is provided. Default: None.
mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Default: no filter.
sample_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be QxM (where Q is the length of sample_filter) instead of NxM. Default: no filter.

class grapp.linalg.ops_scipy.SciPyXTXOperator(*args, **kwargs)#

A scipy.sparse.linalg.LinearOperator on the matrix \(X^TX\) represented by the GRG. This is for the non-standardized matrix, which just contains discrete allele counts, but it is not centered at the mean, so it is not quite the covariance matrix.

Can perform the operation \(X^T \times X \times A\) (_matmat) or \(X^T \times X \times \overrightarrow{v}\) (_matvec).

Parameters:

grg (pygrgl.GRG) – The GRG the operator will multiply against.
dtype (TypeAlias) – The numpy.dtype to use.
haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.
miss_values (Optional[numpy.typing.NDArray]) – If non-None, must be a vector of length num_mutations, which provides a per- mutation value for missingness (applied per haplotype). Usually the per-Mutation mean value (e.g., missingness-adjusted allele frequency) is provided. Default: None.
mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Default: no filter.
sample_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be QxM (where Q is the length of sample_filter) instead of NxM. Default: no filter.

class grapp.linalg.ops_scipy.SciPyXXTOperator(*args, **kwargs)#

A scipy.sparse.linalg.LinearOperator on the matrix \(XX^T\) represented by the GRG. This is for the non-standardized matrix, which just contains discrete allele counts.

Can perform the operation \(X \times X^T \times A\) (_matmat) or \(X \times X^T \times \overrightarrow{v}\) (_matvec).

Parameters:

grg (pygrgl.GRG) – The GRG the operator will multiply against.
dtype (TypeAlias) – The numpy.dtype to use.
haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.
miss_values (Optional[numpy.typing.NDArray]) – If non-None, must be a vector of length num_mutations, which provides a per- mutation value for missingness (applied per haplotype). Usually the per-Mutation mean value (e.g., missingness-adjusted allele frequency) is provided. Default: None.
mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Default: no filter.
sample_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be QxM (where Q is the length of sample_filter) instead of NxM. Default: no filter.

Standardized Linear Operators#

These operators work with scipy.sparse.linalg, and provide the ability to do matrix products against the genotype matrix that is represented by a GRG, except that genotype matrix is implicitly standardized by subtracting the mean and dividing by the standard deviation.

class grapp.linalg.ops_scipy.SciPyStdXOperator(*args, **kwargs)#

A scipy.sparse.linalg.LinearOperator on the genotype matrix represented by the GRG, which allows for multiplication between the GRG and a matrix or vector. This is for the standardized matrix, which is centered to the mean (based on allele frequencies) and standard devation (based on the binomial distribution where each individual is the result of \(p\), the ploidy, trials).

Can perform the operation \(X \times A\) (_matmat) or \(X \times \overrightarrow{v}\) (_matvec).

Parameters:

grg (pygrgl.GRG) – The GRG the operator will multiply against.
direction (pygrgl.TraversalDirection) – Determines whether the matrix is \(X\) (pygrgl.TraversalDirection.UP) or \(X^T\) (pygrgl.TraversalDirection.DOWN).
freqs (numpy.ndarray) – A vector of length num_mutations, containing the allele frequency for all mutations. Indexed by the mutation ID of the mutation.
dtype (TypeAlias) – The numpy.dtype to use.
haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.
mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Default: no filter.
sample_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be QxM (where Q is the length of sample_filter) instead of NxM. Default: no filter.
alpha (float) – Alpha model coefficient (e.g., Speed, et. al., 2012) for variance, which multiplicatively scales the genotype matrix by sqrt(variance^alpha). By default alpha=-1, which corresponds to the “standard” binomial variance scaling.
custom_variance (numpy.ndarray) – Instead of using binomial variance, use provided custom variance for mutations. Must be an array of length num_mutations, for example the result from grapp.util.variance(). Default: None.

class grapp.linalg.ops_scipy.SciPyStdXTXOperator(*args, **kwargs)#

A scipy.sparse.linalg.LinearOperator on the matrix \(X^TX\) represented by the GRG. This is for the standardized matrix, which is centered to the mean (based on allele frequencies) and standard devation (based on the binomial distribution where each individual is the result of \(p\), the ploidy, trials).

This operator performs multiplications against the correlation matrix of the genotype matrix underlying the GRG. Can perform the operation \(X^T \times X \times A\) (_matmat) or \(X \times X \times \overrightarrow{v}\) (_matvec).

Parameters:

grg (pygrgl.GRG) – The GRG the operator will multiply against.
freqs (numpy.ndarray) – A vector of length num_mutations, containing the allele frequency for all mutations. Indexed by the mutation ID of the mutation.
dtype (TypeAlias) – The numpy.dtype to use.
haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.
mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Default: no filter.
sample_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be QxM (where Q is the length of sample_filter) instead of NxM. Default: no filter.
alpha (float) – Alpha model coefficient (e.g., Speed, et. al., 2012) for variance, which multiplicatively scales the genotype matrix by sqrt(variance^alpha). By default alpha=-1, which corresponds to the “standard” binomial variance scaling.
custom_variance (numpy.ndarray) – Instead of using binomial variance, use provided custom variance for mutations. Must be an array of length num_mutations, for example the result from grapp.util.variance(). Default: None.

class grapp.linalg.ops_scipy.SciPyStdXXTOperator(*args, **kwargs)#

A scipy.sparse.linalg.LinearOperator on the matrix \(XX^T\) represented by the GRG. This is for the standardized matrix, which is centered to the mean (based on allele frequencies) and standard devation (based on the binomial distribution where each individual is the result of \(p\), the ploidy, trials).

This operator performs multiplications against the correlation matrix of the genotype matrix underlying the GRG. Can perform the operation \(X \times X^T \times A\) (_matmat) or \(X \times X^T \times \overrightarrow{v}\) (_matvec).

Parameters:

grg (pygrgl.GRG) – The GRG the operator will multiply against.
freqs (numpy.ndarray) – A vector of length num_mutations, containing the allele frequency for all mutations. Indexed by the mutation ID of the mutation.
dtype (TypeAlias) – The numpy.dtype to use.
haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.
mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Default: no filter.
sample_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be QxM (where Q is the length of sample_filter) instead of NxM. Default: no filter.
alpha (float) – Alpha model coefficient (e.g., Speed, et. al., 2012) for variance, which multiplicatively scales the genotype matrix by sqrt(variance^alpha). By default alpha=-1, which corresponds to the “standard” binomial variance scaling.
custom_variance (numpy.ndarray) – Instead of using binomial variance, use provided custom variance for mutations. Must be an array of length num_mutations, for example the result from grapp.util.variance(). Default: None.

Linear Operators for Multiple GRGs#

These operators are the same as the ones above, except they allow for multiple GRGs to be used for a single product. For example, if you have GRGs for each of the 22 autosomes, you can construct these operators and pass in all 22 GRGs. The resulting operator will perform matrix multiplications against the entire autosome. You can use multiple threads to parallelize by GRG.

class grapp.linalg.ops_scipy.MultiSciPyXOperator(*args, **kwargs)#

A scipy.sparse.linalg.LinearOperator on multiple GRGs. Same as SciPyXOperator, except if the input GRGs have mutation counts M1, M2, …, MK, then the dimension of the implicit underlying genotype matrix is Nx(M1 + M2 + … + MK).

Parameters:

grgs (List[pygrgl.GRG]) – The GRGs the operator will multiply against. They must all have the same samples, and the mutations are expected to differ (e.g., one GRG per chromosome of the same dataset).
direction (pygrgl.TraversalDirection) – Determines whether the matrix is \(X\) (pygrgl.TraversalDirection.UP) or \(X^T\) (pygrgl.TraversalDirection.DOWN).
dtype (TypeAlias) – The numpy.dtype to use.
haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.
miss_values (Optional[numpy.typing.NDArray]) – If non-None, must be a vector of length num_mutations (for all GRGs), which provides a per-mutation value for missingness (applied per haplotype). Usually the per-Mutation mean value (e.g., missingness-adjusted allele frequency) is provided. Default: None.
mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Here the mutation filter follows the same numbering as the input/output matrices: for example, if grgs=[grg1, grg2] then indexes 0…(grg1.num_mutations-1) will be for grg1, and grg1.num_mutations…(grg1.num_mutations+grg2.num_mutations-1) will be the mutations for grg2. Then if you have a mutation_filter containing the number grg1.num_mutations + 4 it means it will keep grg2’s mutation with ID 4. Default: no filter.
sample_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be QxM (where Q is the length of sample_filter) instead of NxM. Since all GRGs have the same samples, this behavior is the same as the non-Multi operators. Default: no filter.
threads (int) – Number of threads for performing the multiplication. Each GRG can be done in parallel.

class grapp.linalg.ops_scipy.MultiSciPyXTXOperator(*args, **kwargs)#

A scipy.sparse.linalg.LinearOperator on multiple GRGs. Same as SciPyXTXOperator, except if the input GRGs have mutation counts M1, M2, …, MK, then the dimension of the implicit underlying genotype matrix is Nx(M1 + M2 + … + MK).

Parameters:

grgs (List[pygrgl.GRG]) – The GRGs the operator will multiply against. They must all have the same samples, and the mutations are expected to differ (e.g., one GRG per chromosome of the same dataset).
dtype (TypeAlias) – The numpy.dtype to use.
haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.
miss_values (Optional[numpy.typing.NDArray]) – If non-None, must be a vector of length num_mutations (for all GRGs), which provides a per-mutation value for missingness (applied per haplotype). Usually the per-Mutation mean value (e.g., missingness-adjusted allele frequency) is provided. Default: None.
mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Here the mutation filter follows the same numbering as the input/output matrices: for example, if grgs=[grg1, grg2] then indexes 0…(grg1.num_mutations-1) will be for grg1, and grg1.num_mutations…(grg1.num_mutations+grg2.num_mutations-1) will be the mutations for grg2. Then if you have a mutation_filter containing the number grg1.num_mutations + 4 it means it will keep grg2’s mutation with ID 4. Default: no filter.
sample_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be QxM (where Q is the length of sample_filter) instead of NxM. Since all GRGs have the same samples, this behavior is the same as the non-Multi operators. Default: no filter.
threads (int) – Number of threads for performing the multiplication. Each GRG can be done in parallel.

class grapp.linalg.ops_scipy.MultiSciPyStdXOperator(*args, **kwargs)#

A scipy.sparse.linalg.LinearOperator on multiple GRGs. Same as SciPyStdXOperator, except if the input GRGs have mutation counts M1, M2, …, MK, then the dimension of the implicit underlying genotype matrix is Nx(M1 + M2 + … + MK).

Parameters:

grgs (List[pygrgl.GRG]) – The GRGs the operator will multiply against. They must all have the same samples, and the mutations are expected to differ (e.g., one GRG per chromosome of the same dataset).
direction (pygrgl.TraversalDirection) – Determines whether the matrix is \(X\) (pygrgl.TraversalDirection.UP) or \(X^T\) (pygrgl.TraversalDirection.DOWN).
dtype (TypeAlias) – The numpy.dtype to use.
haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.
mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Here the mutation filter follows the same numbering as the input/output matrices: for example, if grgs=[grg1, grg2] then indexes 0…(grg1.num_mutations-1) will be for grg1, and grg1.num_mutations…(grg1.num_mutations+grg2.num_mutations-1) will be the mutations for grg2. Then if you have a mutation_filter containing the number grg1.num_mutations + 4 it means it will keep grg2’s mutation with ID 4. Default: no filter.
sample_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be QxM (where Q is the length of sample_filter) instead of NxM. Since all GRGs have the same samples, this behavior is the same as the non-Multi operators. Default: no filter.
threads (int) – Number of threads for performing the multiplication. Each GRG can be done in parallel.
alpha (float) – Alpha model coefficient (e.g., Speed, et. al., 2012) for variance, which multiplicatively scales the genotype matrix by sqrt(variance^alpha). By default alpha=-1, which corresponds to the “standard” binomial variance scaling.
custom_variance (Optional[Union[numpy.ndarray, List[numpy.ndarray]]]) – Instead of using binomial variance, use provided custom variance for mutations. Either a single array of length num_mutations (applied to every GRG), or a list of per-GRG arrays. Default: None.

class grapp.linalg.ops_scipy.MultiSciPyStdXTXOperator(*args, **kwargs)#

A scipy.sparse.linalg.LinearOperator on multiple GRGs. Same as SciPyStdXTXOperator, except if the input GRGs have mutation counts M1, M2, …, MK, then the dimension of the implicit underlying genotype matrix is Nx(M1 + M2 + … + MK).

Parameters:

grgs (List[pygrgl.GRG]) – The GRGs the operator will multiply against. They must all have the same samples, and the mutations are expected to differ (e.g., one GRG per chromosome of the same dataset).
dtype (TypeAlias) – The numpy.dtype to use.
haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.
mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Here the mutation filter follows the same numbering as the input/output matrices: for example, if grgs=[grg1, grg2] then indexes 0…(grg1.num_mutations-1) will be for grg1, and grg1.num_mutations…(grg1.num_mutations+grg2.num_mutations-1) will be the mutations for grg2. Then if you have a mutation_filter containing the number grg1.num_mutations + 4 it means it will keep grg2’s mutation with ID 4. Default: no filter.
sample_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be QxM (where Q is the length of sample_filter) instead of NxM. Since all GRGs have the same samples, this behavior is the same as the non-Multi operators. Default: no filter.
threads (int) – Number of threads for performing the multiplication. Each GRG can be done in parallel.
alpha (float) – Alpha model coefficient (e.g., Speed, et. al., 2012) for variance, which multiplicatively scales the genotype matrix by sqrt(variance^alpha). By default alpha=-1, which corresponds to the “standard” binomial variance scaling.
custom_variance (Optional[Union[numpy.ndarray, List[numpy.ndarray]]]) – Instead of using binomial variance, use provided custom variance for mutations. Either a single array of length num_mutations (applied to every GRG), or a list of per-GRG arrays. Default: None.

PCA and other helper methods#

Linear algebra-related operations on GRG. These are typically “generic” operations that could apply to many different types of analyses.

class grapp.linalg.MatrixSelection(*values)#

Bases: Enum

X = 1#

XT = 2#

XTX = 3#

grapp.linalg.PCs(grgs: GRG | List[GRG] | GRGCalcInterface | List[GRGCalcInterface], k: int, include_eig: bool = False, use_pro_pca: bool = False, sample_window: int = 1, threads: int = 1, init_vector: NDArray | None = None, tol: float = 0)#

Get the principal components for each sample corresponding to the first \(k\) eigenvectors from a GRG.

Parameters:

grgs (Union[pygrgl.GRG, List[pygrgl.GRG]]) – The GRG or list of GRGs to perform PCA on.
k (int) – The number of eigenvectors/values to use. These correspond to the k largest eigenvalues.
include_eig (bool) – When True, the return value is a triple of (DataFrame, EigenValues, EigenVectors), where the eigen values are as returned by scipy.sparse.linalg.eigsh(). This can increase RAM usage because it forces the use of \(X^TX\) instead of \(XX^T\). Default: False.
sample_window (Optional[int]) – If provided, defines a window width in base-pair. Within each window (starting at the Mutation with the lowest coordinate) randomly choose a single SNP and use that for performing PCA. Default: 1 (use every SNP).
threads (int) – Number of threads to use. Will never use more than the number of input GRGs. Default: 1.
init_vector (Optional[numpy.typing.NDArray]) – Optional starting vector for the iterative solver, passed to eigsh as v0.
tol (float) – Convergence tolerance for the iterative solver, passed to eigsh. 0 means machine precision. Default: 0.

Returns:

A pandas.DataFrame with a row per individual and a column per principal component. Or, if include_eig then a triple (dataframe, eigen values, eigen vectors), where eigen vectors are None unless use_pro_pca was True.

Return type:

Union[pandas.DataFrame, Tuple[pandas.DataFrame, numpy.array, Optional[numpy.array]]]

grapp.linalg.eigs(matrix: MatrixSelection, grg: GRG | GRGCalcInterface, k: int, standardized: bool = True, haploid: bool = False, op_kwargs: Dict[str, Any] = {}) → Tuple[NDArray, NDArray]#

Get the first K eigen values and vectors from a GRG.

Parameters:

matrix (MatrixSelection) – Which matrix derived from the GRG should be used: the genotype matrix (MatrixSelection.X), the transposed genotype matrix (MatrixSelection.XT), or the covariance/correlation matrix (MatrixSelection.XTX).
grg (pygrgl.GRG) – The GRG to operate on.
k (int) – The number of (largest) eigen values/vectors to retrieve.
standardized (bool) – Set to False to use the non-standardized matrix. Default: True.
haploid (bool) – Set to True to use the haploid values (0,1) instead of diploid values (0,1,2).

Returns:

(eigen_value, eigen_vectors) as defined by scipy.sparse.linalg.eigs

grapp.linalg.get_eig_pcs(grgs: GRG | List[GRG] | GRGCalcInterface | List[GRGCalcInterface], k: int, op_kwargs: Dict[str, Any] = {}, threads: int = 1, verbose: bool = True, do_xtx: bool = False, init_vector: NDArray | None = None, tol: float = 0) → Tuple[NDArray, NDArray, NDArray | None]#

Get the principal components for each sample corresponding to the first \(k\) eigenvectors from a GRG, using an iterative eigenvector decomposition method.

The computation is routed through the backend-agnostic operator selector on GRGCalcInterface, so it runs on either the NumPy/CPU backend (GRGCalculator) or the CuPy/GPU backend (GRGSpMVCalculator) depending on the GRG passed in. Regardless of backend, the returned arrays are host NumPy arrays.

Parameters:

grgs (Union[pygrgl.GRG, List[pygrgl.GRG]]) – The GRG or list of GRGs to perform PCA on.
k (int) – The number of eigenvectors/values to use. These correspond to the k largest eigenvalues.
op_kwargs (Dict[str, Any]) – A dictionary of keyword arguments to pass to the underlying SciPyStdXTXOperator.
threads (int) – Maximum number of threads to use. At most len(grgs) tasks can be done in parallel.
verbose (bool) – Emit information on stderr.
do_xtx (bool) – Use eigsh(X^TX) instead of the default eigsh(XX^T). Default: False.
init_vector (Optional[numpy.typing.NDArray]) – Optional starting vector for the iterative solver, passed to eigsh as v0.
tol (float) – Convergence tolerance for the iterative solver, passed to eigsh. 0 means machine precision. Default: 0.

Returns:

A pair (PC_scores, eigen_values) where each is a numpy array.

Return type:

Tuple[numpy.ndarray, numpy.ndarray, Optional[numpy.ndarray]]

grapp.linalg.sort_by_eigvalues(eigen_values: NDArray, eigen_vectors: NDArray)#

Reorder the eigen value and vector arrays so that they are in descending order of the corresponding eigen value.

Parameters:

eigen_values (numpy.typing.NDArray) – The vector of eigen values, of length k.
eigen_vectors (numpy.typing.NDArray) – The matrix of eigen vectors, with k columns.

Association Studies (GWAS)#

grapp.assoc.linear_assoc_covar(grg: GRG | GRGCalcInterface, Y: NDArray, C: NDArray, only_beta: bool = False, hide_covars: bool = True, standardize: bool = False, method: str = 'QR', dist: str = 'sample') → DataFrame#

Performs regression for each mutation with covariate adjustment. Missing data is treated as the mean genotype value (allele frequency for the relevant variant). Uses QR decomposition to project out covariate effects from the phenotype and genotype vectors.

Parameters:

Y (numpy.ndarray) – Phenotype vector of shape (n_samples,), with missing values specified as NaN.
C (numpy.ndarray) – Covariate matrix of shape (n_samples, n_covariates). Should include intercept.
only_beta (bool) – If True, returns only the BETA column in the output.
hide_covars (bool) – If False, includes estimated covariate effects (GAMMA_i) in the output.
standardize (bool) – If True, standardize X and Y (after adjusting for covariates).
method (str) – Either “QR” (default) or “regress”. “QR” uses QR decomposition to adjust both \(X\) and \(Y\) for covariates (\(C\)), but if standardize=True then it assumes that \(X\) and \(C\) are independent (the more correlated they are, the less “standardized” the result will be). “regress” uses linear regression between \(Y\) and \(C\) (to get \(B_c\)), and then performs GWAS against \(Y'\) (\(Y' = Y - C \times B_c\)).
dist (str) – How to compute the \(diag(X^T X)\) term. Options are: “sample” (use individual coalescence information to compute sample mean and variance), “binomial” (assume the diploid data follows a binomial distribution, for mean and variance). Default: “sample”.

Returns:

A DataFrame containing at least BETA, SE, T, and P columns. If hide_covars is False, also includes GAMMA columns.

Return type:

pandas.DataFrame

grapp.assoc.linear_assoc_no_covar(grg: GRG | GRGCalcInterface, Y: NDArray, only_beta: bool = False, standardize: bool = False, dist: str = 'sample') → DataFrame#

Performs regression for each mutation without adjusting for covariates. Missing data is treated as the mean genotype value (allele frequency for the relevant variant).

Parameters:

Y (numpy.ndarray) – Phenotype vector of shape (n_samples,), with missing values specified as NaN.
only_beta (bool) – If True, returns a DataFrame with only the BETA column.
standardize (bool) – If True, standardize X and Y (after adjusting for covariates).
dist (str) – How to compute the \(diag(X^T X)\) term. Options are: “sample” (use individual coalescence information to compute sample mean and variance), “binomial” (assume the diploid data follows a binomial distribution, for mean and variance). Default: “sample”.

Returns:

A DataFrame containing statistics for each mutation: - POS, ALT, COUNT, BETA, B0, SE, R2, T, and P.

Return type:

pandas.DataFrame

grapp.assoc.read_pheno(filename: str, return_indivs: bool = False, verbose: bool = True) → NDArray | Tuple[NDArray, List[str]]#

Reads a PLINK/GCTA/GRG-style phenotype file and returns the phenotype vector. In all cases, the row order of the file must match the individual indexing order of the GRG.

PLINK-style: Optional header “FID IID PHEN”. The first column is the family ID (ignored by grapp), the second column is the IID (used for validation by grapp), and the third column in the numerical phenotype value. The IID order is checked against the GRG individual ID order, when IDs are present in the GRG (and also not “NA” in the plink input file).

GRG-style: Optional header “person_id phenotypes”. The first column is checked against the GRG individual ID order, when IDs are present in GRG (and also not “NA” in the phenotype file). The second column is the numerical phenotype value.

If there is no header, the number of columns determines the type (3 = plink, 2 = GRG).

In all cases, columns can be tab-separated or space separated.

Parameters:

path (str) – Path to the phenotype file.
return_indivs (bool) – If True, return a list of individual identifiers (IIDs from the plink file).
verbose (bool) – Emit warnings/information about the file if True. Default: True.

Returns:

A one-dimensional NumPy array of phenotype values.

Return type:

Union[numpy.typing.NDArray, Tuple[numpy.typing.NDArray, List[str]]]

grapp.assoc.read_plink_covariates(covar_path: str, return_indivs: bool = False, verbose: bool = False) → NDArray | Tuple[NDArray, List[str]]#

Reads a PLINK-style covariate file: Optional header line, FID and IID in first two columns, covariates in remaining columns. The first two columns (FID/IID) are ignored. Does not allow

Parameters:

path (str) – Path to the covariate file.
return_indivs (bool) – If True, return a list of individual identifiers (IIDs from the plink file).
verbose (bool) – Emit warnings/information about the file if True. Default: True.

Returns:

The covariates as a numpy array of shape (n_samples, n_covariates). If return_indivs is True, then also return a list of individual IDs (strings).

Return type:

Union[numpy.typing.NDArray, Tuple[numpy.typing.NDArray, List[str]]]

Nearest Neighbor Comparisons#

class grapp.nn.NearestNeighborContext(grg: GRG)#

Bases: object

The main class for performing neighbor queries against the GRG format. Holds cached information related to nearest-neighbor queries on a specific GRG.

exact_hamming_dists(seeds: NDArray, direction: TraversalDirection, emit_all_nodes: bool = False) → NDArray#

Using exact computations, get the Hamming distances from the matrix of input seeds to every other sample in the GRG.

Parameters:

seeds (numpy.ndarray) – A two-dimensional numpy array. Each row corresponds to a single “query” for distances, and contains a ‘1’ for every mutation (downward direction) or sample (upward direction) that is used by the query item.
direction (pygrgl.TraversalDirection) – Whether to find the distances to Samples (pygrgl.TraversalDirection.DOWN) or the distances to Mutations (pygrgl.TraversalDirection.UP). The number of columns in the seeds input matrix must match the direction, so columns(seeds) == grg.num_mutations if direction is down, and columns(seeds) == grg.num_samples if direction is up.

Returns:

A two-dimensional numpy array where the number of rows matches the input matrix; i.e. each row is a result from each query. The number of columns is the opposite of the input (similar to pygrgl.matmul), so if the seeds have grg.num_mutations columns then the result will have grg.num_samples columns.

Return type:

numpy.ndarray

exact_hamming_dists_by_mutation(mutation_ids: List[int], emit_all_nodes: bool = False) → NDArray#

Using exact computations, get the Hamming distances from the list of input mutation IDs (for Mutations in the GRG) to every other mutation in the GRG.

Parameters:

mutation_ids (List[int]) – List of GRG Mutation IDs, each of which will be queried for distance to all other Mutations.
emit_all_nodes (bool) – Set to True to compute distances to every _node_ in the graph, not just every other Mutation. The output Matrix will have num_nodes columns when True.

Returns:

Matrix of distances, where each row corresponds to input Mutation IDs, and each column is the distance from the “other” Mutation ID. For example, if the 0th input is Mutation ID “m0”, then the 0th row of output is [D(m0, 0), D(m0, 1), …, D(m0, M-1)] where M is the number of mutations.

Return type:

numpy.ndarray

exact_hamming_dists_by_sample(sample_ids: List[int], emit_all_nodes: bool = False) → NDArray#

Using exact computations, get the Hamming distances from the list of input sample IDs (for samples in the GRG) to every other sample in the GRG.

Parameters:

sample_ids (List[int]) – List of GRG Node IDs for sample nodes, each of which will be queried for distance to all other samples.
emit_all_nodes (bool) – Set to True to compute distances to every _node_ in the graph, not just every other sample. The output Matrix will have num_nodes columns when True.

Returns:

Matrix of distances, where each row corresponds to input sample IDs, and each column is the distance from the “other” sample ID. For example, if the 0th input is sample ID “n0”, then the 0th row of output is [D(n0, 0), D(n0, 1), …, D(n0, N-1)] where N is the number of haploid samples.

Return type:

numpy.ndarray

fast_pairwise_hamming(node1: int, node2: int, direction: TraversalDirection) → int#

Compute the Hamming distance between a pair of samples or mutations (or arbitrary nodes in the graph, but that has a less well-defined “meaning”). This calculation is extremely fast for a pair that are highly similar (low Hamming distance), as it shortcuts the graph traversal by making use of pygrgl.shared_frontier().

Note: Ensure that node1 != node2 prior to calling.

Parameters:

node1 (int) – The first node ID (e.g., sample ID or node associated with a mutation).
node2 (int) – The second node ID (e.g., sample ID or node associated with a mutation).
direction – The direction to use for distance calculation. pygrgl.TraversalDirection.UP means to compare the sets of Mutations shared by the nodes (distance is on differing Mutations) and pygrgl.TraversalDirection.DOWN means to compare sets of Samples.

property grg#

property muts_above: NDArray#: Vector of length grg.num_nodes, where each node’s value is the number of Mutations above that node in the graph.

property samps_below: NDArray#: Vector of length grg.num_nodes, where each node’s value is the number of sample nodes below that node in the graph.

Filtering, Export, etc.#

Filtering GRGs#

Functions for filtering data out of a GRG to create a new, smaller GRG.

grapp.util.filter.grg_save_freq(grg_or_filename: GRG | str, out_filename: str, freq_range: Tuple[float, float])#

Given a GRG filename or object, save a new GRG that contains only the Mutations in the given frequency range.

Parameters:

grg_or_filename (Union[pygrgl.GRG, str]) – Either a pygrgl.GRG object, or the filename of a GRG.
out_filename (str) – The filename of the to-be-created GRG.
freq_range (Tuple[float, float]) – A pair (lower, upper), where the Mutations will be kept if lower <= frequency(Mutation) < upper. I.e., lower is inclusive and upper is exclusive.

grapp.util.filter.grg_save_individuals(grg_or_filename: GRG | str, out_filename: str, individual_ids: List[str], allow_extra: bool = False, verbose: bool = False)#

Save a GRG, keeping only the individuals with the IDs given in the list.

Parameters:

grg_or_filename (Union[pygrgl.GRG, str]) – Either a pygrgl.GRG object, or the filename of a GRG.
out_filename (str) – The new GRG file to create.
individual_ids (List[str]) – List of individual identifiers to be kept.
allow_extra (bool) – When False, throw an exception if individual_ids contains any identifier not found in the GRG. Default: False.

grapp.util.filter.grg_save_mut_filter(grg_or_filename: GRG | str, out_filename: str, mut_filter: Callable[[GRG, int], bool | str], bp_range: Tuple[int, int] = (0, 0), apply_to_sites: bool = False, min_variants: int = 0, max_variants: int = 4294967296, ignore_empty: bool = False)#

Given a GRG filename or object, save a new GRG that contains only the Mutations selected by the given filter function.

Parameters:

grg_or_filename (Union[pygrgl.GRG, str]) – Either a pygrgl.GRG object, or the filename of a GRG.
out_filename (str) – The filename of the to-be-created GRG.
mut_filter (Callable[[pygrgl.GRG, int], bool]) – Callback (function) that takes a MutationID (int) as input and returns true if that mutation should be kept.
bp_range (Tuple[int, int]) – The range to associate with the GRG, as metadata. DOES NOT IMPACT THE FILTERING AT ALL.
apply_to_sites (bool) – By default, the filter applies to each variant independently. This flag will cause an entire site to be dropped if any variants are filtered out.
min_variants (int) – Any site with fewer variants than this will be dropped.
max_variants (int) – Any site with more variants than this will be dropped.
ignore_empty (bool) – When True, just skip the creation of GRGs that would be empty. Otherwise, an exception will be raised if you try to create an empty GRG.

Returns:

Tuple (mutations kept, mutations dropped)

Return type:

Tuple[int, int]

grapp.util.filter.grg_save_populations(grg_or_filename: GRG | str, out_filename: str, populations: List[str], allow_extra: bool = False, verbose: bool = False)#

Save a GRG, keeping only the samples with populations matching the given population list.

Parameters:

grg_or_filename (Union[pygrgl.GRG, str]) – Either a pygrgl.GRG object, or the filename of a GRG.
out_filename (str) – The new GRG file to create.
populations (List[str]) – List of population names to be kept.
allow_extra (bool) – When False, throw an exception if populations contains any identifier not found in the GRG. Default: False.

grapp.util.filter.grg_save_range(grg_or_filename: GRG | str, out_filename: str, bp_range: Tuple[int, int], ignore_empty: bool = False)#

Given a GRG filename or object, save a new GRG that contains only the Mutations in the given basepair range.

Parameters:

grg_or_filename (Union[pygrgl.GRG, str]) – Either a pygrgl.GRG object, or the filename of a GRG.
out_filename (str) – The filename of the to-be-created GRG.
bp_range (Tuple[int, int]) – A pair (lower, upper), where both are in units basepair, and the Mutations will be kept if lower <= Mutation.position < upper. I.e., lower is inclusive and upper is exclusive.

grapp.util.filter.grg_save_samples(grg_or_filename: GRG | str, out_filename: str, sample_nodes: List[int], verbose: bool = False)#

Save a GRG, keeping only the haploid samples corresponding to the NodeIDs (indexes) given. See grg_save_individuals() for a version that uses identifiers to more “safely” down sample a GRG dataset.

Parameters:

grg_or_filename (Union[pygrgl.GRG, str]) – Either a pygrgl.GRG object, or the filename of a GRG.
out_filename (str) – The new GRG file to create.
sample_nodes (List[str]) – List of NodeIDs (indexes) for the haploid samples. If a GRG has N samples, then they are numbered 0…(N-1). The ordering matches the order of the input file that the GRG was constructed from.

grapp.util.filter.multi_grg_save_mut_filter(grgs_or_filenames: List[GRG] | List[str], out_filenames: List[str], mut_filter: Callable[[GRG, int, int], bool])#

Given a list of GRG filenames or GRG objects, save a new GRG for each that contains only the Mutations selected by the given filter function. The callback takes the GRG, the MutationID within that GRG, and the “cumulative MutationID” when considering all GRGs sequentially (e.g. the second GRG’s mutations start counting right after the last MutationID of the first GRG).

Parameters:

grgs_or_filenames (Union[List[pygrgl.GRG], List[str]]) – Either a pygrgl.GRG object, or the filename of a GRG.
out_filenames – The list of filenames of the to-be-created GRGs.
mut_filter (Callable[[pygrgl.GRG, int, int], bool]) – Callback (function) that takes a MutationID (int) as input and returns true if that mutation should be kept.

grapp.util.filter.split_by_ranges(grg_filename: str, ranges: List[Tuple[int, int]], jobs: int = 1, out_dir: str | None = None) → List[str]#

Split a GRG into multiple parts, spanning the list of basepair ranges given.

Parameters:

grg_filename (str) – The input GRG filename.
ranges (List[Tuple[int, int]]) – A list of (lower, upper) pairs, where lower and upper are in units basepair, and lower is inclusive while upper is exclusive.
jobs (int) – Number of processes/threads to use. Default: 1.
out_dir (Optional[str]) – Output directory to put the split pieces into. If None, then use the current working directory. Default: None.

Returns:

List of filenames for the resulting GRG files. If the file does not exist, then it would have been an empty graph.

Return type:

List[str]

Exporting to IGD#

grapp.util.igd.export_igd(grg_or_filename: GRG | str, out_filename: str, jobs: int = 1, batch_size: str | int = 'auto', temp_dir: str | None = None, no_merge: bool = False, split_threshold: int = 5000000, verbose: bool = False)#

Export a GRG to a phased IGD file, which is a sparse matrix representation of the same data. An IGD will almost always be larger than a GRG, but it can be useful because the rows are variants, giving fast access to specific variants and their list of samples. Instead of having traverse many graph edges to get the sample list for a variant, you can just read the row from the IGD.

Parameters:

grg_or_filename – The GRG to convert, either as a pygrgl.GRG or the filename of a GRG.
out_filename (str) – The IGD file to create. The path up to the filename must already exist, and the file itself must not exist.
jobs (int) – The number of parallel processes to use to do the conversion. The speed-up is essentially linear. Default: 1.
batch_size (Union[str, int]) – The number of Mutations to process simultaneously, or the string “auto” if you want a reasonable value to be chosen for you.
temp_dir (Optional[str]) – The directory to use for intermediate IGD files. The GRG is split into multiple pieces and placed in this directory, and then each piece gets converted to an IGD file, and then those IGD files are merged into the final result. If temp_dir is None, these files are placed in a temporary directory which is then deleted upon completion.
no_merge (bool) – Set to True to get all the intermediate files, but not merge them into a final IGD. In this case, out_filename will not be created.
split_threshold (int) – Basepair threshold for splitting the GRG into chunks for processing. A split GRG is much faster to operate on than a full sized GRG, plus this is how we parallelize the conversion. Default: 5MB.

grapp.util.igd.export_vcf(grg_or_filename: GRG | str, out_file_obj: TextIO, contig: str = 'unknown', jobs: int = 1, batch_size: str | int = 'auto', temp_dir: str | None = None, split_threshold: int = 5000000, verbose: bool = False)#

WARNING: Incredibly slow for large datasets! You should only use this for exporting subsets of GRGs (e.g., after filtering) and even then it is slow.

Export a GRG to a phased VCF file. Usage should to either use a Gzip file object for the output, or stdout and then pipe the results to bgzip.

Parameters:

grg_or_filename – The GRG to convert, either as a pygrgl.GRG or the filename of a GRG.
out_file_obj – The file handle to write VCF data to.
contig (str) – The contig name to use in the VCF. Default: “unknown”.
jobs (int) – The number of parallel processes to use to do the conversion. The speed-up is essentially linear. Default: 1.
batch_size (Union[str, int]) – The number of Mutations to process simultaneously, or the string “auto” if you want a reasonable value to be chosen for you.
temp_dir (Optional[str]) – The directory to use for intermediate IGD files. The GRG is split into multiple pieces and placed in this directory, and then each piece gets converted to an IGD file, and then those IGD files are merged into the final result. If temp_dir is None, these files are placed in a temporary directory which is then deleted upon completion.
split_threshold (int) – Basepair threshold for splitting the GRG into chunks for processing. A split GRG is much faster to operate on than a full sized GRG, plus this is how we parallelize the conversion. Default: 5MB.

grapp.util.igd.igd_to_vcf(igd_filename: str, out_file_obj: TextIO, contig: str, buffer_lines: int = 1000)#

WARNING: Incredibly slow for large datasets!

Convert and IGD file to VCF. General usage should to either use a Gzip file object for the output, or stdout and then pipe the results to bgzip.

This method produces a VCF file with the variants expanded just like the IGD file. To “unexpand” the VCF file, use bcftools norm -m +any input.vcf -o output.vcf.

Parameters:

igd_filename (str) – The input IGD filename.
out_file_obj – The file handle to write VCF data to.
contig (str) – The contig name to use in the VCF.
buffer_lines (int) – The number of lines to buffer before flushing to disk. Default: 1000.

Simple Calculations#

Simple utility functions.

class grapp.util.simple.VariantType(*values)#

Bases: Enum

INDELS = 'indels'#

MNPS = 'mnps'#

OTHER = 'other'#

SNPS = 'snps'#

grapp.util.simple.allele_counts(grg: GRG | GRGCalcInterface, return_missing: bool = False, sample_filter: List[int] | NDArray | None = None) → NDArray | Tuple[NDArray, NDArray]#

Get the allele counts for the mutations in the given GRG.

Parameters:

grg (pygrgl.GRG) – The GRG.
return_missing (bool) – Return two arrays: the allele counts, and the missingness counts.
sample_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Only consider the samples listed in the filter. Default: no filter.

Returns:

A vector of length grg.num_mutations, containing allele counts indexed by MutationID.

Return type:

numpy.ndarray

grapp.util.simple.allele_frequencies(grg: GRG | GRGCalcInterface, adjust_missing: bool = False, sample_filter: List[int] | NDArray | None = None) → NDArray#

Get the allele frequencies for the mutations in the given GRG.

Parameters:

grg (pygrgl.GRG) – The GRG.
adjust_missing (bool) – Optional. Set to true to adjust each allele frequncies to be \(\frac{count_i}{total - missing_i}\) instead of \(\frac{count_i}{total}\).
sample_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Only consider the samples listed in the filter. Default: no filter.

Returns:

A vector of length grg.num_mutations, containing allele frequencies indexed by MutationID.

Return type:

numpy.ndarray

grapp.util.simple.common_mut_dataframe(grg: GRGCalcInterface, **kwargs)#

Generate the “common” output format for mutation-based dataframes, which has “POS”, “ALT”, and “REF” in the first three columns, and then whatever extra columns the user provides.

Parameters:: kwargs – Keyword arguments are just passed through to pandas.DataFrame({}).
Returns:: The dataframe, with copy=False.
Return type:: pandas.DataFrame

grapp.util.simple.get_variant_type(mut: Mutation) → VariantType#

grapp.util.simple.get_zygosities(grg: GRG) → NDArray#

For a diploid dataset, return information about the homo/heterzygosity of every variant. Result is a matrix with 4 rows and grg.num_mutations columns. The rows are:

The number of homozygotes for each mutation (ALT of the variant)
The number of heterozygotes for each mutation
The number of homozygote-missing alleles for each mutation (i.e., corresponding site)
the number of heterozygote-missing alleles for each mutation

Parameters:: grg (pygrgl.GRG.) – The GRG.
Returns:: \(4 \times M\) matrix, with rows as described above.
Return type:: numpy.ndarray

grapp.util.simple.hwe(grg: GRG, jobs: int = 1, show_progress: bool = False, return_counts: bool = False) → NDArray | Tuple[NDArray, NDArray]#

Compute hardy-weinberg p-values for all variants in the GRG. Missing data is not yet supported.

NOTES:

Multi-allelic sites only have p-values calculated for the REF/ALT combinations that are present, and the calculations are based on hetALT, homALT, and other, where other is the number of genotypes that do not contain ALT. We do not “flip” the ALT and REF and test hetREF, homREF, etc.

Parameters:

grg (pygrgl.GRG) – The GRG.
jobs (int) – Number of parallel jobs to run (threads). Default: 1.
show_progress (bool) – Show progress bar on sys.stderr. Default: False.

Returns:

A numpy array of length num_mutations, containing a p-value for each mutation. If the

Return type:

numpy.array

grapp.util.simple.hwe_df(grg: GRG, jobs: int = 1, show_progress: bool = False, all_multi: bool = True) → DataFrame#

Compute hardy-weinberg p-values for all variants in the GRG. Missing data is not yet supported.

NOTES:

Multi-allelic sites only have p-values calculated for the REF/ALT combinations that are present, and the calculations are based on hetALT, homALT, and other, where other is the number of genotypes that do not contain ALT. We do not “flip” the ALT and REF and test hetREF, homREF, etc.

Parameters:

grg (pygrgl.GRG) – The GRG.
jobs (int) – Number of parallel jobs to run (threads). Default: 1.
show_progress (bool) – Show progress bar on sys.stderr. Default: False.
all_multi (bool) – Compute p-values for all combinations of multi-allelic sites (e.g., including the REF allele). For a bi-allelic site, there is a single p-value that represents the pair (REF, ALT). However, for a multi-allelic site, e.g. (REF, A1, A2), there are three combos (A1, not A1), (A2, not A2), and (REF, not REF). Setting this parameter to False will only compute two p-values: (A1, not A1) and (A2, not A2). Leaving it as True will additionally compute (REF, not REF).

Returns:

A DataFrame containing “POS”, “ALT”, “COUNT”, and “P”. If all_multi=True, then also includes column “REFP” for the REF allele’s p-value.

Return type:

pandas.DataFrame

grapp.util.simple.hwe_from_counts(het_A: List[int], hom_A: List[int], other: List[int], jobs: int = 1, show_progress: bool = False) → List[float]#

For the given heterozygous, homozygous, and “other” counts, compute the HWE exact p-values.

Parameters:

het_A (List[int]) – List of integer counts for the number of heterozygous individuals (in a focal allele).
hom_A (List[int]) – List of integer counts for the number of homozygous individuals (in a focal allele).
other (List[int]) – List of integer counts for the number of individuals that do not contain the focal allele at all.
jobs (int) – Number of threads to use.
show_progress (bool) – Write progress information to stderr? Default: False.

Returns:

A list of p-values, one for each focal allele.

Return type:

List[float]

grapp.util.simple.multi_allelic_muts(grg: GRG) → List[List[int]]#

Return a list of MutationId lists, where each sublist represents a set of Mutations that exist at the same site (base-pair position). An empty list implies the data is bi-allelic.

Parameters:: grg (pygrgl.GRG) – The GRG containing the mutations.
Returns:: A list of lists [i, i+1, …, i+k], which are MutationIds that has the same underlying base-pair position (site). An empty list implies the data is bi-allelic.
Returns:: List[List[int]]

grapp.util.simple.ref_hwe(grg: GRG, jobs: int = 1, show_progress: bool = False, default: NDArray | float = nan) → NDArray#

For every mutation, return the HWE p-value comparing REF against not-REF. For bi-allelic sites return a defualt value, since the (REF, not REF) p-value is the same as the (ALT, not ALT) p-value. For multi-allelic sites, performs multiple graph traversals (slow) to retrieve the REF sample list to explicitly compute the homozygous/heterzygous counts.

WARNING: This is an expensive operation on large datasets.

Parameters:

grg (pygrgl.GRG) – The GRG.
jobs (int) – The number of threads to use when computing p-values.
show_progress (bool) – When True, show a progress bar.
default (Union[numpy.ndarray, float]) – Either a scalar value or an array of length grg.num_mutations. When a site is bi-allelic, use this default value. Default: NaN.

Returns:

Array of grg.num_mutations p-values.

Return type:

numpy.ndarray

grapp.util.simple.site_alleles(grg: GRG, alt_only: bool = False, mut_ids: List[int] = []) → NDArray#

Compute the number of alleles at the site associated with each mutation (variant). For example, if there is a site with 3 variants A>T, A>G, A>C, then each of those variants (mutations) will have a “4” in their result. Each variant is always bi-allelic, but the site it is associated can have an arbitrary number of alleles. This function counts the number of distinct REF alleles, so the result is count(REF) + count(ALT).

Parameters:

grg (pygrgl.GRG) – The GRG.
alt_only (bool) – Only count ALT alleles, not REF alleles. Default: False.
mut_ids (List[int]) – Restrict to the MutationIDs given (e.g., for only looks at SNPs, etc.)

Returns:

A numpy array of length num_mutations, containing a allele count for each mutation.

Return type:

numpy.array

grapp.util.simple.site_samples(grg: GRG, multi_list: List[List[int]]) → NDArray#

Given a list of sites (each site being a list of MutationIDs), return a bool numpy matrix that represents the samples that have either the ALT or a missing allele at that site.

Parameters:

grg (pygrgl.GRG.) – The GRG.
multi_list (List[List[int]]) – A list of “sites”, where each site is a list of integer MutationIDs. Those mutations all have the same base-pair position, hence are at the same site.

Returns:

Numpy matrix of dimension \(K imes N\), where \(K\) is the number of sites that was passed in, and \(N\) is grg.num_samples.

Return type:

numpy.ndarray

grapp.util.simple.variance(grg: GRG | GRGCalcInterface, dist: str = 'binomial', adjust_missing: bool = False, sample_filter: List[int] | NDArray | None = None, haploid: bool = False)#

Compute the variance of the mutations. You can use the dist parameter to choose between the sample variance and the binomial variance.

Parameters:

grg (pygrgl.GRG) – The GRG.
dist (str) – Either “sample” or “binomial”.
adjust_missing (bool) – Optional. Set to true to adjust each allele frequncy to be \(\frac{count_i}{total - missing_i}\) instead of \(\frac{count_i}{total}\).
sample_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Only consider the samples listed in the filter. Default: no filter.

Returns:

A vector of length grg.num_mutations, containing allele frequencies indexed by MutationID.

Return type:

numpy.ndarray

grapp.util.simple.variants_of_types(grg: GRG, types: Set[VariantType]) → List[int]#

Return the list of MutationIDs for variants of the given types. For example, passing types={VariantTypes.SNPS, VariantTypes.MNPS} will return every mutation that is either a SNP or MNP.

Parameters:

grg (pygrgl.GRG) – The GRG.
types (Set[VariantType]) – Set of VariantType that is the union of types to return.

Returns:

A list of MutationIDs.

Return type:

List[int]

Parallelized Operations#

See also “Linear Operators for Multiple GRGs”, which is how mathematical operations can be parallelized.

grapp.util.parallel.split_and_run(grg_or_filename: GRG | str, operation: Callable[[str | GRG, Dict[str, Any]], Any], merge_operation: Callable[[List[str | GRG], List[Any], Dict[str, Any]], Any], context: Dict[str, Any], jobs: int = 1, temp_dir: str | None = None, split_threshold: int = 1000000, verbose: bool = False) → Any#

Perform an arbitrary GRG operation in parallel by splitting the GRG into smaller graphs, running the operation on each subgraph, and then merging the results. This can be used for mutable or immutable operations. For a mutable operation, the GRGs will be merged into a final GRG, with the filename given.

The context dictionary that is passed between callback functions will always contain the “dir” key, which is the directory that the splitting and running is occurring in, and is where any intermediate result files (temp files) should be placed by the operation.

Parameters:

grg_or_filename – The GRG to convert, either as a pygrgl.GRG or the filename of a GRG.
operation (Callable[[Union[str, pygrgl.GRG], Dict[str, Any]], Any],) – Function that takes (GRG, context_dict) and returns a result (of any type). The GRG is either a string or a pygrgl.GRG object, and this operation is performed on that GRG after it was split out from the larger GRG. Use context_dict to pass information to the operation if needed. The result from this operation will be collected in a list, and that list will be passed to the merge_operation.
merge_operation (Callable[[List[Union[str, pygrgl.GRG]], List[Any], Dict[str, Any]], Any]) – Function that takes (list(GRG), list(results), context_dict) and returns a result (of any type). The list(results) are all the return values from operation. The list of GRGs are all the GRGs that the operation was run on. The context dictionary can be used to pass information to the merge operation.
context (Dict[str, Any]) – The context dictionary, can be empty, or contain any information that you want to pass to both callback functions. The key “dir” is reserved (see above).
jobs (int) – The number of parallel processes to use to do the conversion. The speed-up is essentially linear. Default: 1.
temp_dir (Optional[str]) – The directory to use for intermediate IGD files. The GRG is split into multiple pieces and placed in this directory, and then each piece gets converted to an IGD file, and then those IGD files are merged into the final result. If temp_dir is None, these files are placed in a temporary directory which is then deleted upon completion.
split_threshold (int) – Basepair threshold for splitting the GRG into chunks for processing. A split GRG can be much faster to operate on than a full sized GRG, plus this is how we parallelize the operation. Default: 5MB.