grapp API

Linear Algebra

Non-standardized Linear Operators

These operators work with scipy.sparse.linalg, and provide the ability to do matrix products against the unmodified genotype matrix that is represented by a GRG.

class grapp.linalg.ops_scipy.SciPyXOperator(*args, **kwargs)

A scipy.sparse.linalg.LinearOperator on the genotype matrix represented by the GRG, which allows for multiplication between the GRG and a matrix or vector. This is for the non-standardized matrix, which just contains discrete allele counts.

Can perform the operation \(X \times A\) (_matmat) or \(X \times \overrightarrow{v}\) (_matvec).

Parameters:
  • grg (pygrgl.GRG) – The GRG the operator will multiply against.

  • direction (pygrgl.TraversalDirection) – Determines whether the matrix is \(X\) (pygrgl.TraversalDirection.UP) or \(X^T\) (pygrgl.TraversalDirection.DOWN).

  • dtype (TypeAlias) – The numpy.dtype to use.

  • haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.

  • miss_values (Optional[numpy.typing.NDArray]) – If non-None, must be a vector of length num_mutations, which provides a per- mutation value for missingness (applied per haplotype). Usually the per-Mutation mean value (e.g., missingness-adjusted allele frequency) is provided. Default: None.

  • mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Default: empty filter.

  • mask_samples (Union[List[int], numpy.typing.NDArray]) – Ignore any contribution from the samples listed in this array (e.g., pretend they don’t exist in the GRG). This does not affect the dimensions of the multiplication operations. If haploid=True, these should be sample node IDs. Otherwise, they should be individual IDs.

class grapp.linalg.ops_scipy.SciPyXTXOperator(*args, **kwargs)

A scipy.sparse.linalg.LinearOperator on the matrix \(X^TX\) represented by the GRG. This is for the non-standardized matrix, which just contains discrete allele counts, but it is not centered at the mean, so it is not quite the covariance matrix.

Can perform the operation \(X^T \times X \times A\) (_matmat) or \(X^T \times X \times \overrightarrow{v}\) (_matvec).

Parameters:
  • grg (pygrgl.GRG) – The GRG the operator will multiply against.

  • dtype (TypeAlias) – The numpy.dtype to use.

  • haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.

  • miss_values (Optional[numpy.typing.NDArray]) – If non-None, must be a vector of length num_mutations, which provides a per- mutation value for missingness (applied per haplotype). Usually the per-Mutation mean value (e.g., missingness-adjusted allele frequency) is provided. Default: None.

  • mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Default: empty filter.

class grapp.linalg.ops_scipy.SciPyXXTOperator(*args, **kwargs)

A scipy.sparse.linalg.LinearOperator on the matrix \(XX^T\) represented by the GRG. This is for the non-standardized matrix, which just contains discrete allele counts.

Can perform the operation \(X \times X^T \times A\) (_matmat) or \(X \times X^T \times \overrightarrow{v}\) (_matvec).

Parameters:
  • grg (pygrgl.GRG) – The GRG the operator will multiply against.

  • dtype (TypeAlias) – The numpy.dtype to use.

  • haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.

  • miss_values (Optional[numpy.typing.NDArray]) – If non-None, must be a vector of length num_mutations, which provides a per- mutation value for missingness (applied per haplotype). Usually the per-Mutation mean value (e.g., missingness-adjusted allele frequency) is provided. Default: None.

  • mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Default: empty filter.

Standardized Linear Operators

These operators work with scipy.sparse.linalg, and provide the ability to do matrix products against the genotype matrix that is represented by a GRG, except that genotype matrix is implicitly standardized by subtracting the mean and dividing by the standard deviation.

class grapp.linalg.ops_scipy.SciPyStdXOperator(*args, **kwargs)

A scipy.sparse.linalg.LinearOperator on the genotype matrix represented by the GRG, which allows for multiplication between the GRG and a matrix or vector. This is for the standardized matrix, which is centered to the mean (based on allele frequencies) and standard devation (based on the binomial distribution where each individual is the result of \(p\), the ploidy, trials).

Can perform the operation \(X \times A\) (_matmat) or \(X \times \overrightarrow{v}\) (_matvec).

Parameters:
  • grg (pygrgl.GRG) – The GRG the operator will multiply against.

  • direction (pygrgl.TraversalDirection) – Determines whether the matrix is \(X\) (pygrgl.TraversalDirection.UP) or \(X^T\) (pygrgl.TraversalDirection.DOWN).

  • freqs (numpy.ndarray) – A vector of length num_mutations, containing the allele frequency for all mutations. Indexed by the mutation ID of the mutation.

  • dtype (TypeAlias) – The numpy.dtype to use.

  • haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.

  • mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Default: empty filter.

  • mask_samples (Union[List[int], numpy.typing.NDArray]) – Ignore any contribution from the samples listed in this array (e.g., pretend they don’t exist in the GRG). This does not affect the dimensions of the multiplication operations. If haploid=True, these should be sample node IDs. Otherwise, they should be individual IDs.

class grapp.linalg.ops_scipy.SciPyStdXTXOperator(*args, **kwargs)

A scipy.sparse.linalg.LinearOperator on the matrix \(X^TX\) represented by the GRG. This is for the standardized matrix, which is centered to the mean (based on allele frequencies) and standard devation (based on the binomial distribution where each individual is the result of \(p\), the ploidy, trials).

This operator performs multiplications against the correlation matrix of the genotype matrix underlying the GRG. Can perform the operation \(X^T \times X \times A\) (_matmat) or \(X \times X \times \overrightarrow{v}\) (_matvec).

Parameters:
  • grg (pygrgl.GRG) – The GRG the operator will multiply against.

  • freqs (numpy.ndarray) – A vector of length num_mutations, containing the allele frequency for all mutations. Indexed by the mutation ID of the mutation.

  • dtype (TypeAlias) – The numpy.dtype to use.

  • haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.

  • mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Default: empty filter.

class grapp.linalg.ops_scipy.SciPyStdXXTOperator(*args, **kwargs)

A scipy.sparse.linalg.LinearOperator on the matrix \(XX^T\) represented by the GRG. This is for the standardized matrix, which is centered to the mean (based on allele frequencies) and standard devation (based on the binomial distribution where each individual is the result of \(p\), the ploidy, trials).

This operator performs multiplications against the correlation matrix of the genotype matrix underlying the GRG. Can perform the operation \(X \times X^T \times A\) (_matmat) or \(X \times X^T \times \overrightarrow{v}\) (_matvec).

Parameters:
  • grg (pygrgl.GRG) – The GRG the operator will multiply against.

  • freqs (numpy.ndarray) – A vector of length num_mutations, containing the allele frequency for all mutations. Indexed by the mutation ID of the mutation.

  • dtype (TypeAlias) – The numpy.dtype to use.

  • haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.

Linear Operators for Multiple GRGs

These operators are the same as the ones above, except they allow for multiple GRGs to be used for a single product. For example, if you have GRGs for each of the 22 autosomes, you can construct these operators and pass in all 22 GRGs. The resulting operator will perform matrix multiplications against the entire autosome. You can use multiple threads to parallelize by GRG.

class grapp.linalg.ops_scipy.MultiSciPyXOperator(*args, **kwargs)

A scipy.sparse.linalg.LinearOperator on multiple GRGs. Same as SciPyXOperator, except if the input GRGs have mutation counts M1, M2, …, MK, then the dimension of the implicit underlying genotype matrix is Nx(M1 + M2 + … + MK).

Parameters:
  • grgs (List[pygrgl.GRG]) – The GRGs the operator will multiply against. They must all have the same samples, and the mutations are expected to differ (e.g., one GRG per chromosome of the same dataset).

  • direction (pygrgl.TraversalDirection) – Determines whether the matrix is \(X\) (pygrgl.TraversalDirection.UP) or \(X^T\) (pygrgl.TraversalDirection.DOWN).

  • dtype (TypeAlias) – The numpy.dtype to use.

  • haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.

  • miss_values (Optional[numpy.typing.NDArray]) – If non-None, must be a vector of length num_mutations (for all GRGs), which provides a per-mutation value for missingness (applied per haplotype). Usually the per-Mutation mean value (e.g., missingness-adjusted allele frequency) is provided. Default: None.

  • mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Default: empty filter.

  • threads (int) – Number of threads for performing the multiplication. Each GRG can be done in parallel.

class grapp.linalg.ops_scipy.MultiSciPyXTXOperator(*args, **kwargs)

A scipy.sparse.linalg.LinearOperator on multiple GRGs. Same as SciPyXTXOperator, except if the input GRGs have mutation counts M1, M2, …, MK, then the dimension of the implicit underlying genotype matrix is Nx(M1 + M2 + … + MK).

Parameters:
  • grgs (List[pygrgl.GRG]) – The GRGs the operator will multiply against. They must all have the same samples, and the mutations are expected to differ (e.g., one GRG per chromosome of the same dataset).

  • dtype (TypeAlias) – The numpy.dtype to use.

  • haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.

  • miss_values (Optional[numpy.typing.NDArray]) – If non-None, must be a vector of length num_mutations (for all GRGs), which provides a per-mutation value for missingness (applied per haplotype). Usually the per-Mutation mean value (e.g., missingness-adjusted allele frequency) is provided. Default: None.

  • mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Default: empty filter.

  • threads (int) – Number of threads for performing the multiplication. Each GRG can be done in parallel.

class grapp.linalg.ops_scipy.MultiSciPyStdXOperator(*args, **kwargs)

A scipy.sparse.linalg.LinearOperator on multiple GRGs. Same as SciPyStdXOperator, except if the input GRGs have mutation counts M1, M2, …, MK, then the dimension of the implicit underlying genotype matrix is Nx(M1 + M2 + … + MK).

Parameters:
  • grgs (List[pygrgl.GRG]) – The GRGs the operator will multiply against. They must all have the same samples, and the mutations are expected to differ (e.g., one GRG per chromosome of the same dataset).

  • direction (pygrgl.TraversalDirection) – Determines whether the matrix is \(X\) (pygrgl.TraversalDirection.UP) or \(X^T\) (pygrgl.TraversalDirection.DOWN).

  • dtype (TypeAlias) – The numpy.dtype to use.

  • haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.

  • mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Default: empty filter.

  • threads (int) – Number of threads for performing the multiplication. Each GRG can be done in parallel.

class grapp.linalg.ops_scipy.MultiSciPyStdXTXOperator(*args, **kwargs)

A scipy.sparse.linalg.LinearOperator on multiple GRGs. Same as SciPyStdXTXOperator, except if the input GRGs have mutation counts M1, M2, …, MK, then the dimension of the implicit underlying genotype matrix is Nx(M1 + M2 + … + MK).

Parameters:
  • grgs (List[pygrgl.GRG]) – The GRGs the operator will multiply against. They must all have the same samples, and the mutations are expected to differ (e.g., one GRG per chromosome of the same dataset).

  • dtype (TypeAlias) – The numpy.dtype to use.

  • haploid (bool) – Perform calculations on the {0, 1} haploid genotype matrix, instead of the {0, …, grg.ploidy} genotype matrix. Default: False.

  • mutation_filter (Optional[Union[List[int], numpy.typing.NDArray]]) – Changes the dimensions of \(X\) to be NxP (where P is the length of mutation_filter) instead of NxM. Default: empty filter.

  • threads (int) – Number of threads for performing the multiplication. Each GRG can be done in parallel.

PCA and other helper methods

Linear algebra-related operations on GRG. These are typically “generic” operations that could apply to many different types of analyses.

class grapp.linalg.MatrixSelection(*values)

Bases: Enum

X = 1
XT = 2
XTX = 3
grapp.linalg.PCs(grg: GRG, first_k: int, unitvar: bool = True, include_eig: bool = False, use_pro_pca: bool = False)

Get the principle components for each sample corresponding to the first \(k\) eigenvectors from a GRG.

Parameters:
  • grg (pygrgl.GRG) – The GRG to perform PCA on.

  • first_k (int) – The number of eigenvectors/values to use. These correspond to the k largest eigenvalues.

  • unitvar (bool) – When True, normalize the PCs by dividing by the square root of each corresponding eigenvalue. Default: True.

  • include_eig (bool) – When True, the return value is a triple of (DataFrame, EigenValues, EigenVectors), where the eigen values and vectors are as returned by scipy.sparse.linalg.eigs(). Default: False.

Returns:

A pandas.DataFrame with a row per individual and a column per principle component.

Return type:

pandas.DataFrame

grapp.linalg.eigs(matrix: MatrixSelection, grg: GRG, first_k: int, standardized: bool = True, haploid: bool = False) Tuple[ndarray[tuple[Any, ...], dtype[_ScalarT]], ndarray[tuple[Any, ...], dtype[_ScalarT]]]

Get the first K eigen values and vectors from a GRG.

Parameters:
  • matrix (MatrixSelection) – Which matrix derived from the GRG should be used: the genotype matrix (MatrixSelection.X), the transposed genotype matrix (MatrixSelection.XT), or the covariance/correlation matrix (MatrixSelection.XTX).

  • grg (pygrgl.GRG) – The GRG to operate on.

  • first_k (int) – The number of (largest) eigen values/vectors to retrieve.

  • standardized (bool) – Set to False to use the non-standardized matrix. Default: True.

  • haploid (bool) – Set to True to use the haploid values (0,1) instead of diploid values (0,1,2).

Returns:

(eigen_value, eigen_vectors) as defined by scipy.sparse.linalg.eigs

grapp.linalg.get_eig_pcs(grg: GRG, first_k: int) Tuple[ndarray[tuple[Any, ...], dtype[_ScalarT]], ndarray[tuple[Any, ...], dtype[_ScalarT]], ndarray[tuple[Any, ...], dtype[_ScalarT]]]

Get the principle components for each sample corresponding to the first \(k\) eigenvectors from a GRG, using an iterative eigenvector decomposition method.

Parameters:
  • grg (pygrgl.GRG) – The GRG to perform PCA on.

  • first_k (int) – The number of eigenvectors/values to use. These correspond to the k largest eigenvalues.

Returns:

A triple (PC_scores, eigen_values, eigen_vectors) where each is a numpy array.

Return type:

numpy.ndarray

grapp.linalg.sort_by_eigvalues(eigen_values: ndarray[tuple[Any, ...], dtype[_ScalarT]], eigen_vectors: ndarray[tuple[Any, ...], dtype[_ScalarT]])

Reorder the eigen value and vector arrays so that they are in descending order of the corresponding eigen value.

Parameters:
  • eigen_values (numpy.typing.NDArray) – The vector of eigen values, of length k.

  • eigen_vectors (numpy.typing.NDArray) – The matrix of eigen vectors, with k columns.

Association Studies (GWAS)

grapp.assoc.linear_assoc_covar(grg: GRG, Y: ndarray[tuple[Any, ...], dtype[_ScalarT]], C: ndarray[tuple[Any, ...], dtype[_ScalarT]], only_beta: bool = False, hide_covars: bool = True, standardize: bool = False, method: str = 'QR', dist: str = 'sample') DataFrame

Performs regression for each mutation with covariate adjustment. Missing data is treated as the mean genotype value (allele frequency for the relevant variant). Uses QR decomposition to project out covariate effects from the phenotype and genotype vectors.

Parameters:
  • Y (numpy.ndarray) – Phenotype vector of shape (n_samples,), with missing values specified as NaN.

  • C (numpy.ndarray) – Covariate matrix of shape (n_samples, n_covariates). Should include intercept.

  • only_beta (bool) – If True, returns only the BETA column in the output.

  • hide_covars (bool) – If False, includes estimated covariate effects (GAMMA_i) in the output.

  • standardize (bool) – If True, standardize X and Y (after adjusting for covariates).

  • method (str) – Either “QR” (default) or “regress”. “QR” uses QR decomposition to adjust both \(X\) and \(Y\) for covariates (\(C\)), but if standardize=True then it assumes that \(X\) and \(C\) are independent (the more correlated they are, the less “standardized” the result will be). “regress” uses linear regression between \(Y\) and \(C\) (to get \(B_c\)), and then performs GWAS against \(Y'\) (\(Y' = Y - C \times B_c\)).

  • dist (str) – How to compute the \(diag(X^T X)\) term. Options are: “sample” (use individual coalescence information to compute sample mean and variance), “binomial” (assume the diploid data follows a binomial distribution, for mean and variance). Default: “sample”.

Returns:

A DataFrame containing at least BETA, SE, T, and P columns. If hide_covars is False, also includes GAMMA columns.

Return type:

pandas.DataFrame

grapp.assoc.linear_assoc_no_covar(grg: GRG, Y: ndarray[tuple[Any, ...], dtype[_ScalarT]], only_beta: bool = False, standardize: bool = False, dist: str = 'sample') DataFrame

Performs regression for each mutation without adjusting for covariates. Missing data is treated as the mean genotype value (allele frequency for the relevant variant).

Parameters:
  • Y (numpy.ndarray) – Phenotype vector of shape (n_samples,), with missing values specified as NaN.

  • only_beta (bool) – If True, returns a DataFrame with only the BETA column.

  • standardize (bool) – If True, standardize X and Y (after adjusting for covariates).

  • dist (str) – How to compute the \(diag(X^T X)\) term. Options are: “sample” (use individual coalescence information to compute sample mean and variance), “binomial” (assume the diploid data follows a binomial distribution, for mean and variance). Default: “sample”.

Returns:

A DataFrame containing statistics for each mutation: - POS, ALT, COUNT, BETA, B0, SE, R2, T, and P.

Return type:

pandas.DataFrame

grapp.assoc.read_pheno(filename: str) ndarray[tuple[Any, ...], dtype[_ScalarT]]

Reads a PLINK/GCTA/GRG-style phenotype file and returns the phenotype vector.

Parameters:

path (str) – Path to the phenotype file.

Returns:

A one-dimensional NumPy array of phenotype values.

Return type:

numpy.ndarray

Reads a PLINK-style covariate file (no headers) and returns a NumPy matrix of covariate values. The first two columns (FID/IID) are ignored. Optionally adds an intercept column of 1s.

Parameters:

path (str) – Path to the covariate file.

Returns:

A NumPy array of shape (n_samples, n_covariates [+1 if intercept]).

Return type:

numpy.ndarray

Nearest Neighbor Comparisons

class grapp.nn.NearestNeighborContext(grg: GRG)

Bases: object

The main class for performing neighbor queries against the GRG format. Holds cached information related to nearest-neighbor queries on a specific GRG.

exact_hamming_dists(seeds: ndarray[tuple[Any, ...], dtype[_ScalarT]], direction: TraversalDirection, emit_all_nodes: bool = False) ndarray[tuple[Any, ...], dtype[_ScalarT]]

Using exact computations, get the Hamming distances from the matrix of input seeds to every other sample in the GRG.

Parameters:
  • seeds (numpy.ndarray) – A two-dimensional numpy array. Each row corresponds to a single “query” for distances, and contains a ‘1’ for every mutation (downward direction) or sample (upward direction) that is used by the query item.

  • direction (pygrgl.TraversalDirection) – Whether to find the distances to Samples (pygrgl.TraversalDirection.DOWN) or the distances to Mutations (pygrgl.TraversalDirection.UP). The number of columns in the seeds input matrix must match the direction, so columns(seeds) == grg.num_mutations if direction is down, and columns(seeds) == grg.num_samples if direction is up.

Returns:

A two-dimensional numpy array where the number of rows matches the input matrix; i.e. each row is a result from each query. The number of columns is the opposite of the input (similar to pygrgl.matmul), so if the seeds have grg.num_mutations columns then the result will have grg.num_samples columns.

Return type:

numpy.ndarray

exact_hamming_dists_by_mutation(mutation_ids: List[int], emit_all_nodes: bool = False) ndarray[tuple[Any, ...], dtype[_ScalarT]]

Using exact computations, get the Hamming distances from the list of input mutation IDs (for Mutations in the GRG) to every other mutation in the GRG.

Parameters:
  • mutation_ids (List[int]) – List of GRG Mutation IDs, each of which will be queried for distance to all other Mutations.

  • emit_all_nodes (bool) – Set to True to compute distances to every _node_ in the graph, not just every other Mutation. The output Matrix will have num_nodes columns when True.

Returns:

Matrix of distances, where each row corresponds to input Mutation IDs, and each column is the distance from the “other” Mutation ID. For example, if the 0th input is Mutation ID “m0”, then the 0th row of output is [D(m0, 0), D(m0, 1), …, D(m0, M-1)] where M is the number of mutations.

Return type:

numpy.ndarray

exact_hamming_dists_by_sample(sample_ids: List[int], emit_all_nodes: bool = False) ndarray[tuple[Any, ...], dtype[_ScalarT]]

Using exact computations, get the Hamming distances from the list of input sample IDs (for samples in the GRG) to every other sample in the GRG.

Parameters:
  • sample_ids (List[int]) – List of GRG Node IDs for sample nodes, each of which will be queried for distance to all other samples.

  • emit_all_nodes (bool) – Set to True to compute distances to every _node_ in the graph, not just every other sample. The output Matrix will have num_nodes columns when True.

Returns:

Matrix of distances, where each row corresponds to input sample IDs, and each column is the distance from the “other” sample ID. For example, if the 0th input is sample ID “n0”, then the 0th row of output is [D(n0, 0), D(n0, 1), …, D(n0, N-1)] where N is the number of haploid samples.

Return type:

numpy.ndarray

fast_pairwise_hamming(node1: int, node2: int, direction: TraversalDirection) int

Compute the Hamming distance between a pair of samples or mutations (or arbitrary nodes in the graph, but that has a less well-defined “meaning”). This calculation is extremely fast for a pair that are highly similar (low Hamming distance), as it shortcuts the graph traversal by making use of pygrgl.shared_frontier().

Note: Ensure that node1 != node2 prior to calling.

Parameters:
  • node1 (int) – The first node ID (e.g., sample ID or node associated with a mutation).

  • node2 (int) – The second node ID (e.g., sample ID or node associated with a mutation).

  • direction – The direction to use for distance calculation. pygrgl.TraversalDirection.UP means to compare the sets of Mutations shared by the nodes (distance is on differing Mutations) and pygrgl.TraversalDirection.DOWN means to compare sets of Samples.

property grg
property muts_above: ndarray[tuple[Any, ...], dtype[_ScalarT]]

Vector of length grg.num_nodes, where each node’s value is the number of Mutations above that node in the graph.

property samps_below: ndarray[tuple[Any, ...], dtype[_ScalarT]]

Vector of length grg.num_nodes, where each node’s value is the number of sample nodes below that node in the graph.

Filtering, Export, etc.

Filtering GRGs

Functions for filtering data out of a GRG to create a new, smaller GRG.

grapp.util.filter.grg_save_freq(grg_or_filename: GRG | str, out_filename: str, freq_range: Tuple[float, float])

Given a GRG filename or object, save a new GRG that contains only the Mutations in the given frequency range.

Parameters:
  • grg_or_filename (Union[pygrgl.GRG, str]) – Either a pygrgl.GRG object, or the filename of a GRG.

  • out_filename (str) – The filename of the to-be-created GRG.

  • freq_range (Tuple[float, float]) – A pair (lower, upper), where the Mutations will be kept if lower <= frequency(Mutation) < upper. I.e., lower is inclusive and upper is exclusive.

grapp.util.filter.grg_save_individuals(grg_or_filename: GRG | str, out_filename: str, individual_ids: List[str], allow_extra: bool = False, verbose: bool = False)

Save a GRG, keeping only the individuals with the IDs given in the list.

Parameters:
  • grg_or_filename (Union[pygrgl.GRG, str]) – Either a pygrgl.GRG object, or the filename of a GRG.

  • out_filename (str) – The new GRG file to create.

  • individual_ids (List[str]) – List of individual identifiers to be kept.

  • allow_extra (bool) – When False, throw an exception if individual_ids contains any identifier not found in the GRG. Default: False.

grapp.util.filter.grg_save_mut_filter(grg_or_filename: GRG | str, out_filename: str, mut_filter: Callable[[GRG, int], bool], bp_range: Tuple[int, int] = (0, 0))

Given a GRG filename or object, save a new GRG that contains only the Mutations selected by the given filter function.

Parameters:
  • grg_or_filename (Union[pygrgl.GRG, str]) – Either a pygrgl.GRG object, or the filename of a GRG.

  • out_filename (str) – The filename of the to-be-created GRG.

  • mut_filter (Callable[[pygrgl.GRG, int], bool]) – Callback (function) that takes a MutationID (int) as input and returns true if that mutation should be kept.

grapp.util.filter.grg_save_populations(grg_or_filename: GRG | str, out_filename: str, populations: List[str], allow_extra: bool = False, verbose: bool = False)

Save a GRG, keeping only the samples with populations matching the given population list.

Parameters:
  • grg_or_filename (Union[pygrgl.GRG, str]) – Either a pygrgl.GRG object, or the filename of a GRG.

  • out_filename (str) – The new GRG file to create.

  • populations (List[str]) – List of population names to be kept.

  • allow_extra (bool) – When False, throw an exception if populations contains any identifier not found in the GRG. Default: False.

grapp.util.filter.grg_save_range(grg_or_filename: GRG | str, out_filename: str, bp_range: Tuple[int, int])

Given a GRG filename or object, save a new GRG that contains only the Mutations in the given basepair range.

Parameters:
  • grg_or_filename (Union[pygrgl.GRG, str]) – Either a pygrgl.GRG object, or the filename of a GRG.

  • out_filename (str) – The filename of the to-be-created GRG.

  • bp_range (Tuple[int, int]) – A pair (lower, upper), where both are in units basepair, and the Mutations will be kept if lower <= Mutation.position < upper. I.e., lower is inclusive and upper is exclusive.

grapp.util.filter.grg_save_samples(grg_or_filename: GRG | str, out_filename: str, sample_nodes: List[int], verbose: bool = False)

Save a GRG, keeping only the haploid samples corresponding to the NodeIDs (indexes) given. See grg_save_individuals() for a version that uses identifiers to more “safely” down sample a GRG dataset.

Parameters:
  • grg_or_filename (Union[pygrgl.GRG, str]) – Either a pygrgl.GRG object, or the filename of a GRG.

  • out_filename (str) – The new GRG file to create.

  • sample_nodes (List[str]) – List of NodeIDs (indexes) for the haploid samples. If a GRG has N samples, then they are numbered 0…(N-1). The ordering matches the order of the input file that the GRG was constructed from.

grapp.util.filter.multi_grg_save_mut_filter(grgs_or_filenames: List[GRG] | List[str], out_filenames: List[str], mut_filter: Callable[[GRG, int, int], bool])

Given a list of GRG filenames or GRG objects, save a new GRG for each that contains only the Mutations selected by the given filter function. The callback takes the GRG, the MutationID within that GRG, and the “cumulative MutationID” when considering all GRGs sequentially (e.g. the second GRG’s mutations start counting right after the last MutationID of the first GRG).

Parameters:
  • grgs_or_filenames (Union[List[pygrgl.GRG], List[str]]) – Either a pygrgl.GRG object, or the filename of a GRG.

  • out_filenames – The list of filenames of the to-be-created GRGs.

  • mut_filter (Callable[[pygrgl.GRG, int, int], bool]) – Callback (function) that takes a MutationID (int) as input and returns true if that mutation should be kept.

grapp.util.filter.split_by_ranges(grg_filename: str, ranges: List[Tuple[int, int]], jobs: int = 1, out_dir: str | None = None) List[str]

Split a GRG into multiple parts, spanning the list of basepair ranges given.

Parameters:
  • grg_filename (str) – The input GRG filename.

  • ranges (List[Tuple[int, int]]) – A list of (lower, upper) pairs, where lower and upper are in units basepair, and lower is inclusive while upper is exclusive.

  • jobs (int) – Number of processes/threads to use. Default: 1.

  • out_dir (Optional[str]) – Output directory to put the split pieces into. If None, then use the current working directory. Default: None.

Returns:

List of filenames for the resulting GRG files.

Return type:

List[str]

Exporting to IGD

grapp.util.igd.export_igd(grg_or_filename: GRG | str, out_filename: str, jobs: int = 1, batch_size: str | int = 'auto', temp_dir: str | None = None, no_merge: bool = False, split_threshold: int = 5000000, verbose: bool = False)

Export a GRG to a phased IGD file, which is a sparse matrix representation of the same data. An IGD will almost always be larger than a GRG, but it can be useful because: 1. The rows are variants, giving fast access to specific variants and their list of samples. Instead of having traverse many graph edges to get the sample list for a variant, you can just read the row from the IGD. 2. Conversion to other standard formats is very fast, for example .vcf.gz

Parameters:
  • grg_or_filename – The GRG to convert, either as a pygrgl.GRG or the filename of a GRG.

  • out_filename (str) – The IGD file to create. The path up to the filename must already exist, and the file itself must not exist.

  • jobs (int) – The number of parallel processes to use to do the conversion. The speed-up is essentially linear. Default: 1.

  • batch_size (Union[str, int]) – The number of Mutations to process simultaneously, or the string “auto” if you want a reasonable value to be chosen for you.

  • temp_dir (Optional[str]) – The directory to use for intermediate IGD files. The GRG is split into multiple pieces and placed in this directory, and then each piece gets converted to an IGD file, and then those IGD files are merged into the final result. If temp_dir is None, these files are placed in a temporary directory which is then deleted upon completion.

  • no_merge (bool) – Set to True to get all the intermediate files, but not merge them into a final IGD. In this case, out_filename will not be created.

  • split_threshold (int) – Basepair threshold for splitting the GRG into chunks for processing. A split GRG is much faster to operate on than a full sized GRG, plus this is how we parallelize the conversion. Default: 5MB.

grapp.util.igd.export_vcf(grg_or_filename: GRG | str, out_file_obj: TextIO, contig: str = 'unknown', jobs: int = 1, batch_size: str | int = 'auto', temp_dir: str | None = None, split_threshold: int = 5000000, verbose: bool = False)

WARNING: Incredibly slow for huge datasets!

Export a GRG to a phased VCF file. Usage should to either use a Gzip file object for the output, or stdout and then pipe the results to bgzip.

Parameters:
  • grg_or_filename – The GRG to convert, either as a pygrgl.GRG or the filename of a GRG.

  • out_file_obj – The file handle to write VCF data to.

  • contig (str) – The contig name to use in the VCF. Default: “unknown”.

  • jobs (int) – The number of parallel processes to use to do the conversion. The speed-up is essentially linear. Default: 1.

  • batch_size (Union[str, int]) – The number of Mutations to process simultaneously, or the string “auto” if you want a reasonable value to be chosen for you.

  • temp_dir (Optional[str]) – The directory to use for intermediate IGD files. The GRG is split into multiple pieces and placed in this directory, and then each piece gets converted to an IGD file, and then those IGD files are merged into the final result. If temp_dir is None, these files are placed in a temporary directory which is then deleted upon completion.

  • split_threshold (int) – Basepair threshold for splitting the GRG into chunks for processing. A split GRG is much faster to operate on than a full sized GRG, plus this is how we parallelize the conversion. Default: 5MB.

grapp.util.igd.igd_to_vcf(igd_filename: str, out_file_obj: TextIO, contig: str, buffer_lines: int = 1000)

WARNING: Incredibly slow for huge datasets!

Convert and IGD file to VCF. General usage should to either use a Gzip file object for the output, or stdout and then pipe the results to bgzip.

This method produces a VCF file with the variants expanded just like the IGD file. To “unexpand” the VCF file, use bcftools norm -m +any input.vcf -o output.vcf.

Parameters:
  • igd_filename (str) – The input IGD filename.

  • out_file_obj – The file handle to write VCF data to.

  • contig (str) – The contig name to use in the VCF.

  • buffer_lines (int) – The number of lines to buffer before flushing to disk. Default: 1000.

Simple Calculations

Simple utility functions.

exception grapp.util.simple.UserInputError

Bases: Exception

grapp.util.simple.allele_counts(grg: GRG, return_missing: bool = False, mask_samples: List[int] | ndarray[tuple[Any, ...], dtype[_ScalarT]] = []) ndarray[tuple[Any, ...], dtype[_ScalarT]] | Tuple[ndarray[tuple[Any, ...], dtype[_ScalarT]], ndarray[tuple[Any, ...], dtype[_ScalarT]]]

Get the allele counts for the mutations in the given GRG.

Parameters:
  • grg (pygrgl.GRG) – The GRG.

  • return_missing (bool) – Return two arrays: the allele counts, and the missingness counts.

  • sample_filter (List[int]) – Optional. Restrict the counts to a subset of samples, listed by sample node ID.

  • mask_samples (Union[List[int], numpy.typing.NDArray]) – Ignore any contribution from the samples listed in this array.

Returns:

A vector of length grg.num_mutations, containing allele counts indexed by MutationID.

Return type:

numpy.ndarray

grapp.util.simple.allele_frequencies(grg: GRG, adjust_missing: bool = False, mask_samples: List[int] | ndarray[tuple[Any, ...], dtype[_ScalarT]] = []) ndarray[tuple[Any, ...], dtype[_ScalarT]]

Get the allele frequencies for the mutations in the given GRG.

Parameters:
  • grg (pygrgl.GRG) – The GRG.

  • adjust_missing (bool) – Optional. Set to true to adjust each allele frequncies to be \(\frac{count_i}{total - missing_i}\) instead of \(\frac{count_i}{total}\).

  • mask_samples (Union[List[int], numpy.typing.NDArray]) – Ignore any contribution from the samples listed in this array.

Returns:

A vector of length grg.num_mutations, containing allele frequencies indexed by MutationID.

Return type:

numpy.ndarray