Skip to contents

Normalises column names from any GWAS summary statistics file to the canonical schema expected by run_mr() and run_coloc(). Handles datasets where rsIDs are absent by looking them up from a PLINK bim file, and supports extracting chromosome and position from compound marker ID columns (e.g. SCALLOP MarkerName format: "CHR:POS:A1_A2").

Usage

format_gwas(
  path,
  phenotype_id,
  type = c("outcome", "exposure"),
  col_map = NULL,
  bim_path = NULL,
  marker_col = NULL,
  marker_sep = ":",
  log10_pval = FALSE,
  flip_beta = FALSE,
  n = NULL
)

Arguments

path

Character file path (.tsv, .tsv.gz, .txt.gz, etc. – data.table::fread() auto-detects compression) or a pre-loaded data frame.

phenotype_id

Character. Trait / phenotype identifier (e.g. "IL-18", "CAD"). Stored in the phenotype output column.

type

"outcome" (default) or "exposure". "outcome" returns the normalised data frame ready for run_mr()'s outcome argument. "exposure" additionally calls TwoSampleMR::format_data() and returns TwoSampleMR-formatted data with exposure. column suffixes.

col_map

Named list of extra column-name aliases, e.g. list(pval = "PVALUE", n = "SampleSize"). Only needed for column names not already covered by the built-in alias table (see Column normalisation section). User entries are checked before the built-in list.

bim_path

Character. Path to a PLINK bfile prefix (without .bim) used to recover rsIDs when the data lacks them. Required whenever the rsids column is absent.

marker_col

Character. Name of a compound marker ID column in "CHR<sep>POS<sep>..." format (e.g. "MarkerName" for SCALLOP files). When supplied, chr and pos are parsed from this column.

marker_sep

Character. Field separator used in marker_col. Default ":".

log10_pval

Logical. If TRUE, the p-value column is in -log10 scale and is back-transformed via 10^-x. Default FALSE.

flip_beta

Logical. If TRUE, multiplies beta by -1 – use when the source file encodes the inverse direction of the intended exposure (e.g. modelling NLRP3 activation rather than suppression). Default FALSE.

n

Integer. Explicit sample size. Added as the n column only when no sample-size column is already present in the data.

Value

  • type = "outcome": a data frame with columns rsids, chr, pos, beta, se, eaf, pval, n, effect_allele, other_allele, phenotype (plus any extra columns from the source file).

  • type = "exposure": a TwoSampleMR-formatted data frame (output of TwoSampleMR::format_data()) with exposure.-suffixed columns, suitable for run_mr()'s exposure argument.

Column normalisation

The function renames source columns to a fixed canonical schema by checking a built-in table of known aliases for each target column:

CanonicalBuilt-in aliases recognised automatically
rsidsrsid, rs_id, rsID, SNP
chrchromosome, Chr, CHROM, #CHROM, CHR
posbase_pair_location, PosB37, PosB38, BP, POS, position, GENPOS
betaBeta, Effect, BETA
sestandard_error, StdErr, SE, sebeta
eafeffect_allele_frequency, Freq1, EAFrq, A1FREQ, af_alt, EAF
pvalp_value, P-value, P, Pval, p.value
nN, TotalSampleSize, n_total
effect_alleleAllele1, EA, A1, ALLELE1, effectAllele, ALT
other_alleleAllele2, OA, A2, ALLELE0, otherAllele, REF

Supply col_map only when your dataset uses a column name that does not appear in the table above – for example, if your p-value column is called "PVALUE", add col_map = list(pval = "PVALUE"). Inspect names() of your loaded data to check. User-supplied aliases are checked before the built-in list, so they take precedence in the event of ambiguity.

rsID lookup from bim file

When rsids is absent (or all NA) after column normalisation, and bim_path is supplied, the function inner-joins the data to the PLINK bim file by chromosome and position to recover rsIDs. Rows without a bim match are dropped – they are absent from the reference panel and cannot be used in LD-based analyses. A message reports how many SNPs were retained.

Marker column parsing

Set marker_col to the name of a compound marker ID column whose values have the form "CHR<sep>POS<sep>..." (e.g. SCALLOP "MarkerName"). chr and pos are extracted from the first two fields. This step runs before the rsID lookup so that the extracted coordinates are available for the bim join.

Examples

if (FALSE) { # \dontrun{
# Outcome GWAS whose columns are already in the built-in alias table
cad <- format_gwas(
  path         = "genomics_data/outcome_GWAS/CAD/cad_gwas.tsv.gz",
  phenotype_id = "CAD"
)

# SCALLOP outcome: no rsIDs, chr+pos embedded in MarkerName column
scallop_il6 <- format_gwas(
  path         = "genomics_data/outcome_GWAS/SCALLOP/CVD1_IL6.tsv.gz",
  phenotype_id = "IL6",
  marker_col   = "MarkerName",
  bim_path     = "LD_ref/g1000_eur"
)

# Dataset with a non-standard p-value column not in the alias table
ebi_il18 <- format_gwas(
  path         = "genomics_data/outcome_GWAS/EBI/GCST90428399.tsv.gz",
  phenotype_id = "IL-18",
  col_map      = list(pval = "PVALUE")
)

# Exposure GWAS -- flip beta to model NLRP3 activation not suppression
exposure <- format_gwas(
  path         = "NLRP3/Output/NLRP3_CRP_IVs_300kb.tsv",
  phenotype_id = "NLRP3",
  type         = "exposure",
  flip_beta    = TRUE
)
} # }