Introduction

Previously, we looked at summary statistics for LDL using a large cohort. Here, we’ll see how those summary statistics are actually generated. We’ll look at a case-control outcome for IBD for chromosome 10 using simulated data.

Problem 1

A) Measure the association of each SNP on the phenotype without adjusting for any other covariates. B) Show a qq-plot and manhattan plot as we did in homework 1 using the qqman package. C) Compute \(\lambda_{GC}\) as we did in homework 1. What does the computed value of \(\lambda_{GC}\) say about the degree of inflation for this chromosome?

Here in one code chunk, we can compute all three regression models for 1A, 2A, and 2B. The final data frame “result” will have the SNP coordinates, alleles, and then 3 p-values for each of the associated models.

library(data.table)
library(irlba)

# Import Data
geno <- data.matrix(data.frame(fread(paste0("zcat <", "../../Assignments/Homework3/data/genotype.txt.gz"), header = FALSE)))
## 
Read 0.0% of 6081 rows
Read 6081 rows and 34405 (of 34405) columns from 0.394 GB file in 00:00:15
pheno <- read.table("../../Assignments/Homework3/data/phenotype.txt", header = TRUE)
snpPos <- read.table("../../Assignments/Homework3/data/legend.txt", header = TRUE)

# QC
snpMissingRate <- colMeans(is.na(geno))
personMissingRate <- rowMeans(is.na(geno))

geno <- geno[personMissingRate <= 0.05, snpMissingRate <= 0.05]
snpPos <- snpPos[snpMissingRate <= 0.05,]
pheno <- pheno[personMissingRate <= 0.05,]

# Handle Missingness
for (j in 1:dim(geno)[2])  {
  temp <- geno[,j]
  geno[,j][is.na(temp)] <- median(temp,na.rm = TRUE)  #replace missing with median
  geno[,j] <- (geno[,j]-mean(geno[,j]))/sd(geno[,j])   #normalization / standarization
}

# Perform PCA
PC <- irlba(geno)
allDF <- data.frame(ancestry = pheno$ancestry, gender = pheno$gender,
                    Y = pheno$Y, PC$u)

# Initialize the data frame of results
result <- snpPos
result$noadjp <- 1; result$ancestryp <- 1; result$PCp <- 1

# Loop over each yariant
for (j in 1:dim(geno)[2]) {

  # Fit models
  noadj.mod   <- summary(glm(allDF$Y~geno[,j],family=binomial()))$coeff
  ancestry.mod<- summary(glm(allDF$Y~geno[,j]+allDF$gender+allDF$ancestry,
                             family=binomial()))$coeff
  PC4.mod      <- summary(glm(allDF$Y~geno[,j]+allDF$gender+allDF$X1+allDF$X2+allDF$X3+allDF$X4,
                             family=binomial()))$coeff
  
  # Overwrite the results data frame
  result[j,c("noadjp","ancestryp","PCp")] <- c(noadj.mod[2,4],ancestry.mod[2,4],PC4.mod[2,4])
}

The summary statistics can be visualized from the model here:

qqman::manhattan(result, chr = "chr", bp = "position", p = "noadjp")