Example Analysis
Table of Contents
1 Prepare the data
1.1 Reformat phenotypic data
Check inut pheno-file
head -n 5 ../data/phenotypes.csv
id,sex,age,coolTrait prs001,0,74.39,-0.56 prs002,0,67.76,1.18 prs003,1,50.57,0.27 prs004,0,44.35,0.22
Let us also look into fam-file:
head -n 5 ../data/plink.fam
1 1 0 0 2 -9 2 2 0 0 2 -9 3 3 0 0 2 -9 4 4 0 0 2 -9 5 5 0 0 2 -9
Ok, in the phe-file we need to remove leading 'prs00..' and replace ',' with space. Let us do that using perl
open iFILE, "<", $infile or die $!; open oFILE, ">", $outfile or die $!; while (<iFILE>) { s/prs[0]*//g; s/,/ /g; print oFILE $_; }
check results:
head -n 5 phenotypes.txt
id sex age coolTrait 1 0 74.39 -0.56 2 0 67.76 1.18 3 1 50.57 0.27 4 0 44.35 0.22
Looks good!
1.2 Reformat genetic data
Let us now generate TPED files from binary files
plink --bfile ../data/plink --recode --transpose --out plink
@----------------------------------------------------------@ | PLINK! | v1.07 | 10/Aug/2009 | |----------------------------------------------------------| | (C) 2009 Shaun Purcell, GNU General Public License, v2 | |----------------------------------------------------------| | For documentation, citation & bug-report instructions: | | http://pngu.mgh.harvard.edu/purcell/plink/ | @----------------------------------------------------------@ Web-based version check ( --noweb to skip ) Recent cached web-check found... OK, v1.07 is current Writing this text to log file [ plink.log ] Analysis started: Wed May 9 17:24:16 2012 Options in effect: --bfile ../data/plink --recode --transpose --out plink Reading map (extended format) from [ ../data/plink.bim ] 500 markers to be included from [ ../data/plink.bim ] Reading pedigree information from [ ../data/plink.fam ] 120 individuals read from [ ../data/plink.fam ] 0 individuals with nonmissing phenotypes Assuming a disease phenotype (1=unaff, 2=aff, 0=miss) Missing phenotype value is also -9 0 cases, 0 controls and 120 missing 0 males, 120 females, and 0 of unspecified sex Reading genotype bitfile from [ ../data/plink.bed ] Detected that binary PED file is v1.00 SNP-major mode Before frequency and genotyping pruning, there are 500 SNPs 120 founders and 0 non-founders found Total genotyping rate in remaining individuals is 1 0 SNPs failed missingness test ( GENO > 1 ) 0 SNPs failed frequency test ( MAF < 0 ) After frequency and genotyping pruning, there are 500 SNPs After filtering, 0 cases, 0 controls and 120 missing After filtering, 0 males, 120 females, and 0 of unspecified sex Writing transposed ped file to [ plink.tped ] Writing family information to [ plink.tfam ] Analysis finished: Wed May 9 17:24:16 2012
Now we all set to import data to GenABEL
2 Import data to GenABEL
Load library
library(GenABEL)
Convert genotypic data to internal GA format
convert.snp.tped(tped="plink.tped",tfam="plink.tfam",out="plink.raw")
done. Read 120 individual ids from file 'plink.tfam' Reading genotypes from file 'plink.tped' ... ...done. Read 500 SNPs from file 'plink.tped' Writing to file 'plink.raw' ... ... done.
Make GA object
dta <- load.gwaa.data(pheno="phenotypes.txt",geno="plink.raw")
ids loaded... marker names loaded... chromosome data loaded... map data loaded... allele coding data loaded... strand data loaded... genotype data loaded... snp.data object created... assignment of gwaa.data object FORCED; X-errors were not checked!
3 Quality control
Run QC procedure
qc <- check.marker(dta,maf=0.01,p.lev=1e-6)
Excluding people/markers with extremely low call rate... 500 markers and 120 people in total 0 people excluded because of call rate < 0.1 0 markers excluded because of call rate < 0.1 Passed: 500 markers and 120 people RUN 1 500 markers and 120 people in total 90 (18%) markers excluded as having low (<1%) minor allele frequency 0 (0%) markers excluded because of low (<95%) call rate 7 (1.4%) markers excluded because they are out of HWE (P <1e-06) 0 (0%) people excluded because of low (<95%) call rate Mean autosomal HET is 0.2928246 (s.e. 0.04848604) 0 people excluded because too high autosomal heterozygosity (FDR <1%) Mean IBS is 0.7365075 (s.e. 0.03945568), as based on 403 autosomal markers 0 (0%) people excluded because of too high IBS (>=0.95) In total, 403 (80.6%) markers passed all criteria In total, 120 (100%) people passed all criteria RUN 2 403 markers and 120 people in total 0 (0%) markers excluded as having low (<1%) minor allele frequency 0 (0%) markers excluded because of low (<95%) call rate 0 (0%) markers excluded because they are out of HWE (P <1e-06) 0 (0%) people excluded because of low (<95%) call rate Mean autosomal HET is 0.2928246 (s.e. 0.04848604) 0 people excluded because too high autosomal heterozygosity (FDR <1%) Mean IBS is 0.7365075 (s.e. 0.03945568), as based on 403 autosomal markers 0 (0%) people excluded because of too high IBS (>=0.95) In total, 403 (100%) markers passed all criteria In total, 120 (100%) people passed all criteria
Summary for people and SNPs
summary(qc)
$`Per-SNP fails statistics` NoCall NoMAF NoHWE Redundant Xsnpfail NoCall 0 0 0 0 0 NoMAF NA 90 0 0 0 NoHWE NA NA 7 0 0 Redundant NA NA NA 0 0 Xsnpfail NA NA NA NA 0 $`Per-person fails statistics` IDnoCall HetFail IBSFail isfemale ismale isXXY otherSexErr IDnoCall 0 0 0 0 0 0 0 HetFail NA 0 0 0 0 0 0 IBSFail NA NA 0 0 0 0 0 isfemale NA NA NA 0 0 0 0 ismale NA NA NA NA 0 0 0 isXXY NA NA NA NA NA 0 0 otherSexErr NA NA NA NA NA NA 0
Clean the data
dtaQCed <- dta[qc$idok,qc$snpok] nids(dtaQCed) nsnps(dtaQCed)
[1] 120 [1] 403
Compute summaries for clean data
smr <- summary(gtdata(dtaQCed)) smr[1:3,]
Chromosome Position Strand A1 A2 NoMeasured CallRate Q.2 P.11 rs6139074 20 11244 u A C 120 1 0.34166667 54 rs13043000 20 13288 u G T 120 1 0.05416667 108 rs6037629 20 43093 u T G 120 1 0.06666667 105 P.12 P.22 Pexact Fmax Plrt rs6139074 50 16 0.4205635 0.07378821 0.4210522 rs13043000 11 1 0.2903816 0.10538800 0.3329276 rs6037629 14 1 0.4142976 0.06250000 0.5346466
Generate histogram ofr EAF
hist(smr$Q.2)
4 Run association analysis
What traits are there?
phdata(dtaQCed)[1:3,]
id sex age coolTrait 1 1 0 74.39 -0.56 2 2 0 67.76 1.18 3 3 1 50.57 0.27
Let us analyze 'coolTrait'!
qts <- qtscore(coolTrait ~ sex, data=dtaQCed)
And the top results are …
summary(qts)
Summary for top 10 results, sorted by P1df Chromosome Position Strand A1 A2 N effB se_effB chi2.1df rs6039167 20 846271 u G A 120 -5.8718229 0.9383084 39.16109 rs511582 20 868752 u A G 120 -0.9951493 0.2071729 23.07335 rs967789 20 2876445 u A G 120 -0.9169810 0.2108509 18.91340 rs642758 20 2831146 u A G 120 -0.9031358 0.2145081 17.72632 rs676749 20 2974069 u A T 120 -0.8587089 0.2042989 17.66689 rs577116 20 2818285 u G A 120 -0.9109926 0.2183962 17.39961 rs570673 20 2819015 u C G 120 -0.8558624 0.2110940 16.43827 rs6116745 20 531587 u G A 120 -0.8570554 0.2319487 13.65318 rs857246 20 3039527 u T C 120 -0.8047982 0.2189885 13.50614 rs6138952 20 2849540 u A G 120 0.7276168 0.2035988 12.77188 P1df effAB effBB chi2.2df P2df Pc1df rs6039167 3.902411e-10 -5.8718229 NA 39.16109 3.902411e-10 0.0002461192 rs511582 1.559370e-06 -1.4406424 -1.693303 25.64223 2.703087e-06 0.0048901914 rs967789 1.367883e-05 -1.1036460 -1.683681 19.44626 5.988238e-05 0.0108375531 rs642758 2.550731e-05 -1.0256233 -1.695747 17.96912 1.253300e-04 0.0136387515 rs676749 2.631684e-05 -1.0802561 -1.571311 18.36022 1.030693e-04 0.0137971704 rs577116 3.028887e-05 -1.0479276 -1.682819 17.72193 1.418181e-04 0.0145333688 rs570673 5.026025e-05 -0.9592921 -1.625815 16.60981 2.473009e-04 0.0175331294 rs6116745 2.198682e-04 -1.1379859 -1.301536 15.25578 4.866872e-04 0.0304046737 rs857246 2.377842e-04 -0.7444676 -1.663115 13.55748 1.137710e-03 0.0313116520 rs6138952 3.518693e-04 0.7348530 1.450474 12.77271 1.684384e-03 0.0362826813
And let us also have manhattan plot
plot(qts) abline(h=-log10(5e-8))