Example Analysis

1 Prepare the data
- 1.1 Reformat phenotypic data
- 1.2 Reformat genetic data
2 Import data to GenABEL
3 Quality control
4 Run association analysis

1 Prepare the data

1.1 Reformat phenotypic data

Check inut pheno-file

head -n 5 ../data/phenotypes.csv

id,sex,age,coolTrait
prs001,0,74.39,-0.56
prs002,0,67.76,1.18
prs003,1,50.57,0.27
prs004,0,44.35,0.22

Let us also look into fam-file:

head -n 5 ../data/plink.fam

1 1 0 0 2 -9
2 2 0 0 2 -9
3 3 0 0 2 -9
4 4 0 0 2 -9
5 5 0 0 2 -9

Ok, in the phe-file we need to remove leading 'prs00..' and replace ',' with space. Let us do that using perl

open iFILE, "<", $infile or die $!;
open oFILE, ">", $outfile or die $!;
while (<iFILE>) {
  s/prs[0]*//g;
  s/,/ /g;
  print oFILE $_;
}

check results:

head -n 5 phenotypes.txt

id sex age coolTrait
1 0 74.39 -0.56
2 0 67.76 1.18
3 1 50.57 0.27
4 0 44.35 0.22

Looks good!

1.2 Reformat genetic data

Let us now generate TPED files from binary files

plink --bfile ../data/plink --recode --transpose --out plink

@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Web-based version check ( --noweb to skip )
Recent cached web-check found... OK, v1.07 is current

Writing this text to log file [ plink.log ]
Analysis started: Wed May  9 17:24:16 2012

Options in effect:
        --bfile ../data/plink
        --recode
        --transpose
        --out plink

Reading map (extended format) from [ ../data/plink.bim ] 
500 markers to be included from [ ../data/plink.bim ]
Reading pedigree information from [ ../data/plink.fam ] 
120 individuals read from [ ../data/plink.fam ] 
0 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -9
0 cases, 0 controls and 120 missing
0 males, 120 females, and 0 of unspecified sex
Reading genotype bitfile from [ ../data/plink.bed ] 
Detected that binary PED file is v1.00 SNP-major mode
Before frequency and genotyping pruning, there are 500 SNPs
120 founders and 0 non-founders found
Total genotyping rate in remaining individuals is 1
0 SNPs failed missingness test ( GENO > 1 )
0 SNPs failed frequency test ( MAF < 0 )
After frequency and genotyping pruning, there are 500 SNPs
After filtering, 0 cases, 0 controls and 120 missing
After filtering, 0 males, 120 females, and 0 of unspecified sex
Writing transposed ped file to [ plink.tped ] 
Writing family information to [ plink.tfam ] 

Analysis finished: Wed May  9 17:24:16 2012

Now we all set to import data to GenABEL

2 Import data to GenABEL

Load library

library(GenABEL)

Convert genotypic data to internal GA format

convert.snp.tped(tped="plink.tped",tfam="plink.tfam",out="plink.raw")

 done.  Read 120 individual ids from file 'plink.tfam'
Reading genotypes from file 'plink.tped' ...
...done.  Read 500 SNPs from file 'plink.tped'
Writing to file 'plink.raw' ...
... done.

Make GA object

dta <- load.gwaa.data(pheno="phenotypes.txt",geno="plink.raw")

ids loaded...
marker names loaded...
chromosome data loaded...
map data loaded...
allele coding data loaded...
strand data loaded...
genotype data loaded...
snp.data object created...
assignment of gwaa.data object FORCED; X-errors were not checked!

3 Quality control

Run QC procedure

qc <- check.marker(dta,maf=0.01,p.lev=1e-6)

Excluding people/markers with extremely low call rate...
500 markers and 120 people in total
0 people excluded because of call rate < 0.1 
0 markers excluded because of call rate < 0.1 
Passed: 500 markers and 120 people

RUN 1 
500 markers and 120 people in total
90 (18%) markers excluded as having low (<1%) minor allele frequency
0 (0%) markers excluded because of low (<95%) call rate
7 (1.4%) markers excluded because they are out of HWE (P <1e-06)
0 (0%) people excluded because of low (<95%) call rate
Mean autosomal HET is 0.2928246 (s.e. 0.04848604)
0 people excluded because too high autosomal heterozygosity (FDR <1%)
Mean IBS is 0.7365075 (s.e. 0.03945568), as based on 403 autosomal markers
0 (0%) people excluded because of too high IBS (>=0.95)
In total, 403 (80.6%) markers passed all criteria
In total, 120 (100%) people passed all criteria

RUN 2 
403 markers and 120 people in total
0 (0%) markers excluded as having low (<1%) minor allele frequency
0 (0%) markers excluded because of low (<95%) call rate
0 (0%) markers excluded because they are out of HWE (P <1e-06)
0 (0%) people excluded because of low (<95%) call rate
Mean autosomal HET is 0.2928246 (s.e. 0.04848604)
0 people excluded because too high autosomal heterozygosity (FDR <1%)
Mean IBS is 0.7365075 (s.e. 0.03945568), as based on 403 autosomal markers
0 (0%) people excluded because of too high IBS (>=0.95)
In total, 403 (100%) markers passed all criteria
In total, 120 (100%) people passed all criteria

Summary for people and SNPs

summary(qc)

$`Per-SNP fails statistics`
          NoCall NoMAF NoHWE Redundant Xsnpfail
NoCall         0     0     0         0        0
NoMAF         NA    90     0         0        0
NoHWE         NA    NA     7         0        0
Redundant     NA    NA    NA         0        0
Xsnpfail      NA    NA    NA        NA        0

$`Per-person fails statistics`
            IDnoCall HetFail IBSFail isfemale ismale isXXY otherSexErr
IDnoCall           0       0       0        0      0     0           0
HetFail           NA       0       0        0      0     0           0
IBSFail           NA      NA       0        0      0     0           0
isfemale          NA      NA      NA        0      0     0           0
ismale            NA      NA      NA       NA      0     0           0
isXXY             NA      NA      NA       NA     NA     0           0
otherSexErr       NA      NA      NA       NA     NA    NA           0

Clean the data

dtaQCed <- dta[qc$idok,qc$snpok]
nids(dtaQCed)
nsnps(dtaQCed)

[1] 120
[1] 403

Compute summaries for clean data

smr <- summary(gtdata(dtaQCed))
smr[1:3,]

           Chromosome Position Strand A1 A2 NoMeasured CallRate        Q.2 P.11
rs6139074          20    11244      u  A  C        120        1 0.34166667   54
rs13043000         20    13288      u  G  T        120        1 0.05416667  108
rs6037629          20    43093      u  T  G        120        1 0.06666667  105
           P.12 P.22    Pexact       Fmax      Plrt
rs6139074    50   16 0.4205635 0.07378821 0.4210522
rs13043000   11    1 0.2903816 0.10538800 0.3329276
rs6037629    14    1 0.4142976 0.06250000 0.5346466

Generate histogram ofr EAF

hist(smr$Q.2)

4 Run association analysis

What traits are there?

phdata(dtaQCed)[1:3,]

  id sex   age coolTrait
1  1   0 74.39     -0.56
2  2   0 67.76      1.18
3  3   1 50.57      0.27

Let us analyze 'coolTrait'!

qts <- qtscore(coolTrait ~ sex, data=dtaQCed)

And the top results are …

summary(qts)

Summary for top 10 results, sorted by P1df 
          Chromosome Position Strand A1 A2   N       effB   se_effB chi2.1df
rs6039167         20   846271      u  G  A 120 -5.8718229 0.9383084 39.16109
rs511582          20   868752      u  A  G 120 -0.9951493 0.2071729 23.07335
rs967789          20  2876445      u  A  G 120 -0.9169810 0.2108509 18.91340
rs642758          20  2831146      u  A  G 120 -0.9031358 0.2145081 17.72632
rs676749          20  2974069      u  A  T 120 -0.8587089 0.2042989 17.66689
rs577116          20  2818285      u  G  A 120 -0.9109926 0.2183962 17.39961
rs570673          20  2819015      u  C  G 120 -0.8558624 0.2110940 16.43827
rs6116745         20   531587      u  G  A 120 -0.8570554 0.2319487 13.65318
rs857246          20  3039527      u  T  C 120 -0.8047982 0.2189885 13.50614
rs6138952         20  2849540      u  A  G 120  0.7276168 0.2035988 12.77188
                  P1df      effAB     effBB chi2.2df         P2df        Pc1df
rs6039167 3.902411e-10 -5.8718229        NA 39.16109 3.902411e-10 0.0002461192
rs511582  1.559370e-06 -1.4406424 -1.693303 25.64223 2.703087e-06 0.0048901914
rs967789  1.367883e-05 -1.1036460 -1.683681 19.44626 5.988238e-05 0.0108375531
rs642758  2.550731e-05 -1.0256233 -1.695747 17.96912 1.253300e-04 0.0136387515
rs676749  2.631684e-05 -1.0802561 -1.571311 18.36022 1.030693e-04 0.0137971704
rs577116  3.028887e-05 -1.0479276 -1.682819 17.72193 1.418181e-04 0.0145333688
rs570673  5.026025e-05 -0.9592921 -1.625815 16.60981 2.473009e-04 0.0175331294
rs6116745 2.198682e-04 -1.1379859 -1.301536 15.25578 4.866872e-04 0.0304046737
rs857246  2.377842e-04 -0.7444676 -1.663115 13.55748 1.137710e-03 0.0313116520
rs6138952 3.518693e-04  0.7348530  1.450474 12.77271 1.684384e-03 0.0362826813

And let us also have manhattan plot

plot(qts)
abline(h=-log10(5e-8))