Introduction

Affymetrix has never explicitly supported the use of its SNP microarrays for measuring allele frequencies in pooled DNAs, but the calling method used in the early 10K 2.0 arrays used a Relative Allele Score (RAS), which could be used as an allele frequency estimate. In order to use this approach with subsequent array products, which use different allele calling routines, we made R scripts to calculate RAS in various ways.

1) Using gdas output

This script reads in the intensity value tables that can be exported as text files from the Affymetrix GDAS program and works for 10K, 100K and 500K (2-array set) arrays. The current version reads and processes the files in chunks and so works even on PCs with modest amounts of RAM. The current version doesn't have a manual yet but works very much like the previous one. This is an older version complete with a manual.

described in

Meaburn E, Butcher LM, Schalkwyk LC, Plomin R. (2006) Genotyping pooled DNA using 100K SNP microarrays: a step towards genomewide association scans. Nucleic Acids Res.34(4):e27.

some publications that cite the above

2) Using cel files

The latest Affymetrix genotyping arrays (genome wide V5 and V6) differ from the earlier versions in that they do not include mismatch probes, in order to fit 0.5 and 1 * 10^6 SNPs respectively on a single array.

Judging by the experience of gene expression studies, it is probably just as well to get rid of the mismatch probes, but it does leave us with a problem in the short term in that we cannot calculate the same RAS score that has been validated for pooling in previous work.

What we can calculate is the ratio of allele signal intensities A/(A+B). Doing this with the raw intensities produces RAS* scores that have a different distribution than conventional RAS scores. The RAS* distribution is slumped towards 0.5 compared with the distribution of RAS, which goes from 0 to 1. These values may nonetheless be useful for finding frequency differences between pools, but it remains to be investigated whether some kind of background subtraction and normalisation will improve performance.

Our objective was to calculate RAS* with a minimum of fuss and also to provide a platform for investigating refinements. snpmap.R achieves this for v5 and v6 arrays using the affxparser package. We expect to generalise it for other arrays and wrap it into a package. You can download snpmap.R and the required array data file cdx56.RData into the same directory. Then you're ready to start up R and
install.packages('affxparser', repos='http://www.bioconductor.org')
source('snpmap.R')
attach('cdx56.RData')
setwd ('my_cel_file_dir') # use the path to your real cel file folder here!
celtorasm() -> ras_scores # process all files with names ending in .cel

warnings

you need a lot of RAM:

genome wide arrays V5 and 6 contain a lot of data. I keep memory requirements down by not storing multiple versions of the data but for example on a PC with 1.25Gb of RAM and a paging file of 795Mb, I can process 3 cell files at once if I have no other applications (not even a shell) running. The number on a PC with 2 Gb of memory is unlikely to be more than about 10. There is no guarantee that running out of memory will be handled gracefully.

It's not hard to break the data into chunks for processing (this can be done by using sections of the tables in cdx56.RData) but this is inconvenient for background correction and normalisation, so I have no immediate plans to build it in as an option.

your R session may die

the Fusion library responds to many errors by killing R. I do some basic checks so this shouldn't often happen, but save any work you want to keep before running these functions.

complaints and suggestions to

Leo Schalkwyk spjglcs@iop.kcl.ac.uk