Raw data analysis
We analyze the raw data generated from GeneChip probe arrays using Affymetrix Microarray Analysis Suite (MAS)4.0 and MAS5.0 software. MAS software calculates a variety of metrics using the probe array’s hybridization intensities measured by the scanner. Due to space limitations, we can only give a brief overview over both Affymetrix Analysis Algorithms. For a more detailed description, please refer to Affymetrix Technical Manual.
Affymetrix Technology PowerPoint Tutorial
The intensity from the entire probe array is used for Background and Noise calculations. Other metrics compare the intensities of the sequence-specific Perfect Match (PM) probe cells with their control Mismatch (MM) probe cells for each probe set to determine if a transcript is Present (P), Marginal (M), or Absent (A). Next, the numbers of Positive and Negative probe pairs are determined for every probe set. The numbers of Positive and Negative probe sets are then used to derive metrics that describe the performance of each probe set, the Positive Fraction and the Pos/Neg Ratio.
Affymetrix MAS4.0 Previous versions of MAS calculates a numerical value called the average difference (AvgDiff) for expression intensities of the transcripts. This metric uses probe cell intensities directly rather than relying on the numbers of Positive and Negative probe pairs. The Avg Diff for each probe set is an average of the differences between each PM probe cell and its control MM probe cell and is directly related to the level of expression of the transcript. Probe pair outliers are not allowed to contribute. Thus, in those probe sets where the mismatch probe has a higher intensity value than the perfect match the AvgDiff will be negative.
A “decision matrix” is employed to determine the presence or absence of each transcript (the Absolute Call). The Absolute Call is displayed in the analysis output file (*.chp exported as *.txt file, see below under “File types available for downloads”) together with all the Analysis Metrics for every transcript.
Affymetrix MAS5.0 Affymetrix has changed the algorithm from empirical to statistical and has adapted its terminology to fit more standard terms. The AvgDiff that had been used for empirical expression analysis was changed to the "Signal". The signal is calculated using the One-Step Tukey’s Biweight Estimate, which yields a robust weighted mean that is relatively insensitive to outliers. The Tukey’s Biweight method gives an estimate of the amount of variation in the data, exactly as standard deviation measures the amount of variation for an average. MAS 5.0 subtracts a “stray signal” estimate from the PM signal that is based on the intensity of the MM signal. However, in cases where the MM signal outweighs the PM signal, an adjusted value is used. These adjustments will eliminate negative values.
The AbsCall is now replaced by the Detection value which is associated with a Detection p-value based on a ranking user-defined value.
Downstream data analysis
The Affymetrix average difference (AvgDiff) score is taken as the measurement of gene expression in all experiments. Additional normalization is performed on each array by setting the mean of all expression levels within an individual array equal to zero, and the variance equal to one. For the experiments that use multiple arrays as the MU11KsubA and subB set, the A and B arrays for a single sample are individually normalized, as are the combined expression levels across both arrays.
The expression data are filtered prior to clustering in order to increase the signal to noise ratio. Since replicated samples are used, statistical tests can be performed to identify genes that showed a greater variability across conditions or time points relative to the variability across replicates.
Non-parametric tests are used when possible so as not to assume a normal distribution of replicated expression levels. The Kruskal-Wallis test is a non-parametric analogue of the one-way analysis of variance (ANOVA). It is used to compare the medians between more than two samples when the underlying distribution is not normal, and generates an H statistic from which a p value can be calculated.
For some models, the Kruskal-Wallis test is used to score genes by their likelihood to vary between the different experimental groups (for example wild type, heterozygous and homozygous mutant mice). For each gene, separate H statistics are calculated using the raw AvgDiff, the normalized values for an individual array, and the normalized values for a pair of arrays, as the MU11KsubA and subB arrays. P values are computed from the H statistics, and those genes with p values below an arbitrary level in all three of the above groups are included in subsequent clustering. For example, a p value less than 0.05 was chosen for some experiments because it resulted in a manageable number of genes to explore, not due to any special significance for this threshold value.
Currently, Affymetrix MGU74A arrays are used for experiments related to normal heart development and models of cardiac hypertrophy. Due to manufacturing irregularities on a subset of arrays, approximately twenty percent of the probes were unusable and were masked from further analysis. A single MGU74A array is used for each sample, so statistical tests are performed using the raw AvgDiff and values normalized within each single array.
In several experiments, we are attempting to cluster genes that are likely to be differentially expressed both over time and between different experimental and control conditions. In these cases, a two-way ANOVA test is used to calculate the p value representing the probability that at least one time point in one condition differs significantly from the overall experimental level for a given gene. The two-way ANOVA also provides a measurement of the "row effect", indicating a significant variation between time points, the "column effect", indicating a significant variation between conditions, and the "interaction effect", indicating a combination of row and column effects.After selecting those genes with p values below the given threshold, clustering is performed using the matrix incision tree(Kim et al., 2001).
Kim JH, Ohno-Machado L, Kohane IS(2001). Unsupervised learning from complex data: the matrix incision tree algorithm. Pac Symp Biocomput. 30-41. PMID: 11262950 [Abstract]
Page last modified: 07-Aug-2003
| |
|
|
|
|
| |
|
|||
| |
||||
| |
|
|||
| |
||||