A Probe-Density Based Analysis Method for Array CGH data: Simulation, Normalization and Centralization
Hung-I Harry Chen1,2, Fang-Han Hsu1,2, Yuan Jiang3, Mong-Hsun Tsai2,4, Pan-Chyr Yang5, Paul S. Meltzer3, Eric Y Chuang1,2,4,6,7,8,* and Yidong Chen3,*
1Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan 106, 2Research Center for Medical Excellence, National Taiwan University, Taipei, Taiwan 100, 3Genetics Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA, 4Institute of Biotechnology, Center for Systems Biology and Bioinformatics, National Taiwan University, Taipei, Taiwan 106, 5College of Medicine, National Taiwan University, Taipei, Taiwan 100, 6Department of Life Science, National Taiwan University, Taipei, Taiwan 106, 7Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan 106, 8Graduate Institute of Epidemiology, National Taiwan University, Taipei, Taiwan 106Figures / Supplementary Figures / Results of Published Data
Abstract:
Motivation:
Genomic instability is one of the fundamental factors in tumorigenesis and tumor progression. Many studies have shown that copy-number abnormalities at the DNA level are important in the pathogenesis of cancer. Array Comparative Genomic Hybridization (array CGH), developed based on expression microarray technology, can reveal the chromosomal aberrations in segmental copies at a high-resolution. However, due to the nature of array CGH, many standard expression data processing tools, such as data normalization, often fail to yield satisfactory results.Results:
We demonstrated a novel array CGH normalization algorithm, which provides an accurate array CGH data normalization by utilizing the dependency of neighboring probe measurements in array CGH experiments. To facilitate the study, we have developed a Hidden Markov Model (HMM) to simulate a series of array CGH experiments with random DNA copy number alterations that are used to validate the performance of our normalization. In addition, we applied the proposed normalization algorithm to an array CGH study of lung cancer cell lines. By using the proposed algorithm, data quality and the reliability of experimental results are significantly improved, and the distinct patterns of DNA copy number alternations are observed among those lung cancer cell lines.Matlab Code of the algorithm: Click Here
Matlab tool needed for the algorithm: STPR tool
Figure 1. aCGH data and visualization, where blue dots are raw ratios, and red bars are moving-average ratios (log2 transformed).
Figure 2. Block diagram of the normalization algorithm.
Figure 3. 2D density distribution plot and ridgeline.Figure 4 . Simulation result of array CGH ratio, where red color indicates copy number gain, and green for copy number loss. X-axis is the genomic order of the probes of each chromosome.
(a) normal condition (b) dye-bias condition
Figure 5 . The error bar of rooted mean square error of the simulated data.
(a) 40k data sets (b) 300k data sets
Figure 6.The illustration of the simulated data. (Centralization factor = -0.0521 in (d).)(a)
(b)
(c)
(d)
Figure 7. The results of chromosome 16 after implementing aCGH normalization. X-axis is the probe's index on this chromosome. Y-axis is the signal ratio of the probe. (a) Result of linear regression. (b) Result of quadratic regression. (c) Result of logistic regression. (d) Result of spline curve fitting.Table:
Table 1. Transition matrix trained from GSM 75166, chr20.
Figure S1. Logistic function (with b = [0 1 -1 1]). The shaded section and line in red color is the part we commonly observed in microarray's M-A plot. The mirror (up-down) of this figure is b = [0 -1 -1 1].
Figure S2. State space represents gain, loss, and no-change region. Transition probabilities are derived from Table 1.Figure S3 . Simulation result of array CGH ratio, where red color indicates copy number gain, and green for copy number loss. X-axis is the genomic order of the probes of each chromosome.[A]
[B] Centralization factor = 0.2350
[C] Centralization factor = -0.1599
[D] Centralization factor = -0.0650
[E] Centralization factor = -0.0521
Figure S4. The illustration of the simulated data. [A] Histogram of raw data ratio. [B] Results of linear regression. [C] Results of quadratic regression. [D] Results of logistic regression. [E] Results of spline curve fitting. (a) is the histogram of normalized ratio (without moving-median filter). (b) is the histogram of normalized ratio with moving-median filter. (c) is the density plot by using EM algorithm, in which kernel-smoothing density function derived from moving-median filtered and normalized ratio was plotted in red-color, blue line was the estimated density function from EM algorithm, and three dashed-lines in blue were three Gaussian-densities in the mixture model for EM algorithm.
(a) Result of linear regression. (b) Result of second-order regression. (c) Result of logistic regression. (d) Result of spline curve fitting.Figure S5. The results of aCGH ratio after implementing aCGH normalization. Red color indicates copy number gain, and green for copy number loss. X-axis represents the total numbers of probes on each chromosome. Y-axis represents which chromosome it is.Figure S6. 2D density distribution plot of real array CGH data which shows the signals distribution.Figure S7. Normalized Data of HEEBO array CGH. Derivative Log-Ratio Spread (DLRS)** = 1.2886. The absolute value of signal ratio that is larger than the DLRS was coded red (gain) or green (loss).
** Dlrs (derivative log-ratio spread) is the standard deviation of the diff(normalized ratio) divided by
. The ratio threshold is set based on the dlrs we calculate. The parameter 'k' is set depends on the quality of the microarray. For HEEBO microarray which the probes are designed for gene expression, the dlrs is large so k=1 for HEEBO data. For Agilent microarray with high quality which causes dlrs to be small, k is set to be 6. The smallest dlrs of the data CL1-0, CL1-1 and CL1-5 will be the common threshold of these three cases.
dlrs = std( diff(Normalized Ratio) )/
Ratio threshold = +-(2^(dlrs*k))
The function diff() is implemented in MATLAB (Natik, MA). This function diff(X) calculates differences between adjacent elements of X. If X is a vector, then diff(X) returns a vector, one element shorter than X, of differences between adjacent elements: [X(2)-X(1) X(3)-X(2) ... X(n)-X(n-1)]Figure S8. Normalized Data of Agilent array CGH.Figure S9.The result of HEEBO aCGH data - CL1-5. (CL1-0, CL1-1)
Figure S10.The result of Agilent aCGH data - CL1-5. (CL1-0, CL1-1)
Figure 11. The scatter plot of Agilent CL1-5 aCGH data. X axis represents the sample log2 intensity, y axis represents the reference log2 intensity.Supplementary Tables:
Table S1.The rooted mean square error of each regression method. Values in dark italic are lowest error in the category.
Table S2.The parameter settings of aCGH Simulation for each condition. HMM was training for tumor copy number values, which were not listed here. Details of these parameters can be found in [Attor, 2004].
GSE7822: Melanoma cell line DNA copy number analysis (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7822)
GSM188311~ GSM188328
GSM188312, 188315, 188317, 188319, and 188323 are shown here. (Other results)
Platform: Agilent Human Genome CGH Microarray 244A (G4411B)
Cy3 Channel: WM793B Cell Line
Cy5 Channel: Lymphocyte derived pooled male (Promega, Inc.)GSM188312
Figure S12(a) 2D density distribution plot. X-axis is the log2 intensity of sample channel, Y-axis is the log2 intensity of reference channel.
Figure S12(b).The illustration of the simulated data. (a) is the histogram of raw data ratio. (b) is the histogram of normalized ratio using spline curve fitting as the regression method (without moving-median filter). (c) is the histogram of normalized ratio with moving-median filter. (d) is the density plot by using EM algorithm.
Figure S12(c).The results of aCGH ratio after implementing aCGH normalization.
GSM188315
Figure S13(a) 2D density distribution plot. X-axis is the log2 intensity of sample channel, Y-axis is the log2 intensity of reference channel.
Figure S13(b).The illustration of the simulated data. (a) is the histogram of raw data ratio. (b) is the histogram of normalized ratio using spline curve fitting as the regression method (without moving-median filter). (c) is the histogram of normalized ratio with moving-median filter. (d) is the density plot by using EM algorithm.
Figure S13(c).The results of aCGH ratio after implementing aCGH normalization.
GSM188317
Figure S14(a) 2D density distribution plot. X-axis is the log2 intensity of sample channel, Y-axis is the log2 intensity of reference channel.
Figure S14(b).The illustration of the simulated data. (a) is the histogram of raw data ratio. (b) is the histogram of normalized ratio using spline curve fitting as the regression method (without moving-median filter). (c) is the histogram of normalized ratio with moving-median filter. (d) is the density plot by using EM algorithm.
Figure S14(c).The results of aCGH ratio after implementing aCGH normalization.
GSM188319
Figure S15(a) 2D density distribution plot. X-axis is the log2 intensity of sample channel, Y-axis is the log2 intensity of reference channel.
Figure S15(b).The illustration of the simulated data. (a) is the histogram of raw data ratio. (b) is the histogram of normalized ratio using spline curve fitting as the regression method (without moving-median filter). (c) is the histogram of normalized ratio with moving-median filter. (d) is the density plot by using EM algorithm.
Figure S15(c).The results of aCGH ratio after implementing aCGH normalization.
GSM188323
Figure S16(a) 2D density distribution plot. X-axis is the log2 intensity of sample channel, Y-axis is the log2 intensity of reference channel.
Figure S16(b).The illustration of the simulated data. (a) is the histogram of raw data ratio. (b) is the histogram of normalized ratio using spline curve fitting as the regression method (without moving-median filter). (c) is the histogram of normalized ratio with moving-median filter. (d) is the density plot by using EM algorithm.
Figure S16(c).The results of aCGH ratio after implementing aCGH normalization.
GSE7603: Human TALL tumor vs Normal Human DNA (http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE7603)
GSM183878~ GSM183885
Except GSM183881, other samples all have dye-swap replicates.
GSM183878 is shown here. (Other results)
Platform: Agilent-014693 Human Genome CGH Microarray 244A (G4411B)
Cy3 Channel: TALL tumor / Normal Human DNA
Cy5 Channel: Normal Human DNA / TALL tumorGSM183878 - 1 Cy3 TALL tumor, Cy5 Normal Human DNA
Figure S17(a) 2D density distribution plot. X-axis is the log2 intensity of sample channel, Y-axis is the log2 intensity of reference channel.
Figure S17(b).The illustration of the simulated data. (a) is the histogram of raw data ratio. (b) is the histogram of normalized ratio using spline curve fitting as the regression method (without moving-median filter). (c) is the histogram of normalized ratio with moving-median filter. (d) is the density plot by using EM algorithm.
Figure S17(c).The results of aCGH ratio after implementing aCGH normalization.
GSM183878 - 2 Cy3 Normal Human DNA, Cy5 TALL tumor
Figure S18(a) 2D density distribution plot. X-axis is the log2 intensity of sample channel, Y-axis is the log2 intensity of reference channel.
Figure S18(b).The illustration of the simulated data. (a) is the histogram of raw data ratio. (b) is the histogram of normalized ratio using spline curve fitting as the regression method (without moving-median filter). (c) is the histogram of normalized ratio with moving-median filter. (d) is the density plot by using EM algorithm.
Figure S18(c).The results of aCGH ratio after implementing aCGH normalization.