VIMIDA TUTORIAL

VIMIDA - is a Java-applet for VIsualization of MIcroarray DAta using linear and non-linear principal components analysis (PCA)

Required materials:

For this tutorial you will need

Java Virtual Machine installed and enabled (usually you should already have it)
VDAOEngine.jar archive
sample data file
if you are going to test and utilize the applet without Internet connection then you need to download a settings file for the elmap algorithm.

Put all these materials in a folder on your disk.

Step 1: Preparing data table

VIMIDA accepts tab-, space and comma- delimited text tables. The table should not contain the first line with column names. Instead of this a header should be added which defines the types, classes and descriptions of the table columns. The format of the header is the following:


<number_of_columns> <number_of_rows>
<Field1_name> <Field1_type> [<Field1_class>] [<Field1_description>]
<Field2_name> <Field2_type> [<Field2_class>] [<Field2_description>]
...
<Fieldn_name> <Fieldn_type> [<Fieldn_class>] [<Fieldn_description>]

The field names should not contain spaces. The Field_type can be only of two types: STRING (string of text) and FLOAT (numerical). The Field_class and Field_description are optional, they also should not contain spaces. For this tutorial, we prepared a file containing data about gain/loss in the minimal regions of CGH profiles done on a collection of tumor samples. Below you can see how the sample.dat file looks in the beginning:


319 34
NAME STRING
CLASS STRING
DESCRIPTION STRING
REG0 FLOAT CHR1 1p36.3
REG1 FLOAT CHR1 1p36.2
....
REG313 FLOAT CHR21 21q22
REG314 FLOAT CHR22 22q13
REG315 FLOAT CHR22 22q13
Smpl1 "C2" "" 1 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 -1 0 0 0 -1 0 0 0 -1 0 -1 0 -1 -1 0 -1 0 -1 0 -1 0 0 0 -1 0 -1 0 0 0 0 0 0 -1 0 -1 -1 0 -1 0 -1 -1 -1 -1 0 -1 -1 -1 0 -1 0 -1 0 0 -1 -1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 -1 0 0 -1 0 1 1 0 1 0 -1 -1 0 -1 0 -1 -1 -1 0 -1 0 0 0 0 -1 0 0 0 0 -1 0 0 -1 -1 0 0 -1 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 -1 0 0 0 -1 0 0 -1 0 -1 0 0 -1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 -1 0 0 0 -1 0 -1 0 0 -1 -1 -1 0 0 0 0 0 0 0 0 -1 0 -1 0 -1 0 -1 -1 0 -1 -1 0 0 -1 0 -1 0 0 1 0 0 0 0 -1 -1 0 -1 0 -1 0 -1 0 0 0 0 -1 0 1 0 1 0 0 0 0 0 0 
Smpl2 "C4" "" 0 -1 -1 0 -1 0 -1 0 0 0 0 0 0 -1 0 0 -1 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 -1 0 0 -1 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 -1 0 -1 0 0 -1 0 0 0 -1 0 -1 -1 -1 0 -1 0 -1 -1 -1 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 -1 0 0 -1 0 -1 0 0 -1 -1 -1 0 0 -1 0 0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 -1 0 0 -1 -1 -1 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 -1 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 -1 0 0 0 0 0 0 -1 
....

This means that the data table contains of 319 columns (there are 316 minimal regions and 3 columns with names, clinical class and description of a sample) and 34 rows (samples). The numbers in this table are only -1,0 and 1 that means gain, no change and loss of a minimal region in the corresponding CGH profile. Every minimal region is characterized by chromosome in which it is located and, more precisely, by a standard genetic map index (we could also include the information about the genes located in this region). Presence of three string fields, NAME, CLASS and DESCRIPTION is desirable but not necessary condition to use the applet, because it will try to take the information specifically from these fields to annotate the points. If no NAME field is provided then the applet will take the value from the first field of STRING type. It is preferrable to save the data table you prepared in a file with .dat extension.

Step 2: Running the applet

The applet can be run using appletviewer program from Java Development Kit or by creating an HTML code to run the applet in your favorite browser. The template code for HTML page which will run the applet looks as the following:


<html>
<head>
<title>
HTML Test Page
</title>
</head>
<body>
<br>
<center>
<applet
  codebase = "."
  code     = "vdaoengine.TestApplet.class"
  name     = "TestApplet"
  archive  = "VDAOEngine.jar"
  width    = "500"
  height   = "500"
  hspace   = "0"
  vspace   = "0"
  align    = "middle"
>
<param name = "datfile" value = "http://www.ihes.fr/~zinovyev/test/sample.dat">
<param name = "settings_file" value = "http://www.ihes.fr/~zinovyev/test/elmap.ini">
<param name = "microarray_type" value = "CGH_minimal_regions_gain_loss">
</applet>
</center>
</body>
</html>

There are several parameters: datfile is the URL of the .dat file you have prepared at the previous step, settings_file is the URL of a settings file for the elmap algorithm. If you have Internet connection, you may not to change the settings_file parameter, it will be downloaded from the specified remote location. If you plan to work without an Internet connection then you should download the elmap.ini file and specify a location by ising "local" URL like below:


<param name = "datfile" value = "file:///c:/datas/myfile.dat">
<param name = "settings_file" value = "file:///c:/datas/settings/elmap.ini">

The microarray_type parameter is used to specify the type of the microarray and is used to modify the applet behaviour. For the moment the following types are supported:

CGH_minimal_regions_gain_loss - set of CGH profiles from which the minimal regions have been extracted and assigned -1,0 or 1 value to specify loss, no change or gain value.
CGH_bacs_real_values - set of CGH profiles with some bacs selected and assigned a value corresponding to the copy number
expression_real_values - set of expression profiles characterized by a real-valued numbers

Step 3: Visualizing data with PCA and principal manifolds

After you have created the HTML code, just load it into your favorite browser and answer "YES" on the question "Do you want to trust the signed applet?" After this you will have a PCA visualization image with density estimation:

The points corresponding to different values of the CLASS field will be automatically colored in different colors. The grey coloring in the background corresponds to a simple continuous estimation of the point density. You can browse the image with mouse cursor and in the status line you will see the NAME, CLASS and description of the object you point at:

If you click on the point then the NAME field will be drawed on the image and the DESCRIPTION field will appear in the text box. You can also click "Label all" button to put all the point NAME labels or click "RESET" button to clear them. To apply the elmap algorithm, click Elastic Map radiobutton. The construction of the map can take some time. It takes several seconds to construct the map for a table with several hundreds of rows and several tens of columns, but if you have a big table then it takes more time (the time is more or less proportional to the number of elements in the table). In our sample we should have the result in a couple of seconds:

Step 4: Visualizing densitites

At the previous screen you saw the estimation of the overall density. Using the combobox below the text area you can visualize the density estimation for a particular class, for example for class "C1" or for class "C2":

Another possibility is to visualize the "relative density" of a class, i.e. the relation of the class density and the overall density. To do this, one should click "RD" checkbox. For example, the relative densities of class "C1" and class "C2" look as follows:

The interpretation of the relative density is that the points of some class are overrepresented (or underrepresented) in a region of the map.

Step 5: Visualizing smoothed trends

Principal components analysis can be regarded as a projection of a multidimensional data set onto a two-dimensional linear manifold embedded in the multidimensional space. Elastic map algorithm constructs a non-linear 2D embedded manifold, which also approximates the training set of vectors.

The applet shows these linear (PCA) and non-linear (elastic maps) manifolds "unfolded", in their internal coordinates. But every point of a manifold is a point in multi-dimensional space (since it is embedded and constructed there). Thus we can visualize values of a function defined in multidimensional space, in the points of the manifold. Since we construct a smooth manifold, then the values of a smooth function will remain smooth in the points of the manifold.

The simplest function we want to visualize is simply a coordinate value. We will have m such visualizations where m is the dimension of the data space. For example, below you see linear and non-linear visualization of the REG3 (minimal region, number 3) value:

We call these type of the visualization "a smoothed trend". To understand this notion, notice that you can show the value of REG3 for the data points by enabling the "Show by size" checkbox. Then the points with gained value (+1) will become big, the points with lost value (-1) will become small. The color of the background also corresponds to the gained (black), normal (grey) or lost (white) value, but in the points of the manifold (after some thresholding, of course). In fact, what you see is the smooth tendency of changing the REG3 value.

It is useful to understand that PCA and non-linear PCA can be regarded as methods of averaging. When you calculate a mean point, you minimize mean-square distance to a point. When you calculate the principal plane, you minimize mean-square distance to a plane. When you calculate the principal manifold, you again minimize the sum of the squared distances between the points and their orthogonal projections onto the manifold. Thus every coordinate becomes also averaged and you see this average on the visualization of the smoothed trend.

The second simple function one can visualize with the applet is the mean value in one class of coordinates. In this example the coordinates (minimal regions) have been divided into classes accordingly to their position on different chromosomes. Thus we have 22 classes, one per chromosome. Below, for example you can see the non-linear visualization of the CHR2 class:

Again, if you mark on "Show by size" checkbox then the average gain-loss value for every point will be shown by size. For example, one can see that sample 28 has considerable duplications in the second chromosome whereas sample 21 has some deletions. Most of the points (with one exception) of the class C1 (the green ones) do not have significant changes in the second chromosome.

Step 6: Transpose data table

For our example, in the "normal" problem, every row of a data table corresponds to a sample and every column correponds to a minimal region. Thus, every sample becomes a point on the PCA plot, and evry minimal region becomes a coordinate in the data space, and a coloring, a smoothed trend on the PCA plot.

In the "transposed" problem, we turn the table by 90 degrees such that every row now corresponds to a minimal region and every column corresponds to a sample. Thus on the PCA plot every point is now a minimal region and the samples (or classes of samples) are visualized as smoothed trends.

Change the type of the problem from "Normal" to "Transposed" in the combobox, and, for linear PCA, one has the following image:

The explanation of the image is the following. Those points (minimap regions) which are close to the "kernel" of the distribution are less changed in average. On the image above you can see, for example, that REG57 (locus 6q27) and REG132-134 (all three are from locus 14q32) are frequently alternated in this set of samples.

Now let us swith to the non-linear visualization of the transposed problem:

This non-linear visualization gives us a more detailed picture of alternations.

Now let us visualize by coloring all four classes of samples (C1, C2, C3 and C4). The size of the points denotes here the expression level of a minimal region in one class of samples, averaged. You can observe different patterns of genome alterantions:

CLASS C1 CLASS C2

CLASS C3 CLASS C4

Contacts

Bioinformatics service of Instiute of Curie
Andrei Zinovyev