MICSA: ChIP-Seq data Analysis by de novo motif identification

MICSA is package for the identification of transcription factor binding sites in ChIP-Seq data developed by the Computational Systems Biology of Cancer group at the Bioinformatics Laboratory of Institut Curie (Paris).

Cite: Boeva V, Surdez D, Guillon N, Tirode F, Fejes AP, Delattre O, Barillot E. De novo motif identification improves the accuracy of predicting transcription factor binding sites in ChIP-Seq data analysis. Nucleic Acids Res. 2010 Jun 1;38(11):e126. Epub 2010 Apr 7.

Download the article: HTML, PDF.

 MICSA HOWTOs:

1. Installing the MICSA package

2. Running MICSA on NRSF test data

3. Running MICSA on EWS-FLI1 dataset

4. Reading MICSA's output

 Contact:

Valentina Boeva: micsa@curie.fr

phone: +33.1.56.24.69.31

 Distributions:

MICSA is distributed as an open source application. However it calls MEME which is free only for non-commercial use. Please check MEME license before using

- current release of MICSA without MEME and FindPeaks [ download ]

- precompiled versions for Linux64, Linux32 and Windows OS

 Test data:

- NRSF human ChIP-Seq data (Neural-restrictive silencing factor, hg18) and control data (RX_noIP, hg18) [ download ] (alignment by Eland)

- EWS/FLI1 human ChIP-Seq data (Oncoprotein of Ewing sarcoma, hg18) and control data (MON, hg18) (alignment by MAQ)

 Installation:

  1. Download and install the latest version of MEME from the MEME website. MEME is free only for non-commercial use. Please check the MEME license before using.
  2. Add the directory with meme.bin ("meme_your_version\src"?) to your PATH.

    - type export PATH=$PATH:/YOUR PATH TO meme.bin

    In LINUX make meme.bin runnable by chmod 755 meme.bin
  3. Download the MICSA program: MICSA. The archive includes a version of FindPeaks.jar
  4. If you want to run the latest version of FindPeaks, download and install the latest version of FindPeaks from the FindPeaks website

 Running MICSA on NRSF test data

 MICSA by graphical interface

Launch the GUI and fill in the form as shown in the example:

Click "Check input information"

Click "Start"



Files and parameters in this example:

 MICSA by command line

  1. Run FindPeaks on your data ChIP and control data

    Read the FindPeaks Manual for more options.

    FindPeaks can output either the entire data track or provide the data track split by chromosome. If you have a large dataset, it is best to upload only chromosomes that you are interested in. UCSC genome browser does not accept large files, so attempting to load an excessively large wig file may cause the upload to fail altogether. It is suggested that you use the “by-chromosome” option, triggered by the “-one_per” flag on the command line for FindPeaks.

    UCSC Genome Browser sessions will expire after about 3 days of inactivity, however, you may keep a session alive longer by "refreshing" the current session (by hitting "referesh" button or by zooming in or out).

  2. Filter out peri-centromeric or satellite regions. The following masking data sets are available for use in MICSA:
    • human centromeric repeats (hg18, RepeatMasker dataset), download
    • human satellite repeats (hg18, RepeatMasker dataset), download
    • human whole peri-centrometic and peri-heterochromatic regions (hg18, RepeatMasker dataset), download
      type:

      java DeleteRegions -f chip_triangle_standard.peaks -r hg18_masked_Centr

      java DeleteRegions -f control_triangle_standard.peaks -r hg18_masked_Centr

  3. Create summary about peak distribution in ChIP and control data

    java Summary -f chip_triangle_standard.peaks -c control_triangle_standard.peaks -r 0.735371287

    Mandatory parameters:
    -f input file
    -c control file
    Optional parameters:
    -r (default 1) ratio between number of tags in ChIP and that in control data

    Information about tag number can be fount in "meta_info.txt" file of FindPeaks

  4. Filter out peaks occurring both in ChIP and Control data:

    java FilterPeaks -f chip_triangle_standard.peaks -c control_triangle_standard.peaks -t 3.5

    Mandatory parameters:
    -f input file
    -c control file
    Optional parameters:
    -w (default 0) flanks to add to window length in contol data
    -v (default 2) enrichment fold over control data

  5. Run MICSA.jar

    java -jar -Xmx2G PATH_TO_MICSA/micsa.jar -name micsaTest -f wig/chip_triangle_standard.peaks -n 50 -o "outputDir" -l "wig/FindPeaksSummary.txt" -g "PATH/humanGenome" -w "wig/chip_triangle_standard.wig.gz"

    Mandatory parameters:
    -f input file
    -n maximal number of expected false positives
    -fdr maximal value of FDR (then "-n" option should be omitted)
    -o path to the output directory
    -l file with summary
    -g path to the genome directory
    -w file with wig file produced by FindPeaks

 Running MICSA on EWS-FLI1 dataset

 By graphical interface

  • Download ChIP and control data for EWS-FLI1.

  • Download and unpack genome sequences in fasta format. hg18 (for EWS-FLI1), mm9 (in case if you have mouse data)
  • Run GUI for MICSA pipeline

    type java -Xmx1500m -jar micsaGUI.jar

    or run MICSA pipeline in command line.

    Fill the form as in the example:

    Click "Check input information"

    Click "Start"



    Files and parameters in this example:

    • File with ChIP data: 080221_pgu56.map File with ChIP-Seq data for EWS-FLI1 (download), aligned by Maq to the Human Genome assembly 18 (Mar.2006).

    • File with control data: 080221_PGU-3.txt File with control data ( download), aligned by Maq to the Human Genome assembly 18 (Mar.2006).

    • Black list of genomic regions: hg18_masked_Centr (Download hg18_masked_Centr) file with positions of pericentromeric repeats (e.g., Alpha-satellites) in hg18 downloaded from UCSC Genome Browser.
      You can use other files for masking. For example: hg18 satellites , hg18 whole pericentromeric and heterochomatin regions.
    • Output Directory Select your output directory. If the specified directory does not exist, it will be created.
    • Genome Directory: hg18 Directory with human genome files. The human genome (assembly Mar.2006) can be downloaded here.

    • Alignment format - Maq 64 Here you need to specify the alignment format for your data. The EWS-FLI1 data are in format Maq 64.
    • Minimal, maximal and average length of DNA fragment Optional parameters. Set them if your know the length distribution of the DNA fragments you sequenced.
    • Minimal coverage of potential binding site by overlapping DNA fragments Set how many overlapping DNA fragments you need to keep a peak. This value is roughly equal to the minimal number of mapped DNA reads in a window (the minimal value is 3).
    • Ratio for filtering peaks using the control data In the case when a peak in the ChIP data overlaps with a peak in the control data, the former will be discarded if the height of peak from ChIP data devided by the hight of peak in the control data is smaller than the specified ratio.
    • Number of peaks to search motif within Maximal number of peak which will be used by MEME to call motifs
    • Number of false positive peaks Leave blank if you want to use FDR instead.
    • FDR False discovery rate you allow in the output.

 MICSA's output

 MICSA outputs information about motifs ("motifs.txt" file) and a file with peak coordinates. Information for each peak:
  • chromosome
  • start
  • end
  • maximal depth of coverage by overlapping DNA fragments
  • length
  • score
 Score is small for high confidence peaks.

In GUI version you will see your results appear in a window like this:


Last modified: June 24 2009