计算机
This tool calculates scores per genome regions and prepares an intermediate file that can be used with plotHeatmap
and plotProfiles
.
Typically, the genome regions are genes, but any other regions defined in a BED file can be used.
computeMatrix accepts multiple score files (bigWig format) and multiple regions files (BED format).
This tool can also be used to filter and sort regions according
to their score.
To learn more about the specific parameters, type:
$ computeMatrix reference-point --help or
$ computeMatrix scale-regions --help
usage: computeMatrix [-h] [--version] ...
Named Arguments
- --version
show program's version number and exit
Commands
Possible choices: scale-regions, reference-point
Sub-commands
scale-regions
In the scale-regions mode, all regions in the BED file are stretched or shrunken to the length (in bases) indicated by the user.
An example usage is:
computeMatrix scale-regions -S <biwig file(s)> -R <bed file> -b 1000
Required arguments
- --regionsFileName, -R
File name or names, in BED or GTF format, containing the regions to plot. If multiple bed files are given, each one is considered a group that can be plotted separately. Also, adding a "#" symbol in the bed file causes all the regions until the previous "#" to be considered one group.
- --scoreFileName, -S
bigWig file(s) containing the scores to be plotted. Multiple files should be separated by spaced. BigWig files can be obtained by using the bamCoverage or bamCompare tools. More information about the bigWig file format can be found at http://genome.ucsc.edu/goldenPath/help/bigWig.html
Output options
- --outFileName, -out, -o
File name to save the gzipped matrix file needed by the "plotHeatmap" and "plotProfile" tools.
- --outFileNameMatrix
If this option is given, then the matrix of values underlying the heatmap will be saved using the indicated name, e.g. IndividualValues.tab.This matrix can easily be loaded into R or other programs.
- --outFileSortedRegions
File name in which the regions are saved after skiping zeros or min/max threshold values. The order of the regions in the file follows the sorting order selected. This is useful, for example, to generate other heatmaps keeping the sorting of the first heatmap. Example: Heatmap1sortedRegions.bed
Optional arguments
- --version
show program's version number and exit
- --regionBodyLength, -m
Distance in bases to which all regions will be fit. (Default: 1000)
- --startLabel
Label shown in the plot for the start of the region. Default is TSS (transcription start site), but could be changed to anything, e.g. "peak start". Note that this is only useful if you plan to plot the results yourself and not, for example, with plotHeatmap, which will override this. (Default: "TSS")
- --endLabel
Label shown in the plot for the region end. Default is TES (transcription end site). See the --startLabel option for more information. (Default: "TES")
- --beforeRegionStartLength, -b, --upstream
Distance upstream of the start site of the regions defined in the region file. If the regions are genes, this would be the distance upstream of the transcription start site. (Default: 0)
- --afterRegionStartLength, -a, --downstream
Distance downstream of the end site of the given regions. If the regions are genes, this would be the distance downstream of the transcription end site. (Default: 0)
- --unscaled5prime
Number of bases at the 5-prime end of the region to exclude from scaling. By default, each region is scaled to a given length (see the --regionBodyLength option). In some cases it is useful to look at unscaled signals around region boundaries, so this setting specifies the number of unscaled bases on the 5-prime end of each boundary. (Default: 0)
- --unscaled3prime
Like --unscaled5prime, but for the 3-prime end. (Default: 0)
- --binSize, -bs
Length, in bases, of the non-overlapping bins for averaging the score over the regions length. (Default: 10)
- --sortRegions
Possible choices: descend, ascend, no, keep
Whether the output file should present the regions sorted. The default is to not sort the regions. Note that this is only useful if you plan to plot the results yourself and not, for example, with plotHeatmap, which will override this. Note also that unsorted output will be in whatever order the regions happen to be processed in and not match the order in the input files. If you require the output order to match that of the input regions, then either specify "keep" or use computeMatrixOperations to resort the results file. (Default: "keep")
- --sortUsing
Possible choices: mean, median, max, min, sum, region_length
Indicate which method should be used for sorting. The value is computed for each row.Note that the region_length option will lead to a dotted line within the heatmap that indicates the end of the regions. (Default: "mean")
- --sortUsingSamples
List of sample numbers (order as in matrix), that are used for sorting by --sortUsing, no value uses all samples, example: --sortUsingSamples 1 3
- --averageTypeBins
Possible choices: mean, median, min, max, std, sum
Define the type of statistic that should be used over the bin size range. The options are: "mean", "median", "min", "max", "sum" and "std". The default is "mean". (Default: "mean")
- --missingDataAsZero
If set, missing data (NAs) will be treated as zeros. The default is to ignore such cases, which will be depicted as black areas in a heatmap. (see the --missingDataColor argument of the plotHeatmap command for additional options).
- --skipZeros
Whether regions with only scores of zero should be included or not. Default is to include them.
- --minThreshold
Numeric value. Any region containing a value that is less than or equal to this will be skipped. This is useful to skip, for example, genes where the read count is zero for any of the bins. This could be the result of unmappable areas and can bias the overall results. (Default: None)
- --maxThreshold
Numeric value. Any region containing a value greater than or equal to this will be skipped. The maxThreshold is useful to skip those few regions with very high read counts (e.g. micro satellites) that may bias the average values. (Default: None)
- --blackListFileName, -bl
A BED file containing regions that should be excluded from all analyses. Currently this works by rejecting genomic chunks that happen to overlap an entry. Consequently, for BAM files, if a read partially overlaps a blacklisted region or a fragment spans over it, then the read/fragment might still be considered.
- --samplesLabel
Labels for the samples. This will then be passed to plotHeatmap and plotProfile. The default is to use the file name of the sample. The sample labels should be separated by spaces and quoted if a label itselfcontains a space E.g. --samplesLabel label-1 "label 2"
- --smartLabels
Instead of manually specifying labels for the input bigWig and BED/GTF files, this causes deepTools to use the file name after removing the path and extension.
- --quiet, -q
Set to remove any warning or processing messages.
- --verbose
Being VERY verbose in the status messages. --quiet will disable this.
- --scale
If set, all values are multiplied by this number. (Default: 1)
- --numberOfProcessors, -p
Number of processors to use. Type "max/2" to use half the maximum number of processors or "max" to use all available processors. (Default: 1)
GTF/BED12 options
- --metagene
When either a BED12 or GTF file are used to provide regions, perform the computation on the merged exons, rather than using the genomic interval defined by the 5-prime and 3-prime most transcript bound (i.e., columns 2 and 3 of a BED file). If a BED3 or BED6 file is used as input, then columns 2 and 3 are used as an exon. (Default: False)
- --transcriptID
When a GTF file is used to provide regions, only entries with this value as their feature (column 3) will be processed as transcripts. (Default: "transcript")
- --exonID
When a GTF file is used to provide regions, only entries with this value as their feature (column 3) will be processed as exons. CDS would be another common value for this. (Default: "exon")
- --transcript_id_designator
Each region has an ID (e.g., ACTB) assigned to it, which for BED files is either column 4 (if it exists) or the interval bounds. For GTF files this is instead stored in the last column as a key:value pair (e.g., as 'transcript_id "ACTB"', for a key of transcript_id and a value of ACTB). In some cases it can be convenient to use a different identifier. To do so, set this to the desired key. (Default: "transcript_id")
reference-point
Reference-point refers to a position within a BED region (e.g., the starting point). In this mode, only those genomicpositions before (upstream) and/or after (downstream) of the reference point will be plotted.
An example usage is:
computeMatrix reference-point -S <biwig file(s)> -R <bed file> -a 3000 -b 3000
Required arguments
- --regionsFileName, -R
File name or names, in BED or GTF format, containing the regions to plot. If multiple bed files are given, each one is considered a group that can be plotted separately. Also, adding a "#" symbol in the bed file causes all the regions until the previous "#" to be considered one group.
- --scoreFileName, -S
bigWig file(s) containing the scores to be plotted. Multiple files should be separated by spaced. BigWig files can be obtained by using the bamCoverage or bamCompare tools. More information about the bigWig file format can be found at http://genome.ucsc.edu/goldenPath/help/bigWig.html
Output options
- --outFileName, -out, -o
File name to save the gzipped matrix file needed by the "plotHeatmap" and "plotProfile" tools.
- --outFileNameMatrix
If this option is given, then the matrix of values underlying the heatmap will be saved using the indicated name, e.g. IndividualValues.tab.This matrix can easily be loaded into R or other programs.
- --outFileSortedRegions
File name in which the regions are saved after skiping zeros or min/max threshold values. The order of the regions in the file follows the sorting order selected. This is useful, for example, to generate other heatmaps keeping the sorting of the first heatmap. Example: Heatmap1sortedRegions.bed
Optional arguments
- --version
show program's version number and exit
- --referencePoint
Possible choices: TSS, TES, center
The reference point for the plotting could be either the region start (TSS), the region end (TES) or the center of the region. Note that regardless of what you specify, plotHeatmap/plotProfile will default to using "TSS" as the label. (Default: "TSS")
- --beforeRegionStartLength, -b, --upstream
Distance upstream of the reference-point selected. (Default: 500)
- --afterRegionStartLength, -a, --downstream
Distance downstream of the reference-point selected. (Default: 1500)
- --nanAfterEnd
If set, any values after the region end are discarded. This is useful to visualize the region end when not using the scale-regions mode and when the reference-point is set to the TSS.
- --binSize, -bs
Length, in bases, of the non-overlapping bins for averaging the score over the regions length. (Default: 10)
- --sortRegions
Possible choices: descend, ascend, no, keep
Whether the output file should present the regions sorted. The default is to not sort the regions. Note that this is only useful if you plan to plot the results yourself and not, for example, with plotHeatmap, which will override this. Note also that unsorted output will be in whatever order the regions happen to be processed in and not match the order in the input files. If you require the output order to match that of the input regions, then either specify "keep" or use computeMatrixOperations to resort the results file. (Default: "keep")
- --sortUsing
Possible choices: mean, median, max, min, sum, region_length
Indicate which method should be used for sorting. The value is computed for each row.Note that the region_length option will lead to a dotted line within the heatmap that indicates the end of the regions. (Default: "mean")
- --sortUsingSamples
List of sample numbers (order as in matrix), that are used for sorting by --sortUsing, no value uses all samples, example: --sortUsingSamples 1 3
- --averageTypeBins
Possible choices: mean, median, min, max, std, sum
Define the type of statistic that should be used over the bin size range. The options are: "mean", "median", "min", "max", "sum" and "std". The default is "mean". (Default: "mean")
- --missingDataAsZero
If set, missing data (NAs) will be treated as zeros. The default is to ignore such cases, which will be depicted as black areas in a heatmap. (see the --missingDataColor argument of the plotHeatmap command for additional options).
- --skipZeros
Whether regions with only scores of zero should be included or not. Default is to include them.
- --minThreshold
Numeric value. Any region containing a value that is less than or equal to this will be skipped. This is useful to skip, for example, genes where the read count is zero for any of the bins. This could be the result of unmappable areas and can bias the overall results. (Default: None)
- --maxThreshold
Numeric value. Any region containing a value greater than or equal to this will be skipped. The maxThreshold is useful to skip those few regions with very high read counts (e.g. micro satellites) that may bias the average values. (Default: None)
- --blackListFileName, -bl
A BED file containing regions that should be excluded from all analyses. Currently this works by rejecting genomic chunks that happen to overlap an entry. Consequently, for BAM files, if a read partially overlaps a blacklisted region or a fragment spans over it, then the read/fragment might still be considered.
- --samplesLabel
Labels for the samples. This will then be passed to plotHeatmap and plotProfile. The default is to use the file name of the sample. The sample labels should be separated by spaces and quoted if a label itselfcontains a space E.g. --samplesLabel label-1 "label 2"
- --smartLabels
Instead of manually specifying labels for the input bigWig and BED/GTF files, this causes deepTools to use the file name after removing the path and extension.
- --quiet, -q
Set to remove any warning or processing messages.
- --verbose
Being VERY verbose in the status messages. --quiet will disable this.
- --scale
If set, all values are multiplied by this number. (Default: 1)
- --numberOfProcessors, -p
Number of processors to use. Type "max/2" to use half the maximum number of processors or "max" to use all available processors. (Default: 1)
GTF/BED12 options
- --metagene
When either a BED12 or GTF file are used to provide regions, perform the computation on the merged exons, rather than using the genomic interval defined by the 5-prime and 3-prime most transcript bound (i.e., columns 2 and 3 of a BED file). If a BED3 or BED6 file is used as input, then columns 2 and 3 are used as an exon. (Default: False)
- --transcriptID
When a GTF file is used to provide regions, only entries with this value as their feature (column 3) will be processed as transcripts. (Default: "transcript")
- --exonID
When a GTF file is used to provide regions, only entries with this value as their feature (column 3) will be processed as exons. CDS would be another common value for this. (Default: "exon")
- --transcript_id_designator
Each region has an ID (e.g., ACTB) assigned to it, which for BED files is either column 4 (if it exists) or the interval bounds. For GTF files this is instead stored in the last column as a key:value pair (e.g., as 'transcript_id "ACTB"', for a key of transcript_id and a value of ACTB). In some cases it can be convenient to use a different identifier. To do so, set this to the desired key. (Default: "transcript_id")
- An example usage is:
computeMatrix reference-point -S <bigwig file(s)> -R <bed file(s)> -b 1000
细节
computeMatrix
有两种主要的使用模式:
for computing the signal distribution relative to a point (
reference-point
), e.g., the beginning or end of each genomic regionfor computing the signal over a set of regions (
scale-regions
) where all regions are scaled to the same size

computeMatrix
紧密连接到 plotHeatmap
和 plotProfile
:它获取您想要绘制和计算相应数据矩阵的所有信号文件和所有基因组区域的值。

除了生成中间文件外,还为 plotHeatmap
和 plotProfile
, computeMatrix
也可用于简单地输出热量的基础值或 筛选和排序床位文件 例如,使用 --skipZeros
以及 --sortUsing
参数。
下表总结了三种工具可用的可选输出类型。
可选输出类型 |
command |
computeMatrix |
plotHeatmap |
plotProfile |
热量的基础值 |
|
对 |
对 |
不 |
配置文件的基础值 |
|
不 |
对 |
对 |
排序和/或筛选区域 |
|
对 |
对 |
对 |
小技巧
computeMatrix
可以使用多个线程 (-p
选项),这显著缩短了计算值的时间。
注意
从3.0版开始,ComputeMatrix生成输出,每个样本都有标签。用该版本或更高版本生成的矩阵不能用于旧版本的 plotHeatmap
或任何其他Deeptools程序。
备注
computeMatrix
如果您的BED文件包含该列(GTF文件始终包含strand),则将正确处理strand信息。对于 --metagene
要想工作,您需要一个BED12(包括第11和12列)或GTF文件作为输入。GFF与GTF格式不相同!
实例
下面的示例将为您提供一些最常用的设置 computeMatrix
. 如你所见, computeMatrix
提供了无数的调整,可能对你比“仅仅”计算热图矩阵更有用。
示例1:单输入文件(参考点模式)
在这里,我们从一个 大人物 一个单人间 BED 文件,即, computeMatrix
威尔:
从bed文件中指定的区域开始
add the values indicated with
--beforeRegionStartLength
(-b
) and--afterRegionStartLength
(-a
)将结果区域分成50个BP箱(可通过 (
--binSize
)根据Bigwig文件中给出的分数计算平均分数(分数类型可以通过
--averageTypeBins
)写出每行对应于bed文件中一个区域的值(注意,您可以跳过覆盖率为零的区域;也可以进行排序)
$ computeMatrix reference-point \ # choose the mode
--referencePoint TSS \ # alternatives: TES, center
-b 3000 -a 10000 \ # define the region you are interested in
-R testFiles/genes.bed \
-S testFiles/log2ratio_H3K4Me3_chr19.bw \
--skipZeros \
-o matrix1_H3K4me3_l2r_TSS.gz \ # to be used with plotHeatmap and plotProfile
--outFileSortedRegions regions1_H3K4me3_l2r_genes.bed
让我们更仔细地看看各地区的产出:
$ wc -l testFiles/genes.bed # original file
18257 testFiles/genes.bed
$ wc -l regions1_H3K4me3_l2r_genes.bed # file generated by computeMatrix
12423 regions1_H3K4me3_l2r_genes.bed
如您所见,区域的数量急剧减少。其余的基因恰好是19号染色体上的基因,至少有一个重叠的读码。这是有意义的,因为上面使用的bigwig文件只包含对19号染色体的读取。
# the original file contained genes for chr.19 and chr.X
$ cut -f 1 testFiles/genes.bed | sort | uniq -c
12439 19
5818 X
# the regions used for the computation of the matrix for the heatmap are all located on chr.19 due to the --skipZeros setting (see above)
$ cut -f 1 regions1_H3K4me3_l2r_genes.bed | sort | uniq -c
1 #genes
12422 19
示例2:多个输入文件(缩放区域模式)
$ deepTools2.0/bin/computeMatrix scale-regions \
-R genes_chr19_firstHalf.bed genes_chr19_secondHalf.bed \ # separate multiple files with spaces
-S testFiles/log2ratio_*.bw \ or use the wild card approach
-b 3000 -a 3000 \
--regionBodyLength 5000 \
--skipZeros -o matrix2_multipleBW_l2r_twoGroups_scaled.gz \
--outFileNameMatrix matrix2_multipleBW_l2r_twoGroups_scaled.tab \
--outFileSortedRegions regions2_multipleBW_l2r_twoGroups_genes.bed
请注意,报告区域的坐标将与最初提供的文件中的坐标相同,而不是用于heatmap矩阵的区域。
由两个单独文件提供的区域组将合并为一个:
$ head -n 2 regions2_multipleBW_l2r_twoGroups_genes.bed
19 60104 70951 ENST00000592209 0.0 - genes_chr19_firstHalf
19 60950 70966 ENST00000606728 0.0 - genes_chr19_firstHalf
$ tail -n 3 regions2_multipleBW_l2r_twoGroups_genes.bed
19 59108549 59110722 ENST00000596427 0.0 - genes_chr19_secondHalf
19 59110333 59110802 ENST00000464061 0.0 + genes_chr19_secondHalf
#genes_chr19_secondHalf
小技巧
更多例子 在我们的 Gallery .
deepTools Galaxy <http://deeptools.ie-freiburg.mpg.de> _. |
code @ github <https://github.com/deeptools/deepTools/> _. |