估计自适应滤波

This tool estimates the number of reads that would be filtered given a set of settings and prints this to the terminal. Further, it tracks the number of singleton reads. The following metrics will always be tracked regardless of what you specify (the order output also matches this):

  • Total reads (including unmapped)

  • Mapped reads

  • Reads in blacklisted regions (--blackListFileName)

The following metrics are estimated according to the --binSize and --distanceBetweenBins parameters
  • Estimated mapped reads filtered (the total number of mapped reads filtered for any reason)

  • Alignments with a below threshold MAPQ (--minMappingQuality)

  • Alignments with at least one missing flag (--samFlagInclude)

  • Alignments with undesirable flags (--samFlagExclude)

  • Duplicates determined by deepTools (--ignoreDuplicates)

  • Duplicates marked externally (e.g., by picard)

  • Singletons (paired-end reads with only one mate aligning)

  • Wrong strand (due to --filterRNAstrand)

The sum of these may be more than the total number of reads. Note that alignments are sampled from bins of size --binSize spaced --distanceBetweenBins apart.

usage: estimateReadFiltering -b sample1.bam sample2.bam
help: estimateReadFiltering -h / estimateReadFiltering --help

Required arguments

--bamfiles, -b

List of indexed bam files separated by spaces.

General arguments

--outFile, -o

The file to write results to. By default, results are printed to the console

--sampleLabels

Labels for the samples. The default is to use the file name of the sample. The sample labels should be separated by spaces and quoted if a label itselfcontains a space E.g. --sampleLabels label-1 "label 2"

--smartLabels

Instead of manually specifying labels for the input BAM files, this causes deepTools to use the file name after removing the path and extension.

--binSize, -bs

Length in bases of the window used to sample the genome. (Default: 1000000)

--distanceBetweenBins, -n

To reduce the computation time, not every possible genomic bin is sampled. This option allows you to set the distance between bins actually sampled from. Larger numbers are sufficient for high coverage samples, while smaller values are useful for lower coverage samples. Note that if you specify a value that results in too few (<1000) reads sampled, the value will be decreased. (Default: 10000)

--numberOfProcessors, -p

Number of processors to use. Type "max/2" to use half the maximum number of processors or "max" to use all available processors. (Default: 1)

--verbose, -v

Set to see processing messages.

--version

show program's version number and exit

Optional arguments

--filterRNAstrand

Possible choices: forward, reverse

Selects RNA-seq reads (single-end or paired-end) in the given strand. (Default: None)

--ignoreDuplicates

If set, reads that have the same orientation and start position will be considered only once. If reads are paired, the mate's position also has to coincide to ignore a read.

--minMappingQuality

If set, only reads that have a mapping quality score of at least this are considered.

--samFlagInclude

Include reads based on the SAM flag. For example, to get only reads that are the first mate, use a flag of 64. This is useful to count properly paired reads only once, as otherwise the second mate will be also considered for the coverage. (Default: None)

--samFlagExclude

Exclude reads based on the SAM flag. For example, to get only reads that map to the forward strand, use --samFlagExclude 16, where 16 is the SAM flag for reads that map to the reverse strand. (Default: None)

--blackListFileName, -bl

A BED or GTF file containing regions that should be excluded from all analyses. Currently this works by rejecting genomic chunks that happen to overlap an entry. Consequently, for BAM files, if a read partially overlaps a blacklisted region or a fragment spans over it, then the read/fragment might still be considered. Please note that you should adjust the effective genome size, if relevant.

背景

deeptools中的许多工具允许根据对齐映射质量或其他条件过滤BAM文件。很难提前知道这些不同的设置将如何影响过滤读取的数量。因此, estimateReadFiltering 可用于估计将根据一个或多个条件筛选的一个或多个BAM文件中的读取次数。这也可以用于快速估计BAM文件中的重复级别。

使用实例

estimateReadFiltering 需要一个或多个已排序和索引的BAM文件以及所需的筛选条件。

$ estimateReadFiltering -b paired_chr2L.bam \
--minMappingQuality 5 --samFlagInclude 16 \
--samFlagExclude 256 --ignoreDuplicates

默认情况下,输出将打印到屏幕上。你可以用 -o 选项。输出是一个以制表符分隔的文件:

示例总读取映射读取在黑名单区域中的对齐估计映射读取在MAPQ以下筛选缺少标志排除标志内部确定重复标记重复单例错误链配对_chr2l.bam 12644 12589 0 6313.2 4114.0 6340.0 0 0.0 1163.0 0 0.0 55.0 0 0.0

各列如下:

  • 读取总数(包括未映射)

  • 未映射的读取

  • 在黑名单区域中读取(--blacklistfilename)

以下指标是根据--binsize和--distancebetween-bins参数估计的
  • 已筛选的估计映射读取数(因任何原因筛选的映射读取总数)

  • 与低于阈值的mapq对齐(--minmappingQuality)

  • 至少有一个缺少标志的对齐(--samflaginclude)

  • 与不需要的标志对齐(--samflagexclude)

  • 由deeptools确定的重复项(--ignoreduplicates)

  • 外部标记的重复项(例如,由Picard标记)

  • 单件(配对端读取,只有一个配对端对齐)

  • 错误的链(由于--filterrnastrand)

它们的总和可能大于读取的总数。请注意,对齐是从大小为--间隔为--的箱子中取样的,间隔为两个箱子之间的距离。

deepTools Galaxy <http://deeptools.ie-freiburg.mpg.de> _.

code @ github <https://github.com/deeptools/deepTools/> _.