TrainImagesClassifier¶
从多对图像和训练向量数据训练分类器。
描述¶
从多对图像和训练向量数据训练分类器。样本由每个波段中的像素值组成,可选择居中并使用由 ComputeImagesStatistics 申请。
训练向量数据必须包含具有表示类别标签的正整数字段的多边形。此字段的名称可以使用 Class label field 参数。
训练和验证样本列表是这样构建的,即每个类别在两个列表中都有相同的表示。一个参数控制训练集中样本数量和验证集合中样本数量之间的比率。两个参数管理每个类和每个图像的训练和验证集的大小。
在验证过程中,混淆矩阵按以下方式组织:
- 行:参考标签、
- 栏:制作的标签。
在可选混淆矩阵输出文件的头中,根据混淆矩阵的行/列对验证(引用)和预测(生成)类标签进行排序。
该应用程序基于LibSVM、OpenCV机器学习和Shark ML。这个应用程序的输出是一个文本模型文件,其格式与所选的ML模型类型相对应。没有图像或矢量数据输出。
参数¶
输入和输出数据¶
这组参数允许设置输入和输出数据。
Input Image List -io.il image1 image2...
Mandatory
A list of input images.
Input Vector Data List -io.vd vectorfile1 vectorfile2...
Mandatory
A list of vector data to select the training samples.
Validation Vector Data List -io.valid vectorfile1 vectorfile2...
A list of vector data to select the validation samples.
Input XML image statistics file -io.imstat filename [dtype]
XML file containing mean and variance of each feature.
Output model -io.out filename [dtype]
Mandatory
Output file containing the model estimated (.txt format).
Output confusion matrix or contingency table -io.confmatout filename [dtype]
Output file containing the confusion matrix or contingency table (.csv format).The contingency table is output when we unsupervised algorithms is used otherwise the confusion matrix is output.
Temporary files cleaning -cleanup bool
Default value: true
If activated, the application will try to clean all temporary files it created
训练和验证样本参数¶
这组参数允许您设置训练和验证样本列表参数。
Maximum training sample size per class -sample.mt int
Default value: 1000
Maximum size per class (in pixels) of the training sample list (default = 1000) (no limit = -1). If equal to -1, then the maximal size of the available training sample list per class will be equal to the surface area of the smallest class multiplied by the training sample ratio.
Maximum validation sample size per class -sample.mv int
Default value: 1000
Maximum size per class (in pixels) of the validation sample list (default = 1000) (no limit = -1). If equal to -1, then the maximal size of the available validation sample list per class will be equal to the surface area of the smallest class multiplied by the validation sample ratio.
Bound sample number by minimum -sample.bm int
Default value: 1
Bound the number of samples for each class by the number of available samples by the smaller class. Proportions between training and validation are respected. Default is true (=1).
Training and validation sample ratio -sample.vtr float
Default value: 0.5
Ratio between training and validation samples (0.0 = all training, 1.0 = all validation) (default = 0.5).
Field containing the class integer label for supervision -sample.vfn string
Field containing the class id for supervision. The values in this field shall be cast into integers.
Available RAM (MB) -ram int
Default value: 256
Available memory for processing (in MB).
高程管理¶
这组参数允许管理高程值。
DEM directory -elev.dem directory
This parameter allows selecting a directory containing Digital Elevation Model files. Note that this directory should contain only DEM files. Unexpected behaviour might occurs if other images are found in this directory. Input DEM tiles should be in a raster format supported by GDAL.
Geoid File -elev.geoid filename [dtype]
Use a geoid grid to get the height above the ellipsoid in case there is no DEM available, no coverage for some points or pixels with no_data in the DEM tiles. A version of the geoid can be found on the OTB website (egm96.grd and egm96.grd.hdr at https://gitlab.orfeo-toolbox.org/orfeotoolbox/otb/-/tree/master/Data/Input/DEM).
Default elevation -elev.default float
Default value: 0
This parameter allows setting the default height above ellipsoid when there is no DEM available, no coverage for some points or pixels with no_data in the DEM tiles, and no geoid file has been set. This is also used by some application as an average elevation value.
Classifier to use for the training -classifier [libsvm|boost|dt|ann|bayes|rf|knn|sharkrf|sharkkm]
Default value: libsvm
Choice of the classifier to use for the training.
- LibSVM classifier
This group of parameters allows setting SVM classifier parameters. - Boost classifier
http://docs.opencv.org/modules/ml/doc/boosting.html - Decision Tree classifier
http://docs.opencv.org/modules/ml/doc/decision_trees.html - Artificial Neural Network classifier
http://docs.opencv.org/modules/ml/doc/neural_networks.html - Normal Bayes classifier
http://docs.opencv.org/modules/ml/doc/normal_bayes_classifier.html - Random forests classifier
http://docs.opencv.org/modules/ml/doc/random_trees.html - KNN classifier
http://docs.opencv.org/modules/ml/doc/k_nearest_neighbors.html - Shark Random forests classifier
http://image.diku.dk/shark/doxygen_pages/html/classshark_1_1_r_f_trainer.html.
It is noteworthy that training is parallel. - Shark kmeans classifier
http://image.diku.dk/shark/sphinx_pages/build/html/rest_sources/tutorials/algorithms/kmeans.html
LibSVM分类器选项¶
SVM Kernel Type -classifier.libsvm.k [linear|rbf|poly|sigmoid]
Default value: linear
SVM Kernel Type.
- Linear
Linear Kernel, no mapping is done, this is the fastest option. - Gaussian radial basis function
This kernel is a good choice in most of the case. It is an exponential function of the euclidean distance between the vectors. - Polynomial
Polynomial Kernel, the mapping is a polynomial function. - Sigmoid
The kernel is a hyperbolic tangente function of the vectors.
SVM Model Type -classifier.libsvm.m [csvc|nusvc|oneclass]
Default value: csvc
Type of SVM formulation.
- C support vector classification
This formulation allows imperfect separation of classes. The penalty is set through the cost parameter C. - Nu support vector classification
This formulation allows imperfect separation of classes. The penalty is set through the cost parameter Nu. As compared to C, Nu is harder to optimize, and may not be as fast. - Distribution estimation (One Class SVM)
All the training data are from the same class, SVM builds a boundary that separates the class from the rest of the feature space.
Cost parameter C -classifier.libsvm.c float
Default value: 1
SVM models have a cost parameter C (1 by default) to control the trade-off between training errors and forcing rigid margins.
Gamma parameter -classifier.libsvm.gamma float
Default value: 1
Set gamma parameter in poly/rbf/sigmoid kernel function
Coefficient parameter -classifier.libsvm.coef0 float
Default value: 0
Set coef0 parameter in poly/sigmoid kernel function
Degree parameter -classifier.libsvm.degree int
Default value: 3
Set polynomial degree in poly kernel function
Cost parameter Nu -classifier.libsvm.nu float
Default value: 0.5
Cost parameter Nu, in the range 0..1, the larger the value, the smoother the decision.
Parameters optimization -classifier.libsvm.opt bool
Default value: false
SVM parameters optimization flag.
Probability estimation -classifier.libsvm.prob bool
Default value: false
Probability estimation flag.
增强分类器选项¶
Boost Type -classifier.boost.t [discrete|real|logit|gentle]
Default value: real
Type of Boosting algorithm.
- Discrete AdaBoost
This procedure trains the classifiers on weighted versions of the training sample, giving higher weight to cases that are currently misclassified. This is done for a sequence of weighter samples, and then the final classifier is defined as a linear combination of the classifier from each stage. - Real AdaBoost (technique using confidence-rated predictions and working well with categorical data)
Adaptation of the Discrete Adaboost algorithm with Real value - LogitBoost (technique producing good regression fits)
This procedure is an adaptive Newton algorithm for fitting an additive logistic regression model. Beware it can produce numeric instability. - Gentle AdaBoost (technique setting less weight on outlier data points and, for that reason, being often good with regression data)
A modified version of the Real Adaboost algorithm, using Newton stepping rather than exact optimization at each step.
Weak count -classifier.boost.w int
Default value: 100
The number of weak classifiers.
Weight Trim Rate -classifier.boost.r float
Default value: 0.95
A threshold between 0 and 1 used to save computational time. Samples with summary weight <= (1 - weight_trim_rate) do not participate in the next iteration of training. Set this parameter to 0 to turn off this functionality.
Maximum depth of the tree -classifier.boost.m int
Default value: 1
Maximum depth of the tree.
决策树分类器选项¶
Maximum depth of the tree -classifier.dt.max int
Default value: 10
The training algorithm attempts to split each node while its depth is smaller than the maximum possible depth of the tree. The actual depth may be smaller if the other termination criteria are met, and/or if the tree is pruned.
Minimum number of samples in each node -classifier.dt.min int
Default value: 10
If the number of samples in a node is smaller than this parameter, then this node will not be split.
Termination criteria for regression tree -classifier.dt.ra float
Default value: 0.01
If all absolute differences between an estimated value in a node and the values of the train samples in this node are smaller than this regression accuracy parameter, then the node will not be split further.
Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split -classifier.dt.cat int
Default value: 10
Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split.
Set Use1seRule flag to false -classifier.dt.r bool
Default value: false
If true, then a pruning will be harsher. This will make a tree more compact and more resistant to the training data noise but a bit less accurate.
Set TruncatePrunedTree flag to false -classifier.dt.t bool
Default value: false
If true, then pruned branches are physically removed from the tree.
人工神经网络分类器选项¶
Train Method Type -classifier.ann.t [back|reg]
Default value: reg
Type of training method for the multilayer perceptron (MLP) neural network.
- Back-propagation algorithm
Method to compute the gradient of the loss function and adjust weights in the network to optimize the result. - Resilient Back-propagation algorithm
Almost the same as the Back-prop algorithm except that it does not take into account the magnitude of the partial derivative (coordinate of the gradient) but only its sign.
Number of neurons in each intermediate layer -classifier.ann.sizes string1 string2...
Mandatory
The number of neurons in each intermediate layer (excluding input and output layers).
Neuron activation function type -classifier.ann.f [ident|sig|gau]
Default value: sig
This function determine whether the output of the node is positive or not depending on the output of the transfer function.
- Identity function
- Symmetrical Sigmoid function
- Gaussian function (Not completely supported)
Alpha parameter of the activation function -classifier.ann.a float
Default value: 1
Alpha parameter of the activation function (used only with sigmoid and gaussian functions).
Beta parameter of the activation function -classifier.ann.b float
Default value: 1
Beta parameter of the activation function (used only with sigmoid and gaussian functions).
Strength of the weight gradient term in the BACKPROP method -classifier.ann.bpdw float
Default value: 0.1
Strength of the weight gradient term in the BACKPROP method. The recommended value is about 0.1.
Strength of the momentum term (the difference between weights on the 2 previous iterations) -classifier.ann.bpms float
Default value: 0.1
Strength of the momentum term (the difference between weights on the 2 previous iterations). This parameter provides some inertia to smooth the random fluctuations of the weights. It can vary from 0 (the feature is disabled) to 1 and beyond. The value 0.1 or so is good enough.
Initial value Delta_0 of update-values Delta_{ij} in RPROP method -classifier.ann.rdw float
Default value: 0.1
Initial value Delta_0 of update-values Delta_{ij} in RPROP method (default = 0.1).
Update-values lower limit Delta_{min} in RPROP method -classifier.ann.rdwm float
Default value: 1e-07
Update-values lower limit Delta_{min} in RPROP method. It must be positive (default = 1e-7).
Termination criteria -classifier.ann.term [iter|eps|all]
Default value: all
Termination criteria.
- Maximum number of iterations
Set the number of iterations allowed to the network for its training. Training will stop regardless of the result when this number is reached - Epsilon
Training will focus on result and will stop once the precision isat most epsilon - Max. iterations + Epsilon
Both termination criteria are used. Training stop at the first reached
Epsilon value used in the Termination criteria -classifier.ann.eps float
Default value: 0.01
Epsilon value used in the Termination criteria.
Maximum number of iterations used in the Termination criteria -classifier.ann.iter int
Default value: 1000
Maximum number of iterations used in the Termination criteria.
随机森林分类器选项¶
Maximum depth of the tree -classifier.rf.max int
Default value: 5
The depth of the tree. A low value will likely underfit and conversely a high value will likely overfit. The optimal value can be obtained using cross validation or other suitable methods.
Minimum number of samples in each node -classifier.rf.min int
Default value: 10
If the number of samples in a node is smaller than this parameter, then the node will not be split. A reasonable value is a small percentage of the total data e.g. 1 percent.
Termination Criteria for regression tree -classifier.rf.ra float
Default value: 0
If all absolute differences between an estimated value in a node and the values of the train samples in this node are smaller than this regression accuracy parameter, then the node will not be split.
Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split -classifier.rf.cat int
Default value: 10
Cluster possible values of a categorical variable into K <= cat clusters to find a suboptimal split.
Size of the randomly selected subset of features at each tree node -classifier.rf.var int
Default value: 0
The size of the subset of features, randomly selected at each tree node, that are used to find the best split(s). If you set it to 0, then the size will be set to the square root of the total number of features.
Maximum number of trees in the forest -classifier.rf.nbtrees int
Default value: 100
The maximum number of trees in the forest. Typically, the more trees you have, the better the accuracy. However, the improvement in accuracy generally diminishes and reaches an asymptote for a certain number of trees. Also to keep in mind, increasing the number of trees increases the prediction time linearly.
Sufficient accuracy (OOB error) -classifier.rf.acc float
Default value: 0.01
Sufficient accuracy (OOB error).
KNN分类器选项¶
Number of Neighbors -classifier.knn.k int
Default value: 32
The number of neighbors to use.
鲨鱼随机森林分类器选项¶
Maximum number of trees in the forest -classifier.sharkrf.nbtrees int
Default value: 100
The maximum number of trees in the forest. Typically, the more trees you have, the better the accuracy. However, the improvement in accuracy generally diminishes and reaches an asymptote for a certain number of trees. Also to keep in mind, increasing the number of trees increases the prediction time linearly.
Min size of the node for a split -classifier.sharkrf.nodesize int
Default value: 25
If the number of samples in a node is smaller than this parameter, then the node will not be split. A reasonable value is a small percentage of the total data e.g. 1 percent.
Number of features tested at each node -classifier.sharkrf.mtry int
Default value: 0
The number of features (variables) which will be tested at each node in order to compute the split. If set to zero, the square root of the number of features is used.
Out of bound ratio -classifier.sharkrf.oobr float
Default value: 0.66
Set the fraction of the original training dataset to use as the out of bag sample.A good default value is 0.66.
Shark Kans分类器选项¶
Maximum number of iterations for the kmeans algorithm -classifier.sharkkm.maxiter int
Default value: 10
The maximum number of iterations for the kmeans algorithm. 0=unlimited
Number of classes for the kmeans algorithm -classifier.sharkkm.k int
Default value: 2
The number of classes used for the kmeans algorithm. Default set to 2 class
User defined input centroids -classifier.sharkkm.incentroids filename [dtype]
Input text file containing centroid posistions used to initialize the algorithm. Each centroid must be described by p parameters, p being the number of features in the input vector data, and the number of centroids must be equal to the number of classes (one centroid per line with values separated by spaces).
Statistics file -classifier.sharkkm.cstats filename [dtype]
A XML file containing mean and standard deviation to centerand reduce the input centroids before the KMeans algorithm, produced by ComputeImagesStatistics application.
Output centroids text file -classifier.sharkkm.outcentroids filename [dtype]
Output text file containing centroids after the kmean algorithm.
Random seed -rand int
Set a specific random seed with integer value.
实例¶
从命令行执行以下操作:
otbcli_TrainImagesClassifier -io.il QB_1_ortho.tif -io.vd VectorData_QB1.shp -io.imstat EstimateImageStatisticsQB1.xml -sample.mv 100 -sample.mt 100 -sample.vtr 0.5 -sample.vfn Class -classifier libsvm -classifier.libsvm.k linear -classifier.libsvm.c 1 -classifier.libsvm.opt false -io.out svmModelQB1.txt -io.confmatout svmConfusionMatrixQB1.csv
来自Python的评论:
import otbApplication
app = otbApplication.Registry.CreateApplication("TrainImagesClassifier")
app.SetParameterStringList("io.il", ['QB_1_ortho.tif'])
app.SetParameterStringList("io.vd", ['VectorData_QB1.shp'])
app.SetParameterString("io.imstat", "EstimateImageStatisticsQB1.xml")
app.SetParameterInt("sample.mv", 100)
app.SetParameterInt("sample.mt", 100)
app.SetParameterFloat("sample.vtr", 0.5)
app.SetParameterString("sample.vfn", "Class")
app.SetParameterString("classifier","libsvm")
app.SetParameterString("classifier.libsvm.k","linear")
app.SetParameterFloat("classifier.libsvm.c", 1)
app.SetParameterString("classifier.libsvm.opt","false")
app.SetParameterString("io.out", "svmModelQB1.txt")
app.SetParameterString("io.confmatout", "svmConfusionMatrixQB1.csv")
app.ExecuteAndWriteOutput()