10.2. 基于kNN的手写数据OCR

10.2.1. 目标

在本章中
-我们将利用我们对kNN的知识来构建一个基本的OCR应用程序。-我们将尝试使用OpenCV附带的数字和字母数据。

10.2.2. 手写数字的OCR

我们的目标是构建一个可以读取手写数字的应用程序。为此,我们需要一些列车数据和测试数据。OpenCV附带一个图像 \(digits.png\) (在文件夹中 opencv/samples/python2/data/ )它有5000个手写数字(每个数字500个)。每个数字是20x20图像。所以我们的第一步是把这张图片分成5000个不同的数字。对于每个数字,我们将其展平为400像素的单行。这是我们的特征集(所有像素的强度值)。这是我们可以创建的最简单的特性集。我们使用每个数字的前250个样本作为训练数据,接下来的250个样本作为测试数据。所以我们先准备一下。

>>> import numpy as np
>>> import cv2 as cv
>>> from matplotlib import pyplot as plt
>>>
>>> img = cv.imread('/cvdata/digits.png')
>>> gray = cv.cvtColor(img,cv.COLOR_BGR2GRAY)
>>> # Now we split the image to 5000 cells, each 20x20 size
>>> cells = [np.hsplit(row,100) for row in np.vsplit(gray,50)]
>>> # Make it into a Numpy array. It size will be (50,100,20,20)
>>> x = np.array(cells)
>>> # Now we prepare train_data and test_data.
>>> train = x[:,:50].reshape(-1,400).astype(np.float32) # Size = (2500,400)
>>> test = x[:,50:100].reshape(-1,400).astype(np.float32) # Size = (2500,400)
>>> # Create labels for train and test data
>>> k = np.arange(10)
>>> train_labels = np.repeat(k,250)[:,np.newaxis]
>>> test_labels = train_labels.copy()
>>> # Initiate kNN, train the data, then test it with test data for k=1
>>> knn = cv.ml.KNearest_create()
>>> knn.train(train, cv.ml.ROW_SAMPLE, train_labels)
>>> ret,result,neighbours,dist = knn.findNearest(test,k=5)
>>> # Now we check the accuracy of classification
>>> # For that, compare the result with test_labels and check which are wrong
>>> matches = result==test_labels
>>> correct = np.count_nonzero(matches)
>>> accuracy = correct*100.0/result.size
>>> print( accuracy )
91.76

所以我们的基本OCR应用程序已经准备好了。这个特别的例子给了我91%的准确率。提高准确性的一个选择是为训练添加更多的数据,特别是错误的数据。因此,与其每次启动应用程序时都查找这些培训数据,不如保存它,以便下次直接从文件中读取这些数据并开始分类。你可以在一些Numpy函数的帮助下完成,如np.savetxt、np.savez、np.load等。请查看他们的文档以了解更多详细信息。

>>> # save the data
>>> np.savez('knn_data.npz',train=train, train_labels=train_labels)
>>> # Now load the data
>>> with np.load('knn_data.npz') as data:
>>>     print( data.files )
>>>     train = data['train']
>>>     train_labels = data['train_labels']
['train', 'train_labels']

在我的系统中,大约需要4.4MB的内存。因为我们使用强度值(uint8数据)作为特征,所以最好先将数据转换为np.uint8,然后保存它。在这种情况下只需要110万字节。然后在加载时,可以转换回float32。

10.2.3. 英文字母OCR

接下来,我们将对英文字母表做同样的操作,但数据和功能集略有变化。在这里,OpenCV提供的不是图像而是一个数据文件, letter-recognition.data in opencv/samples/cpp/ folder. If you open it, you will see 20000 lines which may, on first sight, look like garbage. Actually, in each row, first column is an alphabet which is our label. Next 16 numbers following it are its different features. These features are obtained from UCI Machine Learning Repository . 您可以在中找到这些功能的详细信息 this page .

有20000个样本,所以我们将前10000个数据作为训练样本,其余10000个作为测试样本。我们应该把字母表改成ascii字符,因为我们不能直接使用字母表。

>>> import cv2 as cv
>>> import numpy as np
>>> # Load the data, converters convert the letter to a number
>>> data= np.loadtxt('/cvdata/letter-recognition.data',
>>>                  dtype= 'float32',
>>>                  delimiter = ',',
>>>                  converters= {0: lambda ch: ord(ch)-ord('A')})
>>> # split the data to two, 10000 each for train and test
>>> train, test = np.vsplit(data,2)
>>> # split trainData and testData to features and responses
>>> responses, trainData = np.hsplit(train,[1])
>>> labels, testData = np.hsplit(test,[1])
>>> # Initiate the kNN, classify, measure accuracy.
>>> knn = cv.ml.KNearest_create()
>>> knn.train(trainData, cv.ml.ROW_SAMPLE, responses)
>>> ret, result, neighbours, dist = knn.findNearest(testData, k=5)
>>> correct = np.count_nonzero(result == labels)
>>> accuracy = correct*100.0/10000
>>> print( accuracy )
93.06

它给了我93.22%的准确率。同样,如果要提高精度,可以在每个级别迭代添加错误数据。

10.2.4. 额外资源

10.2.5. 练习