key: cord-0058645-njlvic1o
authors: Rakhmanov, Ochilbek
title: A Novel Algorithm to Classify Hand Drawn Sketches with Respect to Content Quality
date: 2020-08-19
journal: Computational Science and Its Applications - ICCSA 2020
DOI: 10.1007/978-3-030-58811-3_13
sha: 40caebe073791d2a257c25eb9d6a761e80954c54
doc_id: 58645
cord_uid: njlvic1o

In this paper, the methodology of a novel algorithm called Counting Key-Points algorithm (CKP) is presented. The algorithm can be used during classification of same type of hand drawn sketches, where content quality is important. In brief, the algorithm uses reference pictures set to form vocabulary of key points (with descriptors) and counting how many times those key points appeared on other images, to decide the image content quality. CKP was tested on Draw-a-Person test images, drawn by primary school students, and reached 65% of classification accuracy. The results of the experiment show that the method is applicable and can be improved with further researches. The classification accuracy of CKP was compared to other state-of-art hand drawn image classification methods, to show superiority of the algorithm. As the dataset needs further studies to improve the prediction accuracy, it would be released to the community.

Sketch can be defined as a freehand drawing, which can be done by professionals, amateurs, adults and even children. With the development of touch screen digital devices like tablets and phones, sketch research has become an active and popular field in computer vision. Many users will sketch shapes on electronic device screens, which can create tremendous amount of the data for further researches. For instance, once of such recently released datasets contains more than 50 million free-hand sketches [1] , which was collected during a mobile game Quick Draw by Google. The front-runner in this field was the research and the dataset released by Eitz et al in 2012, with 20,000 sketches [2] .

Aforementioned publicly accessible datasets triggered many researchers to conduct machine learning classification experiments on hand-drawn sketches [2] [3] [4] . Unlike the ordinary photographic images, hand-drawn sketches differ in three main aspects: (i) they can be very abstract and deformed and still represent the same object, (ii) some people may draw all features of the object others may miss them, and lastly, (iii) sketches lack colors as they are usually black and white only [5] .

The drawings of the children have high importance on their mental development, educational performance and can reflect inner world of the child [6] . For this purpose, researchers have developed many different types of cognitive tests, which require a child to draw a shape of particular object, and later can be used for evaluation of various mental and physical conditions [6] [7] [8] [9] . Thus, hand-drawn sketches have high importance in the fields of education and healthcare. Cognitive drawing tests proved to be effective tool for adults with mental disorders as well [9] [10] [11] . Many different types of drawing test were developed, and several studies been done on their reliability and validity. As this research addresses the issue of classification of hand drawn sketches with respect to content quality, one particular cognitive test were chosen in this context, DAPT -Draw-a-Person test [7] .

All of the datasets released to community [1, 2, 12] have one thing in common, they are useful to classify different objects; cup from plate, lion from horse, cat from dog etc. A computational device should be trained to find difference between objects to classify them. But this is not same if one decides to train computational device using the images collected from DAPT, because in this case the important feature for classification are not the differences, but common features and how frequently they are appearing in the image. In other words, the content quality of the image is important in the classification procedure.

At this point, a new field should be addressed, the content quality of hand-drawn images. Content quality might be defined as content that delivers value, that solves a problem. In this, if an image contains more of main features of the object in comparison to other image of the same object, then it should be classified as higher class. For instance, Fig. 1 shows 3 different level of 2 different objects, face and bicycle. If they need to be classified with respect to image quality, definitely the content of the image would carry high importance. Type C is better drawn in comparison to Type B, while Type B is better than Type A, thus, they need to be classified accordingly.

Type B Type C Fig. 1 . Different representation of the face and bicycle [4] Just like in Fig. 1 , content quality is becoming main actor in classification of images, collected during DAPT procedure. Figure 2 present some sample images from DAPT test, drawn by children which belong to 3 different age categories. From Fig. 2 , it is clearly evident that content quality of images was main actor in manual classification of the pictures to different classes (Fig. 3 ).

Sketch Classification K-means, KNN, SVM [14] and CNN [15] are mostly used machine learning algorithms for classification of hand-drawn sketches, while computer vision algorithms like SIFT-Scale-Invariant Feature Transform [16] , HOG -Histogram of Oriented Gradients [17] and BoVW -Bag of Visual Words [18] are extremely helpful for researches during future extraction, alongside with commonly used image processing operations like contour detection and convolution of the image [19] . In 2012, Eitz et al collected 20,000 original sketches of 250 different objects [2] and released this dataset to community, before testing it for classification with HOG and SVM, to reach 56% of accuracy. Year later, Li et al used several techniques like Ensemble matching, Star graph with KNN to improve classification accuracy to 62% [20] . Next promising research was conducted by Schneider et al in 2014, where they employed SIFT and SVM alongside with Fisher vectors [3] to reach accuracy of 67%. Several progressive researches were conducted by the same group Yu et al in 2015 and 2017 [4, 21] , and CNN was successfully tested for Eitz et al dataset to reach 74% and 77% classification accuracy. CNN seemed to be an acceptable tool, but in 2016, Ballester and Araujo trained two state-of-art CNN structures, AlexNet and GoogleNet, and observed that those CNN structures are resulting with unacceptable accuracies, below 50% [22] , when they were trained with hand-drawn sketches from Eitz et al dataset. This issue needs a further clarification in future researches. In one of recent studies, Rakhmanov tested the aforementioned classification algorithms (SVM, ANN) without computer vision algorithms (SIFT, HOG,BoVW), just using pixel values, on dataset provided by Eitz et al [5] . He stated that existing methods' results are reasonable to accept.

In 2017 a research group from Google [1] released a dataset of hand-drawn sketches, which is presently appears to be world largest doodling dataset, with more than 50 million samples. They used RNN to produce a software which completes the sketch from initial strokes. This triggered more researches; Xu et al [23] and Bensalah et al [24] used GNN (Graph Neural Network) in combination with CNN to classify the images. Still the highest classification accuracy is remaining in region of 77%. BoVW -HOG -SIFT can be also employed as reliable tools to classify handdrawn sketches, and results can be promising.

First conceived by Dr. Florence Goodenough in 1926 [7] , the Draw-a-Person test is a skill test that is designed to measure a child's mental age through a figure drawing task. It estimates the progress of learning visual, cognitive, and motor skills by having the candidate draw a human figure, scoring the drawing for presence and quality of figure features, and comparing the score to children's typical rate of acquisition of figure features [7] . Throughout the years, this test underwent many discussions, whereby the researchers tested the validity and reliability of the test, usually resulting in supportive conclusions. The instrument is among the top 10 tools used by practitioners, according to Yama et al [25] . It is widely used in early childhood education; primary school counselors can use it to monitor children's mental development. In Psychology, for instance, it has been used during comparison between healthy patients and those with mental disorders. There are many supportive researches and case studies for these aforementioned fields, summarized by Naglieri et al [26] . Due to the wide adoption and use of DAPT, there was a need to conduct some experiments on DAPT sketches, to see if we can develop a model to classify images automatically. Evaluation of DAPT can be done only by professional psychologist or practitioner. But there are very few schools in low income countries (including both private and public) who have a psychologist as working personal, because it is too expensive for them. Thus, development of such automatic tool can help to millions of schools, guidance-counsellors and teachers around the world.

This section will provide a brief explanation of the DAPT picture (as it was proposed by Goodenough [7] ). In the DAPT image evaluation, we count the number of features, like eyes, nose, mouth, hair, hands, shoulder, fingers, etc. Yet, some more scores come from the comparison of the geometrical positioning of the features. For instance, a child can draw one leg short and other long or provide a sketch of eyes whose proportion on the face is abnormal. The sum of these scores results in a total score (minimum 0, maximum 51 points). We look up for corresponding mental age from score * mental age scale proposed by Goodenough [7] . This mental age is compared to biological age, and if the difference is high, then counselor or practitioner advices that the child needs special assistance or training on recognizing the functions of the objects surrounding him or her in his or her daily life [7] . Figure 4 shows some pictures from our dataset and their corresponding mental ages. Figure 5 demonstrates some human figures which belong to the same class, even though they are drawn in very different ways. Thus, the biggest challenge in this dataset as it can be observed from Fig. 5 is that sketches may appear in very different shapes. Unlike in most common image classification tasks, we are not actually looking for the difference between the images, but for the common features and how frequently they appear. The exploration of any possible solution for this challenge is the main goal of this research.

Another important difference is that the Eitz et al dataset was collected through touch screen devices and mostly participants were adults. All pictures are clear and sharp. However, it was impossible to do so in our case as we had kids aged 4-7 years who were unfamiliar with touch screens or other electronic devices. Thus, our dataset consists of sketches drawn on a plain sheet. Each student tried to draw his or her best picture; so, there are many pictures where students tried to erase previous sketches and replace them with new ones. They also tried to draw extra elements and the lines in their drawings were clashing or not straight. All this created significant noise on images which affected prediction accuracy in some manner.

As from computer science aspect, it is very important to mention that a child's picture can be only in one highest class. In other words, a picture can only have highest class, like picture of 5 years old child, or picture of 9 years old child, with respect to content quality. This is advantage of the scoring system of DAMT, it cannot be misclassified between 4 years old and 8 years old, or 5 and 10. We only must identify what is the highest class the image belongs, which is making this challenging task easier to approach. In other words, we don't have to predict exact class of the picture, we need to predict what is the highest class it can belong.

The main objective of this study is to present the methodology on how to develop and apply CKP algorithm for image classification of hand drawn sketches with respect to their content quality and to present comparative results with respect to other existing methodologies to prove the superiority of the CKP algorithm, as this type of sketches require extra-ordinary approach. A unique dataset of DAPT images (with more than 1000 images) was collected in order to conduct this experiment. An early preliminary version of this work was published by Rakhmanov et al [27] , where initial results of the experimentations on DAPT images were published. However, this paper presents complete guide on how the methodology of the CKP works, and presents a complete comparative study on results.

We used 2 important concepts in machine learning and computer vision to establish CKP algorithm, they are K-means and BoVW. We also compared the prediction accuracy of CKP to other widely used algorithms. In this context, we compared CKP results with SVM+HOG sketch classification method [2] , and with CNN [4] . We described the complete development process of CKP in Sect. 5, while SVM+HOG and CNN were only briefly described in methodology section, as they are already explained in the literature.

• K-means. K-means algorithm is an iterative algorithm that tries to partition the dataset into K pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the inter-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster's centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum [28] . • Bag of Visual Words. Another important algorithm for image feature detection is BoVW [18] . The general idea of BoVW is to represent an image as a set of features. These features consist of key-points and descriptors. Key points are the points in an image; so, no matter how much the image is rotated, shrunk, or expanded, its key points will always be the same. The descriptor is the description of the key point. We used the key points and descriptors to construct vocabularies and represented each image as a frequency histogram of features that are in the image. From the frequency histogram, we can find other similar images or predict the category of the image. To calculate key points and descriptors, we used ORB [29] . • Programming instruments. We used Python programming language, as it is one of most popular programming languages for image classification. Image processing and computer vision algorithms were implemented by using open source library OpenCV [30] . While Scikit-Learn was our main library during machine learning training [31] , Keras and Tensorflow were used during deep learning training [15] .

• Ethics and regulations. We followed all ethics and regulations during data collection. All parents were properly informed of test regulations and their consents were obtained. Students were guided by their teachers during process. It was an extra-curricular activity for students with minimum level of stress. • Participants. Students from private educational institution were selected. They were aged from 4 to 11 years, from Nursery 1-2 and Primary 1-2-3-4 grades. More than 1000 pictures were collected, and some meaningless sketches were eliminated, resulting with almost 1000 pictures remaining. • Process. Students were given a plain sheet and told to sketch a man. No further assistance was given as the sketch required the authentic work of the child. About 10-15 min were given to the children to finish the job. After the sketching process was completed, all pictures were collected and passed for further steps. • Manual classification. A consensus of three trained persons (one school counselor, one PhD student and one Physiology Professor) followed Goodenough's 51-point scoring criteria to calculate the score of every picture (Goodenough, 1926) . Once scoring was done, we classified them according to mental ages, from 4 to 11 years, resulting a total of 8 classes. Below, in Table 2 we present total numbers of images for every class.

To make our work diverse and to find best possible option, we created three versions of the dataset. a) All_8: this is primer dataset, all sketches divided to 8 classes, according to the ages, from 4 to 11 years. Pictures were cropped and resized to 120 Â 240 pixels. b) Double: in this dataset, we merged the adjacent age groups; 4 with 5, 6 with 7, 8 with 9 and 10 with 11. This resulted in a total of 4 classes. c) Reference: this is small dataset, where we joined classes 10 and 11. This dataset was used during the CKP experiment to propose a novel method. The formation of different datasets is not a desirable condition and our expectation is to get the best possible result on All_8 dataset. However, during the experiment we discovered that the classification of All_8 is a very difficult task, which forced us to look for different methods of classification.

As a rule of thumb, we used 75% of the data for training and the remaining 25% for testing. We used training and testing dataset for other classification methods. In CKP, you do not need such partition of data. In CKP, we need only Reference dataset for formation of vocabulary and rest of data was used for testing.

Firstly, we followed the process described in Fig. 6 . During each experiment we have chosen different values for X and Y, to see how good the prediction accuracy would be. For instance, assume that (X, Y) = (1000, 400). We proceed with followings steps:

Step1. First, we used the Reference dataset to calculate 1000 key points and their descriptors from each picture, using ORB. Figure 7 displays some of selected images and their respective key points (1000) with red circles. 2. Step2. Next, we used K-means to find the 400 most likely descriptors to appear in every picture. We formed a vocabulary of visual words from these 400 descriptors. 3. Step3. We used matching method to determine if these vocabulary elements are appearing in other classes of All_8, and how frequently they appeared. If a descriptor was appearing on picture (probability of appearing is bigger than zero) we counted it as 1 point. 

Step 4. We calculated the total points for each picture. 5.

Step 5. Lastly, we produce summary of the mean and standard deviation for every class (rounded to integer). Table 3 is summary for the operation presented on Fig. 6 (after Step 5) for every set of (X, Y). The experiment is done on All_8 dataset. Regardless of the selected set, we can observe that there is difference between means of the classes and it is gradually increasing with every class. Hypothetically, we can assume that we can classify the sketches with respect to their means. But, biggest challenge in this case is the standard deviation, it is very high. Next, we applied same operation on Double dataset. Table 4 is summary of the experiment on Double dataset.

If we compare results on Tables 3 and 4, we can observe that average standard deviation for every set of (X, Y) is not changing, but still very high. Definitely, this will cause a serious challenge during prediction operation. At this point, we strongly encourage some further researches in order to reduce the value of standard deviation to improve classification accuracy. But, the margins between means are wider in Table 4 , which would directly affect the classification accuracy in positive manner.

Next, we aimed to use information from Tables 3 and 4 to develop an image classification methodology. Figure 8 is process chart to develop this methodology. As standard deviation is too high for every class, we were careful with selection of upper limit for every class. The limits should be reasonable with respect to mean and deviation of two neighbor classes. Table 5 is the summary and limits we have chosen for every class in All_8 and Double dataset.

The rest of operation in Fig. 8 can be completed and described with algorithm presented in Fig. 9 . This operation should be done for every class and accuracy should be recorded.

Finally, we can present prediction accuracies in Tables 6 and 7 . We present prediction accuracies for each class and total prediction accuracy, which is weighted accuracy of the set. We can observe that Set 4 (2000; 800) is giving best result on both All_8 and Double dataset. As it was expected, prediction accuracy on Double dataset is better than All_8, and we managed to reach highest 65% of prediction accuracy. We could keep on this version and produce Set 5 (4000; 1200) or Set 6 (6000; 2000), but this didn't increase the accuracy, instead it led to overfitting the model. This may be applicable on very large datasets, but for our case Set 4 seems to be most optimal. Counting key-points pseudocode 

We developed one-to-one models of two milestone research papers for classification of hand drawn sketches; Eitz et al [2] and Yu et al [4] . In brief, Eitz et al firstly calculated Histogram of Oriented Gradients for each sketch than feed this data to SVM, while Yu et al developed a Convolutional Neural Network model to classify hand drawn sketches. Table 8 is summary of classification accuracies for aforementioned two milestone models and CKP algorithm. We can clearly observe that other two methods struggled to understand the dataset, as it needs special approach, while CKP overperformed both. 

In this paper, we aimed to conduct some classification experiments on DAPT images and we presented the methodology for CKP algorithm. Throughout the experimentation, we discovered that this kind of images should use some specific way of approach, rather than common classification methods. We observed that the All_8 dataset is difficult to classify with high accuracy, while joining 2 classes like in Double, can minimize this burden. However, our primary goal should be to develop a method to classify All_8 dataset. We introduced a novel method, CKP algorithm, which seems to be simple but achieved good results with this type of dataset. As we have tested a unique dataset, which was not tested before for classification by computational device, we believe that our resulted prediction accuracy, 65%, is worthy finding and can be improved with further studies. We also presented comparative results on Table 8 , proving that CKP is superior with comparison to existing state-of-art sketch classification methods.

The main shortcoming of the algorithm is, as we stated several times before, the standard deviation of the classes is very high. Prediction accuracy might improve if future studies can lower the standard deviation.

A neural representation of sketch drawings

How do humans sketch objects?

Sketch classification and classification-driven analysis using fisher vectors

Sketch-a-Net: a deep neural network that beats humans

Testing strength of the state-of-art image classification methods for hand drawn sketches

Children and Pictures: Drawing and Understanding

Measurement of intelligence by drawings

L'examen psychologique dans les cas d'encephalopathie traumatique

Screening for Alzheimer's disease by clock drawing

Adult norms for the Rey-Osterrieth Complex Figure Test and for supplemental recognition and matching trials from the Extended Complex Figure Test

Draw a person: screening procedure for emotional disturbance

Query-adaptive shape topic mining for hand-drawn sketch recognition

Correlation between vanderbilt ADHD diagnostic scale and the draw-a-man test in school children

The Elements of Statistical Learning. SSS

Deep Learning with Keras

Distinctive image features from scale-invariant keypoints

Histograms of oriented gradients for human detection

Evaluating bag-of-visual-words representations in scene classification

Digital Image Processing: Principles and Applications

Free-hand sketch recognition by multikernel feature learning

Sketch-a-Net that Beats Humans

On the performance of GoogLeNet and AlexNet applied to sketches

Multi-graph transformer for free-hand sketch recognition

Shoot less and sketch more: an efficient sketch classification via joining graph neural networks and few-shot learning

The usefulness of human figure drawings as an index of overall adjustment

Draw a person test

Experimentation on hand drawn sketches by children to classify Draw-a-Person test images in psychology

Introduction to Data Mining

ORB: An efficient alternative to SIFT or SURF

Learning OpenCV: Computer Vision with the OpenCV Library

Scikit-learn: machine learning in python