Detecting People in Cubist Art Shiry Ginosar, Daniel Haas, Timothy Brown, and Jitendra Malik University of California Berkeley Abstract. Most evaluations of object detection methods focus on ro- bustness to natural form deformations such as people’s pose changes. However, the human visual system is surprisingly robust to artificial distortions as well. For example, Cubist paintings contain forms of man- made deformation to which human vision is tolerant such as perspective manipulation and, in particular, part-reorganization. We ask how would current object detection methods perform on these distorted images, and present an evaluation comparing human annotators to four state- of-the-art object detectors on Cubist paintings. Our results demonstrate that while human perception significantly outperforms current methods in object detection, human perception and part-based models exhibit a similarly graceful degradation in performance as objects become in- creasingly deformed, thus corroborating the theory of part-based object representation in the brain. Keywords: recognition, deformations, paintings, cubism, perception 1 Introduction In visual fine arts, such as painting and sculpture, artists often distort reality. In abstract art these distortions push the envelope of human perception while (usually) still containing figures that are recognizable by the viewer. Since these art forms characterize perception at its limit, we can use them to test computer vision methods for whether they achieve human-like performance and to design methods that mimic human vision. This stands in stark contrast to the common practice of evaluating computer vision models on datasets consisting only of camera snapshots. More abstract, artistic images often contain elements that the human vision system recognizes and interprets despite the lack of realism. This suggests that a corresponding computer vision method should have certain characteristics that enable it to mimic the recognition of these elements. As an example, we ask in this paper whether the characteristics of certain methods align well with the properties of human vision that enable the recognition of distortions found in Cubist art. Cubist paintings depict objects as if they were seen from many viewpoints at once, breaking them into medium sized parts that appear out of their natural ordering and do not conform to the rules of perspective [1]. Because the deformed objects present in Cubist art are not normally seen in nature, the human visual 2 Shiry Ginosar, Daniel Haas, Timothy Brown, and Jitendra Malik system must strain to recognize them. Since humans are usually still able to identify the depicted object, Cubism shows us that human perception does not rely on exact geometry and is tolerant to a rearrangement of mid-level object parts. However, findings from neuroscience show that this ability degrades as images become more scrambled or abstract [2][3]. If a method mimics human perceptual performance well, then it should behave similarly in the extreme conditions present in Cubist art. We aim to test whether there are computational systems that behave like human vision in the face of extreme form deformation. We choose to focus on part-based methods that permit rearrangements of medium-complexity parts as they have been proven to do well at representing naturally occurring form deformations [4]. Here, we ask how they would perform in the face of abstract man-made distortions, and whether they would mimic human object detection performance better than object-level methods. We note that this is in itself a contribution, as there is little research into how current systems would react in novel circumstances. To this end, we compare human an- notations of person figures in Picasso paintings to the detections of several object detection methods. Moreover, in order to chart the performance of the methods as human vision approaches its limit, we ask participants to divide the paint- ings into subsets according to the level of their abstractness and compare the performance of the humans and detection methods on each subset. Our results show that (1) existing part-based methods are relatively successful at detecting people even in distorted images, (2) that there is a natural correspondence be- tween user ratings of image distortion and part-based method performance, and (3) that these properties are not nearly as evident in non-part-based methods. By demonstrating that part-based methods mimic human performance, we both show that these methods are valuable for object recognition in non-traditional settings, and corroborate the theory of part-based object representation in the brain. 2 Related Work Since in most tasks the human visual system serves as an upper bound bench- mark for computer vision, some studies focus on characterizing its capabilities at the limit. For instance, Sinha and Torralba examined face detection capabilities in low resolution, contrast negated and inverted images [5]. In other cases, com- putational models are used to test the validity of theories from neuroscience [6]. We take inspiration from these studies and evaluate human object detection in man-made art in order to provide a less restrictive benchmark of robustness to deformation than natural images. By using this benchmark we hope to discover parallels between the characteristics of human and algorithmic object detection. From research in neuroscience, we know that the human visual system can detect and recognize objects even when they are deformed in various ways [7]. For instance, humans are able to recognize inverted objects, although their perfor- mance is degraded, especially when the objects are faces or words [8][5]. Similar results were obtained when comparing scrambled images to non-scrambled ones, leading to a theory of object-fragments rather than whole-object representations Detecting People in Cubist Art 3 in the brain [9][10]. This theory is strengthened by recordings from neurons in the macaque middle face patch that indicate both part-based and holistic face detection strategies [11]. Thus, although humans are capable of recognizing im- ages distorted by scrambling, they are less adept at doing so. By analogy, we might expect methods trained on natural images to suffer a similar degradation in the face of the reorganization of object parts. Object detection is one of the prominent unsolved problems in computer vi- sion. Traditionally, object detection methods were holistic and template-based [12], but recent successful detection methods such as Poselets [13][14] and deformable part-based models [4] have focused on identifying mid-level parts in an appropri- ate configuration to indicate the presence of objects. Other part-based methods discover mid-level discriminative patches in an unsupervised way [15][16], use visual features of intermediate complexity for classification [6], or rely on distinc- tive sparse fragments for recognition [17]. Finally, a model inspired by Cubism itself that assembles intermediate-level parts even more loosely has shown success in detecting objects [18]. Another approach to detection that has recently shown remarkable detection results is based on convolutional neural networks [19][20]. We discuss the methods that we have chosen to benchmark in more detail in Section 4. 3 Cubist ‘fragments of perception’ While there are many kinds of visual deformation to which human vision is robust, we choose to focus on the part-reorganization exhibited by Cubist paint- ings as it has an appealing correlation with the strengths of part-based detection methods. Cubism is an art movement that peaked in the early 20th century with the work of artists such as Picasso and Braque. Cubist painters moved away from the two-dimensional representation of perspective characteristic of realism [1]. Instead, they strove to capture the perceptual experience of viewing a three di- mensional object from a close distance where in order to perceive an object, the viewer is forced to observe it part by part from different directions. Cu- bist painters collapsed these ‘fragments of perception’ onto one two-dimensional plane, distorting perspective so that the whole object can be viewed simulta- neously. Despite the deformation of form, the original object is often readily detectable by the viewer as the parts chosen to represent it are naturalistic enough and discriminative enough to allow for the recognition of the object as a whole. However, this becomes harder with the degree of deformation and abstractness [3]. The fact that humans can detect distorted objects in Cubist paintings with- out prior training makes these paintings well-suited to benchmark robustness to deformation in detection methods trained on natural images. In order to provide intuition that part-based models will be able to perform well on this task, we pro- vide some initial evidence that computer vision methods can successfully identify key parts in the Cubist depictions of objects. We train an unsupervised discov- ery method of mid-level discriminative patches on the PASCAL 2010 “person” class images [15][16][21], and compare the part-detector activations on natural 4 Shiry Ginosar, Daniel Haas, Timothy Brown, and Jitendra Malik Fig. 1: Heat maps showing the discriminative patches activations on a natural training image (Left) and “Girl Before a Mirror 1932”, a Picasso Cubist painting (Right). The color palette correlates with confidence score and ranges from blue (lowest) to red (highest). In both cases the most discriminative patches for class person are parts of faces and upper bodies, suggesting that computer vision methods are able to identify the key parts of human figures even when they are split into ‘fragments of perception’ in Cubist paintings. training images and Cubist paintings by Picasso in Figures 1 and 2. Despite the difference in low-level image statistics, the detectors are able to discover the patches that discriminate people from non-people in both image domains. In the rest of the paper, we build on these results to test whether part-based object models can use the detected parts in order to recognize the depicted objects as a whole. 4 Object Detection Methods in Comparison We compare four state-of-the-art person detectors on Cubist paintings, presented in the time ordering in which they were proposed: one holistic template-based method, one part-based model where the parts are learned automatically, one part-based model where the parts are learned from human annotations and the most recent deep learning method. Here we discuss the details of each one. Dalal and Triggs: Object appearance in images can be characterized by histograms of orientations of local edge gradients binned over a dense image grid (HOG) [12]. The Dalal and Triggs (D&T) method trains an object-level HOG template for detection using bounding box annotations. Since the features are binned, the detector is robust only to image deformation within the bins. Deformable Part Models: A holistic HOG template cannot recognize ob- jects in the face of non-rigid deformations such as varied pose, which result in a rearrangement of the limbs versus the torso. Therefore, the deformable part- based models detection method (DPM) represents objects using collections of part models [4]. These can be arranged with respect to the root part, repre- senting the full object, in deformable configurations that can be characterized by spring-like connections between certain pairs of parts. In practice, the model trained on natural images often learns sub-part models (such as half a face). 1 For each detector, the 10 most confident activations are taken from the test set of paintings and presented by decreasing confidence, excluding lower confidence dupli- cates of 50% overlap or more. Detectors are sorted by the average score of their 20 top activations, excluding detectors where over 1/4 of activations are duplicates of activations from higher rated detectors. Detecting People in Cubist Art 5 Fig. 2: Computer vision methods trained on natural images are able to detect discriminative parts of person figures in Cubist paintings. (Left) Each row dis- plays the top ten discriminative patches activations1. The leftmost column shows the top activation on the training data. Most of the detectors find corresponding face parts in the natural and painting images, although many are false positives. The fourth patch-detector from the top detects patches with little visual con- sistency on the paintings as well as the training data (a). Some false positive activations (b)(Bottom) seem more similar to true positives (b)(Top) in Hog- gles space [22] (b)(Middle) than in HOG (b)(Right) or the original RGB (Left, marked in green) spaces. Poselets: Poselets is a similar part-based model that considers extra human supervision during training[13][14]. Here, parts are not discovered but learned from body-part annotations. Poselets do not necessarily correspond to anatomi- cal body parts like limbs or torsos as these are often not the most salient features for visual recognition. In fact, a highly discriminative Poselet for person detection corresponds to “half of a frontal face and a left shoulder”. R-CNN: R-CNN replaces the earlier rigid HOG features with features learned by a deep convolutional neural network [19][23]. While R-CNN does not have an explicit representation of parts, it is trained under a detection objective to be invariant to deformations of objects by using a large amount of data. Deep meth- ods outperform previous algorithms by a large margin on natural data. Here we compare this state-of-the-art method to other methods and human perception on abstract paintings. 5 Experimental Setup We compare the above methods to human perception in two ways. First, we study the human and algorithm performance on person detection over a cor- pus of Cubist paintings. Second, we examine the degradation in performance of human perception and detectors as the distortion of person figures in the paint- ings increases. We conduct all comparisons using the PASCAL VOC evaluation 6 Shiry Ginosar, Daniel Haas, Timothy Brown, and Jitendra Malik mechanism, in which true positives are selected based on a 50% overlap between detection and ground truth bounding box [21]. 5.1 Picasso Dataset In the experiments described below we used a set of 218 Picasso paintings, which have a title indicating that they depict people, as our test data. These ranged from naturalistic portraits to abstract depictions of people as a collection of distorted parts. The set of paintings we used is highly biased in comparison to PASCAL person class images [21]. Given the nature of the art form, Cu- bist paintings usually depict people in full frontal or portrait views. Since in a PASCAL-style evaluation true positives are selected based on a 50% overlap, this results in higher average precision scores. For example, portrait paintings devote most of the canvas area to the torso of a person. In such paintings, a random detection that contains over 50% of the image would count as a true positive. This issue exists in any PASCAL VOC evaluation, but it is especially pronounced in this case. 5.2 Human Perception Study Setup We conducted two experiments as part of our perception study. First, we recorded human detections of person figures in Cubist paintings. Second, we asked partic- ipants to bucket the paintings by the degree to which the figures in them were deformed compared to photorealistic depictions of people. For each painting, raters were asked to pick a classification on a 5-point Likert scale, where 1 cor- responded to “The figures in this painting are very lifelike” and 5 corresponded to “The figures in this painting are not at all lifelike”. Participants We recruited eighteen participants to partake in our perception study. Sixteen participants were undergraduate students at our institution, one was a graduate student and one a software engineering professional. Seventeen participants were male and one was female. Mechanism Participants completed the study on their personal laptops using an online graphical annotation tool we wrote for the purpose. Each participant spent an hour on the study and received a compensation of $15. Each participant annotated 146 randomly chosen paintings out of the total 218, so that every painting was annotated by 14 - 15 unique participants. 5.3 Detector Study Setup We compare the human recognition performance we measured during the per- ception study to four object detection methods. We train all methods using the PASCAL 2010 “person” class training and validation images [21]. We train the Detecting People in Cubist Art 7 methods using natural images so that they do not enjoy an advantage over hu- mans by training on the paintings. However, some research suggests that human recognition in Cubist paintings does improve with repeated exposures [2]. We set all parameters in all four methods to the same settings used in the original papers, except for the Poselets detection score which we set to 0.2 based on cross validation on the training data. 5.4 Ground Truth Because Picasso did not explicitly label the human figures in his paintings, there is no clear cut gold standard for human figure annotations in our image corpus. As a result, we rely on our human participants to form a ground truth annotation set. We do so by capturing the average rater annotation as follows. Since each painting might have more than one human figure, we use k-means clustering to group annotations by the human figure they correspond to. For each cluster, we obtain a ground truth bounding box by taking the median of each corner of the bounding boxes in that cluster along every dimension. This yields one ground truth bounding box per human figure per image, which we can now use to evaluate both human and detector annotations. There is one subtlety omitted in the above description: since it would be unfair to allow a human rater’s annotations to influence ground truth when evaluating that rater, for each human rater we construct a modified leave-one-out ground truth from the annotations of all other raters, withholding the annotations of the rater we intend to evaluate. When evaluating detectors, however, we include annotations from all human raters in the ground truth. It is worth noting that since humans themselves are error-prone (especially when recognizing objects in more abstract paintings), our ground truth cannot be a perfect oracle. Rather, all evaluation is comparing performance to the average, imperfect human. From this perspective, our evaluation of human raters can be seen as a measure of their agreement, and our evaluation of detectors can be seen as a measure of similarity to the average human. 5.5 Evaluating Humans and Detectors Unlike detectors, humans provide only one annotation per figure without confi- dence scores, so we cannot compute average precision. Our primary metric for humans is the F-measure (F1 score), which is the harmonic mean of precision and recall. In order to combine the F-measures of all participants, we consider both (qualitatively) the shape of the distribution of the scores, and (quantitatively) the mean F-measure. To compare detectors with humans, we pick the point on the methods’ precision-recall curve that optimizes F-measure, and return the precision, re- call, and F-measure computed at this point. This is generous to the detectors but captures their performance if they were well tuned for this task. 8 Shiry Ginosar, Daniel Haas, Timothy Brown, and Jitendra Malik Fig. 3: Frequency distribution of human F-measure scores for recognizing per- son figures in all 218 Cubist paintings against the leave-one-out ground truth bounding boxes. Due to the small number of participants, the curve has been smoothed for clarity. 6 Detection Performance on the Picasso Dataset In the first part of the comparison we evaluate the performance of the humans and the four methods at detecting human figures in Picasso paintings. 6.1 Human Performance on the Picasso Dataset First, we evaluate our human participants against the leave-one-out ground truth to determine how effective people are at recognizing human figures in Cubist art. Figure 3 displays the distribution of human F-measures for this task. Qualita- tively, we see that humans perform quite well, as the distribution has its peak around 0.9, and there is little variance among the scores. It is worth noting the bump in the distribution around 0.6—there were a few raters whose annotations were significantly different from the ground truth annotation. This may have been due to either failure to recognize the images, or a misunderstanding of the annotation interface and instructions. Quantitatively, the first row of the ta- ble in Figure 7(Right) shows the mean human precision, recall, and F-measure. These numbers confirm our impressions of the distribution–humans score high on average when detecting human figures in the paintings. 6.2 Detector Performance on the Picasso Dataset Qualitative Results Part-based models trained purely on photographs per- formed surprisingly well when tested against the Picasso data. As can be seen in Figures 4 and Figure 5, Poselets and DPMs successfully produce bounding boxes for figures in the paintings, though they have their fair share of false positives and misses. We additionally note semantically significant Poselet activations as well as false positives in tricky cases where non-human image elements take on human shapes–an expected consequence of the complexity of the paintings (Fig- ure 5). The non-part-based methods do not perform as well, as is evident in Figure 6 where each row displays the top ten detections of a method over the entire painting dataset, sorted by each method’s confidence score. Detecting People in Cubist Art 9 Fig. 4: (Left) DPM detects one of the two viewpoints of the split face, resulting in a shifted localization of the person as a whole. (Right) The DPM model trained on natural images learns sub-face parts in three different scales. This provides insight as to why DPM is able to detect a split-face patch resulting in the bounding box detection on the left. Fig. 5: The Poselets method is able to find person-parts in Cubist paintings and use them to detect person figures as a whole. (Left) Poselet bounding box detections in Picasso’s “Girl Before a Mirror 1932” with a false positive detection in the center. (Middle) A true positive Poselet activation on the painting together with the corresponding Poselet activation on the training data. In both image domains the Poselet detects a downward-angled face in profile. (Right) The false positive activation that results in the incorrect bounding box detection on the left. Here the Poselet falsely detects an away-facing back, shoulders, and head in the painting. Quantitative Results Figure 7(Right) displays the precision-recall curves for each method when evaluated on all paintings. There is a clear ordering in accu- racy: DPM performs the best by far, Poselets and RCNN come next, and Dalal and Triggs, though characterized by high recall, has extremely low precision, leading to terrible performance. In general, all of the methods achieve recall of up to 0.8, which implies that they are capable of recognizing most human figures in the image. However, DPM is the only method which can maintain a reasonable precision as recall increases, which explains its significantly greater AP. The table in Figure 7(Right) confirms these insights. DPM’s performance on this task is quite encouraging, as its AP (0.38) is not too far off from its performance on undistorted PASCAL 2010 photographs (0.41 without context rescoring) although the actual number may be higher in practice as the PASCAL test dataset, unlike ours, includes many images that do not contain people [23]. 10 Shiry Ginosar, Daniel Haas, Timothy Brown, and Jitendra Malik Fig. 6: Top ten detections for each method according to confidence from left to right. First row: DPM. Second row: Poselets. Third row: R-CNN. Fourth row: D&T. False positives are marked in blue. Qualitatively, it is evident that part- based models outperform the other approaches. 6.3 Comparing Human and Detector Performance As is clear from the graphical and tabular data in Figure 7, humans are far better than detectors at recognizing person figures in Picasso paintings. The green dot in the upper right corner of the graph shows human precision and recall to be far higher than the orange dot of DPM, the highest-performing method. 6.4 Discussion of Performance on the Picasso Dataset The comparison between the human participants and four methods on the task of detecting person figures in Picasso paintings demonstrates clearly that hu- mans are highly skilled at this task, and that detectors are much less effective but can still achieve results within an order of magnitude of their detection per- formance on undistorted images. Among algorithms, part-based object detection methods perform better than object-level methods on distorted images. DPM and Poselets, our two part-based methods, demonstrate the best performance on the object detection task. This is likely due to the fact that part-based meth- ods are able to recognize medium-level parts that remain intact even in Cubist paintings where standard human body parts are highly distorted. We emphasize the part-based approach here, as we have no reason to believe that the HOG features used by both DPM and Poselets carry any advantage over other image features organized in a part-based model. Given its success in object detection on natural images, it is interesting that R-CNN does not perform well on this task. One reason for this could be that R-CNN is not a part-based model, however this is only partly true because the Detecting People in Cubist Art 11 Annotator Precision Recall F-measure AP Human 0.804 0.860 0.829 N/A DPM 0.444 0.464 0.458 0.378 Poselets 0.311 0.240 0.271 0.178 RCNN 0.315 0.177 0.226 0.104 D&T 0.027 0.486 0.051 0.019 Fig. 7: Performance comparison via precision-recall curves (Left) and tabular data (Right). In both the plot and the table: for detectors, precision, recall, and F-measure are the maxima over the entire precision-recall curve. For humans, these numbers are averages across the raters. While DPM outperforms other methods, none of the methods reach human performance in person detection in Cubist paintings. convolutional filters can be thought of as parts, and the max pooling as perform- ing deformations. A second factor might be the fact that R-CNN over fits to the natural visual world and fails at adapting to the domain of paintings. There has been little research into how CNN-based networks perform on deformed images, but an initial investigation suggests that tiny changes to an image may cause drastic changes to their output [24]. 7 Performance Degradation with Increased Deformation In the second part of the comparison we study the degradation of performance of humans and methods with increased painting abstractness. 7.1 Classifying Images by Degree of Deformation In our user study (described in Section 5.2), each rater labeled images on a scale from 1 (The figures in this painting are very lifelike) to 5 (The figures in this painting are not at all lifelike). By taking the rounded average of user labels for each image, we divide the images into five ‘deformation buckets’ in order to evaluate object detection performance as image distortion increases. An example of a painting with an average rating of 1 is Picasso’s “Seated Woman 1921”‘, and an example of a painting with an average rating of 5 is Picasso’s “Nude and Still Life 1931” (copyrighted paintings not reproduced in this text). The histogram of the number of images in each bucket is shown in Figure 8(Top Left). 7.2 Human Performance Degradation Figure 8(Top Right) demonstrates the impact of deformation on human object detection performance. As the images become more distorted, the distribution 12 Shiry Ginosar, Daniel Haas, Timothy Brown, and Jitendra Malik F-measure N or m al iz ed fr eq ue nc y Fig. 8: Impact of increasing image distortion on object detection performance. Bucket 1 contains images that appear most natural, and bucket 5 contains images that appear most distorted. (Top Left) A histogram showing the number of images per bucket. (Top Right) Human F-measure distributions by distortion bucket. Because there were a small number of human raters, the curves have been smoothed for clarity. (Bottom Left) DPM precision-recall curves by distortion bucket. (Bottom Right) Comparing humans and methods by bucket. Here, part- based models show a similar degradation behavior to human performance as the distortion of the image increases. of human F-measures shifts clearly to the left. This indicates that human per- formance worsens on more deformed human figures, which is consistent with previous results [8][5]. As noted in Section 6, there are a few annotators with low F-measures in each of the curves, which implies that these raters’ errors are independent of image distortion. This supports the theory that they are due to a failure to follow the instructions rather than an inability to recognize objects. 7.3 Detector Performance Degradation Qualitative Results In Figure 9 we compare the top detections per method on paintings from bucket 2 versus bucket 5. The top discriminative-patches ac- tivations on these two buckets (Right) help visualize the difficulty in detecting meaningful mid-level parts as the paintings become more abstract. Detecting People in Cubist Art 13 Fig. 9: The degradation in performance with image difficulty. Top five detections per method (rows correspond to: DPM, Poselets, R-CNN, D&T) on images from bucket 2 (Left) and bucket 5 (Middle). False positives are marked in blue. This comparison shows that all detection methods perform worse on more distorted images. (Right) The degradation in performance is also evident in the detection of the parts in isolation. Top 10 discriminative patches activations for images of bucket 2 and bucket 5, with corresponding activations on PASCAL images. All top activations on bucket 2 are true positives compared to only one from bucket 5 (in green). Quantitative Results Figure 8(Bottom Left) shows the precision recall curves for the DPM method with varying distortion buckets. As with human perfor- mance, increasing the image distortion causes a pronounced decrease in method performance. The overlap between the curves for buckets 1 and 2 may be due to variance as a consequence of the low number of images in those two buckets (see Figure 8(Top Left)). 7.4 Comparing Human and Detector Degradation Figure 8(Bottom Right) shows the F-measures for humans and each method with varying distortion buckets. This figure shows that detector performance degrades in a similar pattern to human performance as images become increasingly dis- torted. A closer look at the degradation patterns reveals that the part-based methods follow the human pattern most closely, which matches our intuition about the similarities between part-based object detection and the human vi- sual system. In contrast, the template-based Dalal and Triggs method abruptly breaks down after bucket 1. 7.5 Discussion of Degradation with Increased Deformation As we have demonstrated, part-based models for object detection show a smooth degradation in precision and recall as the distortion of the images increases. This is consistent with results from neuroscience, which indicate that humans are capable of detecting objects cut into parts, but that their ability degrades 14 Shiry Ginosar, Daniel Haas, Timothy Brown, and Jitendra Malik significantly when the parts are scrambled. The correspondence between human and computational method performance on this task suggests that a part-based object representation might be a good approximation for the mechanisms of ob- ject detection in the human brain. The ability to model these mechanisms com- putationally further corroborates the neuroscience theory of part-based object detection strategies. This is encouraging, even though current methods cannot yet perform at the level of human vision. We note that the correspondence between part-based models and human perception is a somewhat surprising one. At their core, the methods we used are based on HOG features that we expected to be highly dependent on image statistics. It was pleasantly surprising to observe the correlation between these methods, trained on natural images, and human perception on Cubist paintings with completely different statistics. We believe that better performance could be achieved using part-based models that rely on higher level features than HOG. 8 Conclusion Since human perception acts as a natural benchmark for computer vision, we should strive to understand its performance at its limits. In this paper, we have argued that object detection under form deformation is an example of a challeng- ing perception task that existing image benchmarks do not properly evaluate. We have proposed the use of Cubist paintings as a more appropriate benchmark, as they contain rearranged object parts that are nevertheless recognizable as whole objects. Using this benchmark, our evaluation comparing human performance to that of various object detection methods demonstrates that part-based mod- els are a step in the right direction for modeling human robustness to object deformation, since their performance degrades comparably to humans as the ab- stractness in images increases. By showing that these models can be trained on photographic data yet still perform on abstract data we demonstrate that they are less over-fit to the natural world than template-based and deep models. Part-reorganization in Cubism is one example that pushes the envelope of human perception, but there are other artistic movements with characteristic deformations, such as the use of blurring in Impressionism, that would provide rich grounds for study. Future work in the design of computer vision methods should be cognizant of the limitations of traditional camera snapshot datasets and look for complementary resources when evaluating computational methods’ ability to mimic the wide range of human perception. 9 Acknowledgments The authors would like to thank Mark Lescroart for his guidance and advice throughout this project and Bharath Hariharan, Carl Doersch and Katerina Fragkiadaki for their insightful comments. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE 1106400. Detecting People in Cubist Art 15 References 1. Laporte, P.M.: Cubism and science. The Journal of Aesthetics and Art Criticism 7(3) (1949) pp. 243–256 2. Wiesmann, M., Ishai, A.: Training facilitates object recognition in cubist paintings. Frontiers in Human Neuroscience 4 (2010) 11 3. Ishai, A., Fairhall, S.L., Pepperell, R.: Perception, memory and aesthetics of inde- terminate art. Brain Research Bulletin 73(46) (2007) 319 – 324 4. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. Pattern Analysis and Machine Intelligence (PAMI) 32(9) (2010) 5. Sinha, P., Torralba, A.: Detecting faces in impoverished images. Journal of Vision 2(7) (November 2002) 6. Ullman, S., Vidal-Naquet, M., Sali, E.: Visual features of intermediate complexity and their use in classification. Nature Neuroscience 5(7) (July 2002) 682–687 7. Lewis, M.B., Edmonds, A.J.: Face detection: Mapping human performance. Per- ception 32(8) (2003) 903–920 8. Tsao, D.Y., Livingstone, M.S.: Mechanisms of face perception. Annual Review of Neuroscience 31 (2008) 411–437 9. Grill-Spector, K., Kushnir, T., Hendler, T.: A sequence of object-processing stages revealed by fMRI in the human occipital lobe. Human Brain Mapping 6(4) (1998) 316–328 10. Vogels, R.: Effect of image scrambling on inferior temporal cortical responses. Neuroreport 10(9) (June 1999) 1811–1816 11. Freiwald, W.A., Tsao, D.Y., Livingstone, M.S.: A Face Feature Space in the Macaque Temporal Lobe. Nature Neuroscience 12(9) (September 2009) 1187–1196 12. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Volume 2. (2005) 886–893 13. Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3D human pose annotations. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). (2009) 1365–1372 14. Bourdev, L.D., Maji, S., Brox, T., Malik, J.: Detecting People Using Mutually Consistent Poselet Activations. In: Proceedings of the European Conference on Computer Vision (ECCV). (2010) 168–181 15. Singh, S., Gupta, A., Efros, A.A.: Unsupervised discovery of mid-level discrimi- native patches. In: Proceedings of the European Conference on Computer Vision (ECCV). (2012) 16. Doersch, C., Singh, S., Gupta, A., Sivic, J., Efros, A.A.: What makes paris look like paris? ACM Transactions on Graphics (SIGGRAPH) 31(4) (2012) 17. Akselrod-Ballin, A., Ullman, S.: Distinctive and compact features. Image and Vision Computing 26(9) (September 2008) 1269–1276 18. Nelson, R.C., Selinger, A.: A Cubist Approach to Object Recognition. In: Pro- ceedings of the IEEE International Conference on Computer Vision (ICCV). (1998) 614–621 19. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for ac- curate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2014) 20. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: Integrated recognition, localization and detection using convolutional networks. In: International Conference on Learning Representations (ICLR 2014), CBLS (2014) 16 Shiry Ginosar, Daniel Haas, Timothy Brown, and Jitendra Malik 21. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2010 (VOC2010) Results. http://www.pascal-network.org/challenges/VOC/voc2010/workshop/index.html 22. Vondrick, C., Khosla, A., Malisiewicz, T., Torralba, A.: HOGgles: Visualizing Object Detection Features. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). (2013) 23. Girshick, R.B., Felzenszwalb, P.F., McAllester, D.: Discriminatively trained deformable part models, release 5. http://people.cs.uchicago.edu/ rbg/latent- release5/ 24. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I.J., Fergus, R.: Intriguing properties of neural networks. CoRR abs/1312.6199 (2013)