SOME WELL-KNOWN MENTAL TESTS EVALUATED AND COMPARED BY DOROTHY RUTH MORGENTHAU Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy, in the Faculty of Philosophy, Columbia University REPRINTED FROM ARCHIVES OF PSYCHOLOGY R. S. WOODWORTH, Editor No. 52 NEW YORK Mat, 1982 SOME WELL-KNOWN MENTAL TESTS EVALUATED AND COMPARED BY DOROTHY RUTH MORGENTHAU Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy, in the Faculty of Philosophy, Columbia University REPRINTED FROM ARCHIVES OF PSYCHOLOGY No. 52 NEW YORK Mat, 1922 Gift 2 -V TABLE OF CONTENTS Introduction 5 Subjects 8 Tests Briefly Described and Reasons for Their Selection 14 Method — Applying Tests to Subjects 21 General Considerations Specific Observations on the Application of the Test Selected Results 25 Conclusion 52 ACKNOWLEDGMENT. For the advice of Professor Edward Lee Thorndike of Teachers College, Columbia University, and of Dr. William Healy and Dr. Augusta Bronner of Judge Baker Foundation, Boston, the writer wishes to express appreciation. Special thanks are due for the painstaking assistance given by Pro- fessor Robert S. Woodworth of Columbia University, New York City. Some Weil-Known Mental Tests Evaluated and Compared ONE who approaches the subject of the measuring of children's mentality will find that the mind of the nor- mal child has received attention in what we may call vertical and parallel respects. There have been a considerable number of tests developed by students of psychology in the en- deavor to secure mental measurements independent of the experience and judgment of the clinician. The development has been in a vertical manner, that is, the best recognized psychologists who have undertaken this work have each de- veloped tests, have each put them into extensive practice and have published the results of that experience. But each of these psychologists has developed his test on his own suppo- sitions, and, basing the nature of his test on his own experi- ence, has tried to evolve a plan of testing which is supposed to be useful in determining mental conditions of such general ex- tent that they may roughly be called intelligence. Thus we have the Stanf ord-Binet scale, the most generally used of any one of the mental tests. Then there are the Porteus tests, the Pintner-Patterson performance scale, and a dozen or more of others which are known to every clinical psychologist. The development of mental tests has been parallel in that none of these psychologists in developing their own ideas have carried them to the point of thoroughly comparing the re- sults obtained by their tests to the results obtained by the simultaneous use of a number of the other tests all with re- spect to normal children. There has been some comparison of results of the various tests when applied to abnormal children but this has not been thoroughgoing and has been done not by making the tests with the idea of eventually combining the re- sults and of placing valuations upon them, but merely in the course of clinical work with abnormal children. It is question- able whether such results are sufficiently thorough to be con- sidered the basis for a convincing answer as to the relative value of the respective tests, and inasmuch as they were made on abnormal minds, one would not dare to trust even those 6 SOME WELL-KNOWN MENTAL TESTS comparative results with respect to what the test will show as to normal minds. Those who have developed their respective tests have com- pared them with some other mental test, most frequently one of the Binet revisions. But, no considerable number of the tests which have been so developed in parallel fashion have been ap- plied purposely to obtain comparative results and to ascertain which if any of them can be shown to be untrustworthy and what group of them can be relied upon as furnishing a satis- factory schedule for testing and comparing the common ele- ments of mentality in normal children. Upon perceiving that there was a lack of any purposely made comparative study of mental tests it was proposed here- in to set forth the results of such a study of about a dozen of the most commonly used mental tests. The tests were applied to a large number of unselected normal children, in general each child receiving the full schedule of tests. By means of the results to be obtained from this comparative study it was an- ticipated : 1. That the degree of reliability of each test would be indi- cated. 2. That the same purposes could be effected with respect to the value of each test. 3. That the information obtained under the first and sec- ond headings would make it possible to select a schedule of tests of indicated reliability for application to normal minds, or further, whether the Stanford-Binet alone would suffice. 4. That by restricting the ages of children tested in general to from ten to sixteen years, the period in which individual capacities first assume importance for vocational determina- tion, it would be possible to guide the vocational training with some degree of success. A brief statement of the results can now be given reserving the more detailed statement involving the basis and methods for the results for future pages. The first aim, to secure an estimate of the reliability of the tests used, was largely suc- cessful. Of the thirteen tests, the reliability of which was investigated, one class, the four construction tests, Healy A and B. and Knox Moron tests, and diamond shaped frame, were found to be unreliable ; five other tests were found to be relia- ble, namely the Stanford-Binet, Pintner Non-Language group test, Thorndike Reading Scale Alpha 2, Porteus Maze test, and EVALUATED AND COMPARED 7 Tapping test; while the reliability of four tests, the Myers Mental Measure, Healy Pictorial Completion test II, Healy- Bronner learning tests, and the Crossline test was undeter- mined for various reasons. The results obtained as to value of the tests were as follows : Stanford-Binet, Pintner, Alpha II, and Porteus are valuable tests and should be included in individual case studies. In spite of their unmeasured reliability, Myers and Pictorial Completion II are also valuable tests and should likewise be included. Judgment should be suspended with regard to learning tests. The Tapping test is of doubtful value and its use should be left to the discretion of the examiner. The Con- struction tests because of their unreliable character do not give valuable results. As to the schedule of tests to be used in testing normal minds it was found best not to use the Stanford-Binet alone but to have the schedule composed of that test and the five others which were found valuable. From the tests used and results obtained it cannot be stated here whether this schedule is of value as to vocational guidance for the reason that the factors involved in each test are not known with certainty and until they are known, definite valid conclusions about the abilities of the individuals concerned cannot be reached. SUBJECTS. It was desired to test one hundred normal but otherwise un- selected children. In order to obtain an unselected group it proved necessary to select the subjects very carefully, for, if all the children tested had been from a Children's Home, or from a Settlement, or from any one school, the result would have been a highly selected group. To avoid this a few were taken from many different sources and in this respect the dis- tribution proved to be reasonably satisfactory. As to age, originally the plan was to have about ten children at each of the ten periods of one year each, from seven to six- teen inclusive. But this plan was given up because our inter- est is not with the six or seven year old who has to go to school and learn fundamentals, no matter wherein his is gifted and who rarely shows talents or handicaps at such an early age. Our chief concern is with children in the sixth, seventh or eighth grades and in high school, because they are the adjust- ment problems, and because it is important to aid them if pos- sible in deciding whether they should remain in school or go to work. If the latter what should they do, if the former what sort of training do they need? So the attempt was made to lay all the emphasis here and reduce the number of children under eleven to a minimum. Another objection to the origin- al plan is that ten in a group is too small for any kind of gen- eralization. The total number tested was 128, of which 116 usable rec- ords were retained. For various reasons many of these rec- ords are incomplete so that this number was necessary in or- der to have a minimum of 100 scores on each test. There still remain some tests which were given to less than 100 children, but the number is in each case sufficiently large to give valu- able results. All defectives were excluded, for in mixing their records with those of normal children many confusions would have arisen, and the issues would have been less clear. Much in- tensive work has been done in testing defectives, so that we know a great deal about their reactions to a group of tests such as we have chosen. To be sure, they vary considerably in their results, but we know in general the points where they are weakest as in abstract reasoning and formal generalization, 8 EVALUATED AND COMPARED 9 and also the points in which proportionately they excel. By narrowing the field to normals the significance of the conclu- sions can be made more pertinent. This was an arbitrary pro- cedure dependent largely upon the judgment of the writer, and subject to criticism on this basis. It is quite possible that some very dull normals were also excluded, this being justified on the grounds that their normality might reasonably have been called in question by more severe examiners. With reference to the three cases whose I. Q.'s fall below 80, there seems to be no doubt that they are to be considered as dull normals. The grade they attained in school for their age, their response on the other tests and their behavior in the community all argue for including them in our study. The boy receiving the lowest I. Q. — 73 — was born in the United States but taken to Italy at the age of five, and remained there six years. In spite of this he was in the eighth grade. He did very well with all the construction tests. As no limitations were set at the other end, the grade and I. Q. distributions are higher than one would otherwise expect in a general sampling of the population. I. Thirty-seven children, twelve girls and twenty-five boys, were tested at the Home for Jewish Children in Dorchester, Massachusetts. Many of these children were half orphans, some had lost both parents — most of them were in the Home temporarily. They were chosen from the total number entire- ly by chance. They all attended public school in the vicinity and all but two or three had come to the Home within two years. All were able to speak and understand English, this being the only language used at the institution, although in many of their homes no English was spoken. Their ages ranged from 7-0 to 15-1. II. Twenty-four girls came from Frances Willard Settle- ment in Boston, Massachusetts. These were divided into three clubs — one consisting of one seventh grade and ten eighth grade girls, the youngest being 12-7 and the oldest 14-2. They came one evening a week for the express purpose of taking the tests. They were the first ones to volunteer from a large group. The other two groups of seven and six respectively were younger girls who happened to meet on afternoons which were convenient for the examiner. III. Six high school girls in New York volunteered to take the tests. 10 SOME WELL-KNOWN MENTAL TESTS IV. The ninth grade consisting of six boys and five girls in the Woodmere School (private) at Woodmere, Long Island, were tested. The ages ranged from 13-1 to 15-2. V. The poorer section of the 8B class of Public School 11, New York, were tested. There were thirty boys in the class ranging in age from 13-2 to 16-11. VI. Finally eight miscellaneous children were tested. The subjects selected appeared to give a satisfactory differ- ence in quality so as to bring out the capacities of the tests to meet a variety of normal mental conditions. Yrs. Mos 7 7 1 7 2 7 3 7 4 7 5 7 6 7 7 7 8 7 9 7 10 7 11 8 8 1 8 2 8 3 8 4 8 5 8 6 8 7 8 8 8 9 8 10 8 11 9 9 1 9 2 9 3 9 4 9 5 9 6 9 7 9 8 9 9 9 10 9 11 10 10 1 10 2 10 3 TABLE I AGE DISTRIBUTION 116 Yrs. Mos. Frequency Frequency 1 1 1 3 1 1 2 1 1 1 2 1 1 1 0_ 2 15 10 4 10 5 10 6 10 7 10 8 10 9 10 10 10 11 11 11 1 11 2 11 3 11 4 11 5 11 6 11 7 11 8 11 9 11 10 11 11 12 12 1 12 2 12 3 12 4 12 5 12 6 12 7 12 8 12 9 12 10 12 11 13 13 1 13 2 13 3 13 4 13 5 13 6 13 7 1 2 1 1 2 1 2_ 3 1 1 2 2 2 1 2 4 3_ 1 2 1 3 1 1 44 10 21 CASES Yrs. Mos. 13 8 13 9 13 10 13 11 Frequency 2 3 5 19 14 1 14 1 1 14 2 2 14 3 1 14 4 2 14 5 2 14 6 2 14 7 3 14 8 14 9 14 10 1 14 11 6 21 15 1 15 1 2 15 2 5 15 3 1 15 4 15 5 2 15 6 2 15 7 2 15 8 15 9 15 10 15 11 1 16 16 3 16 1 16 2 1 16 3 1 16 4 1 16 5 16 6 1 16 7 1 16 8 1 16 9 16 10 16 11 1 10 57 Distribution of the subjects by age. It will be noted that only 19 of the 116 subjects are under eleven years old. EVALUATED AND COMPARED 11 TABLE II Grade Distribution — 114 Grade I II III IV V VI VII VIII IXorlH. S. X or II H. S. XI or III H. S. XIIorlVH.S. Left School VIII 1 II H. S. 1 The vast majority of subjects were in the Vlth to IXth g:rades in- clusive. Cases. 2 had left school. Frequency 1 2 2 11 7 16 10 44 12 4 5 INTELLIGENCE QUOTIENT DISTRIBUTION. 112 CASES. Scale: — 1 square to 1 child 70 means 70.000 to 79.999 etc. The curve of distribution is skewed positively. uo 130 uo 12 SOME WELL-KNOWN MENTAL TESTS TABLE III DISTRIBUTION OF INTELLIGENCE QUOTIENTS — 112 CASES. I. Q. Frequency I. Q. Frequency I. Q. Frequency 70 95 2 120 71 96 3 121 1 72 97 4 122 2 73 1 98 2 123 74 1 99 3 124 75 100 7 125 1 76 1 101 7 126 2 77 102 4 127 2 78 103 2 128 1 79 104 129 80 1 105 6 130 3 81 106 4 131 1 82 107 3 132 1 83 108 2 133 84 1 109 2 134 1 86 2 110 1 135 86 2 111 1 136 1 87 1 112 1 137 88 2 113 3 138 89 3 114 2 139 90 1 116 1 140 91 2 116 4 141 1 92 1 117 1 142 93 4 118 2 143 94 4 119 2 144 4 were not given the Stanford-Binet test. Average 104.5 Mental age in months ■ Average 154.8 Mean Square Deviation 34.54 The table shows that very few of the children tested had I. Q.'s below normal. Age-Grade Distribution CHRONOLOGICAL AGE 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 Total cases I 1 1 II 2 2 III 1 1 IV 1 6 3 2 12 V 2 3 1 1 7 VI 6 10 16 VII 1 5 2 2 10 VIII 5 9 15 11 6 46 IX 7 3 1 11 X 1 2 2 5 XI 1 4 5 Total 3 2 8 6 10 21 19 22 15 10 116 EVALUATED AND COMPARED 13 CHRONOLOGICAL AGE Chronological age — mental age distribution. 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 Total 6.5 cases 7.5 1 1 8.6 2 1 1 1 6 9.5 3 1 1 1 6 10.5 1 2 3 2 2 10 11.5 1 1 3 1 6 12.5 1 1 3 4 2 3 1 15 13.5 4 1 4 1 10 14.5 2 3 2 6 6 3 22 15.5 2 4 8 14 16.5 3 4 2 1 1 11 17.5 2 1 1 4 18.5 3 1 1 5 19.5 1 2 3 Total 3 2 7 6 10 20 17 20 16 10 112 Mental age — grade distribution. I II III IV V VI VII VIII IX 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 17.5 16.5 18.5 19.5 Total 1 2 1 1 2 1 2 1 2 4 3 4 11 2 1 1 5 2 1 7 17 8 44 12 1 6 5 15 11 1 4 1 X XI Total cases 1 6 7 10 7 14 10 22 13 4 11 4 3 3 TESTS BRIEFLY DESCRIBED AND REASONS FOR THEIR SELECTION The large number of tests available had to be classified so as to find which tests covered identical ground; only one of these was then selected. Time was an element particularly to be recorded since preferably less than three hours should be de- voted to each child for the completion of all tests. This allot- ment of time is considered by most authorities to be generous, particularly since the Stanford-Binet takes nearly three quar- ters of an hour, thus leaving only twp hours for all the other tests. Consequently between alternate tests apparently serv- ing the same purpose the briefer one was chosen. The same limitation on the amount of time to be spent on any one indi- vidual caused the necessary omission of some tests which were highly desirable except as to their length. In the last men- tioned class are group tests requiring an hour or more to be applied. Where the results that are sought can be reached by group tests doubtless much time can be saved in using them, but the inquiries involved herein were such as to necessitate largely individual testing. In selecting the tests another danger that was realized and that it was attempted to avoid, although, as the results show, not with entire success, was that a great many tests involved many sides of mental activity so that the final result expressed numerically would not be indicative of which mental abilities had tested favorably and which unfavorably. For instance, ability to deal with abstract and with concrete material may be extreme opposites giving a correlation of minus 100. If both kinds of material are combined in one test, the child who succeeds in one may fail in the other and vice versa. In com- puting the final scores compensation will give the same net re- sult to two children of exactly opposite capabilities. If gen- eral intelligence is what we want we may find it in this way, but if we are interested in special abilities or disabilities these tests which hide them must not be used. We have found this confusion to exist in many tests, of course, never in such an extreme form as in the illustration above, and undoubtedly introduced on purpose, but we feel that its value is at all times questionable. This error is extremely difficult to eliminate 14 EVALUATED AND COMPARED 15 completely, in fact we can not be sure even now, as it will ap- pear later in the results, that we have successfully done so. Another source of error too often overlooked was borne in mind in selection of the tests, namely the variability of the test that is being considered. Where the same test is applied to a person at intervals and it is found that the resulting scores are not identical the question arises whether the vary- ing scores can be combined so as to give a reliable standard for use and comparison with the results obtained when the test is applied to other children, or whether the variation indicates an unreliability in the test itself sufficiently serious to warrant the test being discarded. As an example of variations of such minor character that their existence does not indicate unre- liability, and which can be compensated, we can take the tap- ping test where there may be a variation of about five taps in each direction from the average, which would be entirely sat- isfactory. Such variations are due to unessential and insig- nificant details of the conditions under which the test is re- peated, such as posture of the child being tested, kind of pen- cil or stylus being used, etc. Taking ten or more measures of tapping ability would increase the reliability but the final re- sults would show such slight difference from the result of one or two trials that the frequent repetition is entirely uncalled for to secure reasonable reliability. On the other hand, if the variations in result obtained by re- peated use of a test on the same individual are not of a minor character and if the day-to-day variability is so erratic that the variation is all the way from good performance to poor performance, then the situation is either that the child tested is shown to be subject to mental disturbance, or that the test itself shows a high and dangerous variability. If it is the test that is variable, it is obviously essential to weed it out ab initio. Such variability has been found to exist in the Knox cube test, in the application of which a uniformly normal child may make the record of an imbecile one day and of a super-normal child the next day. Of course, such a test, if not eliminated, would lead to results that are valueless for comparative pur- poses and dangerous for diagnostic ones. As to variability, the reliability of a number of tests was es- tablished and recorded before the study was undertaken. As to the remaining tests, in order to overcome the possible exist- ence of variations indicating unreliability it was necessary to 16 SOME WELL-KNOWN MENTAL TESTS retest each child with the same or with a similar test after an interval of a week — no less or practice eifect would be met, no more to avoid the effect of any mental growth in the interval. The necessity of retesting caused by possible variability in the test itself, led to the subordinate but difficult problem of determining what methods of retesting would avoid errors due to the process itself. Thus, as has been mentioned, retest- ing must be done in such a manner as to avoid practice effect. It has been shown by various workers that certain types of tests once solved, such as most puzzles, are no longer tests at all, whereas others, such as auditory memory for digits and psychomotor control, show a minimum effect, which, after the week between tests, is negligible. Those of our tests which come within the last-mentioned class were similarly repeated. Those which were of the former type had similar tests substi- tuted for them in the second trial, while still others falling be- tween these classes were altered in details so that the same test could be repeated, avoiding the memory aspect. The tests finally selected were : 1. The Stanford revision of the Binet-Simon scale. This test is so widely known that it does not seem to be necessary to describe it here. 2. Pintner's mental survey non-language group test, with Myers Mental Measure as an alternate for repeating. These tests involved a minimum use of language. In the Pintner test no language is used in the performance, and in fact it is possible to give this test to foreigners or deaf children through the medium of signs, while in giving the Myers Mental Meas- ure it is necessary for the subject to understand simple lan- guage, but none is used in executing the test. The Pintner test has six parts, the first resembling the Knox cube test, the sec- ond and third being substitution tests, the fourth a drawing completion, while the fifth is a reversed drawing test, and the sixth a picture reconstruction. Following directions. Pictori- al Completion, and two tests of picking out objects with com- mon elements, compose the Myers test. 3. Thorndike's reading scale Alpha 2. This is a test in which language plays a prominent part. The subject reads a paragraph and then reads certain questions based upon the paragraph to which he writes his answer. To succeed he must understand the context of the paragraph, he must understand the question and know what it calls for, and he must be able EVALUATED AND COMPARED 17 to find the answer in the context and write it down. This is a graded test which is applicable from the second grade through high school. Since the practical work of this research was undertaken, Dr. McCall of Teachers College has considerably increased the usefulness of this test by devising ten sets iden- tical in method but with different contents, of which the test here used is one. It is now known as the Thorndike-McCall reading scale and its reliability has been thoroughly estab- lished. 4. Healy's Pictorial Completion Test B is an apperception test with the language element omitted. The ten pictures (plus one sample) present a day's activities of a young school boy, in which each picture contains a situation known to every child, such as eating breakfast, the school cloak room, a street accident, etc. In each picture one important element is lack- ing ; pieces which complete the picture, plus fifty more of the same size being arranged in a definite order in a box from which the subject is at liberty to choose those which he desires. A clue to the missing piece is furnished by the pictures. 5. Porteus Maze Tests. Vineland Revision 1919. These tests are supposed to measure social fitness and common sense. Among the capacities which they were devised to measure are forethought and planning capacity, prudence and mental alertness in meeting a situation new to experience. There are eleven mazes, graded in difficulty from year three to fourteen. Beginning with year five, avoidance of blind alleys is the main requirement for a successful performance. The more complex the maze, the further ahead must one look in order to be cer- tain that one is choosing the correct path. There is no time limit; in fact no mention of speed is made, and if the child asks he is told to do it as well as possible, taking as long as he likes. Porteus says that children fail mainly because of im- pulsiveness in action, overconfidence and carelessness, lack of pre-consideration, lack of planning capacity, irresolution and mental confusion, inability to sustain attention, or to profit by past mistakes. 6. Tapping Tests — Healy's Form. This consists of a sheet containing one hundred and fifty half inch squares, arranged ten in a row — fifteen rows. The subject taps once in each square, without touching the lines and covers as much ground as he can in thirty seconds. This is a simple test of psycho- motor control which was repeated without alteration. 18 SOME WELL-KNOWN MENTAL TESTS This test in a slightly different form was first introduced by Cattell in 1896, for testing freshmen at college. He had one hundred 1 cm. squares, into each of which the student must put a dot, completing the task as quickly as possible. Time was recorded ; evidently there were no errors. This test was supposed to measure rate of movement. Clark Wissler used it with many of Cattell's other tests in his "Correlation of Mental and Physical Tests" on college freshmen in 1901. He found that the average time for men was 34 seconds, for wom- en 30.8 seconds. In 1911 Whitley: (M. T. Whitley, An Em- pirical Study of Certain Tests for Individual Differences) re- ports results on Cattell's test, in which she kept the time constant (30 seconds) but computed the length of time which it would take to complete the blank. We have found the ad- ditional fifty squares useful in that some of our cases marked over one hundred squares in the thirty second time limit. 7. Healy's Construction Tests A. and B. The Knox-Moron test and Knox Modification of Healy A — a diamond-shaped frame, were used as alternates. We have called these A and B respectively to correspond with the Healy tests and for convenience. The equipment for these tests consists of a board containing one or more openings into which the child tested is supposed to fit pieces of wood so shaped that When properly arranged they will just close up the apertures. An advantage of these tests is the convenient size of the materials required. As all materials had to be carried from place to place the use of clumsy form boards or the tapping board with its dry bat- teries, metal plate and stylus, was practically out of the ques- tion. Where other things were equal, tests having the least paraphernalia were to be preferred. 8. The Crossline Tests shown in the figure were also given. The crossline tests were included because they are a modi- fication of the famous Code test, which is generally considered one of the best in the whole Stanford-Binet series. They take very little time to give and can easily be modified for repeti- tion. 9. Healy and Bronner Learning Tests. — These tests were de- vised to test learning ability, not as in the skill experiment, but as it is found essential in the elementary school subjects. Learning test A — the association of two symbols, a figure and a number, resembles other substitution tests such as those of Woodworth and Wells, Pintner, and especially Woolley. The EVALUATED AND COMPARED I. Crossline Test Id II. Crossline Test I 4 7 2 5 8 3 6 9 1 2 3 4 5 6 7 8 9 (c) (d) (a) and (c) are the forms used generally. (b) and (d) were used for retesting. difference lies in the fact that three trials were given and speed of learning determined success. Learning test B is the associa- tion of a symbol with a sound, as in learning a language. The symbols are from the Phoenician alphabet, and the sounds con- sist of one or two consonants and a vowel, simple enough to pronounce but without meaning. This prevents older children from forming associations which would be impossible for those who did not know the meaning of the syllables. Test C is the association of a symbol and a value presented audibly, and test D is the association of ideas with a picture. The first three test a sort of rote ability whereas the latter tests learning of ideas. It seems reasonable that success in school work may depend as largely upon learning ability as upon mental capacity, es- pecially in the early grades where the chief requirement in most of our schools is a good rote memory, as in learning mul- tiplication tables, and these two do not necessarily go together. Certain clinical cases bear out this suggestion, and these tests 20 SOME WELL-KNOWN MENTAL TESTS were included to ascertain the reactions of normal unselected children in this respect. National Intelligence tests were not yet published in Novem- ber and December, 1919, when this study was begun, or they would surely have been considered and very likely used. METHOD— APPLYING TESTS TO SUBJECTS General Considerations All of the tests except the non-language group tests and the Thorndike Alpha 2 were given to the subjects individually. The non-language tests were given sometimes individually and sometimes in groups of about ten with one exception where thirty eighth grade boys were tested in a group. The time of day at which the tests were given varied consid- erably. About fifty of the subjects, from the Children's Home and from the Settlement, were tested in the evening. All others were tested in the daytime. Care was taken to avoid giving any tests while the subject might be fatigued. Each child was questioned regarding the matter and whenever there were indications of fatigue the testing was always postponed. Usually a subject was tested for only an hour and a half at one time; frequently the duration of the testing was shorter and only occasionally was it longer. The tests were all scored according to the directions laid down by their respective authors. They were all scored per- sonally by the examiner twice. In all of the tests selected for use the scoring is objective and requires no technique. Where possible, score cards or keys were used. Where the time taken by a subject to complete a test was to be recorded, the timing was done by means of a stop watch. Much effort was expended in persuading the subjects to give an equal amount of attention and concentration to all of the tests, so that the results would not be affected by individual preferences. For a large proportion of the subjects the incen- tive of vocational guidance was offered and some general voca- tional advice, based partly on the experience of the examiner as well as on the tests, was given at the conclusion of the test- ing. Younger children needed no incentive and their enthusi- asm was so pronounced that they continually applied to take more than the regular number of tests. Supplementary information concerning the subjects was gathered and recorded, especially age in months, school grade, success in school work, marks, standing in class, whether a 21 22 SOME WELL-KNOWN MENTAL TESTS repeater and how often, whether subject skipped any grades, etc. The vocational plans and interests of the older children were obtained whenever they had any. Results of physical ex- amination were obtainable for a large per cent of the cases. Several subjects had also been given neurological examina- tions. Occasionally some result can be explained by reference to these findings, as for instance an unaccountably poor per- formance on the Healy Pictorial Completion test which was probably due to uncorrected vision. One case where peculiar results were obtained from the tests was explained by the physical examination which showed a history of epilepsy and thereupon the case was no longer considered. Specific Observations on the Application of the Tests Selected Stanford-Binet. — In the United States there have been sev- eral revisions of the Binet-Simon test, the most recent and well the best of these being that by Professor Lewis M. Terman of Leland Stanford University, California, published in its final form in 1916. This revision, called the Stanford-Binet, was the one used in this study. The score obtained in the Stanford- Binet test is expressed in years and months, mental age. This mental age, when divided by the life age, results in the intel- ligence quotient, which is expressed as a decimal. There have been some wrongful uses of the intelligence quotient. It is an attractive but erroneous idea that a certain intelligence quo- tient can be found below which all can be considered feeble minded while all above are normal or supernormal. The error in this idea has been pointed out by Fernald, Mateer, Kohs, and others, who demonstrate the degree of overlapping, and show how valueless the I. Q. is when reported vidthout refer- ence to life age. The Stanford-Binet results can be analyzed, as well as summed up in the I. Q., and it is possible that a detailed analy- sis of the data would yield all the information required. The plea that general intelligence scales have a right to be so called is largely based upon the supposition that the functions which are tested are manifold. Auditory memory for rote material and for ideas, visual memory, language ability, reasoning abil- ity, apperceptions, general information and many other abili- ties — all are found within the total range of tests. Unfor- EVALUATED AND COMPARED 23 tunately, in the Stanford scale no child gets tested in all these fields, and further, since they are not standardized separately the significance of success or failure in one part is difficult to determine. The Stanford-Binet tests were all given by the writer in the manner described by Terman. It is unnecessary to repeat this test in order to establish its reliability as the reliability has been independently reported upon by Terman. The vocabulary and memory span for digits of the Stanford- Binet were given with the Porteus tests, the remainder of the Stanford-Binet taking only one session. The Alpha 2 Reading Scale was scored by the method worked out by Kelley and his tables were used. The tapping test was scored for number of taps and errors. In the construction tests number of moves and time were taken and when the test was not completed within the limit of five minutes it was scored as a failure and the number of moves up to that time was noted. A construction test — once solved — is much easier to solve a second time unless the first solution was due to chance. Healy A was repeated in order to check the first performance. In Healy's construction test B a second trial generally brings a result as near perfect as possible (that is, dependent only on skill and speed in motor performances), even if the first solu- tion was hit upon by chance. It is impossible to do away with the chance element in performance tests, but in order to guard against it as much as possible, two tests were used each time, and the selection was made after a study of many types. There are several difficulties in making this choice and we were impressed by the fact that most performance tests have not been standardized and that there are very few tests of this kind which are sufficiently difficult for older subjects. The Healy and Knox tests satisfied both of these conditions. The scoring for the learning tests is rather complicated. A perfect score on all four tests is four hundred, one hundred being the perfect score for each test. Learning test A has twelve elements, and if these were all correct on the three trials, thirty-six elements would receive a mark of one hun- dred, or each would get 2.8. Thus the score equals the number correct multiplied by 2.8. When a perfect score is made on the first or second trial, it is assumed that further trials would give a perfect score also. In learning test B there are five 24 SOME WELL-KNOWN MENTAL TESTS symbols in each of the three trials, — consequently each re- ceives a value of 6.7. In test C there are seven symbols and three trials. Dividing one hundred by three times seven there results a value of 4.7 for each, while in test D, which has ten items, the total number is thirty, with a value of 3.3 each. The total for all the tests is the sum of the score on each of the four. RESULTS Where a clinician is generally satisfied to take the score ob- tained by applying a test as a final goal, if in fact he goes so far as to work out a score, it is obvious that to attain the pur- poses here in mind the scores of the various tests used must be compared to gather statistics reflecting their qualities. That is, v/hen the one hundred and sixteen children had been given the tests that were selected and when the scores were recorded, the field work was completed, but there remained to investi- gate in a laboratory manner what a combination of the re- sults would show with reference to the purposes of this study. This comparison of results was made by correlation, that is, by measuring the mutual implications (see Thorndike, Mental and Social Measurements, pp. 156-185). A test is to be evalu- ated in three ways; its correlation with criteria other than results of tests; its self -correlation, and its correlation with other tests. In the present inquiry we obtained no outside criteria with which to correlate our tests, because no outside criteria available could be relied upon. In the field of mental abilities, the only criteria which have been widely used are teachers' opinions, school marks, etc. These are unsatisfac- tory at best. Although we possess all these data for our cases we consider them useless since the children attended eight dif- ferent schools in four places, with the marking systems vary- ing for each. We compared judgments as to intelligence made by the teacher of the ninth grade of the Woodmere school with those made by the eighth grade teacher of the New York Pub- lic School. In the former the I. Q.'s varied from 95 to 141 ; in the latter from 73 to 116. In the former all but two children tested as supernormal and the class average was 121, where- as in the latter only one tested above 110 with a class average of 96. But to read the teachers' judgments one would think that the pupils of the latter school were considerably more in- telligent than those of the former. Even the comparative rat- ings within one group were markedly unreliable. They showed all the errors of judgment pointed out by Terman. No account was taken of age; the best behaved, most conscientious pupil was invariably considered the most intelligent, etc. What is the use of making correlations with this kind of material, 25 26 SOME WELL-KNOWN MENTAL TESTS when one knows in advance that all the fault of a low correla- tion will be attributed to the criterion, and the tests will stand as before — unknown quantities! Moreover, these criteria could only be used to represent a measure of general intelli- gence. The teachers admittedly knew practically nothing about the special abilities of their pupils ; the parents, where consulted, knew very little more. A rating on general intelli- gence has been frequently correlated with general intelligence tests, and the results published. Our data would present no new factors. Consequently we have not evaluated the tests by means of correlation with outside criteria but we do have the data for self-correlations and for inter-correlations. Where various tests which we used intercorrelate extremely highly, we may feel that they are measuring the same thing. On the other hand, if the intercorrelations approach zero or are negative, the results indicate that we have no evidence that aspects of intelligence are being measured at all. Only if the correla- tions are sufficiently high to indicate that intelligence is being measured and low enough to show that different factors are entering into the different tests, can we consider the tests worthy of being included in mental examination. In judging our correlations we must remember that we are testing nor- mal children only, — ^therefore our coefficients are lowered — and that our ages do not cover a large area, which also lowers the coefficients of correlation. Our conclusions are limited to the tests we used but the gen- eral method of dealing with the scores has a wide applicability. Table IV DISTRIBUTION OF PINTNER SCORES — 100 CASES Score Frequency Score Frequency 200-209.9 370-379.9 3 210-219.9 380-389.9 5 220-229.9 1 390-399.9 2 230-239.9 400-409.9 4 240-249.9 410-419.9 5 250-259.9 1 420-429.9 5 260-269.9 430-439.9 7 270-279.9 1 440-449.9 4 280-289.9 1 450-459.9 5 290-299.9 2 460-469.9 7 300-309.9 1 470-479.9 5 310-319.9 1 480-489.9 1 320-329.9 2 490-499.9 4 330-339.9 4 500-509.9 6 340-349.9 2 510-519.9 4 EVALUATED AND COMPARED TABLE IV— Continued 16 were not given the Pintner Test. The evenness of distribution of scores is noticeable. Average=420.964. Unreliability 6.9. Mean Square Deviation=68.95. Unreliability 4.9. 27 Score Frequency Score Frequency 350-359.9 6 520-529.9 1 360-369.9 8 530-539.9 2 Table V DISTRIBUTION OF MYEKS SCORES — 90 CASES Score Frequency Score Frequency Score Frequency 16 46 76 1 17 1 47 1 77 1 18 48 3 78 19 49 1 79 20 1 50 2 80 21 61 1 81 2 22 1 62 4 82 1 23 1 53 4 83 24 54 2 84 1 26 1 56 2 85 26 1 56 86 27 1 57 3 87 2 28 2 68 2 88 29 2 59 2 89 30 60 2 90 31 1 61 1 91 1 32 62 3 92 33 2 63 3 93 34 2 64 3 94 1 36 66 2 95 36 3 66 96 37 1 67 97 38 1 68 2 98 39 69 3 99 40 1 70 1 100 41 1 71 1 101 42 72 1 102 43 4 73 1 103 1 44 74 104 45 2 75 105 26 were not given test Average=53.325 Unreliability 1.8. Mean Square Deviation= =17.88. Unreliability 1.3. Table VI DISTRIBUTION OP ALPHA SCORES — 107 CASES Score Frequency Score Frequency Score Frequency 3.6 2 5.4 1 7.3 7 3.7 5.5 1 7.4 6 3.8 5.6 1 7.5 12 3.9 5.7 1 7.6 2 4.0 5.8 7.7 6 4.1 3 5.9 2 7.8 1 4.2 1 6.0 7.9 2 4.3 6.15 1 8.0 2 28 SOME WELL-KNOWN MENTAL TESTS Table VI- -Continued Score Frequency Score Frequency Score Frequency 4.4 6.2 3 8.1 1 4.55 1 6.3 8.2 2 4.6 6.4 2 8.3 2 4.7 3 6.5 1 8.4 1 4.8 1 6.6 3 8.5 2 4.9 2 6.7 5 8.6 5.0 1 6.8 4 8.7 5.1 3 6.9 3 8.8 1 5.2 6 7.0 1 8.9 5.3 7.1 7.2 7 9.0 1 9 were not given The Alpha Test. Average=6.834. Unreliability .116. Mean Square Deviation=1.20. Unreliability .082. TABLE VII DISTRIBUTION OF PICTORIAL COMPLETION TEST SCORES — 110 CASES Score Frequency Score Frequency Score Frequency -15 to 2 30 to 34.99 6 65 to 69.99 12 Oto +5 2 35 to 39.99 6 70 to 74.99 7 5 to 9.99 2 40 to 44.99 6 75 to 79.99 4 10 to 14.99 1 45 to 49.99 10 80 to 84.99 9 15 to 19.99 4 50 to 54.99 8 85 to 89.99 4 20 to 24.99 4 55 to 59.99 12 90 to 94.99 2 25 to 29.99 60 to 64.99 8 95 to 99.99 1 6 were not given test. Average:=54.527. Mean Square Deviation=22.69. Unreliability 2.16 Unreliability 1.5. TABLE VIII LEARNING TESTS DISTRIBUTION — 106 CASES Score 250-259.9 260-269.9 270-279.9 280-289.9 290-299.9 300-309.9 310-319.9 320-329.9 10 were not given the tests; 3 none at all; 7 not all four Average=300.66. Unreliability=5.08. Mean Square Deviation=52.31. Unreliability=3.6. TABLE IX Score Frequency 170-179.9 2 180-189.9 190-199.9 1 200-209.9 210-219.9 2 220-229.9 1 230-239.9 4 240-249.9 1 Frequency 4 7 6 5 8 10 8 7 Score Frequency 330-339.9 340-349.9 350-359.9 360-369.9 370-379.9 380-389.9 390-399.9 400-409.9 PORTEUS SCORES DISTRIBUTION — 113 CASES Score Frequency Score Frequency Score Frequency 5 2 8.5 4 11.5 14 5.5 1 9 2 12 7 6 9.5 6 12.5 15 6.5 1 10 6 13 15 7 3 10.5 8 13.5 4 7.5 4 11 11 14 7 S 3 3 were not give this test. Average= =11.09. Unreliability .19. Mean Square Deviation=2.02. Unreliability .13. EVALUATED AND COMPARED 29 DISTRIBUTION TABLI OF CROSSLINE S X TEST SCORES — 114 ( CASES Score Frequency Score Frequency Score Frequency I II I II I II Both OK' 70 OK'-OK' 1 OK'-F 3 OK'-OK' 13 OK'-OK' 2 OK'-F 1 OK'-OK' 5 OK'-OK' 1 OK'-F 2 OK'-OK" 5 OK'-OK* 2 OK*-F 1 OK'-OK' 2 OK'-OK' 1 F -F 5 OK'=Correct on first trial. OK'=Correct on second trial. F=Failure on fourth trial. TABLE XI DISTRIBUTION OF TAPPING SCORES. AVERAGE OF 2 TRIALS — 113 CASES Score Frequency Score Frequency Score Frequency 40 to 44.99 1 65 to 69.99 11 90 to 94.99 5 45 to 49.99 3 70 to 74.99 13 96 to 99.99 6 50 to 54.99 5 75 to 79.99 23 100 to 104.99 2 55 to 59.99 9 80 to 84.99 20 105 to 109.99 60 to 64.99 9 85 to 89.99 6 110 to 114.99 1 Average=73.43. Unreliability 1.26. Mean Square Deviation=13.39 Unreliability .89. TABLE XII DISTRIBUTION OF CONSTRUCTION AND KNOX — TIME 108 CASES Score Frequency Score Frequency Score Frequency 50 to 99.99 1 350 to 399.99 12 650 to 699.99 6 100 to 149.99 7 400 to 449.99 8 700 to 749.99 4 150 to 199.99 5 450 to 499.99 8 750 to 799.99 2 200 to 249.99 10 500 to 549.99 9 800 to 849.99 3 250 to 299.99 10 550 to 599.99 5 850 to 899.99 2 300 to 349.99 8 600 to 649.99 7 900 to 949.99 950 to 999.99 1 Average=420 to 480 or 7.685. Mean Square Deviation=3.39. Tables 4 to 12 inclusive show the distribution of scores on the various tests. The average, or more properly speaking the arithmetic mean and mean square deviation, are also given for each. That w^e have sufficient cases is shown by the relation of the variability to the average. In only a few instances is it large enough to raise a doubt as to whether enough cases were used. These are the Pictorial Completion test, the Construction tests, and the Myers Mental Measure. The formula for the unreli- <7dis. , ability of an average is crT-obt.av.r: for the unreliability of a mean square deviation it is CTT-obt C- (1) I* OJ mfi _ - w • 'fi.c o -*-> e ^ asc-t-co osrHON 11 Ho'j3iH bo SlbOlCO OOIMCO II C8fflr?i L^S o .5 00 < .oooo ecrHcoiNiN -r;., -oo "^7;^--^°-1 I'l II - — H «S,2SxSgrt«°"' \> a> to c- .1 to ^ 2 •■« ^ S WWW*^ >.>. H eoooin-^MOso > >Xj3 _^j3 ^^ W Tii(oc-iou5->*«o P:5'J3:pj3j=45'gt3a!.-8 ' r1 t1 ^ F--N .W Jh ^ T.SZ?— . O 4) C8 46 SOME WELL-KNOWN MENTAL TESTS memory was tested. It was found that a good memory for logical material did not follow from a good memory for nonsense; that being able to remember visually presented facts did not necessarily indicate ability to remember what was heard. The result of these and similar observations has been the development of tests dealing with specific types of material, or — giving up the specific side entirely — tests of general intelligence. Our data seem to indicate that real, underlying differences do exist, if we only know how to get at them. In order to prove this, it is necessary to have a test with omnibus material, all of which is designed to measure a certain type of thing. We shall now proceed to do this. A COMBINATION TEST FOR PLANFULNESS The correlations in table XVI, particularly those obtained between Porteus, Alpha 2, and P. C. II, seem to indicate the possibility of a factor, common to all and largely determining the score on each, which has nothing to do with the material employed, that is, whether a language or non-language test, or the like. We have suggested above several names for this factor, — good judgment, common sense, deliberation, care- fulness, foresight, good apperceptions, planfulness, persis- tence, prudence and mental alertness in meeting a new sit- uation, ability to see the whole of a situation instead of re- acting to the most obvious part of it. An attempt was made to investigate it more thoroughly by combining the elements of each test which seemed most specifically to measure it. The selection was made from the Porteus, Myers, Alpha 2, P. C. II, and Stanford-Binet tests. All the tests selected would require about twenty-five minutes to perform, this being a liberal estimate based upon the time limit for each test. Alpha 2 has no definite time limit, but from the writer's experience, ten minutes would seem ample to allow for the parts of the test included in this selection. When all the in- dividual tests had been chosen, they were divided into two sections, and a self-correlation of .763 was obtained with 80 cases. The tests in each group were : I. Porteus — year 11 (scored 0, 1, 2) year 12 (scored 0, 1, 2,3,4). Myers — pages 4. Numbers 3 and 7 (scored each 0, 1). P. C. II — pictures 2 and 6 (scored 1 each if OK; other- wise 0). Alpha 2, Part II — difficulty 8 — number 4 (scored 0, 1). Pintner — test 5, numbers 5 and 7 (scored each 0, 1). Pintner — test 6, picture 2 pieces 2 and 1 (scored each 0,1). Pintner — ^test 6, picture 3, pieces 4 and 1 (scored each 0, 1). II. Porteus — year 10 (scored 0, 1, 2) year 14 (scored 0, 1, 2,3,4). 47 48 SOME WELL-KNOWN MENTAL TESTS Myers — page 4. Numbers 5 and 10 (scored each 0, 1). P. C. II — pictures 7 and 8 (scored 1 each if OK; otherwise 0). Alpha 2, Part II — difficulty 8 — number 1 (scored 0, 1). Pintner — test 5, number 6 (scored 0, 1). Pintner — test 6, picture 2, pieces 4 and 3 (scored each 0, 1). Pintner — test 6, picture 3, pieces 2 and 3 (scored each 0,1). Stanford-Binet — XIV years, number 6 (scored 0, 1). The Porteus tests were chosen because they were devised to measure this very thing. The fact that only one type of material — mazes — was included, was considered by Porteus one of the outstanding advantages of his test. We feel that this is a disadvantage since some children might have a dis- ability for working with this kind of material although possessed of common sense, foresight, etc. With omnibus material this special factor is overcome. The choice of the four most difficut tests was largely a matter of the distri- bution of the subjects. Too many would have made perfect records on the easier tests. The selection from Myers Mental Measure was based largely upon resistance to suggestion. In each case four pictures with some element in common must be chosen from eight possible ones and underlined. These four could not be too difficult or our subjects would all score 0; if they were too easy we would have no reason to believe that this characteris- tic pertained to them. Number 3 is the selection of four toys, — a tricycle, top, kite and rocking horse, with a soldier as the confusing picture. In number 5, four items made of iron must be chosen, — a stove, dagger, or sword, train, and lock. This has several confusing suggestions. There is a broom which might be associated with the stove, and two animals which might be connected with the train as they all are capable of locomotion. Number 7 consists of an' insect, a broom, a bird, a table, a butterfly, an aeroplane, a goat and a cow. The four things which can travel in air are to be underlined. The two animals prove confusing to many children. In number 10 the subject is to select four articles of wood, — two trees, a barrel and a table, with a snake, a camel, a cannon, and a EVALUATED AND COMPARED 49 bird to be omitted. Here also the three animals receive con- siderable attention, the hasty child not noticing that the fourth is lacking, or the snake is overlooked, the two remaining animals and two trees being classed together as objects possessing life. There seems to be some suggestion in each of these pictures, and it is certainly true that a careful, deliberate, performance by a subject who takes in the whole situation and responds to it will give far better results than a hasty, careless one. The pictures from P. C. II are those in which there are several obvious possibilities. A hasty, careless selection will hit upon the first possible one, rather than searching further for the exactly correct one. All correct pieces were checked by asking the subject why that particular one had been chosen and if it was put in by chance, no credit was given. The partial credits given by Healy were omitted, the picture scored either as perfect or a failure. This was necessary in order to eliminate the other possible factors which enter into solving the test partly. For instance, in the second picture, where a book is missing, it is not sufficient to put in any book, pencil case or lunch box, but by following up persistently all the clues, the one and only correct red book can be placed in the space with certainty. From the Thorndike Alpha 2 reading scale questions were selected which had been answered by a large number of children. Question I requires a fairly careful study of the paragraph in order to find just what it is that seems true at first but is really false. The question is a little clumsily put, — certainly not direct and to the point, — which is an advantage for our purposes. Question 4 is not a reading scale problem proper, but necessitates close attention to several directions. In two rows of digits the subject must underline every five that comes just after a two, unless the two comes just after a nine. If that is the case, he must draw a line under the next figure after the five. The last few lines of the first page of the Myers Mental Measure are similar to this, but the Alpha 2 was given to a larger number of cases, there was no time limit, and less possibility of copying, so it was given the preference, as being more accurate. Numbers 5, 6, and 7 from Pintner test 5 are all similar in nature. Given a drawing, the problem is to draw it in a reversed position, with two lines of the second position given 50 SOME WELL-KNOWN MENTAL TESTS on which to construct the rest. This seems like a rather special ability, but Pintner gives each drawing considerable weight in his total score, and persistence and planfulness are certainly essential for a good performance. Pintner test 6 consists of parts of pictures presented in a disarranged order. Each part is numbered and blank spaces are provided in which the subject is to place the numbers of the parts in order which would give a perfect ensemble. Here again planfuness, patience, and foresight are needed, and on the whole the subject who possesses them to the greatest degree will be the most successful. Finally one test was selected from the Stanford-Binet scale, — namely the reversed clock hands of year XIV. If two out of three were correct a score of one was given, if less no credit at all. This test seemed to require the same kind of ability as many of the other tests included, and was therefore added. Some of the other Stanford-Binet series might have been used also, but those which seemed desirable came too high or too low in the scale so that the distribution for our subjects would not be satisfactory. The correlation of .763 obtained between the two parts is fairly high when it is remembered that the highest score on each section can only be 17 ; also that the whole series of both parts would only take half an hour to give. As to reliability it is a noteworthy conclusion that this self-correlation is the highest one obtained with any non-identical material. A correlation of the composite tests with any of the tests which are included would probably give a high coefficient difficult to interpret because of the varying amount of each included in the composites, and a low correlation with learning tests, construction tests, or tapping could hardly be considered strong evidence in favor of our new grouping. But the correlation with Stanford-Binet seemed worth finding, and when worked out yielded a coefficient of .537. This indicates that our combination test is comparable with the whole series of tests from which it was compiled. We have, however, no criterion to prove that it actually measures the trait which we presuppose it does. But this same criticism applies to all the tests which are supposed to measure specific factors. Our new test combination of old material is certainly as good as the tests from which it originated ; we think it is better, be- cause it gives evidence of measuring one trait, or group of EVALUATED AND COMPARED 51 traits with a variety of materials, whereas all the others measure many kinds of traits with identical or similar ma- terial. That is, the classification and material preparatory to the formation of a test has generally heretofore been along the lines of the material employed, such as form boards, etc., whereas the combination test being discussed presents the results obtained from forming a test directed toward plan- fulness, or other ability. CONCLUSION It is proposed to set forth the practical results of this study, to show the positive information that has been ascertained and also to show from the experience gathered in the course of obtaining such information, what further investigations should be made, with what purpose, and what methods may lead to success. This study has reached some positive results and has disclosed other perhaps more valuable ones in the same field. In entering upon this study it was believed that the results of the method that has been pursued would justify the con- clusion that the Stanford-Binet series can be used as a test of general intelligence and that certain other tests used as auxili- aries would make apparent and give a measure of special abilities not individually measured by the Stanford-Binet. It was expected that the various tests would give reasonably high correlations with the Stanford-Binet and rather low correlations with each other, thus on the one hand establishing the reliability of the tests used, and on the other hand, the diversity of the abilities that were subjected to measurement. These results were anticipated because care was used in selecting the tests to take those which had an approved author- ship, an extended use, a definite purpose, and a general repu- tation of success in the field they purported to cover. That is, the various units had each been shown apparently to be satisfactory and on these a priori grounds it was thought that properly selected units used in conjunction would result in a reliable schedule. Had the results of the correlations been in harmony with this anticipated situation, we might properly have pointed to this study as a demonstration of the process by which sched- ules of tests for children should be composed. Looking upon our results as they have been reported upon, the fact is obvious that there is no such easy manner in which to arrive at reliable schedules of tests. Unexpected low cor- relations were obtained in some situations where the indicated results should have been high, and vice versa, and while our positive purpose therefore met with disappointing obstacles, a study of the figures as we have them led to other worth- while conclusions. 52 EVALUATED AND COMPARED 53 Drawing upon the results of the correlations, it can be stated with assurance that it will not be well to take tests upon which a high face value has been placed when they were used without being effectively valued by comparison, and combining a number of them in the expectation of using the combination to get reliable information as to the general intelligence and the special abilities of normal children. One of the best examples which we can show, as a result of this study, of the impropriety of such procedure is, that the type of material used does not govern the abilities tested. We obtained a higher correlation between a language and a non- language test than between two language tests or two non- language tests, similar examples can be drawn from the corre- lations listed above respecting other characteristics of various tests. Insofar therefore as authors of tests have relied upon the material as a quality that would single out and measure a certain one of many abilities, it seems clear that individual tests miss their purpose. However, the correlations did seem to show that something definite was being tested, so that if our purpose of finding a schedule of tests at once sufficient to measure both general and special abilities, was disap- pointed, at least the schedule we used can be relied upon for general abilities and that such a schedule is more reliable than the Stanford-Binet alone. The components of this schedule have been previously listed and it only remains to state what individual matters of interest relating to each were made clear in the course of the study which was directed to larger purposes. It was a matter of actual demonstration herein that all of the construction tests used are unreliable, this conclusion disproving the previously held opinion based upon empirical considerations to the effect that they reliably measure ability to handle concrete material. Persons having occasion to apply mental tests have too frequently overlooked the matter of how far the test can be relied upon. This is an important matter and consequently it should be of some interest to note that the reliability of the Stanford-Binet, Pintner non-language group test, Thorndike reading scale Alpha 2, Porteus Maze Test, and tapping test has been established, whereas the Myers Mental Measure, the Healy Pictorial Completion test II, the Healy-Bronner 64 SOME WELL-KNOWN MENTAL TESTS learning tests and the crossline tests are not yet definitely- shown to be reliable. Care should also be observed in interpreting the results of correlations, for the mere fact of high correlation is only generally and not conclusively proof of reliability. There is the possibility that factors causing unreliability have been hidden — thus, in the tapping test, the high correlation with Stanford-Binet was deceptive owing to the fact that the scores on both increased with the age of the subjects. Other specific remarks relating to individual tests are contained in the results. There remains to state what considerations we have found to have a probable value as to future work in this field. If we found on the one hand that the type of material used in a test does not govern the ability tested, on the other hand there are some indications that to test individual abilities the test should have a variety of material. So far the elements of a desired test can be stated, but the further necessity of finding just what material is suitable, can only be determined by practical work consisting of correlation with outside criteria and with any other measures of claimed effectiveness in the field in question. As an experimental example, for the confines of this study would allow no more extended investigation, various parts of a number of the tests wei'e united in a combination test in- tended to secure a measure of planfulness. The resulting correlations indicated success in this attempt. A similar or even greater measure of success may follow further com- binations aimed at the measurement of other abilities. It may also be stated as having been illustrated in the course of this study that the supposed merit of various mental tests based upon various, insufficient or unscientific criteria, such as mere hypothesis, or even practical results, if relied upon, may lead to misleading or dangerous conclusions, and that before one takes the responsibility of giving advice or of taking action with respect to information gained from the ap- plication of mental tests, there should be available the as- surance that proper comparative tests and correlations have verified the supposed propriety of relying upon the results. VITA The author of this dissertation was born in New York, August 25, 1898. Secondary education was at Far Rockaway High School, taking highest honors, and receiving Regents Scholarship for College. Vassar College, 1915-1917; Barnard College, Columbia University, 1917-1919; B. A. De- gree Columbia University, 1919, Honors in Psychology; 1918, research work for New Jersey State Institution for Feeble Minded; 1919-1920, Fellowship at Judge Baker Foundation, Boston, Assistant Psychologist; Columbia University, 1920-1922, Post Graduate Work in Psychology. LIBRARY OF CONGRESS 019 842 533 3