THE UNIVERSITY OF ILLINOIS LIBRARY 370 IL6 . No. 26-34 sssftrsss^"" University of Illinois Library L161— H41 Digitized by the Internet Archive in 2011 with funding from University of Illinois Urbana-Champaign http://www.archive.org/details/interpretationof32odel BULLETIN NO. 32 BUREAU OF EDUCATIONAL RESEARCH COLLEGE OF EDUCATION THE INTERPRETATION OF THE PROBABLE ERROR AND THE COEFFICIENT OF CORRELATION By Charles W. Odell Assistant Director, Bureau of Educational Research ( THE UBMM Of (Hi MAR 14 1927 HKivERsrrv w Illinois PRICE 50 CENTS PUBLISHED BY THE UNIVERSITY OF ILLINOIS. URBANA 1926 370 lit TABLE OF CONTENTS PAGE Preface 5 i Chapter I. Introduction 7 Chapter II. The Probable Error 9 .Chapter III. The Coefficient of Correlation 33 PREFACE Graduate students and other persons contemplating ed- ucational research frequently ask concerning the need for training in statistical procedures. They usually have in mind training in the technique of making tabulations and calcula- tions. This, as Doctor Odell points out, is only one phase, and probably not the most important phase, of needed train- ing in statistical methods. The interpretation of the results of calculation has not received sufficient attention by the authors of texts in this field. The following discussion of two derived measures, the probable error and the coefficient of correlation, is offered as a contribution to the technique of educational research. It deals with the problems of the reader of reports of research, as well as those of original investigators. The tabulating of objective data and the mak- ing of calculations from the tabulations may be and fre- quently is a tedious task, but it is primarily one of routine. The interpreting of the results of calculation is not a routine task. Many conditions affect their meaning and the research worker constantly encounters new problems of interpretation. It is, however, possible to state certain general principles that will serve as a guide in this phase of educational research. Walter S. Monroe, Director. July 7, 1926 THE INTERPRETATION OF THE PROBABLE ERROR AND THE COEFFICIENT OF CORRELATION CHAPTER I INTRODUCTION Purpose of this bulletin. One of the most noticeable recent devel- opments in the field of education has been the extensive application of statistical methods to the description of educational conditions and the solution of educational problems. Only a comparatively few years ago conditions were portrayed chiefly in terms of adjectives and other ex- pressions of quality or degree, but now these have been superseded to a considerable extent by definite quantitative terms. There are at least two reasons why everyone engaged in educational work, from the class- room teacher to the research expert, should become acquainted with certain commonly used formulae, methods of computation, and other statistical procedures. In the first place, situations frequently are en- countered in which it is desirable to make use of statistical procedures for the purpose of collecting and analyzing data which have a bearing upon practical educational problems. For the great majority of educa- tional workers, however, it is probably more important to be able to interpret correctly the numerous statistical expressions and discussions which are encountered in professional reading and other work. It is almost impossible to peruse a single issue of an educational periodical or a recent book in the field of education or to attend an educational meeting without seeing or hearing many statistical terms employed. Most of the commonly used methods of computation can be mastered by practically any person of average intelligence and arithmetical ability within a rather short time, but the power of interpreting correctly the various measures derived by statistical methods is not so easily ac- quired. The acquisition of this power demands considerable familiarity with the concepts involved and this in turn requires clear and critical thinking. It is with the second of the two purposes mentioned in the preced- ing paragraph chiefly in mind that the writer has attempted in this bulletin to throw some light on the use and interpretation of two of the most frequently used statistical measures, 1 the probable error 2 (com- 'The term "statistical measure," sometimes shortened to "measure," is used in this bulletin to refer to a measure or quantitative expression which has been derived from a number of data such as scores or other measurements and which summarizes [7] monly abbreviated P.E.), and the coefficient of correlation (commonly abbreviated r), in the hope that readers will be helped in their under- standing of the significance of these terms. Since the methods of com- puting them can be found in many places, 3 their actual calculation will not be explained in detail, although the formulae for them will be given. or expresses in a single numerical index some tendency of the original data. All such expressions as means, medians, modes, measures of deviation, measures of relationship, and so on are statistical measures. In order to avoid confusion the term "measure," which is often used to refer to the result obtained by applying a measuring instrument to an individual case, will not be used in this bulletin in that sense, but "score" or "measure- ment" will be used instead. 'The term "probable error" (P.E.) has come to be generally used to include both the probable error proper and the median deviation (abbreviated Md.D.), although, as will be shown later in the discussion, the latter is in no real sense an error. For this reason and also to avoid confusion the term probable error will sometimes be used when median deviation would be preferable from the standpoint of strict accuracy of use. The reader should not obtain the idea, however, that the writer believes it is desirable to use probable error instead of median deviation; he distinctly does not believe so. 3 See: Odell, C. W. Educational Statistics. New York: The Century Company, 1925, p. 138-39, 150-8-, 221-41, or any other standard text on statistics. [8] CHAPTER II THE PROBABLE ERROR 1 Formula for the probable error. Since the probable error, as shown by the substitute term median deviation, is the median of the deviations or differences of the individual scores or measurements from their aver- age, 2 it may be computed simply by determining the median of these deviations or differences. However, the customary method is to deter- mine the standard deviation 3 first and then to multiply it by .6745 4 to obtain the probable error. In other words the usual formula for the probable error is: P.E. = .6745 a. The relationship existing between the probable error and the standard deviation is therefore of the same sort as that existing between a foot and a yard or a pint and a quart, that one always equals the other mul- tiplied by a constant factor. Thus just as .5 quart equals a pint and 2 pints a quart, so .6745o- equals 1 P.E. and 1.4826 5 P.E. equals la. Different uses of the probable error. There are several more or less different uses or meanings of the probable error, at least five of 1 The probable error, often more properly called the median deviation, is only one of several commonly used measures of the same sort. Among the other similar ones are the standard deviation (abbreviated S.D. or a (sigma) ), which in certains uses becomes the standard error, the mean deviation (M.D. or A.D.), the quartile deviation or semi-interquartile range (Q), and the 10-90 percentile range (D). All of these except the last are rather frequently encountered. In general, what- ever is said about the probable error may be applied to these other measures also. The one important exception to this statement is that since these measures, except Q, differ from the probable error in magnitude, their interpretations in numerical statements will, of course, differ. For a discussion of these other measures see: . Odell, op. cit., p. 120-38. 2 The term "average" is used here in a general sense, that is, it includes the arith- metic mean, commonly called the average, the median, the mode, and all other measures of central tendency. Deviations or differences are usually computed from the arithmetic mean but may be taken from any other measure of central tendency. t=t in which .v denotes the N deviation or difference of a particular score from the average, N stands for the total number of cases or scores and 2 (sigma) is the symbol for summation. 4 It is only in the case of a normal distribution or by chance that the probable error is equal to nearly .6745 times the standard deviation. However, most educational data form distributions which approximate normality closely enough that no serious, error is involved in using the given decimal as the multiplier. This number is of course the reciprocal of .6745. [9] which are fairly distinct from one another, and will be dealt with in this discussion. 1. A measure of the spread or variability of a distribution of data about the average. When used in this way it should properly be called the median deviation (Md.D.). If the term probable error is employed it should be followed by the words "of the distribu- tion" and abbreviated P.E. Dis. 2. A unit of measurement. This also is a use for which the term median deviation is really the correct one to employ, since it involves merely a particular use of the median deviation of a dis- tribution. Since no subscript has been agreed upon to denote this use, the writer suggests "U" for "unit." Thus, when designating the median deviation used as a unit of measurement one should write Md.D. . or if one follows the general practice rather than the best, P.E. . 3. A measure of the reliability of sampling. The accepted ab- breviation for this use is P.E. with a subscript denoting the measure to which it applies. Thus P.E. denotes the probable error of the mean, P.E. that of the median, P.E. that of the coefficient of Hid . r correlation and so on. 4. A measure of the reliability or accuracy of any one of a number of scores or measurements of the same thing. As will be explained later this is from one standpoint a variety of the imme- diately preceding use. It has no conventional abbreviation, hence PE. is suggested as a suitable one. 5. A measure of the reliability of a measuring instrument. This may be divided into two sub-heads as follows: A. A measure of the reliability or accuracy of scores ob- tained from a measuring instrument when compared with those obtained from another application of the same or of a suppos- edly equivalent measuring instrument to the same individuals. This is called the probable error of estimate and is abbre- viated P.E. . Est. B. A measure of the reliability or accuracy of scores ob- tained from a measuring instrument when compared with the theoretically true scores. This is called the probable error of measurement and is best abbreviated P.E.,, Meat. The probable error as a measure of the spread or variability of a distribution of data around its average. As was stated above the term [10] probable error is a misnomer in connection with this use and median deviation should be used instead. Therefore, the writer will use the latter expression in the discussion immediately following. The use of the median deviation as a measure of the spread or variability of a distri- bution of data around its average is the fundamental one and from it all the others are derived. When a number of scores or measurements yielded by a test or other measuring instrument are tabulated in a dis- tribution it is frequently desirable and useful to describe in some concise way their spread or variability about the average. In other words, one often desires to indicate or summarize by a single numerical expression the extent to which the individual scores tend to cluster about or depart from their average. For example, if the marks assigned the pupils in two classes have been tabulated and the averages of both classes are com- puted and found to be 85 percent, one knows that the average rating of the classes is the same but he does not know whether or not the classes are equally homogeneous in regard to the ratings given. In other words, he does not know whether all the pupils in both classes received marks closely grouped around the average, whether their marks ranged from decidedly below to considerably above the average, or whether the first condition held in one class and the second in the other. One of the measures most commonly used as an index of the amount of spread or variability is the median deviation. 6 This is exactly what its name implies, the median of the deviations or differences of the indi- vidual scores from their average. Since the median is a point on each side of which there are half of the measures in the whole distribution, the median deviation is always of such a magnitude that half of the individual scores differ from their average by less than this amount and half by more. For example, if one of the classes referred to above had a median deviation of 3 percent it would mean that half of the pupils' marks were within 3 percent of 85, that is, from 82 to 88, and the other half either below 82 or above 88. Similarly a median deviation of 5 percent for the other class would mean that the marks of half of its members were between 80 and 90 and those of the other half either below 80 or above 90. From these values of the median deviation, 3 and 5, one would know that the first class was more homogeneous than the second in respect to the ratings given. "It cannot be said in any real sense that the differences between the individual scores or measures of a number of individuals and their average are errors. Despite this fact, however, the term probable error is frequently used in this connection. [11] TABLE 1.8 A SUMMARY OF TABLE I OF JOHNSON'S STUDY GIVING THE MEAN ACCURACY SCORES EARNF.D ON THE COURTIS SUB- TRACTION CARD NO. 33 BY THE GROUPS USING THE SEVERAL METHODS OF SUBTRACTION Method Score I II III IV Mixed 17 75 13 2 8 3 16 74 5 4 3 6 15 35 2 1 l a 1 14 22 3 1 2 13 5 1 12 6 11 1 1 10 1 9 1 N 220 23 ' 8 13 13 M 15.7 , 16.2 15.8 16.4 15.5 Md.D. 0,9 0.7 0.8 0.6 1.1 "Printed as 5 but here taken as 15, since the use of the latter value checks with the mean reported. The actual use of the median deviation in this way is shown by the following table taken from a magazine article. 7 This table shows the distributions of scores on the Courtis Subtration Card No. 33 made by five groups of pupils who had used different methods of subtraction. Below each column in the table are given the number of pupils, the mean score, the standard deviation and the median deviation of the 7 Rucn. G. M., Kxight, F. B., and Lutes, O. S. "On the relative merits of sub- traction methods: another view,"' Journal of Educational Research, 11:154-55, Feb- ruary, 1925. For other examples of the use of the probable error or median deviation see the following references: Courtis, S. A. The Gary Public Schools: Measurement of Classroom Products. New York: General Education Board, 1919, p. 213. Stoddard, G. D. '"Iowa Placement Examinations." University of Iowa Studies in Education, Vol. 3. No. 2. Iowa City: University of Iowa, 1925, p. 62-64. Kallom, A. W. "Times of writing each of the Arabic numerals determined by the reaction time method," Journal of Educational Psychology, 7:226-28, April, 1916. Childs, H. G. "Measurement of the drawing ability of two thousand one hun- dred and seventy-seven children in Indiana city school systems by a supplemented Thorndike Scale," Journal of Educational Psychology, 6:391-408, September, 1915. Tor purposes of convenience the tables in this bulletin are numbered consecu- tively instead of as in the sources from which they are quoted. Also some of them have been modified slightly in order to be consistent or to follow the best form, parts of some have been omitted, and occasional errors have been corrected. [12] distribution in that column. For example, 220 pupils used the first method, their mean score was 15.7, and the median deviation of their scores .9. This statement is merely a way of expressing the fact that half of the scores probably fell within .9 of the mean, or between 14.8 and 16.6, and half outside of these limits. Similarly, for the pupils who used the second method the mean was 16.2 and the median deviation .7, which indicates that half of the pupils probably made scores between 15.5 and 16.9 and half lower or higher than these limits. It will perhaps be helpful to illustrate the significance of the median deviation by a graphical representation. With this in mind Figure 1 has been prepared. The portion of the figure at the left represents graph- jro.of Cases b ■ If.. 2- oJ ^ lg 1 \K TS Score Wo. of Cases 6 - * ■ 2" ~.u £J IE Vr 16 Score Data from Column M IV tt Data from Column "Mixed" Figure 1. Graphical Representation of the Data in the Last Two Columns of Table I ically the distribution of scores contained in Column IV of Table I, the portion at the right the scores in the column headed "Mixed." The dis- tributions in these two columns were chosen for graphical representa- tion because the total number of scores in each is the same and there- fore the areas of the surfaces representing them are equal. Inspection of the figure makes it evident that the scores represented at the right spread out considerably more than do the others. The height of the graph at the right is less and the length of its base greater than of the one at the left, which indicates a wider spread of scores. This agrees with the fact that the median deviation of the distribution represented by it is 1.1, whereas that of the other one is only .6. It might be noted also that neither of the graphs approach normality very closely, the one at the right, however, doing so more nearly than the one at the left. [13] The interpretation of the median deviation, when used to measure how closely individual scores or measurements cluster about their aver- age or how far they spread out from it, may be extended further than has been suggested in the preceding paragraphs by stating what fraction of the scores will not differ from the average by more than a given multi- ple of the median deviation. For the few smallest integral multiples we may state as follows: 9 50.00 percent of scores differ from the average by less than 1 Md.D. 82.26 percent of scores differ from the average by less than 2 Md.D. 95.70 percent of scores differ from the average by less than 3 Md.D. 99.30 percent of scores differ from the average by less than 4 Md.D. 99.92 percent of scores differ from the average by less than 5 Md.D. We may also change the form of statement and say that the chances are: 1 to 1 that a score differs from the average by less than 1 Md.D. 4.6 to 1 that a score differs from the average by less than 2 Md.D. 22 to 1 that a score differs from the average by less than 3 Md.D. 142 to 1 that a score differs from the average by less than 4 Md.D. 1,340 to 1 that a score differs from the average by less than 5 Md.D}" ^Although the numerical interpretations given in the text hold exactly only in the case of normal frequency distributions they may be used without serious error in deal- ing with the large majority of tabulations of such educational facts as pupils' heights, weights, school marks and test scores, teachers' salaries, numbers of pupils to the room, and so forth. For example, 109 of the scores in the first column of Table I fall within 1 Md.D. of the mean, whereas 110 would be expected to do so. "'It has been previously stated that the chief difference between the interpretation of the standard deviation (c), the mean deviation (M.D.), and the 10-90 percentile range (D.), and of the median deviation has to do with numerical interpretation. For example, it is to be expected that: 68.27 percent of scores differ from the average by less than 1 a 95.44 percent of scores differ from the average by less than 2 a 99.74 percent of scores differ from the average by less than 3 a 99.99 percent of scores differ from the average by less than 4 a Using the other form of statement, the chances are: 2.15 to 1 that a score differs from the average by less than 1 a 21 to 1 that a score differs from the average by less than 2 a 369 to 1 that a score differs from the average by less than 3 a 15,772 to 1 that a score differs from the average by less than 4 a Also it is probable that: 57.51 percent of scores differ from the average by less than 1 M.D. 88.94 percent of scores differ from the average by less than 2 M.D. 98.33 percent of scores differ from the average by less than 3 M.D. 99.86 percent of scores differ from the average by less than 4 M .D. Or the chances are: 1.55 to 1 that a score differs from the average by less than 1 M.D. 8 to 1 that a score differs from the average by less than 2 M.D. [14] These more extended interpretations may be illustrated by re- ferring back to the examples used earlier. For the first of the two classes referred to, which had a mean score of 85 and a median devia- tion of 3, it is not only probable that half of its members have scores between 82 and 88 but also that about 82 percent of them have scores between 79 and 91 (85 ± 6), almost 96 percent between 76 and 94 (85 ± 9), over 99 percent between 73 and 97 (85 ± 12), and very nearly 100 percent between 70 and 100 (85 ± 15). Using the other form of statement for the first column of Table I, the chances are 1 to 1, or even, that a particular score chosen at random falls between 14.8 and 16.6 (15.7 ± .9), 4.6 to 1 that it falls between 13.9 and 17.5 (15.7 ± 1.8), 22 to 1 that it is between 13.0 and 18.4 (15.7 ± 2.7), 142 to 1 that it is between 12.1 and 19.3 (15.7 ± 3.6), and 1340 to 1 that it is between 11.2 and 20.2 (15.7 ± 4.5). 11 The probable error as a unit of measurement. 12 In dealing with data of various sorts one encounters many different units. The unit usually used for school marks is the percent, for ages the year, the month, or the day, for salaries the dollar, for heightsfthe foot or the inch, for weights the pound, for spelling the word, iSt arithmetic the example, and so on. In the case of such characteristics as height, weight, age, salary, and so forth, even though there are commonly used units of measurement, it is difficult if not impossible to compare one trait with another. For example, one cannot readily determine whether a pupil's height of four feet, eleven inches, his weight of 102 pounds, or his age of 12 vears and 8 months is the highest or lowest ranking when 59 to 1 that a score differs from the average by less than 3 M.D. 706 to 1 that a score differs from the average by less than 4 M.D. For the 10-90 percentile range the corresponding statements are: 99 percent of scores differ from their average by less than 1 D. 99.99997 percent of scores differ from their average by less than 2 D. And the chances are: 95 to 1 that a score differs from the average by less than 1 D. 3,380,614 to 1 that a score differs from the average by less than 2 D. As was suggested previously the quartiie deviation may be interpreted in the same way numerically as the median deviation. "The fact that one, or sometimes even both, of the limits within which a certain fraction of the scores may be expected to fall comes outside the range of actually ob- tained scores is due to the fact that the scores do not form a normal distribution. That they do not is often caused by the number of scores being small, as well as by causes inherent in the nature of the data themselves. 13 This use is derived directly from the one discussed in the preceding paragraph and also is one to which the name median deviation should properly be applied. The writer will therefore employ the latter term, abbreviated Md.D. , throughout his treat- ment of this use. *■ • [15] compared with other similar pupils. There are also many situations in which there is no commonly used unit or indeed any conventional unit closely connected with the type of thing being measured. Probably most of such cases in the field of education have to do with the measurement of difficulty, such as difficulty of examples in arithmetic, of questions in history or geography, of passages in reading, of words in spelling, and so forth. To meet the need for a common unit in which all scores and measurements, including those for which no conventional units are avail- able, may be expressed and thereby easily compared the median devia- tion has been adopted and come into rather common use. Irrespective of the units in terms of which scores or measurements have been ex- pressed originally, by applying certain statistical procedures they may be expressed in terms of median deviations. The most frequent use of the median deviation as a unit has proba- bly been in connection with the construction of standardized educational measuring instruments. The values or difficulties assigned the different items or steps on the scale or the distances between the steps are very frequently expressed in such units. An example of this may be found in connection with Woody's Arithmetic Scales, 13 given as part of his account of the derivation of these scales. Woody describes how the dif- ficulty values of the exercises composing each scale were determined. The essential steps in this determination consisted of finding the median deviation of the distribution of scores 14 for each scale and then measur- ing the distance of each exercise from the average of the distribution in terms of Md.D. units. 13 Different results were obtained in the different school grades so that it was necessary to combine these into average results. Finally, Woody located zero 16 points, that is, points of absolute "Woody, Clifford. "Measurements of Some Achievements in Arithmetic." Teach- ers College Contributions to Education, No. 80. New York: Teachers College, Colum- bia University, 1916, p. 29-54. "It is assumed that the distribution of pupils' scores represents the distribution of their abilities. "It does not seem necessary for the purpose of the present discussion to explain in complete detail just how this was done. Briefly, Woody found the percent of pupils obtaining the correct answer to each exercise and, on the assumption that the distribu- tion of pupils' abilities was normal, calculated the degree of difficulty of an exercise in terms of the number of Md.D. units that the ability required to do each exercise dif- fered from the average ability of the group. For a fuller explanation of the method of procedure, see: Odell, op. cit., p. 313-15. 10 The determination of such zero points is not a necessary part of the process of employing Md.D. t but merely renders the values so expressed more usable. The actual [16] lack, of ability to solve exercises in the four fundamental operations and transformed the Md.D. , values of the various exercises from distances from the averages of the distributions into distances from the zero points. To illustrate this simply, we may express John's height by saying that he is six inches taller than Paul. If, however, we know that Paul is five feet and three inches tall we can express John's height much more satisfactorily for most purposes by saying that he is five feet and nine inches above the zero point which is, of course, zero inches or no height at all. To show the final result of the process, that is, the difficulty values determined for the exercises, a portion of one of Woody's tables 17 is given as Table II. Exercise 2 was found to be the easiest, having a difficulty value of 1.23 Md.D. , exercise 3 was next with a value of 1.40 Md.D. u/ _ u. and so on up to exercise 38, the most difficult, which had a value of 9.19 Md.D. , . After the difficulty values have been so expressed we can not only say, for example, that exercise 3 is .17 Md.D and exercise 5 \. 21 Md.D. T more difficult than No. 2, but also, if the zero point has been located accurately, that exercise 5 is about twice as hard as No. 2, but only half as difficult as No. 15. The preceding discussion has used the term Md.D. . but perhaps not made clear just what it really means. Since 50 percent of the scores in a normal distribution fall within 1 Md.D. of the average and since a normal distribution is symmetrical it follows that half of these 50 per- cent, or 25 percent, of the scores will fall within 1 Md.D. of the average on each side. That is, 25 percent will fall between the average and 1 Md.D. determination of zero points usually rests, at least in part, upon opinion as to just what constitutes absolute lack of ability in a given field. Sometimes it is possible to deter- mine rather accurately just what is the least difficult task of a certain sort and to locate that degree of ability just barely insufficient to accomplish this task, but in many cases this can not or at least has not been done. "Woody, op. cit., p. 54. Other examples of the use of Md.D. as a unit may be U. found in: Buckingham, B. R. "Spelling Ability — Its Measurement and Distribution." Teachers College Contributions to Education, No. 59. New York: Teachers College, Columbia University, 1913, p. 40-65. Monroe, W. S. An Introduction to the Theory of Educational Measurements. Boston: Houghton Mifflin, 1923, p. 61-62, 94-103, 138-41, 150-52. Trabue, M. R. "Completion-Test Language Scales." Teachers College Contribu- tions to Education, No. 77. New York: Teachers College, Columbia University 1916, p. 29-73. Hughes, J. M. "The use of tests in the evaluation of factors which condition the achievement of pupils in high school physics," Journal of Educational Psychology, 16: 217-31, April, 1925. [17] TABLE II. FINAL VALUES OF ADDITION EXERCISES No. of Value No. of Value No. of Value No. of Value Exercise Exercise Exercise Exercise 2 1.23 14 3.92 22 6.44 35 7.97 3 1.40 9 4.18 19 6.79 29 8.04 5 2.50 12 4.19 23 7.11 31 8.18 7 2.61 13 4.85 34 7.43 24 8.22 6 2.83 15 4.97 26 7.47 36 8.58 8 3.21 17 5.52 30 7.61 37 8.67 1 3.26 16 5.59 27 7.62 33 8.67 4 3.35 18 5.73 25 7.67 38 9.19 10 3.63 20 5.75 28 7.71 11 3.78 21 6.10 32 7.71 below the average and another 25 percent between the average and 1 Md.D. above it. Furthermore the average of a symmetrical distribu- tion falls at the middle of the distribution, so that 50 percent of the scores lie below it and 50 percent above it. Therefore, it is easily seen that 75 percent of the scores lie below 1 Md.D. above the average, as this is simply the sum of the 50 percent below the average and the 25 per- cent between the average and 1 Md.D. above it. To make this clearer the accompanying figure is given. The portion of the normal frequency M. -MMdLD. Figure 2. Representation of a Normal Distribution of Scores Showing the Meaning of the Median Deviation as a Unit of Difficulty surface to the left of the vertical line at its center, marked M. s is the 50 percent of the area below the average. That part between this vertical fine and the one erected at -j- 1 Md.D. is the 25 percent between the [18] average and one median deviation above the average. Thus, all the area to the left of the shorter vertical line is 75 percent of the whole area. With this in mind we can now explain the meaning of the median deviation as a unit of difficulty by saying that it is the difference in difficulty between an exercise answered correctly by 50 percent of the pupils tested and another answered correctly by 75 percent of the pupils. 1S Looking at Table II we see that exercise 25 has a value of 7.67 and exercise 37 of 8.67, a difference of 1.00 Md.D. ,. We know, therefore, that if the two exercises were given to the same group of pupils and 50 percent of them answered exercise 37 correctly, 75 per- cent might be expected to answer exercise 25 correctly, since it is 1 Md.D. easier than the former. u. There is also another somewhat different meaning which is often attached to the median deviation when used as a unit of measurement. In the construction of such measuring instruments as handwriting and drawing scales, one method of determining the value or merit of the specimens being rated for a scale is to have them compared with one another by a number of supposedly competent judges. For example, judges compare specimen A with B, also A with C, B with C, and so on. Record is made of how many or what percent of the judges rate A as better than B and of course how many rate B as better than A, and so on. When 75 percent of the judges rate one specimen as better than another 11 ' the difference in merit between the two is assumed to be 1 Md.D. , . This is illustrated by Figure 3 in which the surfaces under the two curves are assumed to represent distributions of judges' ratings of two specimens, A and B. It is assumed that the opinions of judges concerning the merit or value of a specimen will form a normal distri- bution, the center or average of which is the true value. Therefore, the surface at the left, under curve A, is taken as representing the distribu- tion of judges' opinions concerning specimen A and the point A on the base line where the solid vertical line meets it as the true value of A. Similarly, point B at the foot of the broken vertical line is assumed to represent the true value of B. If 75 percent of the judges rate B as lh It is also possible to say that 1 Md.D. _ is the difference in difficult)- between an exercise answered correctly by 25 percent of the pupils and one answered correctly by SO percent of them, but the form of statement given above is more usual. M In rating specimens for the purpose being discussed, judges are expected to rate each as better or worse than each of those with which it is compared. If they rate two as equal, the rating must be thrown out or divided between the two. Therefore if 75 percent of judges rate one specimen as better than another, 25 percent must rate it as worse. [19] Figure 3. Illustration of Method of Determining Difference in Merit of Two Specimens by Judges' Ratings of One as Better or Worse Than the Other superior to A, 75 percent of the area of the surface to the right, repre- senting B, will lie above or to the right of the vertical line assumed to show the true value of A and of course 25 percent below or to the left of that line. Since 50 percent of the judges' ratings of B lie above its average merit, that is, to the right of the broken vertical line above point B, the portion of the surface representing B which is included between the two vertical lines must be 75 percent minus 50 percent, or 25 percent. To make clear which this is, it has been shaded in the figure. We have already seen that a distance of 1 median deviation in one direction from the average distribution includes 25 percent of the total number of cases. Therefore, the distance between the two vertical lines must be 1 Md.D. in order that 25 percent of the area be included. This method of determining the value of merit of specimens has been made use of in the case of a number of our standardized scales. Probably the best known example of its use is in connection with Thorn- dike's Handwriting Scale.-" In his account of its construction he describes two methods, one of which is that just mentioned. He had samples of handwriting rated by a number of judges as to whether they were better 2 °Thorndike, E. L. '"Handwriting," Teachers College Record, 11:1-41, March, 1910. For further examples, see: Hoke, E. R. "The Measurement of Achievement in Shorthand." The Johns Hop- kins University Studies in Education, No. 6. Baltimore: Johns Hopkins Press, 1922, p. 33-34. Hillegas, M. B. '"Scale for the measurement of quality in English composition by young people," Teachers College Record, 13:1-54, September, 1912. Murdoch, Katherine. "'The Measurement of Certain Elements of Hand Sewing." Teachers College Contributions to Education, No. 103. New York: Teachers College, Columbia University, 1919. p. 22-26. [20] or worse than the other samples and, according to the method outlined above, determined the differences in merit between the samples in terms of the median deviation. A sample considered to possess no merit as handwriting, though obviously an attempt to write, was used as the zero point and the distance of each sample above this point determined. Probable errors of sampling. The third use of the probable error is one to which that term is properly applied. In this case it is employed directly as a measure of the size of the errors involved in sampling, that is to say, as a measure of the reliability of sampling. The probable error of sampling can not be used alone, but must always be connected with some other measure such as an average, a standard or quartile deviation, a difference, a coefficient or ratio of correlation, a regression coefficient, or other similar measures. Assuming that the sample has been selected in a random manner, in other words that it is not biased, the probable error of sampling gives an indication of how reliable such derived measures are when the cases upon which they are based are considered as a sample of a larger number of similar ones. For example, if the average score of five hundred eighth-grade children upon an intel- ligence test has been determined and it is assumed that no errors are present in the test scores or computations leading to the average, this average is the true one for the children actually tested. If the five hun- dred children have been selected from a much larger number in a city school system, the average obtained from their scores is not, except by chance, the true average of all the eighth-grade children in the system. However, if we assume that the five hundred children constitute a ran- dom sample, we can determine the reliability of the average actually obtained when considered as the average of all of the eighth grade children in the system. When the probable error of sampling is used, it is both customary and convenient to place a plus and minus sign, followed by the probable error, immediately after the measure to which it applies. Thus if the average intelligence quotient of the five hundred pupils had been found to be 102 and its probable error 3, it would frequently be written 102 ± 3, when considered as an average I.O. of all of the eighth-grade pupils in the system. The same practice is also followed in the case of other measures than the average. A second fairly common way of re- ferring to the probable error of sampling is to use the abbreviation P.E. with a subscript indicating the measure to which it applies. Thus P.E. M denotes the probable error of the mean, P.E. that of the median, P.E. that of the coefficient of correlation, and so on. [21] The interpretation of the probable error of sampling from the standpoint of chance is the same as that of the median deviation when used as a measure of variability or scatter. That is, the chances are even that the true measure of the whole group does not differ from the measure obtained from the sample by more than the value of the probable error; they are 4.6 to 1 that it does not differ by more than 2 P.E., 22 to 1 that it does not differ by more than 3 P.E., and so on. Another way of stating the same thing is that if a number of samples of the same size as the one already taken and similar to it were selected and corresponding measures computed from them, half of these measures would probably fall within 1 P.E. of the first one computed, 82 percent within 2 P.E., 96 percent within 3 P.E., and so on. Thus, in the case of the group of eighth-grade pupils referred to. it is probable that, if a number of similar samples were chosen and their means determined, half of them would fall within 3 points of 102, that is between 99 and 105, 82 percent between 96 and 108 (102 ±6), 96 percent between 93 and 111 (102 ±9), and so forth. A good example of this use of the probable error is to be found in a recent issue of the Journal of Educational Psychology.- 1 In the article referred to the following table is given. It contains a number of means, standard deviations, and coefficients of correlation, each followed by its probable error. For example, the mean English grade of the first high- school group is given as 84.6 ± .3. This indicates that if similar sam- ples were taken it is probable that half of the obtained means would be between 84.3 and 84.9 (84.6 ± .3), 82 percent of them between 84.0 and 85.2 (84.6 ± .6). 96 percent between 83.7 and 85.5 (84.6 ± .9), and so on. The formulae by which to compute a probable error of sampling differ according to the measure for which it is being found. The follow- 21 Gowex, Johx W., and Gooch, Marjorie. "The mental attainments of college students in relation to previous training," Journal of Educational Psychology, 16:547-68. November, 1925. Other examples may be found in the following references: Rich, S. G.. and Skixxer, C. E. "Intelligence among normal school students." Educational Administration and Supervision, 11:639-44, December, 1925. Ellis, R. S. "A comparison of the scores of college freshmen and seniors on psy- chological tests," School and Society. 23:310-12. March 6, 1926. Remmers, H. H., and Edxa M. "The negative suggestion effect of true-false exam- ination questions." Journal of Educational Psychology, 17:52-56, January. 1926. Moxroe, \V. S. An Introduction to the Theory of Educational Measurements. Boston: Houghton Mifflin Company, 1923, p. 204. [22] _• 'J 2 w _3 C C = X o = O U w 1—1 - C/) to 1 w -J - c u c/5 w Q < O to 2 < fe; c^ (N J~- Os so r-- SO ~Y co (N rf »j"> -f co co i— i c^i 0000\NO»0 -h d N ri x to rt Q -H -H -H -H -H -H -H CS oo as ■* so co as t>3 rt H Ifl t^ rH rt VI Cs OS -h o O Os OS co co Tf t^i — l/~i *o u c/3 be ■-H 1) "3 -H -H -H -H -H -H -H > u-i w-) — . cs \o os r^ M < ■^ o r) n -h vo n OO 00 OC OO 00 oc oo bo O U o 3 o -F oo >. £ a j: •- . (0 «- 2 .2 "O^to g cjj a; u a £ > £ tou<fl * o o o o o o o v. -H -H -H -H -H -H -H Tf OS O so co OS © OC © CN l-~ © OO OS CN CN — i co sC oo i— i C*" o so o as -H -H -H -H -H -H -H U vODXh-co^* > < ■*f ^f co co ""> so - CO 75 (f) X CO ^^^"oo"3b"5b ao c c c c c c c ft u. u. [* J- U fcL [23] ing are the formulae for the probable errors of the mean, the median, the standard deviation, and the coefficient of correlation: Md.B. jr. r.. yi — ^^ P.K Mim - 1-2533^ P. E. = .7071 Md - D - P.E. = Vn 1 - r- vw In these typical formulae it will be noticed that there is one common element, ^/A r , appearing in their denominators. The same is true of the formulae for the probable errors of sampling of almost all commonly used statistical measures. Even in those cases in which -y/^ydoes not appear directly in the denominator of the formulae, it or some similar expression is usually in some way contained therein. Since N stands for the number of cases in the sample, it can easily be seen that the larger the sample the larger is the denominator of the fraction and therefore the smaller the value of the probable error. In other words, increasing the size of the sample decreases the size of the probable errors present and hence increases the reliability or accuracy of the derived measures. The probable error of a number of measurements of the same thing. Another situation in which the term probable error is appropri- ate is in measuring the size of the variable errors present in a number of measurements of the same thing. In most, if not all, situations it is impossible to measure a trait with such a high degree of precision and reliability that all similar measurements will agree exactly with the original one. For example, let us suppose that ten different persons determine a child's height or that the same person does so ten times. If height is being found only to the nearest inch and the persons doing the measuring are fairly competent it is likely that all the results will agree. If, however, the attempt is made to secure a rather high degree of accuracy and results are given, let us say, to the nearest sixteenth of an inch, it is extremely improbable that the results obtained, whether bv ten different persons or by the same person at ten different times, will be the same. There are generally two causes for this and fre- quently a third one. In the first place, even though the persons making [24] the measurements are reasonably competent it is unlikely that all have just the qualities, such as keenness of eyesight, steadiness of hand, abil- ity to time accurately, and so forth, necessary for accurate measure- ment or that all exercise exactly the same degree of care. Secondly, it is improbable that the child being measured will assume exactly the same posture when all ten measurements are being made. In addition, if different measuring instruments are used it is very unlikely that they are absolutely identical. The errors due to all these and any other chance causes are called variable errors 2 - and are often measured by the probable error. In a sense they may be thought of as errors of sampling, for just as a group too large to have all its members meas- ured is sampled by measuring a part of them, so a characteristic which cannot be measured with absolute accuracy and therefore theoretically requires that an infinite number of measurements be made and averaged to secure a perfect one, is sampled by making a limited number of measurements. A common example of the occurrence of variable errors is in con- nection with the giving of written examinations and tests. At one test- ing period a pupil may happen to be feeling unusually well, whereas at another his health may be below par; at one time he may have re- viewed the material covered by the questions recently, but at another it may happen that the questions touch material about which he knows little although he remembers most of what he has studied; at one time he may make a better score than he deserves by cheating, whereas at another his score may not indicate his true ability because his pencil broke or something outside the window attracted his attention, and so on. Similarly, when weight is being measured the result will vary ac- cording to whether or not the individual has eaten a meal recently, whether he is wearing heavier or lighter clothing than usual, has more or less in his pockets, and so on. Since, because of all these variable errors, we can rarely, if ever, establish that any one obtained score is a true or even the best obtain- able measurement of the characteristic being dealt with, the best that we can do is to supplement the scores obtained by a statement of their reliability. As in the case of the probable error of sampling so here the P.E. is commonly affixed to the obtained measure with a plus or minus sign connecting the two. Thus, a pupil's height may be stated as ^For a fuller discussion of variable errors see: Monroe, W. S. "The constant and variable errors of educational measurements." University of Illinois Bulletin, Vol. 21, No. 10, Bureau of Educational Research Bulletin No. 15. Urbana: University of Illinois, 1923. 30 p. [25] w a, < Oh O < OS w I u < w H to O C/3 X o h P pq CO f 1 CO *-0 n . — — 'CNCM^OOOr-CTsrOr^-Hr^CN ^D © o 00 B U. of c. Pass. 70 OO ON d CO V) B U. of W. Pass. 70 e-i r~ i~- r- ■* oo r-i ci ci i-h 1/1 oo CI CO Oh rtN'O'Ha'.rtUinrt LO CO QQ -0 Wl (N (N *H On oo A U. of C Pass. 70 tsnn OS oo 00 A U. of W. Pass. 70 ^H Tfri C) oo as Oh t-h h vd r~~ r~- vi tji in oo <* Ch'^ tSrHNVlX^Offl iHoin 0\ oo oo C n '-5 ft, ! O © olMMi'^'vivnoiOh-l^ooM^iH I i I i l i i I I i i i I I I u-iOviovioinoviOviOioOV) J2^; o N tJ< ^f ^ t^ r^ so t*i " " — O i-i c OOO — — c O OOO OOO' © p © OOO < 000 OOO C C 3 OOO OOO OOO OOO CN CO — ' £ '-5 (^ ni 1) -w ^ c . so r- r-~ \£ Tf CO © (^ O — O O a-. 00 r- v. so r-- OC OC OC _u .2 t*3 »-l t-H t-H — — ,— I — O O — — — O — c OOO c/> °3! c < OOO OOO OOO OOO O O -h OOO ~ ~ -~ OOO & s E rN co co Tf rN rN NO\^ as CO ' — 1 ~^ rN -f — OOC>-h 1— ( 1 — |H o hq U < J "7 r- (N O O oc O — ' 1-1 c r-" -~ O O o 0" O © O O O 4J s *£ >g < ^ SO SO -t" (N CO CO -r CI CN -r -t CO sO >J-1 ■-o SO ^ ■7 C oc r-~ r-~ oc r-~ u ^ \ O O O o < O O O u oc OJ ■M « c ^ W-, >j~> r~ u-1 ■* I-~ CO ^ co LT> ■* LO SO r~ w-> ~~ McOfj rN co co rN co its CN co CO rN CO co rN co co CN CO CO rN co to CN g jS-fii-C _c_e _e J=_C J= _C _C _C -c_£ _e _C _C _c _C -C _c _=_c-e jS 1h C ^S Z ? Z & s & & =s £ ■s ^ -s £ £ £ S S is £ i * £ — < — ri ~ ^h n — — fN — — CN ^h ^ ri ~^c. — r-H CN ■ —l u rt O ~ *~ > > > c O 1»* r> k* ►— >■ > [31] to VIII, the probable errors of measurement of the three tests which make up the Illinois Examination, also their ratios to the averages. Tak- ing the entries in the first line of the table as an illustration, half of the differences between the intelligence scores actually obtained in Grade III with forms 1 and 2 and the theoretically true scores were found to be less than 3.5 points and half of them greater than this amount. For the arithmetic test half of these differences were less than 2.6 points; for the comprehension scores of the silent reading test half of the dif- ferences were less than 1.2, and for the rate scores half of the differences less than 13.7 words per minute. The probable errors of estimate, which are not given in Table VI, would of course be larger, that cor- responding to the probable error of measurement of 13.7 just mentioned being about 18.5 words per minute. It will be noticed that in Tables V and VI the columns containing the probable errors of estimate and of measurement, respectively, are followed by columns showing the ratios of these measures to the corre- sponding averages. 20 This is done because the mere statement of the size of a probable error of estimate or of measurement usually conveys little definite meaning unless one knows the size of the individual meas- ures themselves. Just as an error of an inch is of slight significance in measuring the distance between two cities or even the length of a lot but is relatively significant in measuring a person's height and very im- portant in fitting a piston to its cylinder, so an error of a given number of points on a test becomes more significant the smaller the score. It will be seen that whereas Tables V and VI show either a slight tendency for the probable errors to be greater in the higher grades or else no regular tendency at all, they reveal that relative to the average scores which increase from grade to grade, the errors become smaller, the ratios being considerably less in the eighth grade than in the third. 29 There are certain objections raised to the use of these ratios which will not be discussed here, further than to admit that sometimes their use may be misleading. The writer believes, however, that in general their use is desirable both because the probable errors alone frequently convey little helpful information and because no better relative measure has been suggested. [32] CHAPTER III THE COEFFICIENT OF CORRELATION Definition of correlation. Before proceeding to discuss the use and interpretation of the coefficient of correlation it seems in order to define, first, what is meant by correlation in general, and, second, what is meant by the coefficient of correlation. Two characteristics or traits are said to be correlated when there is a tendency for changes in the value of one to be associated or occur concurrently with changes in the value of the other. If most of the changes in one of the things being dealt with are in the same direction as the corresponding changes in the other, the correlation is said to be positive or direct; if in opposite directions, it is said to be negative or inverse. For example, if pupils' marks in al- gebra and English are being correlated, and in most cases pupils who are relatively high in one are also relatively high in the other and like- wise those who are low in one are generally low in the other, the corre- lation is positive; whereas, if pupils who stand high in algebra tend to rank low in English, and vice versa, it is negative. The greater the proportion of associated changes which are in the same direction, the greater is the amount of positive correlation; the greater the proportion in opposite directions, the greater the negative correlation. It is also true that the greater the agreement in relative magnitude of the con- current changes, the greater the degree of correlation, whether positive or negative. For example, if a pupil who is 10 percent above the aver- age in English is also 10 percent above the average in algebra, if one who is 5 percent above in English is 5 percent above in algebra, and so on for most of the cases, the correlation is higher than if this condition does not obtain. Examples of both positive and negative correlation are very numer- ous and easily found. For example, it is usually found that the greater a person's height, the greater his weight; and that the older a child, the greater his strength. Therefore, height and weight, and children's age and strength are positively correlated. On the other hand, after an adult passes a certain age strength tends to decrease with advancing years so that the correlation is negative. This is also true when the two things compared are size of class and cost of instruction per pupil, since, on the whole, the larger the class the smaller is the cost for each mem- ber thereof. [33] The fact should be emphasized that the existence of correlation does not prove that there is any dependence or causal relationship be- tween the two things correlated. It may be that such dependence exists, but it may also be that neither trait in any sense causes the other. In- stead, the existing correlation may be due to the action of one or more outside factors which affect both the characteristics being dealt with. Sometimes the causal factor or factors may be even more remote than this, that is, some common cause may affect two characteristics or fact- ors, each of these two may affect another, and so on, with the result that the final pair of characteristics considered, though relatively remote from the common cause, show correlation with each other. On the other hand, if the correlation between two traits is fairly high, the like- lihood that one of them affects the other or that both are affected by a relatively proximate common cause, is great enough to be investigated as a probable hypothesis. Definition of the coefficient of correlation. Although "coefficient of correlation" is sometimes used in a broad sense toinclude any one or all of a number of numerical expressions which summarize the degree of relationship between two variables, it is best to reserve this term for the product-moment coefficient of correlation, sometimes called the Pear- son coefficient because its present extensive use is chiefly due to the English statistician, Karl Pearson. This expression, which is abbrevi- ated by "r", is given by the formula: 2xy Na