UNIVERSITY OF CALIFORNIA PUBLICATIONS IN AGRICULTURAL SCIENCES Vol. 4, No. 7, pp. 159-181, 12 text figures September 10, 1920 A NEW AND SIMPLIFIED METHOD FOR THE STATISTICAL INTERPRETATION OF BIOMETRICAL DATA 1 BY GEORGE A. LINHART SECTION I The derivation of the Law of Probability may be found in any text on the subject. Here we shall assume its validity and use it to obtain the several quantities which serve as criteria in statistical calculations. In the fundamental equation y = ke-h*x* [1] there are two characteristic constants, k and h, whose numerical values must be known for a given set of data before we can proceed with any calculations. A simple and at the same time exact method of obtaining the numerical values for those constants forms the subject of this paper. Since y equals k when x equals zero, k is the probability of an error zero and will therefore be defined here as the largest number of measure- ments of a given set having the same numerical value; while y will denote any number of measurements whose group value ranges from zero to the group value of the number of measurements denoted by k or y . Equation (1) then becomes ^-=e-^ 2 [2] which by means of logarithms we have transformed into a linear equa- y tl0n ' Va Log (2.303 Log ^ ) = 2 Log x + 2 Log h [3] or Log (Log ^ ) = 2 Log x + 2 Log h - 0.3623 [4] Collecting 2 Log h and —0.3623 into one constant, we have, Log(Log|°) = 2Logx + K [5] 1 From the Division of Soil Chemistry and Bacteriology, College of Agriculture, University of California, Berkeley. 160 University of California Publications in Agricultural Sciences [Vol. 4 If we now plot Log (Log y — Log y) as ordinate and Log x as abscissa with a slope of 2 all the measurements should theoretically fall on the straight line, provided the data are susceptible to statistical interpreta- tion — that is, provided they are truly chance data. Practically, how- ever, even such data fall on either side of the straight line. Drawing now the "best" straight line with a slope of 2 through these points, we can then read off the values on the line as accurately as we choose, depending upon the size of the scale of plotting, and construct a "theo- retical" frequency curve for comparison with the experimental frequency curve obtained in the usual way; that is, by plotting the number of experiments in groups or classes against the measured values. Fre- quently 2/0 does not fall directly over the arithmetical mean. In such a case the theoretical polygon may be shifted to the left or to the right, and this corresponds to the parallel shifting of the straight line from which the values for the construction of the theoretical frequency poly- gon have been obtained. Often this theoretical polygon reveals the fact that the arithmetical mean calculated from the raw" data is not in all cases the "best" mean, for, as it frequently happens, one or two abnormal values will vitiate the mean considerably, especially if the number of experiments are not sufficiently large. We must, therefore, so superpose the two polygons as to make their areas approximately equivalent, since, as will be shown later, the areas play an important part in the calculation of the probable error. A concrete example will best illustrate the method of procedure. In a recent paper by Way nick and Sharp (1919) are given the nitrogen contents of a hundred samples of a local soil. The results are recorded to 0.001%, based upon ten gram samples, and therefore to 0. 1 mg. In figure I these one hundred results are mapped in groups or classes . 1 mg. apart, the circles indicating the number of determinations falling into each class. 2 Plotting these classes vertically to a scale of one-half inch per one determination, we obtain the multimodal curve drawn immed- iately above the circles. Evidently the analyses are too fine as com- pared with the variability of nitrogen in those samples of soil. The number of determinations were then grouped in classes 0.5 mg. apart, resulting in the next curve above. This curve bears some resemblance to a " frequency" curve, but is still unsatisfactory. However, such a curve is quite sufficient for the construction of a theoretical frequency polygon by our straight line method. In this case y would equal 25 and could be made to fall directly over the arithmetical mean, 10.0 mg. *When the circles f;ill on a line dividing two classes, then if the number of circles is even they are equally divided between the two classes; if odd, the extra one is put into that class which helps to make the experimental polygon most symmetrical. 1920] Lin-hart : Method for Statistical Interpretation of Biometrical Lata 161 Indeed, if we attempt to plot these data in classes very much farther apart than 0.5 mg., say 2.0 mg., we obtain a so-called skew curve, and, finally, we may obtain a line sloping in one direction only when we plot these data in classes 2.5 mg. apart. It is evident, therefore, that such skew curves are meaningless. 3 In the present case when we plot the data in classes 1.0 mg. apart the curve " skews" but slightly. Here y falls directly over the arithmetical mean, and the one hundred deter- minations fall into four classes. With these four points on the curve, two on each side of the mean and approximately equidistant from it we may construct the straight line as shown in figures VII, VIII and IX, where the values for Log (Log y — Log y) are plotted as ordinates and the values for Log x as abscissae, x denoting the residuals on either side of the mean without regard to algebraic sign. It should be noted that in drawing the "best" straight line with the theoretical slope of 2 through such points proportionately less weight must be given to points taken from the experimental polygon near the base than to those taken from the upper portion of the curve. A little practice will soon enable one to judge at a glance which points are most significant. Having now obtained the "best" straight line, we may calculate any number of values for x by means of equation (5), namely: Log(Log|°)=2Logx + K, K denoting the distance on the Log (Log y — Log y) axis, or ordinate, from the origin to the point of its intersection by the "best" straight line. In the present example y equals 40 and y may be taken anywhere from one to thirty-nine, but for the construction of the theoretical polygon six to ten values for y will suffice. These are shown in table I. Discussion of the Figures Figure I has been fully discussed. Figure II is but another example of how to construct a theoretical polygon approximately equivalent in area to the experimental polygon. An interesting set of data is that mapped in figure III. Here the total nitrogen in each sample is so small that a few samples might have contained no measurable amount of nitrogen at all. The values for the construction of these two figures, II and III, were taken from a paper by Waynick (1918). The data mapped in figure IV are recorded in a paper by Batchelor and Reed (1918). Here as in figure III the theoretical polygon indicates that among the one thousand orange trees about three might have borne 3 A discussion of truly abnormal curves and their susceptibility to statistical interpretation will be given in another paper. See also section II. of this paper. 162 University of California Publications in Agricultural Sciences [Vol. 4 no fruit at all had they been left wholly to chance. In fact one tree yielded but five pounds of fruit, which is practically zero, while another yielded 341 pounds, the mean of all the thousand trees being 137.6 pounds of fruit. Two more interesting sets of data are those of Wood (1910) on the dry weights of mangel roots, and by Collins (1912) on butter fat. These results are mapped in figures V and VI. In figures VII, VIII and IX are shown the construction of the straight lines from the experimental data as previously described. Finally, in figure X are mapped the results of bacterial counts taken from a recent article in Science (1920). Calculation of the Index of Precision Turning once more to the straight line plots on figures VII, VIII and IX, we see that we may read off the values for K of equation (5) to any degree of accuracy, depending upon the size of the scale of the plot. On the above plots, 20x20 inches, the values for K can be read off accurately to three places of decimals, which is quite sufficient for most cases. With this value for K of a given set of measurements we can calculate the value for h, the Index of Precision, as is shown in equation (5) where K was put in place of 2 Log h — 0. 3623 ; hence, K + 0.3623 h = (10) 2 [6] Calculation of the Probable Error The simplest way of calculating the probable error is to take from a probability integral table the value for hx corresponding to the integral value }/2- This value for hx is 0.4769; hence, _ K + 0-3623 :r = 0.4769(10) 2 [7] We might of course draw a straight line through every " class" point parallel to the "best" straight line and so obtain a probable error for each class which, when meaned, would give an average probable error. However, in most cases the probable error obtained from the "best" straight line is more accurate. A more instructive method of calculating the probable error is to make; a tracing of the theoretical polygon, which is constructed from the values icad off on the straight line plot, on reasonably uniform tracing cloth and then carefully cutting out the area under this curve, rolling it up and finally weighing it on accurate balances. The polygon is then unrolled and folded along the mode exactly in two and trimmed along t he sides parallel to the fold by means of a photographer's print trimmer 1920] Linhart: Method for Statistical Interpretation of Biometrical Data 163 until it weighs exactly one-half of the original weight. Replacing now this trimmed tracing upon the original theoretical polygon, we may read off the probable error on the base of the polygon at the limit of the tracing. Calculation of the Probable Error of the Arithmetical Mean By means of the Principle of Least Squares it can be shown that the probable error of the arithmetical mean, x , is equal to the probable error (obtained from h) of one determination divided by the square root of the number of determinations, or, x 0.4769 , K + 0-3623 Xo=— = ^—^=-..(10) 2 y/n vn TABLES OF RESULTS In the tables below are given in the first columns the number of determinations falling into each class, while in the last columns are given the values calculated by means of the straight lines for the con- struction of the theoretical polygons. The headings are self-explana- tory. The Roman numerals of each table correspond to the Roman numerals on the figures constructed from these tables. Calculated from I K= -0.670 x obs. Log x Log x x -(-co -+- CO +0.4375 2.739 2.0 +0.301 +0.3605 2.293 +0.3130 2.056 y Log^-° y Log (Log ^) + CO + CO l 1.602 +0.205 3 1.125 +0.051 5 0.903 -0.044 9 0.648 -0.188 15 0.426 -0.371 20 0.301 -0.521 22 0.260 -0.586 26 0.187 -0.728 30 0.125 -0.903 35 0.058 -1.237 40 0.000 CO y **? Log (Log |°) + CO + 00 1 1.342 +0.128 2 1.041 +0.018 3 0.865 -0.063 6 0.564 -0.249 8 0.439 -0.357 10 0.342 -0.466 15 0.166 -0.780 19 0.064 -1.194 2.0 +0.301 +0.2410 1.742 +0.1495 1.411 +0.0745 1.187 1.0 0.000 +0.0420 1.102 1.0 0.000 -0.0290 0.935 -0.1165 0.765 -0.2835 0.192 -co 0.000 Calculated from 11 K= -0.350 x obs. Log x Log x x + co -±- CO 1.7 +0.230 +0.239 1.734 +0.184 1.528 1.8 +0.255 +0.144 1.392 +0.051 1.124 -0.004 0.991 0.8 -0.097 -0.058 0.875 0.7 -0.155 -0.215 0.610 0.2, -0.699, -0.422 0.378 or 0.3 or -0.523 22 0.000 -co -co 0.000 164 University of California Publications in Agricultural Sciences [Vol. 4 III Calculated from K = +0.250 y -? Log (Log |) x obs. Log x Log x X + 00 + 00 + 00 ±00 l 1.447 +0.161 1 . 15 +6.061 -0.045 0.902 2 1.146 +0.057 -0.097 0.801 4 0.845 -0.073 0.85 -6.071 -0.162 0.690 7 0.602 -0.220 -0.235 0.582 8 0.544 -0.264 0.55 -0.260 -0.257 0.553 10 0.447 -0.350 -0.300 0.510 15 0.271 -0.567 -0.409 0.390 IS 0.192 -0.717 6.35 -0.456 -0.484 0.328 21 0.125 -0.903 0.25 -0.602 -0.577 0.265 25 0.049 -1.310 -0.779 0.166 28 0.000 00 CO 0.000 Ilia Observed Calculated from K = I i.625 y T 2/o Log- (Log — ) 2 Log m (Log ^-f T m L °s — Log m y w m Wo + 00 + oo ±co ±00 i 1.591 0.2828 0.5318 +0.3318 -0.7318 5 0.892 0.2304 +0.28 -0.68 0.1585 0.3981 +0.1981 -0.5981 10 0.591 0.1051 0.3241 +0.1241 -0.5241 18 0.336 0.0576 +0.04 -0.44 0.0597 . 2443 +0.0443 -0.4443 19 0.312 0.0576 +0.04 -0.44 0.0555 0.2356 +0.0356 -0.4356 30 0.114 0.0203 0.1425 -0.0575 -0.3425 35 0.047 0.0084 0.0917 -0.1083 -0.2917 39 0.000 0.0000 0.0000 -0.2000 IV Calculated from K=- 4.100 y Wf Log (Log |°) x obs. Log x Log x X + 00 + 0O + 00 ±00 i 2.182 +0.339 137.6, or 202.4 +2.139, or 2.306 +2.220 165.8 2 1.881 +0.274 102.4 2.211 2.187 153.8 3 1 . 705 +0.232 182.4 2.261 2.166 146.6 7 1.337 +0.126 142.4 2.154 2.113 129.7 8 1.279 +0.107 117.6 2.070 2.103 126.9 17 . 952 -0.021 122.4 2.088 2.039 109.5 20 0.881 -0.055 102.4 2.010 2.022 105.3 25 0.784 -0.106 97.6 1.989 1.997 99.3 51 0.474 -0.324 82.4 1.916 1.886 76.9 58 0.419 -0.378 77.6 1.890 1.861 72.6 62 . 390 -0.409 62.4 1.795 1.846 70.1 91 . 223 -0.652 42.4 1.627 1.724 53.0 116 0.118 -0.928 57.6 1.760 1.585 38.5 120 0.103 -0.987 37.6 1.575 1.556 36.0 124 089 -1.051 17.6 1.246 1.525 33.5 142 030 -1.523 22.4 1.350 1.288 19.4 152 0.000 00 — CO 0.0 1920] Linhart : Method for Statistical Interpretation of Biometrical Data 165 Calculated from K = -0.980 y T V0 Log- Log (Log |) x obs. Log x Log x X + 00 + ro + 00 ±co l 1.623 +0.210 4.0, or 5.0 +6.602, or +0.699 +0.595 3.94 2 1.322 +0.121 4.0 +0.602 +0.551 3.55 7 0.778 -0.109 3.0 +0.477 +0.436 2.73 9 0.669 -0.175 3.0 +0.477 +0 . 403 2.53 16 0.419 -0.378 2.0 +0.301 +0.301 2.00 17 0.393 -0.406 2.0 +0.301 +0.287 1.94 24 0.243 -0.614 +0.183 1.52 32 0.118 -0.928 id 0.000 +0 . 026 1.06 33 0.104 -0.983 1.0 0.000 +0.002 1.00 38 0.043 -1.367 -0.194 0.64 42 0.000 OD OO 0.00 VI Calculated from K = -0.700 y Log| Log (Log |) x obs. Log x Log x X + 00 + 00 + O0 ±00 l 2.415 +0.383 0.85 -0 071 -0.159 0.694 3 1.938 +0 . 287 0.95 -0.022 4 1.813 +0.258 0.65, or 0.75 -0.187, or -0.125 5 1.716 +0.235 0.85 -0.071 7 1.570 +0.196 0.75 -0.125 8 1.512 +0.180 0.65 -0.187 11 1.374 +0.138 0.55 -0.260 -0.281 0.524 34 0.884 -0.054 0.45 -0.347 39 0.824 -0.084 0.55 -0.260 -0.392 0.406 45 0.762 -0.118 0.45 -0.347 58 0.652 -0.186 0.35 -0.456 63 0.616 -0.210 0.35 -0.456 -0.455 0.351 97 0.428 -0.369 0.25 -0.602 137 0.278 -0.556 0.25 -0.602 -0.628 0.236 200 0.114 -0.943 0.15 -0.824 -0.822 0.151 205 0.103 -0.987 0.15 -0.824 241 0.033 -1.481 0.05 -1.301 -1.091 0.081 260 0.000 OO — oo 0.000 Calculated from K=- -2.690 y Log| Log (Log ^°) x obs. Log x Log x X + 00 + 00 + 00 ±00 1 0.903 -0.044 20, or 30. + 1.301, or +1.477 + 1.323 21. 3 0.426 -0.371 + 1.160 14. 5 0.204 -0.690 10. + 1.000 + 1.000 10. 7 0.058 -1.237 +0.727 5. 8 0.000 CO OO 0. 166 University of California Publications in Agricultural Sciences [Vol. 4 Xa Observed Calculated from K = 5.Q25 y *? < L <>° Log m < L "~ r 172 University of California Publications in Agricultural Sciences [Vol. 4 1920] Linhart : Method for Statistical Interpretation of Biometrical Data 173 ____J5§L _ ;: ; ;;:-: ■'ufHhtnttu H7rh;r!WrrHrt ti. 174 University of California Publications in Agricultural Sciences [Vol. 4 ^ ; ==^|rr? Bf5lpi|| ::_..:-j.— :5^: ; ! ii |||df „ . j ; _L.._,i_ .____._ '13: 1920] Linhart: Method for Statistical Interpretation of Biometrical Data 175 ^fc„u,-^-4J -C - -: : : - ' : : : .:z::\V:l : .~~ 1 §1 176 University of California Publications in Agricultural Sciences [Vol. 4 ■ ■■ -j: ■ : •)' ■ ■.. j-:: . •;{ ■• -:-| :• ■■:;!:• 1 1 ..T-r PERCENT Iof Blttter Fat IN Mil ililiililii .... .... W ;Fi@vii lii;::!: m 1920] Linhart: Method for Statistical Interpretation of Biometrical Data 177 raw 178 University of California Publications in Agricultural Sciences [Vol. 4 r --7 " I A mm mm. FKfeVSI !■■■: I-"- -.i: ::}: ■M 1920] Linhart: Method for Statistical Interpretation of Biometrical Data 179 — 1 :'! : '• '' !-'' , '--.'| / ■■ !': I- 10 I :::;-::-. :::-j : ; : : 13 :.. - -ti-yt- - m ReiXi 180 University of California Publications in Agricultural Sciences [Vol. 4 H ! :::: : \-! : | '^ ' ! ' " ':• fNUJMBCRCT Ba^teriaj. Counts ib ! ao rekPlKte 1 S| .. n! •tf : -2-j - : ' FI<3X l:Mi:ii 'K» M j < V-^"^J^ : : :j::::::' : ^:^f^J: 1920] Linhart: Method for Statistical Interpretation of Biometrical Data 181 :3 \':\ m -?U _<: ! : I : ; ■■:-:■ ■=■ ■:■:■■; ■