key: cord-0582076-at8s38mv
authors: Abowd, John M.; Ashmead, Robert; Cumings-Menon, Ryan; Kifer, Daniel; Leclerc, Philip; Ocker, Jeffrey; Ratcliffe, Michael; Zhuravlev, Pavel
title: Geographic Spines in the 2020 Census Disclosure Avoidance System TopDown Algorithm
date: 2022-03-30
journal: nan
DOI: nan
sha: 07daf0b6aad9d42ee979c561f3e7e42a68314dc9
doc_id: 582076
cord_uid: at8s38mv

The TopDown Algorithm (TDA) first produces differentially private counts at the nation and then produces counts at lower geolevels (e.g.: state, county, etc.) subject to the constraint that all query answers in lower geolevels are consistent with those at previously estimated geolevels. This paper describes the three sets of definitions of these geolevels, or the geographic spines, that are implemented within TDA. These include the standard Census geographic spine and two other spines that improve accuracy in geographic areas that are far from the standard Census spine, such as cities, towns, and/or AIAN areas. The third such spine, which is called the optimized spine, also modifies the privacy-loss budget allocated to the entities within the geolevels, or the geounits, to ensure the privacy-loss budget is used efficiently within TDA.

The Census TopDown Algorithm (TDA) outputs microdata that satisfy either pure or zero-concentrated differential privacy (DP). The TDA begins by estimating the differentially private data histogram at the coarsest (highest) geographic level, or geolevel in a hierarchy, which is the U.S. (or the Commonwealth of Puerto Rico). Next, TDA estimates differentially private data histograms at progressively finer (lower) geographic granularity, subject to the constraint that these histograms are consistent with the estimates of next-higher (parent) geolevels. TDA processes the geolevels in a hierarchy given by, U.S., state, county, tract group, tract, block group, and block. Geounits are defined as the geographic entities in each of these geolevels. The definition of the geounits in this hierarchy is called the geographic spine.

TDA originally used the conventional geographic spine based on the Census Bureau Geography Division definitions of state, county, tract, block group, and block. Block groups, as defined by the Census Bureau's Geography Division, are the smallest geographic tabulation areas for the American Community Survey. Tracts, as defined by the Geography Division, are statistical tabulation areas used by decennial censuses and the American Community Survey. For both of these definitions, the tabulation geographies are designed to produce statistics on meaningful areas that can be usefully compared over time. Note that the geographic spine is a construct that is used internally in the TDA, and, even when the geographic spine used within the TDA does not correspond to this conventional spine, the statistical tabulations based on the TDA output define all geounits, including tracts and block groups, by Geography Division's standard definitions.

In order to ensure computational feasibility and enhance the accuracy of statistical tabulations, TDA created tract groups and redefined block groups in a consistent manner with the hierarchical structure of all geographic spines and the Geography Division's definitions of counties, tracts, and blocks. In particular, TDA used tract groups and redefined block groups to reduce the number of child geounits of parent geounits, called the fanout values.

The conventional spine, as defined by Geography Division, had two primary shortcomings when used within TDA. First, legally defined American Indian/Alaska Native/Native Hawaiian (AIAN) tribal areas are usually far from this spine, meaning that many on-spine geounits must be added to or subtracted from one another to compose these geographic areas. Being distant from the spine often results in query count estimates with greater mean squared error than equivalent geounits that are on-spine. Second, the conventional spine also places many other important legal, political or Census-designated entities far from the spine. These include minor civil divisions (MCDs) in regions that are both within strong-MCD states and outside of AIAN tribal areas, and incorporated places in regions that are either outside of the strong-MCD states or within AIAN areas. Collectively, we refer to these geographic entities as the Place/MCD entities (PMEs) in the remainder of the paper.

To ameliorate the off-spine distance of AIAN tribal areas, TDA placed the state-level aggregate of all AIAN tribal areas on the spine. Then, TDA redefined each geounit at the state geolevel and below by the intersection of this geounit's original definition and either the on-spine AIAN areas or their complement. This geographic hierarchy is called the AIAN spine. It was audited by the Geography Division for correctness in defining the AIAN tribal areas and their complements according to the final 2020 Census tabulation geographies.

This document describes several optimization heuristics that perform updates on the AIAN spine to enhance a few aspects of the spine, and we call the resulting geographic spine the optimized spine. First, the spine optimization methods redefine block groups and tract groups to bring a user specified set of offspine entities (OSEs) closer to the spine. The current definition of OSEs used within the TDA includes both PMEs, and, to decrease the correlation between blocks with group quarters (GQs) and their nearby counterparts, for each tract and each combination of the seven major GQ types, the union of all blocks within the tract that contain GQs of each type in the GQ type combination. These redefinitions also ensure that fan-out values are not too high. After this is done, the second stage of the spine optimization algorithm reduces the variance of the final estimate by ensuring that fan-out values of the spine are also not too low, using a heuristic based on the variance matrix of the OLS estimator. For example, this decision rule results in parent geounits that only have one child being bypassed, the consistency with parent constraints require that the count estimates of the child are equal to those of the parent, so bypassing the parent geounit, or reallocating all of the privacy loss budget (PLB) from the parent to the child and then removing the parent, reduces the variance of the DP answers used to derive the final histogram estimate.

The heuristics used in these optimization routines are based on existing results from the DP literature, which are provided in Section 2. Afterward, Section 3 describes how geographic spines are represented here using matrices, and a summary of the remainder of the paper is also provided in this section. Afterward, the method used in the first stage is described Section 4. Section 5 describes the method used in the second stage and provides a result on the privacy guarantees of the TDA when an optimized spine is used. The next subsection outlines the notation used throughout this paper.

1.1. Notation. Throughout the paper we will denote matrices using capitalized font, and vectors using lower case bold font. Also, let the Kronecker product of the real matrices A, B be denoted by A ⊗ B, and elementwise division, when these matrices are conformable, by A B respectively. We will denote the elementwise absolute value of the real matrix A by |A|. A length N column vector with each element equal to c ∈ R 1 will be denoted by c N , and, when there is little risk of confusion, we will omit the subscript N.

We will make use of several probability distributions, but two are worth pointing out explicitly, as their definitions are less standard. First, we will define the discrete Gaussian random variable so that its probability mass function, f : Z → R, is given by f (x) ∝ exp(−(x − µ) 2 /(2σ 2 )), and denote the distribution of this random variable by N Z (µ, σ 2 ); see for example, [3] . Second, we will also define the discrete Laplace random variable so that its probability mass function, f : Z → R, is given by f (x) ∝ exp(−|x − µ|/b), and denote the distribution of this random variable by Laplace Z (µ, σ 2 ); see for example, [8] . We also conform with the usual notation for the (continuous) Gaussian distribution and denote the distributions of length n column vectors with elements that are distributed as N Z (µ i , c i ), Laplace(µ i , c i ), and Laplace Z (µ i , c i ), where µ ∈ R n and c ∈ R n + , by N Z (µ, diag(c)), Laplace(µ, diag(c)), and Laplace Z (µ, diag(c)) respectively.

2.1. Privacy Definitions. The definitions of pure and approximate differential privacy are provided by [6, 7] , and these definitions are also restated below. Both definitions can be viewed as a limit on the difference between the distribution of a mechanism's output when the input is a given database and the distribution of the output when the input is a neighboring database. The specific notion of difference used in these definitions, along with their required upper bounds, is generally self evident, but more care is required in the case of the notion of neighboring database. This is because there are multiple definitions for this term that are commonly used, but the privacy semantic guarantees that these definitions imply are different from one another. For example, two databases are said to be unbounded neighbors if one database can be derived by adding or removing a record from the other database; likewise, two databases are said to be bounded neighbors if one database can be derived from the other by adding one record and removing one record [10] .

In other words, the set of unbounded neighbors can be written as {x, x ∈ X d | d H (x, x ) = 2}, where X d is the set of databases with d records, and the set of unbounded neighbors as {x,

The definition of neighbor has important implications because the upper bounds provided in the privacy guarantees below can be translated into an upper bound on the power of any statistical test of the null hypothesis that the input database is x verse the alternative that it is a neighbor of x [5, 13] . For example, using bounded neighbors does not provide a theoretical bound on the accuracy of inferences on the number of records in the input database. However, in the case of tests with a null hypothesis that the database is x and an alternative hypothesis that it is a given bounded neighbor of x, formal privacy definitions that use bounded neighbors provide stronger limitations on the accuracy of privacy attackers' inferences. This can be shown for the privacy definitions below using the fact that a bounded neighbor of a given database is also an "unbounded neighbor of an unbounded neighbor" of the database.

The privacy guarantees provided by the TDA use a unique definition of neighbor that does not protect against certain inferences. Specifically, TDA does not protect each state's total population or the locations of each group quarter type. In other words, TDA outputs a database with state total populations that are identical to those of the input database, and, since at least one person must reside in each group quarter by the Census definition of group quarter, the population residing in each group quarter type in a given geounit is constrained to be at least the number of group quarters of that type within the geounit. For more detail on these two categories of data-dependent constraints imposed by TDA, as well as those that are data-independent, see [1] .

For this reason, in the context of the TDA, two databases are said to be neighbors if one database can be derived from the other by adding one record to a given state and then removing one record from that same state, such that both databases also satisfy the group quarters invariant constraints. To emphasize this definition of neighbors in the presence of invariant constraints in our privacy definitions below, we will define the set of pairs of neighboring databases as {x, x ∈ U | d H (x, x ) = 2}, where U is a given universal set of databases. For example, this notation is a generalization of the definition of bounded neighbors, which used U = X d . In the results throughout this paper that use these privacy definitions, we will denote the set of databases that satisfy these group quarter invariant constraints as G and the set of databases with d i records in each state i as X d . Using this notation, the definition of neighbors used by the TDA can be written as, {x, x ∈ X d ∩ G | d H (x, x ) = 2}. Definition 1. (Pure/Approximate Differential Privacy [6, 7] ) A randomized algorithm M : U → Y satisfies approximate ( , δ)-differential privacy if, for all (x, x ) ∈ {x, x ∈ U | d H (x, x ) = 2} and all E ⊂ Y, we have P(M(x) ∈ E) ≤ exp( )P(M(x ) ∈ E) + δ. Also, we say a randomized algorithm satisfies pure -differential privacy if it satisfies ( , 0)-differential privacy.

The definition of ρ-zero-concentrated DP (ρ-zCDP) [4, 3, 2] is provided below. We will primarily use ρ-zCDP as an intermediate privacy accounting tool when deriving an approximate DP mechanism; a result from [3] , which is restated in the next subsection, provides a way to translate ρ-zCDP privacy guarantees to ( , δ)-DP privacy guarantees.

is the Rényi divergence of order α between the distributions P and Q.

Privacy-Loss Accounting Results. A general pattern in our proofs that the TDA is DP will be to first establish that the DP primitive mechanism for each query i is either i -DP or ρ i -zCDP, and then use parallel and sequential composition to show that the combination of all of the DP primitive query answers used in the TDA are i i -DP or i ρ i -zCDP. In the case of approximate DP, we will also require the final step of ensuring that i ρ i implies ( , δ)-DP. This section contains these intermediate results that will be used in these proofs. The first two results will be used in the next subsection to show that each DP primitive mechanism satisfies ( , 0)-DP or ρ-zCDP.

While providing a comprehensive compilation of the properties that are implied by a mechanism satisfying either ( , δ)-DP or ρ-zCDP is beyond the scope of this paper, there are a few of these properties that we make use of below that are worth explicitly pointing out. First, these privacy guarantees share the property that they are invariant to post-processing. In other words, if M : U → Y satisfies either of these privacy guarantees, then so does the mechanism f • M, where f : Y → Z. Second, if M 1 (x) is 1 -DP (respectively, ρ 1 -zCDP) and M 2 (x) is 2 -DP (ρ 2 -zCDP) then releasing the output of M 1 (x) and M 2 (x) simultaneously is itself ( 1 + 2 )-DP ((ρ 1 + ρ 2 )-zCDP), which is called sequential composition. In some cases in which M 1 (x) and M 2 (x) only depend on disjoint subsets of the arbitrary dataset x ∈ U, releasing the outputs of both of these mechanisms simultaneously is (max { 1 , 2 })-DP (respectively, (max {ρ 1 , ρ 2 })-zCDP), which is known as parallel composition. Specifically, the following lemma uses sequential and parallel composition to provide privacy guarantees for a mechanism that is defined by multiple univariate mechanisms described in Lemmas 1 and 2.

Lemma 3. (Theorem 14 [3] , Proposition 1 [6] , Theorem 3.2 [8] ) Suppose x, x ∈ X d differ on a single entry, and let a randomized algorithm M : X d → Y m be defined by M(x) = q(x) + y where q : X d → Y m and y is an m dimensional column vector of independent random variables. Then we have the following.

(2) Let σ 1 , . . . , σ m > 0, ρ > 0, and m j=1

The next result provides a way to translate between an ( , δ)-DP guarantee and a ρ-zCDP guarantee. Note that this result implicitly defines ρ in terms of , δ. After ρ is computed numerically using this result, the TDA scales the parameters denoted by σ 2 that are used within the DP primitive mechanisms to ensure that releasing a combination of all of these primitive mechanisms simultaneously satisfies ( , δ)-DP. We refer the interested reader to [4, 3] for more detail on how the optimization problem below can be solved efficiently within the numerical method used to compute ρ. 

Marginal Query Groups. The notation for each query q(·) in the preceding subsections is actually more general than required to describe the TDA. Specifically, the TDA only makes use of linear queries. We will define a linear query as a length n vector, and a linear query answer as the inner product of a linear query and the histogram cell counts. Since we only consider linear queries from this point forward, we will sometimes refer to a linear query as a query below. A query matrix is used to refer to linear queries that are vertically stacked on top of one another. All elements of the linear queries used within the TDA have elements that are either zero and one, and they are defined so that each linear query answer provides an individual count for a marginal of the full histogram. It will be convenient to group the queries providing counts for the same marginal together, so we will define a query group as the linear queries that provide counts for the same marginal, vertically stacked on top of one another. For example, suppose the schema of the original database is CENRACE × HISP × VOTINGAGE, where CENRACE and HISP indicate which of the 63 Census race combinations, and which of the two hispanicity levels, the respondent identified as, respectively, and VOTINGAGE indicates if the respondent is at least 18 years of age. 1 In this case the CENRACE × VOTINGAGE query group is defined as the matrix Q = I 63 ⊗ 1 2 ⊗ I 2 , and right multiplying Q by the vector of histogram cell counts of a database provides the CENRACE × VOTINGAGE query group answers. More generally, the query matrix of a given query group is defined as a matrix that can be written as a series of Kronecker products of identity matrices and row vectors of ones. Note that this definition encompasses cases in which each of these Kronecker factors are all either row vectors of ones or are all identity matrices.

One of the inputs of the TDA are either the proportions of PLB, when pure DP primitive mechanisms are used, or proportions of ρ, when approximate DP primitive mechanisms are used, to allocate to each query for a given geolevel. The following lemma uses this notational convention, while only considering either a single geounit, to provide the -DP and ρ-zCDP privacy guarantees of releasing the output of the primitive mechanisms of this geounit. 2 This will be used in the next section to provide the privacy guarantees of the TDA when either the conventional spine or the AIAN spine is used.

, where x is a vector of the histogram counts for a database with d records and y[i] is a length m[i] column vector of independent random variables, then we have the following.

(

Proof. Since, for each i ∈ {1, . . . , q}, only one element of each column of Q[i] is nonzero, we will use Lemma 3 to leverage parallel composition to prove both cases. This reduces the problem to carrying out sequential composition of the set of mechanisms,

In both cases, we will use the fact that

x are vectors of histogram cell counts of databases differing on a single entry, is a vector with at most two elements that are equal to one, with the remaining elements equal to zero.

In the first case,

-DP. Thus, using sequential composition, releasing the output of

In the second case,

Thus, using sequential composition, releasing the output of

2.4. The Matrix Mechanism. This section will describe a class of random mechanisms known as matrix mechanisms in the context of linear queries composed of query groups [11] . Using the notation introduced in Lemma 5, we will assume that each of the errors of the i th query group are random variables that are distributed as either y 

, and the response variable as z := W x + y, which is also the vertically stacked output of the mechanism described in Lemma 5. We will assume that W has full column rank, which can be ensured by including the detailed cell query group in W (i.e.: the identity matrix).

For both of the distributions that we consider in this section, we can define an alternative response variable that is observationally equivalent to z, but with homoskedastic errors. For example, in the case of -DP, we have y ∼ Laplace(0, diag(2 ( α))), so we can define the rescaled errors asỹ := diag( α/2)y, which are 1 While we are using attributes that are part of the census, note that this example, and the queries in this section, are simply described in the context of a hypothetical survey that does not use a geographic spine. The linear query notation introduced here will be extended in the next section to account for the presence of the geographic spine. 2 Alternatively, for datasets that contain an attribute identifying each respondent's geounit for a given geolevel, this theorem can also be used to provide the privacy guarantees of releasing the output of all of the primitive DP mechanisms for the geolevel. However, depending on how the operation of bypassing geounits is defined, this can be less straightforward in the case of the optimized spine because there may be respondents in the dataset that are not included in any geounit in the geolevel. This alternative interpretation of the Lemma will be discussed in more detail in Sections 3, and the operation of bypassing geounits will be defined in 5.

distributed asỹ ∼ Laplace(0, I), the rescaled workload as W := diag( α/2)W, and the rescaled response variable asz := diag( α/2)z = W x +ỹ. In the case of ρ-DP, we have y ∼ N(0, diag(1 (ρα))), so we can define the rescaled errors asỹ := diag( √ ρα)y, which are distributed asỹ ∼ N(0, I), the rescaled workload as W := diag( √ ρα)W, and the rescaled response variable asz := diag( √ ρα)z = W x +ỹ.

One simple example of a matrix mechanism is a mechanism that releases the ordinary least squares estimates of W x. Specifically, this can be done by first estimating x asx := arg min x W x −z 2 2 = ( W W ) −1 W z, and then defining the output of the mechanism as Wx. Note that in either the case of -DP or ρ-zCDP, the privacy guarantee of the final mechanism follows from the fact that a mechanism that releasedz would satisfy the same privacy guarantee, along with the invariance to post-processing property.

This simple mechanism can be generalized by making a distinction between the linear query answers provided as output and the linear queries used to estimatex. Specifically, we will define the strategy matrix A ∈ R p×n by A := stack({R[i]} r i ), and the query group PLB or precision proportions by {γ[i]} r i . In the same manner as described above for W, we can define the rescaled strategy matrix as A := diag( γ/2)A, when deriving an -DP mechanism, or A := diag( √ ργ)A, when deriving a ρ-zCDP mechanism. Likewise, the rescaled error vector and response variable can be defined as above so thatz := Ax +ỹ. The final output of this matrix mechanism is then given by this alternative estimate of the linear queries in the workload W,

Note that the variance matrix of this output vector is given by,

Prior works have focused on using this variance matrix to find strategies that provide a low expected sum of squared errors, which is given by Trace(Var(Wx)) = Trace(W W ( A A) −1 ); see for example, [11, 12] . We use an alternative approach to motivate the heuristic used to bypass geounits in Section 5, after introducing how we represent the spine using matrices in the next section.

In this section we will describe how we represent the workload and the strategy matrices that include the linear queries for each geounit on the spine. To do so, we will first consider the case in which the same query groups are used in all such geounits, and generalize this notation afterward to the case in which the query matrix is dependent on the geolevel.

Let the workload of the linear queries for each geounit be denoted by W ∈ R m×n , where W = stack({Q[i]} q i ) and each Q[i] is a query group matrix. We will also assign an integer to each block geounit for the purpose of ordering the blocks. Specifically, we will suppose that the blocks are ordered lexicographically so that blocks in the same state are adjacent to one another, within each state, blocks in the same county are adjacent, etc. For example, in the case of the conventional geographic spine, this can be achieved by sorting the blocks by their 15 digit Census GEOID, as the format of the GEOID is [2 digit state FIPS code][3 digit county FIPS code][6 digit Census tract code][4 digit Census block code]. We will also periodically refer to geounit u ∈ {1, . . . , U [l]} in geolevel l ∈ {1, . . . , L} as geounit (l, u).

We will also let b[l, u] denote the number of block level descendants of geounit (l, u). For example, since all blocks are descendants of the US geounit, there a total of b[1, 1] blocks. Let x ∈ Z nb[1,1] + be a vector of histogram cell counts over all blocks in the US, ordered in the manner described in the preceding paragraph, which we will refer to as the true histogram cell counts below. Using this notation, we can express the query answers to the query matrix W for the US geounit in terms of these block level cell counts as (1 b[1,1] ⊗ W )x. Likewise, the query answers for all of the block geounits is (I b[1,1] ⊗ W )x. More generally, this notation allows us to express the workload over all geolevels as sums of block level cell counts. Specifically, a possible workload is 3 All of the results and discussion in this paper are also straightforward to extend to the more general case in which the workloads and strategy matrices are not required to be vertically stacked query group matrices. The most obvious generalization would be to cases in which one also assigned a weight to each query in the workload matrix. Allowing for this generalization, or removing the requirement that these matrices are vertically stacked query groups entirely, is also straightforward, and this would not alter the theorem and observation provided in Section 5.

where M := m l U [l], N := nb [1, 1] , and,

. . .

As a summary of this notation, an example of B ⊗ W is shown in Figure 1 .

While we will use this definition for the workload over all geolevels when describing a heuristic used in the spine optimization routines below, it is not actually quite general enough to capture all possible workloads that a user may specify in the TDA. This is because the TDA also allows users to specify distinct strategy matrices in each geolevel, but the notation above assumes that the strategy matrix in each geolevel is W. For this reason, we will also use an alternative definition for the strategy matrix that is general enough to encompass all possible workloads used by TDA in our results that establish privacy guarantees. To do so, let W [l] ∈ {W [l]} L l=1 denote the workload in geolevel l. Using this notation, the workload used within the TDA can be defined as,

. . .

The following theorem uses the notation established so far to show that the TDA is either -DP or ρ-zCDP, depending on whether a pure or approximate DP primitive mechanism is used, when either the conventional spine or the AIAN spine is used. The fact that geounits can be bypassed in the case of the optimized spine requires some modifications to this proof, which are described in Section 5. Note that, as in the proof below, the TDA takes the positive values in the set {β[l]} L l=1 as input, which define either the PLB proportions, in the case of pure DP, or the precision proportions, in the case of zCDP, for each geolevel.

Note that in the case of the conventional spine and the AIAN spine, there is no distinction between the workload matrix and the strategy matrix, as the TDA does not currently alter the per-geounit workloads {W [l]} l , or the matrix representation of the spine B, in these cases prior to calling the primitive mechanisms. , which corresponds to the case in which there is one US geounit with two block geolevel child geounits. Each geounit contains a 2 × 2 histogram, and the linear queries in W are given by a total sum query (1 2 ⊗ 1 2 ), the marginal query groups for each attribute (I 2 ⊗ 1 2 and 1 2 ⊗ I 2 ), and the detailed cell query group (I 2 ⊗ I 2 ). For each of the geounits, the total sum query is highlighted in green, the marginal query groups are highlighted in light and dark red, and the detailed cell query group is unhighlighted. In order to describe the spine optimization routines, we will require a distinction between the workload and the strategy matrix. Specifically, this strategy will maintain the same per-geounit queries {W } q i=1 and the same per-geounit proportions {α[i, l]} i,l as the initial inputs. However, the strategy matrix will alter the matrix representation of the spine as well as the PLB or precision proportions allocated to each geounit. We will denote the matrix representation of this alternative spine as A = stack({A[l]} L l ), which we require to satisfy the following properties. First, we will constrain our attention to spines that include all blocks, so A must have the same number of columns as B, which is b(1, 1), and we require the lowest geolevel to be defined as these blocks, so A[L] := I b (1,1) . Second, the spine must include a US geounit, so we have (1,1) . Third, the spine must include the same number of geolevels and be hierarchical, so, for geolevel l ∈ {1, . . . , L − 1}, we require each row of A[l] to be equal to the sum of a subset of the rows of A[l + 1].

Since bypassing geounits will result in geounits within the same geolevel having different PLB or precision proportions, we will also use γ[l, u] to denote the geolevel PLB or precision proportion for each geounit (l, u), and γ to denote a vector composed of these values, using the same lexicographic ordering of the geounits described above. We will continue to use {β[l]} L l to denote the per-geolevel PLB or precision proportions of the initial spine, and use β to denote the vector stack({β[l] U [l] } L l ). Using the terminology from Section 2.4, we will introduce the rescaling vector r ∈ R m l U [l] + and the strategy matrix S, for later use in Section 5, as,

in the case in which a pure -DP primitive Laplace mechanism is used and,

in the case in which a ρ-zCDP primitive Gaussian mechanism is used. Also, let the stacked DP primitive mechanism answers be defined asz := Sx +ỹ, whereỹ = diag(r)y, y = stack({y[l]} L l=1 }), and {y[l]} L l=1 is defined as in the previous theorem. Then the OLS estimate for this strategy matrix can be expressed as,

and the variance matrix of the ouput of the matrix mechanism is given by,

3.1. Summary. There are two stages of the spine optimization routines, and each are described in the next two sections. The first stage takes the AIAN spine as input and then applies a method that is described in Section 4. This section introduces the OSE distance (OSED), which, for each OSE, is defined as the number of on-spine geounits that must be added or subtracted from one another in order to derive the OSE. We believe that defining geounits to minimize a function (such as the maximum or the mean) of the OSED values is a non-convex optimization problem, so Section 4 also provides a computationally efficient heuristic that reduces the OSED by redefining block groups and, when they are included on the spine, tract groups.

The output of the first stage is then updated again using a method that is described in Section 5. Specifically, this section provides a result that describes when bypassing a geounit on the spine results in every expected squared error of matrix mechanism either decreasing or remaining unchanged. Afterward, an algorithm is provided that uses this result to motivate a decision rule for when to bypass a geounit. The use of this decision rule within the TDA should also be regarded as a heuristic for two reasons. First, the output of the TDA is not defined using ordinary least squares, so this decision rule is not based on the actual variance matrix of the output of the TDA. Second, the TDA uses the random noise drawn from either of the discrete distributions, Laplace Z (0, b) or N Z (0, σ 2 ), rather than their continuous counterparts, Laplace(0, b) and N(0, σ 2 ), which are the distributions that motivate this decision rule. The TDA uses the discrete distributions Laplace Z (0, b) and N Z (0, σ 2 ) rather than Laplace(0, b) and N(0, σ 2 ) in part because they provide lower variance, and this same fact implies that using these distributions instead would alter the decision rule. However, this difference is primarily relevant when the PLB or precision allocated to a particular query of a geolevel is very large. For example, even when σ 2 = 1, using N Z (0, σ 2 ), rather than N(0, σ 2 ), provides a decrease in the variance of less than 3 · 10 −7 , and when σ 2 = 2, this decrease in variance is less than 3·10 −15 . 4 For this reason, in most practical cases, this second departure from the TDA is unlikely to be as important as the first.

As described in Section 2.4, the norm in the literature on the matrix mechanism is to minimize the expected sum of squared error [11, 12] . In our independent experiments, this alternative decision rule resulted in many more geounits being bypassed, which resulted in a low expected sum of squared error objective function, but with high values for some individual expected squared errors. This problem persisted even after accounting for the disproportionate number of terms in this sum from the block level geounits by assigning each geolevel a PLB or precision proportion that was proportional to the inverse of the number of geounits in the geolevel. In contrast, the approach that we use is more conservative, in the sense that the input spine is only changed when doing so would either decrease or not changed the expected squared error of each query of the OLS estimate. This objective function aligns more closely with our goal when setting geolevel PLB or precision proportions, which is to ensure that the final query estimates for each geolevel satisfies a certain accuracy criterion. The only paper that we are aware of that takes a similar approach is [14] .

This section describes how to redefine block groups and tract groups to bring OSEs closer to the spine. This is done by using a heuristic based on an objective function called the off-spine entity distance (OSED), which returns the minimum number of geounits that must be added or subtracted from one another to derive an OSE. 5 This heuristic generally results in a reduction in the variance of estimates for these geographic regions, particularly when sufficient PLB or precision proportion is allocated to the geolevels that are being redefined.

Before moving onto defining a systematic way in which the OSEDs can be found for each OSE, we will define some additional notation. Specifically, let the set of OSEs be denoted by K, an arbitrary encoding of the geographic spine by g, and the set of geounits that are children of geounit u of geolevel l by J[l, u]. Also, let {C k (u)} k∈K , where C k : N + → {0, 1} denote a set of functions such that C k (u) = 1 when the OSE k ∈ K contains the block geounit u and zero otherwise.

We will begin by choosing k ∈ K and finding the OSEDs for both k as well as its complement, which we will denote by k , under the assumption that the only geolevel is the block geolevel. In this case, each block would contribute c[k, L, u] := C k (u) to the OSED of OSE k. Likewise, each block would contribute c[k , L, u] := 1 − C k (u) to the OSED of the complement of k.

If the block-group geolevel is added at this point, we could compose the intersection of a single block group and k in one of two ways. First, we could add all of the blocks to one another that are both inside of entity k and inside of the block group. This would result in the block-group u contributing v∈J(L−1,u) c[k, l, v] to the OSED of k. On the other hand, we could also take all of the geographic extent that the block group occupies and then subtract off the geographic region of the blocks in the complement of k. This would result in this block group contributing 1 + v∈J(L−1,u) c[k , l, v] to the OSED of k. Note that an additional one is added in this case because of the additional step of subtracting the complement of k in the block group from the block group itself. Since OSED is defined as the minimum number of geounits that must be added or subtracted to one another to define an entity, we will choose the option that results in a smaller value. Since similar derivations can be carried out for the complement of k, by symmetry, we have, Since all entities are assumed to be contained within the US, the final OSED for entity k can be found by applying these recursions to the root geounit and defining this OSED as c[k, 1, 1]. Afterward, our final 5 Our use of the term "off-spine entities" refers to geographic entities that may be off of the spine, but we do not assume that each OSE has a geographic extent that differs from each (on-spine) geounit. An implication of this definition of OSED is that, when the geographic extent of an OSE is identical to a geounit on the spine, its OSED is equal to one. objective function is defined by a reduce operation, such as a function that returns the mean or maximum OSED. This is summarized in Algorithm 1.

Algorithm 1: OSEDs Reduced(g, K, {C k (·)} k , h(·), L)

In the most general setting, in which the OSEs are not necessarily disjoint, formulating an algorithm that redefines block groups and tract groups in order to minimize the OSEDs with a polynomial time complexity appears to be a difficult problem because of the similarity of this optimization problem to a covering problem. For this reason, Algorithm 2 describes a greedy approach to approximate the optimal redefinitions of tract groups in a computationally tractable manner. In contrast, block groups are redefined by combining blocks within a given tract, and within the same intersection of the OSEs. As described in the code below, each redefined block group is composed of up to √ n + fanout cutoff blocks, where n is the blocks in the redefined block group's parent tract and fanout cutoff is a user choice parameter. This choice is motivated by the fact that the smallest maximum fanout value among a given tract and its child geounits is √ n , where · denotes the cieling function. For example, for a tract with 100 block geounit descendants, redefining its children by 10 redefined block group child geounits, each with 10 block child geounits, would result in the lowest possible maximum fanout of 10 among these 11 geounits, so the redefined block groups within this tract would contain no more than 10 + fanout cutoff blocks in this case. Note that a similar approach, is used to define the maximum number of tracts within a tract group in Algorithm 2.

This algorithm also uses x lexicographic y, where x, y ∈ R n , to denote the lexicographic less than or equal to partial ordering, which is defined as ⊥ when x = y and the first index i for which x i and y i differ satisfies x i > y i , and otherwise.

Note that Algorithm 2 does not alter the PLB or precision proportions of the input spine, which is the AIAN spine. The next section describes the Algorithm used to update these proportions, along with the spine that is output from Algorithm 2, in the second spine optimization stage.

This section uses a matrix mechanism to provide a decision rule for when it is best to bypass a geounit. To do so, we will constrain our attention to DP primitive mechanisms that use noise drawn from the continuous distributions N(·) and Laplace(·). We will also suppose that the matrix B is the matrix representation of the spine that is used as input of this method. In the case of the TDA, this spine is formed using the output of Algorithm 2.

It is also worth explicitly stating how we define the operation of bypassing a parent geounit. We define the operation of bypassing a parent geounit with c children, each with equal PLB or precision proportions, by, 1) creating c geounits in the geolevel of the parent, each with a geolevel proportion given by the sum of proportion allocated to the parent and one of the children, 2) defining the single child of each of these c new geounits by one of the children, 3) removing the old parent geounit, 4) redefining the geolevel proportion of each child to be zero. Thus, even though we call this operation "bypassing a parent," this operation actually moves the geolevel PLB or precision proportion to a higher geolevel. This ensures that this operation does not change the total number of geolevels, and since the TDA fixes the final estimates in a top-down manner, so that the consistency with parent constraints are satisfied, this also ensures that the entire share of the geolevel PLB or precision allocations are used in cases in which a parent geounit has only one child geounit. While this is the definition of this operation used within the TDA, the decision rules developed in this section Algorithm 2: Redefine Block Groups and Tract Groups(g, K, {C k (·)} k ) // Within each tract, redefine block groups by combining groups of √ n + fanout cutoff blocks in the intersections of the same OSEs, where n is the number of blocks in the tract. // Initialize tract groups so that they each have one child Tract. current OSEDs ← OSEDs Reduced(g, K, {C k (·)} k , Sort Descending(·)) for i ∈ {1, . . . } do g changed ← ⊥ for county ∈ Counties(g) do n ← Number of tracts in county(county) for u, v ∈ Children(county) do if Card(Children(u) ∪ Children(v)) > √ n + fanout cutoff then continue g ← Combine Siblings(u, v, g) test OSEDs ← OSEDs Reduced(g , K, {C k (·)} k , Sort Descending(·)) if test OSEDs lexicographic current OSEDs then g ← g current OSEDs ← test OSEDs g changed ←

if not g changed then break return g only depend on properties of S S, and using this definition of the bypass operation impacts this matrix in the same way as simply reallocating the geolevel PLB or precision proportion of the parent geounit to the children. For this reason, we motivate the decision rules described in this section on this simpler definition of the bypass operation.

Using the notation from 3, recall that the expected squared errors of the output of the matrix mechanism with the strategy matrix S are given by the diagonal of Var(Vx), as defined in (4) . The next result describes a case in which each of these expected squared errors can be decreased, or remain unchanged, by bypassing a parent. In cases in which the initial PLB proportion of a parent and its children are equal, this result simply implies that accuracy can be improved by bypassing over the parent geounit when it has less than or equal to three child geounits. The decision rule that used for the case in which ρ-zCDP primitive Gaussian mechanisms are used will be discussed after the proof. Proof. Let {γ 0 [l, u]} l,u and {γ 1 [l, u]} l,u denote the geounit PLB proportions before and after reallocation, respectively. Likewise, letx 0 andx 1 denote the OLS estimate before and after reallocation, respectively. Also, for the symmetric matrices C, D ∈ R n×n we will use C ≤ D to denote the condition that C − D is negative semidefinite and C ≤ 0 to denote the condition that C is negative semidefinite. The variance matrix of the OLS estimate before (respectively, after) bypassing geounit (l, 1) is proportional to,

where i = 0 (respectively, i = 1). We will prove the sufficient condition that Var(x 1 ) ≤ Var(x 0 ). Given the variance matrix of the OLS estimator above, this condition is equivalent to,

Since we assume that W W is positive definite, this condition holds if and only if,

Note that the only elements of γ 1 that are not equal to γ 0 can be defined in terms of γ 0 by γ 

This matrix only has c unique columns, and 

Since γ 0 [l, u] ≥ 0 for all l, u, the eigenvector corresponding to the smallest eigenvalue is 1 r , so this matrix is positive semidefinite when,

Since (l, 1) was not bypassed in our initial PLB allocation, γ 0 [l, 1] = 0, so we have,

Remark 1. Consider two cases in which the only query in W is a total population query. First, if the parent is not bypassed, we could construct unbiased estimates of the total population of the parent by either observing the DP answer for the parent directly, which has a variance of 2/γ[l, 1] 2 , or by summing the DP answers of the children together, which has a variance of 2c/γ[l + 1, 1] 2 . The mean with inverse variance weighting provides the linear combination of these estimates with the lowest possible variance of 2c/(cγ[l, 1] 2 + γ[l + 1, 1] 2 ) in this case. Second, if we did bypass the parent, we could estimate the total population query for the parent by summing the total population of the children, which would have a variance of 2c/(γ[l + 1, 1] + γ[l, 1]) 2 in this case. The theorem above can be viewed as a statement that limiting our attention to a single total population query, and to only the parent and its children, in this way is without loss of generality, at least for the purpose of finding the cases in which bypassing the parent does not increase the expected error for any query in any geounit. This is because 2c/(cγ[l, 1] 2 + γ[l + 1, 1] 2 ) ≥ 2c/(γ[l + 1, 1] + γ[l, 1]) 2 if and only if γ[l + 1, 1] ≥ (c − 1)γ[l, 1]/2, which is the same requirement given in the statement of the theorem.

Remark 2. This theorem may be of independent interest in the DP literature because other authors have already considered strategy matrices that have a hierarchical structure that is analogous to the matrix representation of the spine A above, and this result can be used to narrow the search space of the set of hierarchical strategy matrices considered when choosing a strategy matrix with this property [9, 11] . For example, when using the OLS approach described by [9] with PLB proportions that are the same for each level of the hierarchy and with the children of each non-leaf node defined by either two or three sub-intervals of the range of the parent node, this result implies that the the expected squared error of an arbitrary linear query can be reduced by increasing the number of sub-intervals used to define these child nodes.

One can use a similar technique to show that, in the case in which ρ-zCDP primitive Gaussian mechanisms are used to define y, bypassing a parent will increase at least one diagonal element of Var(Vx) whenever the parent has two or more children, and the diagonal elements of Var(Vx) will remain unchanged whenever the parent has only one child. In part for this reason, our decision rule for the case in which ρ-zCDP primitive Gaussian mechanisms are used is simply to bypass parent geounits with only one child. However, given the top-down manner in which the TDA fixes estimates to ensure consistency with the estimates of the parent geounits, the estimates of a child geounit are fixed by those of the parent geounit whenever the parent only has one child geounit. Thus, we need not solely rely on the change in the diagonal elements of Var(Vx) to motivate the decision rule for this case; an alternative motivation is provided in the following observation. Note that the decision rule provided in Theorem 2 results in at least the same number of geounits being bypassed, since, whenever c = 1, the condition in the statement of the theorem becomes γ[l + 1, 1] ≥ 0, and we require each γ[l, u] to be non-negative. Observation 1. The DP primitive mechanism answers for each child geounit that does not have a sibling geounit are not used. Thus, the variance of the DP primitive mechanisms output that are actually used to construct the final estimates in the TDA can be decreased by bypassing all parent geounits with only one child.

Algorithm 3 summarizes how both of these decision rules are used within the spine optimization routines of the TDA. Note that this algorithm starts at the block group geolevel rather than the US, which ensures that no remaining geounits can be bypassed after only one pass over the spine. Note that the using the TDA with the spine that is output from Algorithm 2 does not alter the privacy guarantees of the TDA because the same arguments used in the proof of Theorem 1 apply in this case as well. However, this cannot be said for the spine that is output from Algorithm 3 when at least one geounit is bypassed. The following theorem generalizes the argument used in 1 to show that using the spine output from Algorithm 3 within the TDA does not impact the privacy guarantees. In the case of the first result of the theorem, the noise for this mechanism is distributed as eitherŷ ∼ Laplace(0, diag(b)) orŷ ∼ Laplace Z (0, diag(b)), whereb := stack({b[l]} l ) and b[l] := 2 ( (γ[l] ⊗ α[l])). Thus, we will show the following condition, which, by the first result in Lemma 3, is a sufficient condition,

Starting from the left hand side of this inequality, we have, which implies inequality (9) . Similar logic also implies the second result of the theorem. Specifically, in this case we can use the second result of Lemma 3, and the fact that for each x ∈ X d ∩G and x ∈ N(x) we have (D i,· (x−x )) 2 = |D i,· (x − x )| , to derive the sufficient condition, 

Since this last value is equal to the value of (10) multiplied by 2ρ/ , and the logic above implies the value of (10) is equal to , we have, 

This document describes both stages of the spine optimization routines used in the TDA. The first stage brings OSEs closer to the spine by redefining block groups and tract groups, and the second stage bypasses geounits with low-fanouts using a decision rule that is motivated by the expected squared errors of the ordinary least squares estimator. This final section also provides a result that spine optimization does not alter the privacy guarantees of the TDA.

Simson Garfinkel, and Ashwin Machanavajjhala. Census topdown: Differentially private data, incremental schemas, and consistency with public knowledge

Concentrated differential privacy: Simplifications, extensions, and lower bounds

The discrete gaussian for differential privacy

The discrete gaussian for differential privacy

Our data, ourselves: Privacy via distributed noise generation

Calibrating noise to sensitivity in private data analysis

Universally utility-maximizing privacy mechanisms

Boosting the accuracy of differentially private histograms through consistency

No free lunch in data privacy

Optimizing linear counting queries under differential privacy

Optimizing error of high-dimensional statistical queries under differential privacy

A statistical framework for differential privacy

Optimizing fitness-for-use of differentially private linear queries