Beautiful differentiation Beautiful differentiation Conal Elliott LambdaPix 1 September, 2009 ICFP Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 1 / 32 Differentiation Differentiation Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 2 / 32 Differentiation Derivatives have many uses. For instance, I optimization I root-finding I surface normals I curve and surface tessellation Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 3 / 32 Differentiation There are three common differentiation techniques. I Numeric I Symbolic I “Automatic” (forward & reverse modes) Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 4 / 32 Differentiation What’s a derivative? For scalar domain: d :: Scalar s ⇒ (s → s ) → (s → s ) d f x = lim ε→0 f (x + ε) − f x ε What about non-scalar domains? Return to this question later. Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 5 / 32 Differentiation What’s a derivative? For scalar domain: d :: Scalar s ⇒ (s → s ) → (s → s ) d f x = lim ε→0 f (x + ε) − f x ε What about non-scalar domains? Return to this question later. Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 5 / 32 Differentiation Aside: We can treat functions like numbers. instance Num β ⇒ Num (α → β) where u + v = λx → u x + v x u ∗ v = λx → u x ∗ v x . . . instance Floating β ⇒ Floating (α → β) where sin u = λx → sin (u x ) cos u = λx → cos (u x ) . . . Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 6 / 32 Differentiation We can treat applicatives like numbers. instance Num β ⇒ Num (α → β) where (+) = liftA2 (+) (∗) = liftA2 (∗) . . . instance Floating β ⇒ Floating (α → β) where sin = fmap sin cos = fmap cos . . . Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 7 / 32 Differentiation What is automatic differentiation? I Computes function & derivative values in tandem I “Exact” method I Numeric, not symbolic Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 8 / 32 Differentiation Scalar, first-order AD Overload functions to work on function/derivative value pairs: data D α = D α α For instance, D a a′ + D b b′ = D (a + b) (a′ + b′) D a a′ ∗ D b b′ = D (a ∗ b) (b′ ∗ a + a′ ∗ b) sin (D a a′) = D (sin a) (a′ ∗ cos a) sqrt (D a a′) = D (sqrt a) (a′ / (2 ∗ sqrt a)) . . . Are these definitions correct? Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 9 / 32 Differentiation Scalar, first-order AD Overload functions to work on function/derivative value pairs: data D α = D α α For instance, D a a′ + D b b′ = D (a + b) (a′ + b′) D a a′ ∗ D b b′ = D (a ∗ b) (b′ ∗ a + a′ ∗ b) sin (D a a′) = D (sin a) (a′ ∗ cos a) sqrt (D a a′) = D (sqrt a) (a′ / (2 ∗ sqrt a)) . . . Are these definitions correct? Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 9 / 32 Differentiation What is automatic differentiation — really? I What does AD mean? I How does a correct implementation arise? I Where else might these answers take us? Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 10 / 32 What does AD mean? What does AD mean? Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 11 / 32 What does AD mean? What does AD mean? data D α = D α α toD :: (α → α) → (α → D α) toD f = λx → D (f x ) (d f x ) Spec: toD combinations correspond to function combinations, e.g., toD u + toD v ≡ toD (u + v ) toD u ∗ toD v ≡ toD (u ∗ v ) recip (toD u) ≡ toD (recip u) sin (toD u) ≡ toD (sin u) cos (toD u) ≡ toD (cos u) I.e., toD preserves structure. Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 12 / 32 How does a correct implementation arise? How does a correct implementation arise? Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 13 / 32 How does a correct implementation arise? How does a correct implementation arise? Goal: ∀u. sin (toD u) ≡ toD (sin u) Simplify each side: sin (toD u) ≡ λx → sin (toD u x ) ≡ λx → sin (D (u x ) (d u x )) toD (sin u) ≡ λx → D (sin u x ) (d (sin u) x ) ≡ λx → D ((sin ◦ u) x ) ((d u ∗ cos u) x ) ≡ λx → D (sin (u x )) (d u x ∗ cos (u x )) Sufficient: sin (D ux dux ) = D (sin ux ) (dux ∗ cos ux ) Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 14 / 32 How does a correct implementation arise? How does a correct implementation arise? Goal: ∀u. sin (toD u) ≡ toD (sin u) Simplify each side: sin (toD u) ≡ λx → sin (toD u x ) ≡ λx → sin (D (u x ) (d u x )) toD (sin u) ≡ λx → D (sin u x ) (d (sin u) x ) ≡ λx → D ((sin ◦ u) x ) ((d u ∗ cos u) x ) ≡ λx → D (sin (u x )) (d u x ∗ cos (u x )) Sufficient: sin (D ux dux ) = D (sin ux ) (dux ∗ cos ux ) Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 14 / 32 How does a correct implementation arise? How does a correct implementation arise? Goal: ∀u. sin (toD u) ≡ toD (sin u) Simplify each side: sin (toD u) ≡ λx → sin (toD u x ) ≡ λx → sin (D (u x ) (d u x )) toD (sin u) ≡ λx → D (sin u x ) (d (sin u) x ) ≡ λx → D ((sin ◦ u) x ) ((d u ∗ cos u) x ) ≡ λx → D (sin (u x )) (d u x ∗ cos (u x )) Sufficient: sin (D ux dux ) = D (sin ux ) (dux ∗ cos ux ) Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 14 / 32 Where else might these answers take us? Where else might these answers take us? Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 15 / 32 Where else might these answers take us? Where else might these answers take us? In this talk I Prettier definitions I Higher-order derivatives I Higher-dimensional functions Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 16 / 32 Where else might these answers take us? Prettier definitions Digging deeper — the scalar chain rule d (g ◦ u) x ≡ d g (u x ) ∗ d u x For scalar domain & range. Variations for other dimensions. Define and reuse: (g ./ dg ) (D ux dux ) = D (g ux ) (dg ux ∗ dux ) For instance, sin = sin ./ cos cos = cos ./ λx →−sin x sqrt = sqrt ./ λx → recip (2 ∗ sqrt x ) Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 17 / 32 Where else might these answers take us? Prettier definitions Function overloadings make for prettier definitions. instance Floating α ⇒ Floating (D α) where exp = exp ./ exp log = log ./ recip sqrt = sqrt ./ recip (2 ∗ sqrt ) sin = sin ./ cos cos = cos ./ −sin acos = acos ./ recip (−sqrt (1 − sqr )) atan = atan ./ recip (1 + sqr ) sinh = sinh ./ cosh cosh = cosh ./ sinh sqr x = x ∗ x Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 18 / 32 Where else might these answers take us? Higher-order derivatives Scalar, higher-order AD Generate infinite towers of derivatives (Karczmarczuk 1998): data D α = D α (D α) Suffices to tweak the chain rule: (g ./ dg ) (D ux 0 dux ) = D (g ux 0) (dg ux 0 ∗ dux ) -- old (g ./ dg ) ux @(D ux 0 dux ) = D (g ux 0) (dg ux ∗ dux ) -- new Most other definitions can then go through unchanged. The derivations adapt. Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 19 / 32 Where else might these answers take us? Higher-dimensional functions What’s a derivative – really? For scalar domain: d f x = lim ε→0 f (x + ε) − f x ε Redefine: unique scalar s such that lim ε→0 f (x + ε) − f x ε − s ≡ 0 Equivalently, lim ε→0 f (x + ε) − f x − s ·ε ε ≡ 0 or lim ε→0 f (x + ε) − (f x + s ·ε) ε ≡ 0 Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 20 / 32 Where else might these answers take us? Higher-dimensional functions What’s a derivative – really? For scalar domain: d f x = lim ε→0 f (x + ε) − f x ε Redefine: unique scalar s such that lim ε→0 f (x + ε) − f x ε − s ≡ 0 Equivalently, lim ε→0 f (x + ε) − f x − s ·ε ε ≡ 0 or lim ε→0 f (x + ε) − (f x + s ·ε) ε ≡ 0 Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 20 / 32 Where else might these answers take us? Higher-dimensional functions What’s a derivative – really? For scalar domain: d f x = lim ε→0 f (x + ε) − f x ε Redefine: unique scalar s such that lim ε→0 f (x + ε) − f x ε − s ≡ 0 Equivalently, lim ε→0 f (x + ε) − f x − s ·ε ε ≡ 0 or lim ε→0 f (x + ε) − (f x + s ·ε) ε ≡ 0 Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 20 / 32 Where else might these answers take us? Higher-dimensional functions What’s a derivative – really? lim ε→0 f (x + ε) − (f x + s ·ε) ε ≡ 0 Now generalize: unique linear map T such that: lim ε→0 |f (x + ε) − (f x + T ε)| |ε| ≡ 0 Derivatives are linear maps. Captures all “partial derivatives” for all dimensions. See Calculus on Manifolds by Michael Spivak. Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 21 / 32 Where else might these answers take us? Higher-dimensional functions What’s a derivative – really? lim ε→0 f (x + ε) − (f x + s ·ε) ε ≡ 0 Now generalize: unique linear map T such that: lim ε→0 |f (x + ε) − (f x + T ε)| |ε| ≡ 0 Derivatives are linear maps. Captures all “partial derivatives” for all dimensions. See Calculus on Manifolds by Michael Spivak. Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 21 / 32 Where else might these answers take us? Higher-dimensional functions What’s a derivative – really? lim ε→0 f (x + ε) − (f x + s ·ε) ε ≡ 0 Now generalize: unique linear map T such that: lim ε→0 |f (x + ε) − (f x + T ε)| |ε| ≡ 0 Derivatives are linear maps. Captures all “partial derivatives” for all dimensions. See Calculus on Manifolds by Michael Spivak. Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 21 / 32 Where else might these answers take us? Higher-dimensional functions The chain rules all unify into one. Generalize from d (g ◦ u) x ≡ d g (u x ) ∗ d u x etc to d (g ◦ u) x ≡ d g (u x ) ◦ d u x Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 22 / 32 Where else might these answers take us? Higher-dimensional functions The chain rules all unify into one. Generalize from d (g ◦ u) x ≡ d g (u x ) ∗ d u x etc to d (g ◦ u) x ≡ d g (u x ) ◦ d u x Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 22 / 32 Where else might these answers take us? Higher-dimensional functions Generalized derivatives Derivative values are linear maps: α ( β. d :: (Vector s α, Vector s β) ⇒ (α → β) → (α → (α ( β)) First-order AD: data α . β = D β (α ( β) Higher-order AD: data α.∗β = D β (α.∗(α ( β)) ≈ β × (α ( β) × (α ( (α ( β)) × . . . Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 23 / 32 Where else might these answers take us? Higher-dimensional functions What’s a linear map? Preserves linear combinations: h (s1 · u1 + . . . + sn · un) ≡ s1 · h u1 + . . . + sn · h un Fully determined by behavior on basis of α, so type α ( β = Basis α M→β Memoized for efficiency. Vectors, matrices, etc re-emerge as memo-tries. Statically dimension-typed! Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 24 / 32 Where else might these answers take us? Higher-dimensional functions What’s a linear map? Preserves linear combinations: h (s1 · u1 + . . . + sn · un) ≡ s1 · h u1 + . . . + sn · h un Fully determined by behavior on basis of α, so type α ( β = Basis α M→β Memoized for efficiency. Vectors, matrices, etc re-emerge as memo-tries. Statically dimension-typed! Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 24 / 32 Where else might these answers take us? Higher-dimensional functions What’s a linear map? Preserves linear combinations: h (s1 · u1 + . . . + sn · un) ≡ s1 · h u1 + . . . + sn · h un Fully determined by behavior on basis of α, so type α ( β = Basis α M→β Memoized for efficiency. Vectors, matrices, etc re-emerge as memo-tries. Statically dimension-typed! Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 24 / 32 Where else might these answers take us? Higher-dimensional functions What’s a basis? class Vector s v ⇒ HasBasis s v where type Basis v :: ∗ coord :: v → (Basis v → s ) basisValue :: Basis v → v Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 25 / 32 Where else might these answers take us? Higher-dimensional functions instance HasBasis Double Double where type Basis Double = () coord s = λ() → s basisValue () = 1 instance (HasBasis s u, HasBasis s v ) ⇒ HasBasis s (u, v ) where type Basis (u, v ) = Basis u ‘Either ‘ Basis v coord (u, v ) = coord u ‘either ‘ coord v basisValue (Left a) = (basisValue a, 0) basisValue (Right b) = (0, basisValue b) Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 26 / 32 Automatic differentiation – naturally Automatic differentiation – naturally Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 27 / 32 Automatic differentiation – naturally Can we make AD even simpler? Recall our function overloadings: instance Num β ⇒ Num (α → β) where (+) = liftA2 (+) (∗) = liftA2 (∗) . . . instance Floating β ⇒ Floating (α → β) where sin = fmap sin cos = fmap cos . . . These definitions are standard for applicative functors. Could they work for D ? Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 28 / 32 Automatic differentiation – naturally Automatic differentiation – naturally Could we simply define AD via the standard sin = fmap sin etc? What is fmap? Require toD x be a natural transformation: fmap g ◦ toD x ≡ toD x ◦ fmap g where toD x u = D (u x ) (d u x ) Define fmap from this naturality condition. Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 29 / 32 Automatic differentiation – naturally Derive AD naturally toD x (fmap g u) ≡ toD x (g ◦ u) ≡ D ((g ◦ u) x ) (d (g ◦ u) x ) ≡ D (g (u x )) (d g (u x ) ◦ d u x ) fmap g (toD x u) ≡ fmap g (D (u x ) (d u x )) Sufficient definition: fmap g (D ux dux ) = D (g ux ) (d g ux ◦ dux ) Similar derivation for liftA2 (for (+), (∗), etc). Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 30 / 32 Automatic differentiation – naturally Sufficient definition: fmap g (D ux dux ) = D (g ux ) (d g ux ◦ dux ) Oops. d doesn’t have an implementation. Solution A: Inline fmap for each fmap g and rewrite d g to known derivative. Solution B: Generalize Functor to allow non-function arrows, and replace functions by differentiable functions. Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 31 / 32 Automatic differentiation – naturally Conclusions I Specification as a structure-preserving semantic function. I Implementation derived systematically from specification. I Prettier implementation via functions-as-numbers. I Infinite derivative towers with nearly no extra code. I Generalize to differentiation over vector spaces. I Even simpler specification/derivation via naturality. Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 32 / 32 Differentiation What does AD mean? How does a correct implementation arise? Where else might these answers take us? Prettier definitions Higher-order derivatives Higher-dimensional functions Automatic differentiation – naturally