Beautiful differentiation


Beautiful differentiation

Conal Elliott

LambdaPix

1 September, 2009 ICFP

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 1 / 32


Differentiation

Differentiation

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 2 / 32


Differentiation

Derivatives have many uses.

For instance,

I optimization

I root-finding

I surface normals

I curve and surface tessellation

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 3 / 32


Differentiation

There are three common differentiation techniques.

I Numeric

I Symbolic

I “Automatic” (forward & reverse modes)

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 4 / 32


Differentiation

What’s a derivative?

For scalar domain:

d :: Scalar s ⇒ (s → s ) → (s → s )

d f x = lim
ε→0

f (x + ε) − f x
ε

What about non-scalar domains?
Return to this question later.

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 5 / 32


Differentiation

What’s a derivative?

For scalar domain:

d :: Scalar s ⇒ (s → s ) → (s → s )

d f x = lim
ε→0

f (x + ε) − f x
ε

What about non-scalar domains?
Return to this question later.

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 5 / 32


Differentiation

Aside: We can treat functions like numbers.

instance Num β ⇒ Num (α → β) where
u + v = λx → u x + v x
u ∗ v = λx → u x ∗ v x

. . .

instance Floating β ⇒ Floating (α → β) where
sin u = λx → sin (u x )
cos u = λx → cos (u x )

. . .

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 6 / 32


Differentiation

We can treat applicatives like numbers.

instance Num β ⇒ Num (α → β) where
(+) = liftA2 (+)
(∗) = liftA2 (∗)

. . .

instance Floating β ⇒ Floating (α → β) where
sin = fmap sin
cos = fmap cos

. . .

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 7 / 32


Differentiation

What is automatic differentiation?

I Computes function & derivative values in tandem

I “Exact” method

I Numeric, not symbolic

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 8 / 32


Differentiation

Scalar, first-order AD

Overload functions to work on function/derivative value pairs:

data D α = D α α

For instance,

D a a′ + D b b′ = D (a + b) (a′ + b′)
D a a′ ∗ D b b′ = D (a ∗ b) (b′ ∗ a + a′ ∗ b)
sin (D a a′) = D (sin a) (a′ ∗ cos a)
sqrt (D a a′) = D (sqrt a) (a′ / (2 ∗ sqrt a))

. . .

Are these definitions correct?

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 9 / 32


Differentiation

Scalar, first-order AD

Overload functions to work on function/derivative value pairs:

data D α = D α α

For instance,

D a a′ + D b b′ = D (a + b) (a′ + b′)
D a a′ ∗ D b b′ = D (a ∗ b) (b′ ∗ a + a′ ∗ b)
sin (D a a′) = D (sin a) (a′ ∗ cos a)
sqrt (D a a′) = D (sqrt a) (a′ / (2 ∗ sqrt a))

. . .

Are these definitions correct?

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 9 / 32


Differentiation

What is automatic differentiation — really?

I What does AD mean?

I How does a correct implementation arise?

I Where else might these answers take us?

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 10 / 32


What does AD mean?

What does AD mean?

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 11 / 32


What does AD mean?

What does AD mean?

data D α = D α α

toD :: (α → α) → (α → D α)
toD f = λx → D (f x ) (d f x )

Spec: toD combinations correspond to function combinations, e.g.,

toD u + toD v ≡ toD (u + v )
toD u ∗ toD v ≡ toD (u ∗ v )
recip (toD u) ≡ toD (recip u)
sin (toD u) ≡ toD (sin u)
cos (toD u) ≡ toD (cos u)

I.e., toD preserves structure.

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 12 / 32


How does a correct implementation arise?

How does a correct implementation arise?

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 13 / 32


How does a correct implementation arise?

How does a correct implementation arise?

Goal: ∀u. sin (toD u) ≡ toD (sin u)

Simplify each side:

sin (toD u) ≡ λx → sin (toD u x )
≡ λx → sin (D (u x ) (d u x ))

toD (sin u) ≡ λx → D (sin u x ) (d (sin u) x )
≡ λx → D ((sin ◦ u) x ) ((d u ∗ cos u) x )
≡ λx → D (sin (u x )) (d u x ∗ cos (u x ))

Sufficient:

sin (D ux dux ) = D (sin ux ) (dux ∗ cos ux )

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 14 / 32


How does a correct implementation arise?

How does a correct implementation arise?

Goal: ∀u. sin (toD u) ≡ toD (sin u)

Simplify each side:

sin (toD u) ≡ λx → sin (toD u x )
≡ λx → sin (D (u x ) (d u x ))

toD (sin u) ≡ λx → D (sin u x ) (d (sin u) x )
≡ λx → D ((sin ◦ u) x ) ((d u ∗ cos u) x )
≡ λx → D (sin (u x )) (d u x ∗ cos (u x ))

Sufficient:

sin (D ux dux ) = D (sin ux ) (dux ∗ cos ux )

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 14 / 32


How does a correct implementation arise?

How does a correct implementation arise?

Goal: ∀u. sin (toD u) ≡ toD (sin u)

Simplify each side:

sin (toD u) ≡ λx → sin (toD u x )
≡ λx → sin (D (u x ) (d u x ))

toD (sin u) ≡ λx → D (sin u x ) (d (sin u) x )
≡ λx → D ((sin ◦ u) x ) ((d u ∗ cos u) x )
≡ λx → D (sin (u x )) (d u x ∗ cos (u x ))

Sufficient:

sin (D ux dux ) = D (sin ux ) (dux ∗ cos ux )

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 14 / 32


Where else might these answers take us?

Where else might these answers take us?

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 15 / 32


Where else might these answers take us?

Where else might these answers take us?

In this talk

I Prettier definitions

I Higher-order derivatives

I Higher-dimensional functions

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 16 / 32


Where else might these answers take us? Prettier definitions

Digging deeper — the scalar chain rule

d (g ◦ u) x ≡ d g (u x ) ∗ d u x

For scalar domain & range. Variations for other dimensions.

Define and reuse:

(g ./ dg ) (D ux dux ) = D (g ux ) (dg ux ∗ dux )

For instance,

sin = sin ./ cos
cos = cos ./ λx →−sin x
sqrt = sqrt ./ λx → recip (2 ∗ sqrt x )

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 17 / 32


Where else might these answers take us? Prettier definitions

Function overloadings make for prettier definitions.

instance Floating α ⇒ Floating (D α) where
exp = exp ./ exp
log = log ./ recip
sqrt = sqrt ./ recip (2 ∗ sqrt )
sin = sin ./ cos
cos = cos ./ −sin
acos = acos ./ recip (−sqrt (1 − sqr ))
atan = atan ./ recip (1 + sqr )
sinh = sinh ./ cosh
cosh = cosh ./ sinh

sqr x = x ∗ x

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 18 / 32


Where else might these answers take us? Higher-order derivatives

Scalar, higher-order AD

Generate infinite towers of derivatives (Karczmarczuk 1998):

data D α = D α (D α)

Suffices to tweak the chain rule:

(g ./ dg ) (D ux 0 dux ) = D (g ux 0) (dg ux 0 ∗ dux ) -- old
(g ./ dg ) ux @(D ux 0 dux ) = D (g ux 0) (dg ux ∗ dux ) -- new

Most other definitions can then go through unchanged.
The derivations adapt.

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 19 / 32


Where else might these answers take us? Higher-dimensional functions

What’s a derivative – really?

For scalar domain:

d f x = lim
ε→0

f (x + ε) − f x
ε

Redefine: unique scalar s such that

lim
ε→0

f (x + ε) − f x
ε

− s ≡ 0

Equivalently,

lim
ε→0

f (x + ε) − f x − s ·ε
ε

≡ 0

or

lim
ε→0

f (x + ε) − (f x + s ·ε)
ε

≡ 0

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 20 / 32


Where else might these answers take us? Higher-dimensional functions

What’s a derivative – really?

For scalar domain:

d f x = lim
ε→0

f (x + ε) − f x
ε

Redefine: unique scalar s such that

lim
ε→0

f (x + ε) − f x
ε

− s ≡ 0

Equivalently,

lim
ε→0

f (x + ε) − f x − s ·ε
ε

≡ 0

or

lim
ε→0

f (x + ε) − (f x + s ·ε)
ε

≡ 0

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 20 / 32


Where else might these answers take us? Higher-dimensional functions

What’s a derivative – really?

For scalar domain:

d f x = lim
ε→0

f (x + ε) − f x
ε

Redefine: unique scalar s such that

lim
ε→0

f (x + ε) − f x
ε

− s ≡ 0

Equivalently,

lim
ε→0

f (x + ε) − f x − s ·ε
ε

≡ 0

or

lim
ε→0

f (x + ε) − (f x + s ·ε)
ε

≡ 0

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 20 / 32


Where else might these answers take us? Higher-dimensional functions

What’s a derivative – really?

lim
ε→0

f (x + ε) − (f x + s ·ε)
ε

≡ 0

Now generalize: unique linear map T such that:

lim
ε→0

|f (x + ε) − (f x + T ε)|
|ε|

≡ 0

Derivatives are linear maps.

Captures all “partial derivatives” for all dimensions.
See Calculus on Manifolds by Michael Spivak.

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 21 / 32


Where else might these answers take us? Higher-dimensional functions

What’s a derivative – really?

lim
ε→0

f (x + ε) − (f x + s ·ε)
ε

≡ 0

Now generalize: unique linear map T such that:

lim
ε→0

|f (x + ε) − (f x + T ε)|
|ε|

≡ 0

Derivatives are linear maps.

Captures all “partial derivatives” for all dimensions.
See Calculus on Manifolds by Michael Spivak.

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 21 / 32


Where else might these answers take us? Higher-dimensional functions

What’s a derivative – really?

lim
ε→0

f (x + ε) − (f x + s ·ε)
ε

≡ 0

Now generalize: unique linear map T such that:

lim
ε→0

|f (x + ε) − (f x + T ε)|
|ε|

≡ 0

Derivatives are linear maps.

Captures all “partial derivatives” for all dimensions.
See Calculus on Manifolds by Michael Spivak.

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 21 / 32


Where else might these answers take us? Higher-dimensional functions

The chain rules all unify into one.

Generalize from
d (g ◦ u) x ≡ d g (u x ) ∗ d u x

etc

to
d (g ◦ u) x ≡ d g (u x ) ◦ d u x

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 22 / 32


Where else might these answers take us? Higher-dimensional functions

The chain rules all unify into one.

Generalize from
d (g ◦ u) x ≡ d g (u x ) ∗ d u x

etc to
d (g ◦ u) x ≡ d g (u x ) ◦ d u x

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 22 / 32


Where else might these answers take us? Higher-dimensional functions

Generalized derivatives

Derivative values are linear maps: α ( β.

d :: (Vector s α, Vector s β)
⇒ (α → β) → (α → (α ( β))

First-order AD:

data α . β = D β (α ( β)

Higher-order AD:

data α.∗β = D β (α.∗(α ( β))
≈ β × (α ( β) × (α ( (α ( β)) × . . .

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 23 / 32


Where else might these answers take us? Higher-dimensional functions

What’s a linear map?

Preserves linear combinations:

h (s1 · u1 + . . . + sn · un) ≡ s1 · h u1 + . . . + sn · h un

Fully determined by behavior on basis of α, so

type α ( β = Basis α
M→β

Memoized for efficiency.

Vectors, matrices, etc re-emerge as memo-tries.
Statically dimension-typed!

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 24 / 32


Where else might these answers take us? Higher-dimensional functions

What’s a linear map?

Preserves linear combinations:

h (s1 · u1 + . . . + sn · un) ≡ s1 · h u1 + . . . + sn · h un

Fully determined by behavior on basis of α, so

type α ( β = Basis α
M→β

Memoized for efficiency.

Vectors, matrices, etc re-emerge as memo-tries.
Statically dimension-typed!

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 24 / 32


Where else might these answers take us? Higher-dimensional functions

What’s a linear map?

Preserves linear combinations:

h (s1 · u1 + . . . + sn · un) ≡ s1 · h u1 + . . . + sn · h un

Fully determined by behavior on basis of α, so

type α ( β = Basis α
M→β

Memoized for efficiency.

Vectors, matrices, etc re-emerge as memo-tries.
Statically dimension-typed!

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 24 / 32


Where else might these answers take us? Higher-dimensional functions

What’s a basis?

class Vector s v ⇒ HasBasis s v where
type Basis v :: ∗
coord :: v → (Basis v → s )
basisValue :: Basis v → v

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 25 / 32


Where else might these answers take us? Higher-dimensional functions

instance HasBasis Double Double where
type Basis Double = ()
coord s = λ() → s
basisValue () = 1

instance (HasBasis s u, HasBasis s v )
⇒ HasBasis s (u, v ) where

type Basis (u, v ) = Basis u ‘Either ‘ Basis v
coord (u, v ) = coord u ‘either ‘ coord v
basisValue (Left a) = (basisValue a, 0)
basisValue (Right b) = (0, basisValue b)

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 26 / 32


Automatic differentiation – naturally

Automatic differentiation – naturally

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 27 / 32


Automatic differentiation – naturally

Can we make AD even simpler?

Recall our function overloadings:

instance Num β ⇒ Num (α → β) where
(+) = liftA2 (+)
(∗) = liftA2 (∗)

. . .

instance Floating β ⇒ Floating (α → β) where
sin = fmap sin
cos = fmap cos

. . .

These definitions are standard for applicative functors.
Could they work for D ?

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 28 / 32


Automatic differentiation – naturally

Automatic differentiation – naturally

Could we simply define AD via the standard

sin = fmap sin

etc? What is fmap? Require toD x be a natural transformation:

fmap g ◦ toD x ≡ toD x ◦ fmap g

where

toD x u = D (u x ) (d u x )

Define fmap from this naturality condition.

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 29 / 32


Automatic differentiation – naturally

Derive AD naturally

toD x (fmap g u) ≡ toD x (g ◦ u)
≡ D ((g ◦ u) x ) (d (g ◦ u) x )
≡ D (g (u x )) (d g (u x ) ◦ d u x )

fmap g (toD x u) ≡ fmap g (D (u x ) (d u x ))

Sufficient definition:

fmap g (D ux dux ) = D (g ux ) (d g ux ◦ dux )

Similar derivation for liftA2 (for (+), (∗), etc).

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 30 / 32


Automatic differentiation – naturally

Sufficient definition:

fmap g (D ux dux ) = D (g ux ) (d g ux ◦ dux )

Oops. d doesn’t have an implementation.

Solution A: Inline fmap for each fmap g and rewrite d g to known
derivative.

Solution B: Generalize Functor to allow non-function arrows, and replace
functions by differentiable functions.

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 31 / 32


Automatic differentiation – naturally

Conclusions

I Specification as a structure-preserving semantic function.

I Implementation derived systematically from specification.

I Prettier implementation via functions-as-numbers.

I Infinite derivative towers with nearly no extra code.

I Generalize to differentiation over vector spaces.

I Even simpler specification/derivation via naturality.

Conal Elliott (LambdaPix) Beautiful differentiation 1 September, 2009 ICFP 32 / 32


	Differentiation
	What does AD mean?
	How does a correct implementation arise?
	Where else might these answers take us?
	Prettier definitions
	Higher-order derivatives
	Higher-dimensional functions

	Automatic differentiation – naturally