Complejidad de Rademacher

En la teoría del aprendizaje computacional ( aprendizaje automático y teoría de la computación ), la complejidad de Rademacher , que lleva el nombre de Hans Rademacher , mide la riqueza de una clase de funciones de valor real con respecto a una distribución de probabilidad .

Definiciones

Complejidad Rademacher de un conjunto

Dado un conjunto ${\ Displaystyle A \ subseteq \ mathbb {R} ^ {m}}$ , la complejidad de Rademacher de A se define de la siguiente manera: ^[1]^[2]^{: 326}

{\ Displaystyle \ operatorname {Rad} (A): = {\ frac {1} {m}} \ operatorname {E} \ left [\ sup _ {a \ in A} \ sum _ {i = 1} ^ { m} \ sigma _ {i} a_ {i} \ right]}

dónde ${\ Displaystyle \ sigma _ {1}, \ sigma _ {2}, \ dots, \ sigma _ {m}}$ son variables aleatorias independientes extraídas de la distribución de Rademacher, es decir, ${\ Displaystyle \ Pr (\ sigma _ {i} = + 1) = \ Pr (\ sigma _ {i} = - 1) = 1/2}$ por ${\ Displaystyle i = 1,2, \ dots, m}$ , y ${\ Displaystyle a = (a_ {1}, \ ldots, a_ {m})}$ . Algunos autores toman el valor absoluto de la suma antes de tomar el supremo, pero si ${\ Displaystyle A}$ es simétrico, esto no hace ninguna diferencia.

Complejidad de Rademacher de una clase de función

Dada una muestra ${\ Displaystyle S = (z_ {1}, z_ {2}, \ dots, z_ {m}) \ in Z ^ {m}}$ y una clase ${\ Displaystyle F}$ de funciones de valor real definidas en un espacio de dominio ${\ Displaystyle Z}$ , dónde ${\ Displaystyle f}$ es la función de pérdida ${\ Displaystyle f (z) = \ ell (h (x), y)}$ de un clasificador ${\ Displaystyle h}$ , la complejidad empírica de Rademacher de ${\ Displaystyle F}$ dado ${\ Displaystyle S}$ Se define como:

{\ Displaystyle \ operatorname {Rad} _ {S} (F) = {\ frac {1} {m}} \ operatorname {E} \ left [\ sup _ {f \ in F} \ sum _ {i = 1 } ^ {m} \ sigma _ {i} f (z_ {i}) \ right]}

Esto también se puede escribir usando la definición anterior: ^[2]^{: 326}

{\ Displaystyle \ operatorname {Rad} _ {S} (F) = \ operatorname {Rad} (F \ circ S)}

dónde ${\ Displaystyle F \ circ S}$ denota la composición de la función , es decir:

{\ Displaystyle F \ circ S: = \ {(f (z_ {1}), \ ldots, f (z_ {m})) \ mid f \ in F \}}

Dejar ${\ Displaystyle P}$ ser una distribución de probabilidad sobre ${\ Displaystyle Z}$ . La complejidad de Rademacher de la clase de función ${\ Displaystyle F}$ con respecto a ${\ Displaystyle P}$ para el tamaño de la muestra ${\ Displaystyle m}$ es:

{\ Displaystyle \ operatorname {Rad} _ {P, m} (F): = \ operatorname {E} _ {S \ sim P ^ {m}} \ left [\ operatorname {Rad} _ {S} (F) \derecho]}

donde la expectativa anterior se toma sobre una muestra distribuida de manera idéntica independientemente (iid) ${\ Displaystyle S = (z_ {1}, z_ {2}, \ dots, z_ {m})}$ generado de acuerdo con $P$ .

Ejemplos de

1. $A$ contains a single vector, e.g., $A=\{(a,b)\}\subset \mathbb {R} ^{2}$ . Then:

\operatorname {Rad} (A)={1 \over 1}\cdot \left({1 \over 4}\cdot (a+b)+{1 \over 4}\cdot (a-b)+{1 \over 4}\cdot (-a+b)+{1 \over 4}\cdot (-a-b)\right)=0

The same is true for every singleton hypothesis class.^[3]^:56

2. $A$ contains two vectors, e.g., $A=\{(1,1),(1,2)\}\subset \mathbb {R} ^{2}$ . Then:

{\begin{aligned}\operatorname {Rad} (A)&={1 \over 2}\cdot \left({1 \over 4}\cdot \max(1+1,1+2)+{1 \over 4}\cdot \max(1-1,1-2)+{1 \over 4}\cdot \max(-1+1,-1+2)+{1 \over 4}\cdot \max(-1-1,-1-2)\right)\\[5pt]&={1 \over 8}(3+0+1-2)={1 \over 4}\end{aligned}}

Usando la complejidad de Rademacher

The Rademacher complexity can be used to derive data-dependent upper-bounds on the learnability of function classes. Intuitively, a function-class with smaller Rademacher complexity is easier to learn.

Bounding the representativeness

In machine learning, it is desired to have a training set that represents the true distribution of some sample data $S$ . This can be quantified using the notion of representativeness. Denote by $P$ the probability distribution from which the samples are drawn. Denote by $H$ the set of hypotheses (potential classifiers) and denote by $F$ the corresponding set of error functions, i.e., for every hypothesis $h\in H$ , there is a function $f_{h}\in F$ , that maps each training sample (features,label) to the error of the classifier $h$ (note in this case hypothesis and classifier are used interchangeably). For example, in the case that $h$ represents a binary classifier, the error function is a 0–1 loss function, i.e. the error function $f_{h}$ returns 1 if $h$ correctly classifies a sample and 0 else. We omit the index and write $f$ instead of $f_{h}$ when the underlying hypothesis is irrelevant. Define:

L_{P}(f):=\operatorname {E} _{z\sim P}[f(z)]

– the expected error of some error function

f\in F

on the real distribution

P

;

L_{S}(f):={1 \over m}\sum _{i=1}^{m}f(z_{i})

– the estimated error of some error function

f\in F

on the sample

S

.

The representativeness of the sample $S$ , with respect to $P$ and $F$ , is defined as:

\operatorname {Rep} _{P}(F,S):=\sup _{f\in F}(L_{P}(f)-L_{S}(f))

Smaller representativeness is better, since it provides a way to avoid overfitting: it means that the true error of a classifier is not much higher than its estimated error, and so selecting a classifier that has low estimated error will ensure that the true error is also low. Note however that the concept of representativeness is relative and hence can not be compared across distinct samples.

The expected representativeness of a sample can be bounded above by the Rademacher complexity of the function class:^[2]^:326

\operatorname {E} _{S\sim P^{m}}[\operatorname {Rep} _{P}(F,S)]\leq 2\cdot \operatorname {E} _{S\sim P^{m}}[\operatorname {Rad} (F\circ S)]

Bounding the generalization error

When the Rademacher complexity is small, it is possible to learn the hypothesis class H using empirical risk minimization.

For example, (with binary error function),^[2]^:328 for every $\delta >0$ , with probability at least $1-\delta$ , for every hypothesis $h\in H$ :

L_{P}(h)-L_{S}(h)\leq 2\operatorname {Rad} (F\circ S)+4{\sqrt {2\ln(4/\delta ) \over m}}

Limitando la complejidad de Rademacher

Since smaller Rademacher complexity is better, it is useful to have upper bounds on the Rademacher complexity of various function sets. The following rules can be used to upper-bound the Rademacher complexity of a set $A\subset \mathbb {R} ^{m}$ .^[2]^:329–330

1. If all vectors in $A$ are translated by a constant vector $a_{0}\in \mathbb {R} ^{m}$ , then Rad(A) does not change.

2. If all vectors in $A$ are multiplied by a scalar $c\in \mathbb {R}$ , then Rad(A) is multiplied by $|c|$ .

3. Rad(A + B) = Rad(A) + Rad(B).^[3]^:56

4. (Kakade & Tewari Lemma) If all vectors in $A$ are operated by a Lipschitz function, then Rad(A) is (at most) multiplied by the Lipschitz constant of the function. In particular, if all vectors in $A$ are operated by a contraction mapping, then Rad(A) strictly decreases.

5. The Rademacher complexity of the convex hull of $A$ equals Rad(A).

6. (Massart Lemma) The Rademacher complexity of a finite set grows logarithmically with the set size. Formally, let $A$ be a set of $N$ vectors in $\mathbb {R} ^{m}$ , and let ${\bar {a}}$ be the mean of the vectors in $A$ . Then:

\operatorname {Rad} (A)\leq \max _{a\in A}\|a-{\bar {a}}\|\cdot {{\sqrt {2\log N}} \over m}

In particular, if $A$ is a set of binary vectors, the norm is at most ${\sqrt {m}}$ , so:

\operatorname {Rad} (A)\leq {\sqrt {2\log N \over m}}

Bounds related to the VC dimension

Let $H$ be a set family whose VC dimension is $d$ . It is known that the growth function of $H$ is bounded as:

for all

m>d+1

:

\operatorname {Growth} (H,m)\leq (em/d)^{d}

This means that, for every set $h$ with at most $m$ elements, $|H\cap h|\leq (em/d)^{d}$ . The set-family $H\cap h$ can be considered as a set of binary vectors over $\mathbb {R} ^{m}$ . Substituting this in Massart's lemma gives:

\operatorname {Rad} (H\cap h)\leq {\sqrt {2d\log(em/d) \over m}}

With more advanced techniques (Dudley's entropy bound and Haussler's upper bound^[4]) one can show, for example, that there exists a constant $C$ , such that any class of $\{0,1\}$ -indicator functions with Vapnik–Chervonenkis dimension $d$ has Rademacher complexity upper-bounded by $C{\sqrt {\frac {d}{m}}}$ .

Bounds related to linear classes

The following bounds are related to linear operations on $S$ – a constant set of $m$ vectors in $\mathbb {R} ^{n}$ .^[2]^:332–333

1. Define $A_{2}=\{(w\cdot x_{1},\ldots ,w\cdot x_{m})\mid \|w\|_{2}\leq 1\}=$ the set of dot-products of the vectors in $S$ with vectors in the unit ball. Then:

\operatorname {Rad} (A_{2})\leq {\max _{i}\|x_{i}\|_{2} \over {\sqrt {m}}}

2. Define $A_{1}=\{(w\cdot x_{1},\ldots ,w\cdot x_{m})\mid \|w\|_{1}\leq 1\}=$ the set of dot-products of the vectors in $S$ with vectors in the unit ball of the 1-norm. Then:

\operatorname {Rad} (A_{1})\leq \max _{i}\|x_{i}\|_{\infty }\cdot {\sqrt {2\log(2n) \over m}}

Bounds related to covering numbers

The following bound relates the Rademacher complexity of a set $A$ to its external covering number – the number of balls of a given radius $r$ whose union contains $A$ . The bound is attributed to Dudley.^[2]^:338

Suppose $A\subset \mathbb {R} ^{m}$ is a set of vectors whose length (norm) is at most $c$ . Then, for every integer $M>0$ :

\operatorname {Rad} (A)\leq {c\cdot 2^{-M} \over {\sqrt {m}}}+{6c \over m}\cdot \sum _{i=1}^{M}2^{-i}{\sqrt {\log \left(N_{c\cdot 2^{-i}}^{\text{ext}}(A)\right)}}

In particular, if $A$ lies in a d-dimensional subspace of $\mathbb {R} ^{m}$ , then:

\forall r>0:N_{r}^{\text{ext}}(A)\leq (2c{\sqrt {d}}/r)^{d}

Substituting this in the previous bound gives the following bound on the Rademacher complexity:

\operatorname {Rad} (A)\leq {6c \over m}\cdot {\bigg (}{\sqrt {d\log(2{\sqrt {d}})}}+2{\sqrt {d}}{\bigg )}=O{\bigg (}{c{\sqrt {d\log(d)}} \over m}{\bigg )}

Complejidad gaussiana

Gaussian complexity is a similar complexity with similar physical meanings, and can be obtained from the Rademacher complexity using the random variables $g_{i}$ instead of $\sigma _{i}$ , where $g_{i}$ are Gaussian i.i.d. random variables with zero-mean and variance 1, i.e. $g_{i}\sim {\mathcal {N}}(0,1)$ . Gaussian and Rademacher complexities are known to be equivalent up to logarithmic factors.

Eqivalence of Rademacher and Gaussian complexity

Given a set $A\subseteq \mathbb {R} ^{n}$ then it holds that:
${\frac {G(A)}{2{\sqrt {\log {n}}}}}\leq {\text{Rad}}(A)\leq {\sqrt {\frac {\pi }{2}}}G(A)$
Where $G(A)$ is the Gaussian Complexity of A

Referencias

^ Balcan, Maria-Florina (November 15–17, 2011). "Machine Learning Theory – Rademacher Complexity" (PDF). Retrieved 10 December 2016.
^ ^a ^b ^c ^d ^e ^f ^g Chapter 26 in Shalev-Shwartz, Shai; Ben-David, Shai (2014). Understanding Machine Learning – from Theory to Algorithms. Cambridge University Press. ISBN 9781107057135.
^ a b Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012). Foundations of Machine Learning. USA, Massachusetts: MIT Press. ISBN 9780262018258.
^ Bousquet, O. (2004). Introduction to Statistical Learning Theory. Biological Cybernetics, 3176(1), 169–207. http://doi.org/10.1007/978-3-540-28650-9_8

Peter L. Bartlett, Shahar Mendelson (2002) Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. Journal of Machine Learning Research 3 463–482
Giorgio Gnecco, Marcello Sanguineti (2008) Approximation Error Bounds via Rademacher's Complexity. Applied Mathematical Sciences, Vol. 2, 2008, no. 4, 153–176

[b11-1] Balcan, Maria-Florina (November 15–17, 2011). "Machine Learning Theory – Rademacher Complexity" (PDF). Retrieved 10 December 2016.

[book14-2] ^ ^a ^b ^c ^d ^e ^f ^g Chapter 26 in Shalev-Shwartz, Shai; Ben-David, Shai (2014). Understanding Machine Learning – from Theory to Algorithms. Cambridge University Press. ISBN 9781107057135.

[book12-3] Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012). Foundations of Machine Learning. USA, Massachusetts: MIT Press. ISBN 9780262018258.

[4] Bousquet, O. (2004). Introduction to Statistical Learning Theory. Biological Cybernetics, 3176(1), 169–207. http://doi.org/10.1007/978-3-540-28650-9_8

[1]