Special Session 66: Advances in discrete-time dynamical systems with applications

The simplest neural network does solve the simplest classification problem

Victor Jimenez Lopez
Universidad de Murcia
Spain
Co-Author(s):    V\`{\i}ctor Jim\`{e}nez L\`{o}pez (coauthor: Jes\`{u}s Molina Rodr\`{\i}guez de Vera)
Abstract:
A single-layer neural network, usually referred to as a \emph{perceptron}, is the most elementary of all machine learning tools to deal with the \emph{classificacion problem}. Roughly speaking, if disjoint finite sets $A,B\subset \mathbb{R}^h$ are given, we look for a (computationally efficient) way to distinguish points $\mathbf{x}=(x_1,\ldots,x_h)$ from $A$ and $B$. More precisely, we look for a \emph{weight vector} $\mathbf{w} =(w_1,\ldots,w_h)$ and a \emph{bias} $b$ so that, when applying the \emph{sigmoid activation function} $\sigma(z)=1/(1+e^{-z})$ to the scalar product of $\overline{\mathbf{x}}=(1,\mathbf{x})$ and $\mathbf{u}=(b,\mathbf{w})$, the resultant function $Z(\mathbf{x})=Z(\mathbf{x},\mathbf{u})=\sigma(\overline{\mathbf{x}}\cdot\mathbf{u})$ takes values close to $1$ (respectively, to $0$) when $\mathbf{x}\in A$ (respectively, $\mathbf{x}\in B$). To do this, and after rewriting $A\cup B=\{\mathbf{x}_k\}_{k=1}^m$, the natural way is to minimize the \emph{squared error function} $E(\mathbf{u})=\sum_{k=1}^m (Z(\overline{\mathbf{x}_k},\mathbf{u})-y_k)^2$, where $y_k=1-\epsilon$ or $y_k=\epsilon$ according to, respectively, $\mathbf{x}_k\in A$ or $\mathbf{x}_k\in B$, and $\epsilon$ is a fixed small positive number (the natural choice $\epsilon=0$ must be avoided for technical reasons; typically, $\epsilon=0.1$ is used). Hopefully, the error function will have exactly one minimum $\mathbf{u}^*=(b^*, \mathbf{w}^*)$ having the prescribed properties. Moreover, it should be found by using the \emph{gradient descent algorithm} with a \emph{learning rate} $\alpha>0$ sufficiently small. In dynamical systems terms, this means that $\textbf{u}_{n+1}=\textbf{f}(\textbf{u}_n)$ given by $\textbf{f}(\textbf{u})=\textbf{u} - \alpha \nabla E(\textbf{u})$ has $\mathbf{u}^*$ as a stable global attractor. While it is very well-known that this dream scenario needs not hold (thus the need to resort to multilayer networks), there are no examples in the literature, to the best of our knowledge, where global convergence of a perceptron, in the previous sense, has been rigorously proved. In the present work we succeed in doing so in the simplest of all cases: when $A=\{1\}$, $B=\{0\}$.