# Cross Validation

## 2022-01-15

Comparing the results of using different data subsets, for supervised learning training and testing, is referred to as *cross validation*.

Machine Learning Resource

AUTOMATIC PROGRAMMING

DR. CHRISTIAN SEBERINO

Machine Learning,
Deep Learning,
Artificial Intelligence,
Python & More

"Machine learning is automatic programming!" — Me

Comparing the results of using different data subsets, for supervised learning training and testing, is referred to as *cross validation*.

To estimate the importance of words in a document, determine term frequency-inverse document frequency (tf-idf) values. These values grow with the word frequencies in the document, but, decrease with the word frequencies in *other* documents.

The number of standard deviations an element of a set differs from the average is referred to as its *z-score*. An alternative to min max normalization is to replace numbers with z-scores.

The Cartesian product of sets *A* and *B* is {(*x*, *y*) : *x* ϵ *A* and *y* ϵ *B*}.

Lists of the words and word frequencies in text data are referred to as *bags of words*.
Lists of the phrases, up to some maximum word count, in text data are referred to as *n-grams*.

Converting a set of strings to a set of integers establishes an ordering. Converting a set of strings to a set of perpendicular unit vectors does *not* establish an ordering. This is referred to as one hot encoding.

Inference is the process of *using* supervised learning models.

Validation data are used to evaluate machine learning method variations. Testing data are used to evaluate machine learning programs. Often training, validation and testing data are 80%, 10% and 10% respectively of the total data.

AutoML is a Google service which creates programs using machine learning methods. For example, it can create an image classifier given a large set of categorized images.

Regularization involves methods to avoid overfitting in supervised learning. A common backpropagation method regularization technique is to adjust the error function to penalize large weights and biases. L1 (Ridge) regularization adds a term containing the sum of the *squares* of all the weights and biases. L2 (Lasso) regularization adds a term containing the sum of the *absolute values* of all the weights and biases.

The backpropagation method is an extension of the perceptron method for acyclic artificial neural networks. Acyclic artificial neural networks are defined in terms of the following:

functions *f*_{1}, *f*_{2}, *f*_{3}, ..., *f*_{N}

weight matrices *W*_{1}, *W*_{2}, *W*_{3}, ..., *W*_{N}

bias vectors *b*_{1}, *b*_{2}, *b*_{3}, ..., *b*_{N}

such that the result for an input vector ** i** involves:

*o*_{0} = *i*

*o*_{j } = (*f*_{j} (*a*_{j1}), *f*_{j} (*a*_{j2}), *f*_{j} (*a*_{j3}), ..., *f*_{j} (*a*_{jN})) for *j* = 1, 2, 3, ..., *N*

*a*_{j} = *W*_{j} *o*_{j -1} + *b*_{j} for *j* = 1, 2, 3, ..., *N*

where *o*_{N} is the result.

In the backpropagation method, each weight matrix and bias vector is updated for each input output vector pair (** i**,

Here is sample Python backpropagation method code:

#!/usr/bin/env python3 """ Implements the backpropagation method. Usage: ./backprop <data file> \ <data split> \ <number of hidden layers> \ <number of hidden layer functions> \ <number of categories> \ <learning rate> \ <number of epochs> Data files must be space delimited with one input output pair per line. Every hidden layer has the same number of functions. The hidden layer functions are rectified linear unit functions. The outer layer functions are identity functions. initialization steps: The input output pairs are shuffled and the inputs mix max normalized. The weights and biases are set to random values. Requires NumPy. """ import numpy import sys def minmax(data): """ Finds the min max normalizations of data. """ return (data - numpy.min(data)) / (numpy.max(data) - numpy.min(data)) def init_data(data_file, data_split, n_cat): """ Creates the training and testing data. """ data = numpy.loadtxt(data_file) numpy.random.shuffle(data) data[:, :-1] = minmax(data[:, :-1]) outputs = numpy.identity(n_cat)[data[:, -1].astype("int")] data = numpy.hstack((data[:, :-1], outputs)) data_split = int((data_split / 100) * data.shape[0]) return data[:data_split, :], data[data_split:, :] def accuracy(data, weights, biases, n_cat): """ Calculates the accuracies of models. """ results = model(data[:, :-n_cat], weights, biases) answers = numpy.argmax(data[:, -n_cat:], 1) return 100 * (results == answers).astype(int).mean() def model_(inputs, weights, biases, relu = True): """ model helper function """ outputs = numpy.matmul(weights, inputs.T).T + biases if relu: outputs = numpy.maximum(outputs, 0) return outputs def model(inputs, weights, biases): """ Finds the model outputs. """ output = model_(inputs, weights[0], biases[0]) for e in zip(weights[1:-1], biases[1:-1]): output = model_(output, e[0], e[1]) output = model_(output, weights[-1], biases[-1], False) output = numpy.argmax(output, 1) return output def adjust(weights, biases, input_, output, func_inps, func_outs, learn_rate): """ Adjusts the weights and biases. """ d_e_f_i = [func_outs[-1] - output] d_e_w = [numpy.outer(d_e_f_i[-1], func_outs[-2])] for i in reversed(range(len(weights) - 1)): func_deriv = numpy.clip(numpy.sign(func_inps[i]), 0, 1) vector = numpy.matmul(weights[i + 1].T, d_e_f_i[-1]) func_out = func_outs[i - 1] if i else input_ d_e_f_i.append(numpy.multiply(vector, func_deriv)) d_e_w.append(numpy.outer(d_e_f_i[-1], func_out)) for i, e in enumerate(reversed(list(zip(d_e_w, d_e_f_i)))): weights[i] -= learn_rate * e[0] biases[i] -= learn_rate * e[1] def learn(train_data, n_hls, n_hl_funcs, n_cat, learn_rate, n_epochs): """ Learns the weights and biases from the training data. """ weights = [numpy.random.randn(n_hl_funcs, train_data.shape[1] - n_cat)] for i in range(n_hls - 1): weights.append(numpy.random.randn(n_hl_funcs, n_hl_funcs)) weights.append(numpy.random.randn(n_cat, n_hl_funcs)) weights = [e / numpy.sqrt(e.shape[0]) for e in weights] biases = [numpy.random.randn(n_hl_funcs) for i in range(n_hls)] biases.append(numpy.random.randn(n_cat)) biases = [e / numpy.sqrt(e.shape[0]) for e in biases] for i in range(n_epochs): for e in train_data: input_ = e[:-n_cat] func_inps = [] func_outs = [] for l in range(n_hls + 1): input__ = func_outs[l - 1] if l else input_ func_inp = numpy.matmul(weights[l], input__) func_inp += biases[l] relu = numpy.maximum(func_inp, 0) func_out = relu if l != n_hls else func_inp func_inps.append(func_inp) func_outs.append(func_out) adjust(weights, biases, e[:-n_cat], e[-n_cat:], func_inps, func_outs, learn_rate) return weights, biases n_cat = int(sys.argv[5]) train_data, test_data = init_data(sys.argv[1], float(sys.argv[2]), n_cat) weights, biases = learn(train_data, int(sys.argv[3]), int(sys.argv[4]), n_cat, float(sys.argv[6]), int(sys.argv[7])) print(f"weights and biases: {weights}, {biases}") accuracy_ = accuracy(train_data, weights, biases, n_cat) print(f"training data accuracy: {accuracy_:.2f}%") accuracy_ = accuracy(test_data, weights, biases, n_cat) print(f"testing data accuracy: {accuracy_:.2f}%")

Here are sample results for the MNIST dataset (Modified National Institute Of Standards And Technology dataset) available from many sources such as Kaggle:

% ./backprop MNIST_dataset 80 2 64 10 0.001 100 weights and biases: [array([[ 0.10866304, 0.0041912 , -0.23560872, ..., 0.03364987, -0.19519161, -0.00068468], [ 0.12745399, 0.12268858, -0.13698254, ..., 0.19508343, 0.20920324, 0.1970561 ], ... -0.24605896, 0.02329749, -0.16363297, -0.24085487, -0.14819292, -0.19237153, -0.21772553, -0.19817858, 0.50966376, 0.14384857, 0.10621777, 0.64537735, 0.77337279, 0.01737619]), array([0.03938714, 0.0574965 , 0.16544762, 0.13164358, 0.04927753, 0.12365563, 0.0401857 , 0.18105514, 0.10016533, 0.11111991])] training data accuracy: 97.96% testing data accuracy: 96.59%

Here is a plot of the accuracy versus the number of epochs for a data split of 80 / 20, two hidden layers, 64 functions per hidden layer, 10 categories, and, a learning rate of 0.001. Blue denotes the training data accuracy and orange denotes the testing data accuracy:

Feedforward artificial neural networks are acyclic artificial neural networks.

The rectified linear unit is a popular artificial neural network function. It is widely used in deep learning and is also referred to as the ramp function:

Deep learning methods are artificial neural network supervised learning methods involving large numbers of compositions of functions.

Supervised learning inputs are referred to as *features* and *samples*.
Supervised learning outputs are referred to as *labels*, *targets*, *classes* and *categories*.

"The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." — New York Times, July 8, 1958

The perceptron method is one of the earliest and simplest artificial neural network supervised learning methods. It involves the single function *H*(** w** ·

Here is sample Python perceptron method code:

#!/usr/bin/env python3 """ Implements the perceptron method. Usage: ./perceptron <data file> <data split> <learning rate> <number of epochs> Data files must be space delimited with one input output pair per line. initialization steps: Input output pairs are shuffled. Inputs are min max normalized. Weights are set to random values. Requires NumPy. """ import numpy import sys def minmax(data): """ Finds the min max normalizations of data. """ return (data - numpy.min(data)) / (numpy.max(data) - numpy.min(data)) def init_data(data_file, data_split): """ Creates the training and testing data. """ data = numpy.loadtxt(data_file) numpy.random.shuffle(data) data[:, :-1] = minmax(data[:, :-1]) ones = numpy.ones(data.shape[0])[None].T data = numpy.hstack((data[:, :-1], ones, data[:, -1][None].T)) data_split = int((data_split / 100) * data.shape[0]) return data[:data_split, :], data[data_split:, :] def accuracy(data, weights): """ Calculates the accuracies of models. """ model_ = model(data[:, :-1], weights) return 100 * (model_ == data[:, -1]).astype(int).mean() def model(inputs, weights): """ Finds the model outputs. """ return (numpy.matmul(inputs, weights) > 0).astype(int) def learn(data, learn_rate, n_epochs): """ Learns the weights from data. """ weights = numpy.random.rand(data.shape[1] - 1) / (data.shape[1] - 1) for i in range(n_epochs): for e in data: model_ = model(e[:-1], weights) weights += learn_rate * (e[-1] - model_) * e[:-1] return weights train_data, test_data = init_data(sys.argv[1], float(sys.argv[2])) weights = learn(train_data, float(sys.argv[3]), int(sys.argv[4])) print(f"weights and bias: {weights}") print(f"training data accuracy: {accuracy(train_data, weights):.2f}%") print(f"testing data accuracy: {accuracy(test_data, weights):.2f}%")

Here are sample results for a subset of the popular MNIST dataset (Modified National Institute Of Standards And Technology dataset) available from many sources such as Kaggle. Outputs denote whether the inputs correspond to the number eight or not:

% ./perceptron MNIST_subset_dataset 80 0.000001 100 weights and bias: [ 1.26322608e-03 1.08497202e-03 1.03465701e-03 6.20197534e-05 8.92840895e-04 3.13696893e-04 9.32305752e-04 5.30571491e-04 9.57601044e-04 9.92650699e-04 4.41735355e-04 9.50010528e-04 7.11471738e-04 1.26831615e-03 7.15789174e-04 1.59426438e-04 ... 9.04247841e-04 7.11406621e-04 2.85485411e-04 -3.17756922e-05 6.38906024e-04 9.42761704e-04 1.01108588e-03 3.51662937e-04 8.18848025e-04 5.85304004e-04 1.77400185e-05 1.27172550e-03 -1.72279550e-03] training data accuracy: 90.04% testing data accuracy: 85.89%

Here is a plot of the accuracy versus the number of epochs for a data split of 80 / 20 and a learning rate of 0.000001. Blue denotes the training data accuracy and orange denotes the testing data accuracy:

Hyperparameters alter machine learning methods.

Artificial neural networks (ANNs) are built from functions which correspond to idealized neurons. These functions are referred to as *activation functions* and are organized into sets referred to as *layers* based on their compositions.

Ensemble methods involve multiple machine learning methods.

Min max normalizations transform sets of numbers into ones with the extrema zero and one. Let *m* and *M* denote the minimum and maximum of a set of numbers. The min max normalization of that set replaces every element *x* with (*x* - *m*) / (*M* - *m*).

Models are function approximations created with supervised learning methods.

Epochs are supervised learning steps which process *all* of the input output pairs.

Using supervised learning methods to approximate piecewise continuous functions is referred to as *classification*.
Using supervised learning methods to approximate continuous functions is referred to as *regression*.

Training data are *supervised learning input output pair sets*.

Modes are the most frequent elements of sets.

The k nearest neighbors method is one of the simplest supervised learning methods. It involves finding the most similar inputs in the set of input output pairs.

Here is sample Python code to determine the accuracy of the k nearest neighbors method on data:

#!/usr/bin/env python3 """ Determines the accuracy of the k nearest neighbors method on data. Usage: ./k_nn <data file> <data split> <number of nearest neighbors> Data files must be space delimited with one input output pair per line. initialization steps: Input output pairs are shuffled. Inputs are min max normalized. Requires SciPy and NumPy. """ import scipy.stats import numpy import sys def minmax(data): """ Finds the min max normalizations of data. """ return (data - numpy.min(data)) / (numpy.max(data) - numpy.min(data)) def init_data(data_file, data_split): """ Creates the model and testing data. """ data = numpy.loadtxt(data_file) numpy.random.shuffle(data) data[:, :-1] = minmax(data[:, :-1]) data_split = int((data_split / 100) * data.shape[0]) return data[:data_split, :], data[data_split:, :] def accuracy(model_data, test_data, n_nn): """ Calculates the accuracies of models. """ model_ = model(test_data[:, :-1], model_data, n_nn) return 100 * (model_ == test_data[:, -1]).astype(int).mean() def model_(input_, model_data, n_nn): """ model helper function """ squares = (input_ - model_data[:, :-1]) ** 2 indices = numpy.sum(squares, 1).argsort()[:n_nn] return scipy.stats.mode(numpy.take(model_data[:, -1], indices))[0][0] def model(inputs, model_data, n_nn): """ Finds the model outputs. """ return numpy.apply_along_axis(lambda e : model_(e, model_data, n_nn), 1, inputs) model_data, test_data = init_data(sys.argv[1], float(sys.argv[2])) n_nn = int(sys.argv[3]) print(f"testing data accuracy: {accuracy(model_data, test_data, n_nn):.2f}%")

Here are sample results for the popular Iris flower dataset available from many sources such as Sklearn:

% ./k_nn Iris_flower_dataset 80 1 testing data accuracy: 96.67% % ./k_nn Iris_flower_dataset 80 2 testing data accuracy: 93.33%

Symbols are objects that represent *ideas*. Symbol manipulation can correspond to thinking. Therefore, computers can correspond to minds and replace humans at some mental tasks.

Underfitting is the creation of supervised learning continuous function approximations that are too *simple*. Overfitting is the creation of supervised learning continuous function approximations that are too *complex*. Both decrease accuracy.

NumPy arrays are a fundamental data structure of machine learning with Python. Pandas, SciPy, Sklearn and many other libraries use Numpy arrays.

Labeled means *categorized*. Supervised learning methods require labeled inputs.

Statistics is a superset of what is often referred to as data science. The field of statistics predates computers.

Supervised learning methods automate the creation of programs that approximate *functions* of any number of variables. They require many input output pairs.

Intelligence does *not* have a rigorous definition. Rather than trying to define intelligence, Alan Turing suggested trying to create devices that *act* as if they have intelligence. Quality can be measured by their ability to fool people.

Computers are symbol manipulation machines.

Machine learning methods *automatically* create programs. They are useful when creating programs is inconvenient, impractical or even *impossible* for humans.

Many inventions surpass humans in limited ways. Cars move faster than humans. Cranes lift more than humans.
Calculators calculate better than humans. Now an invention surpasses humans at *programming*!