Importance of dot product in machine learning

In the most basic type of machine learning model, the output is calculated by taking a weighted sum of the input features. Each input is multiplied by a corresponding weight that represents its importance in the model. Once this weighted sum is obtained, a bias term is added to the result. The bias allows the model to adjust the output independently of the input values, helping to improve accuracy and fit the model better to the data. This fundamental approach serves as the foundation for more complex machine learning algorithms.
For a given one input instance defined as \(x=[x_0 x_1]\) where \(x_0 \) and \(x_1\) are features in the data set output of the model is defined as
\( y=w_0x_0+w_1x_1+b\) where \(w_0,w_1\) are weights and $b$ is bias.
In situations where there are multiple features and weights, we utilize dot product notation for representation. The dot product is defined as the element-wise multiplication of two vectors, facilitating the analysis of their relationships.
lets say we have two vectors vector \(x= \\\begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_n \end{bmatrix}\\\) and vector \(w= \\\begin{bmatrix} w_0 \\ w_1 \\ \vdots \\w_n \end{bmatrix}\\\) then their dot product will simply be the element-wise multiplication of the two vectors.
\(w.x=w_0x_0+w_1x_1+.....+w_nx_n\) in other way the dot product of two vectors is the sum of the products of their corresponding elements.
Lets assume a machine learning model that is designed to predict a specific target value, represented as $y$However, instead of achieving this target, the model produces an output, denoted as \(\hat{y}\) which may differ from what we expect. To evaluate the model's performance and understand how accurately it is making predictions, we need to calculate the error. This error is defined as the difference between the desired target value y and the actual output generated by the model \(\hat{y}\). To effectively measure this discrepancy, we employ a statistical method known as mean squared error, which helps us quantify the average of the squared differences between the predicted and target values. This allows us to gain insights into the model's accuracy and areas for improvement.
squared error \(e^2=(y-\hat{y})^2\)
The total error across the entire training dataset is determined by calculating the difference between the output vector and the ground truth vector. Each element of this difference is squared, and the resulting squared values are summed to obtain the total error. This procedure is equivalent to computing the dot product of the difference vector with itself. This operation represents the squared magnitude, or length, known as the L2 norm of a vector, which is defined as the dot product of the vector with itself.
\(E^2=(Y-\hat{Y}).(Y-\hat{Y})=(Y-\hat{Y})^T(Y-\hat{Y})= \\\begin{bmatrix} y_0 \\ y_1 \\ \vdots \\ y_n \end{bmatrix}\\.\\\begin{bmatrix} y_0 \\ y_1 \\ \vdots \\ y_n \end{bmatrix}\\\)
\(=y_0^2+y_1^2.......y_n^2\)
The L2 norm of a vector, often referred to as the Euclidean norm, is a mathematical concept that measures the length or magnitude of a vector in a multi-dimensional space. It is calculated as the square root of the sum of the squares of its components.
L2 norm of a vector is denoted as \(||x||\) is defined as \(||V||=\sqrt{V^TV}=\sqrt{v_0^2+v_1^2+....v_n^2}\)
In a machine learning model with an output vector \(\hat{Y}\) and a target vector $Y$, the error is defined as the magnitude or L2 norm of the difference between these vectors.
\(E=\sqrt{(Y-\hat{Y}).(Y-\hat{Y})}=\sqrt{(Y-\hat{Y})^T(Y-\hat{Y})}\)
Feature similarity using dot product
lets take an example as shown below where we have multiple documents as shown as sentences and words eligible for the feature vector are highlighted in bold. The first element of the feature vector indicates the number of occurrences of the word home, and the second indicates office.
| id | document | feature vector |
| \(d_0\) | I can't wait to go home after a long vacation. | $[1,0]$ |
| \(d_1\) | "I have the flexibility to work from my home office three days a week, but I still prefer going into the main office for meetings | $[1,2]$ |
| \(d_2\) | In his new remote setup, his home had to function simultaneously as both a quiet home environment and a fully operational home office, blending the comfort of home with the structure of the office until he couldn't tell where the home ended and the office began | $[5,2]$ |
| \(d_3\) | I need to stop by the main office to pick up my new employee badge before the meeting starts. | $[0,1]$ |
We have a collection of documents, each represented by its own feature vector. To evaluate the similarity between any two documents, we need to assess the similarity between their corresponding feature vectors. In this section, we will explore how the dot product of a pair of vectors can serve as a measure of their similarity.
Feature vectors corresponding to \(d_0\) and \(d_3\) are \(\\\begin{bmatrix} 1 \\ 0\end{bmatrix}\\\) and \(\\\begin{bmatrix} 0\\ 1\end{bmatrix}\\\) their dot product between them will be \(\\\begin{bmatrix} 1 \\ 0\end{bmatrix}\\.\\\begin{bmatrix} 0 \\ 1\end{bmatrix}\\=0 . 1+1 . 0=0\). This low score aligns with our intuition that there is no common word of interest between the documents, indicating they are very dissimilar.
Feature vectors corresponding to \(d_1\) and \(d_2\) are \(\\\begin{bmatrix} 1 \\ 2\end{bmatrix}\\\) and \(\\\begin{bmatrix} 5 \\ 2\end{bmatrix}\\\) their dot product will be
\(\\\begin{bmatrix} 1 \\ 2\end{bmatrix}\\.\\\begin{bmatrix} 5 \\ 2\end{bmatrix}\\=1 . 5+2 . 2=9\) .
The elevated score reinforces our understanding that the documents possess numerous common words of interest, highlighting their similarities. As a result, we can deduce that vectors representing similar content generate larger dot products, indicating a stronger relationship, while vectors that represent dissimilar content generate dot products that are nearly zero, reflecting a lack of connection between them.This high score aligns with our intuition that the documents share many common words of interest and exhibit similarities. Therefore, we can conclude that similar vectors produce larger dot products, while dissimilar vectors yield dot products that are close to zero.




