Learn-Software.com

Vectors, Scalars and Vector Spaces

In 2017, researchers at Google Brain published Attention Is All You Need, introducing the Transformer architecture that now underlies GPT-4, Gemini, and virtually every state-of-the-art language model. At the heart of that architecture (and of every neural network, recommendation system, and computer vision model) is a deceptively simple object: the vector.

When a language model reads the word “bank”, it doesn’t see a string. It sees a vector in a 4096-dimensional space where “bank (financial)” and “bank (riverbank)” occupy measurably different regions. When a search engine decides that your query matches a document, it is computing an angle between two vectors. When a neural network learns, it is moving vectors through space in response to a gradient, itself a vector.

This article builds your working foundation for all of that. By the end, you will be able to:

No fluff, let’s start.

Prerequisites

Before reading this article, you should be comfortable with:

Intuition first

The programmer’s analogy: vectors as typed arrays with geometric soul

As a developer, you’ve used arrays your whole career. A Python list [3.0, -1.5, 7.2] stores three numbers. A vector is superficially the same thing, but with a crucial additional structure: position in space and the geometry that connects positions.

Think of it this way. If you have two dictionaries in Python:

user_A = {"age": 28, "purchase_freq": 5, "avg_spend": 120.0}
user_B = {"age": 29, "purchase_freq": 6, "avg_spend": 115.0}

As dictionaries, they’re just data blobs. You can read values, but “how similar are these users?” is not a question the dictionary can answer natively. Now convert them to vectors:

A = [28, 5, 120.0]
B = [29, 6, 115.0]

Suddenly you have geometry. You can measure the distance between them, the angle they form relative to the origin, and whether one is a scaled version of another. This is the leap vectors make over plain arrays: they live in a space equipped with rules for measuring, comparing, and transforming.

Geometric picture: vectors as arrows

Picture a standard 2D coordinate system. The vector (\mathbf{v} = [3, 2]) is an arrow starting at the origin ((0, 0)) and ending at the point ((3, 2)). Two things define it completely: its magnitude (how long the arrow is) and its direction (which way it points).

This geometric interpretation is not just visual sugar. In Machine Learning, a data point (a row in your dataset) is a vector, it’s an arrow in feature space. Two similar data points are arrows pointing in roughly the same direction. An outlier is an arrow pointing somewhere unexpected. Dimensionality reduction (PCA, UMAP) is the art of finding a lower-dimensional space where those arrows still tell roughly the same story.

Scalars: the simplest case

A scalar is just a single number, no direction, no components. Temperature, loss value, learning rate: all scalars. When you multiply a vector by a scalar, you stretch or shrink the arrow without rotating it:

Same direction vector, twice as long. \(2 \cdot \begin{pmatrix} 3 \\ 2 \end{pmatrix} = \begin{pmatrix} 2 \cdot 3 \\ 2 \cdot 2 \end{pmatrix} = \begin{pmatrix} 6 \\ 4 \end{pmatrix}\)

2 x [3, 2] = [6, 4]

Same direction vector, flipped (180°). \(-1 \cdot \begin{pmatrix} 3 \\ 2 \end{pmatrix} = \begin{pmatrix} -1 \cdot 3 \\ -1 \cdot 2 \end{pmatrix} = \begin{pmatrix} -3 \\ -2 \end{pmatrix}\)

-1 x [3, 2] = [-3, -2]

This operation, scalar multiplication, is one of the two foundational operations that define a vector space.

When debugging a neural network and the loss explodes, it often means vectors (activations or gradients) are being scaled up by factors much greater than 1.0 each layer. Understanding scalar multiplication geometrically helps you see why gradient clipping or batch normalisation restores stability: they’re renormalizing the length of those arrows.

Mathematical derivation

Formal definitions

Scalar

An element of a field (\mathbb{F}), for our purposes, a real number (\mathbb{R}) or complex number (\mathbb{C}). Denoted with standard italics and usually greek characters: (\alpha), (\beta), (\lambda).

Vector

An ordered tuple of scalars from (\mathbb{F}). An (n)-dimensional real vector is an element of the space (\mathbb{R}^n):

\[\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} \in \mathbb{R}^n\]

In plain English: (\mathbf{v}) is a column of (n) real numbers. The subscript tells you which “slot” you’re in.

Vectorial space

A set (V) of vectors over a field (\mathbb{F}), equipped with two operations, for which it is closed.

Vector addition: (\mathbf{u} + \mathbf{v} \in V) for all (\mathbf{u}, \mathbf{v} \in V). Internal operation with the following properties:

Scalar multiplication: (\alpha \mathbf{v} \in V) for all (\alpha \in \mathbb{F}, \mathbf{v} \in V). External operation with the following properties:

What does it mean that the vector space is closed for those operations?

It means that they produce results that are in the same vector space.

Vector addition

Given (\mathbf{u} = [u_1, u_2, \ldots, u_n]^T) and (\mathbf{v} = [v_1, v_2, \ldots, v_n]^T):

\[\mathbf{u} + \mathbf{v} = \begin{bmatrix} u_1 + v_1 \\ u_2 + v_2 \\ \vdots \\ u_n + v_n \end{bmatrix}\]

In plain English: add element-by-element. Geometrically, place the tail of (\mathbf{v}) at the head of (\mathbf{u}), the result is the arrow from start to finish following the parallelogram law.

Vector norms

The norm of a vector measures its length. The most common is the Euclidean norm ((L^2) norm):

\[\|\mathbf{v}\|_2 = \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2} = \sqrt{\sum_{i=1}^{n} v_i^2}\]

In plain English: square each component, sum them, take the square root. This is precisely the Pythagorean theorem generalized to (n) dimensions.

The general family is the (L^p) norm:

\[\|\mathbf{v}\|_p = \left( \sum_{i=1}^{n} |v_i|^p \right)^{1/p}\]

Two special cases appear constantly in Machine Learning (ML):

To dig more about Vector norms, check this Wikipedia article.

The dot product

The dot product (inner product) of two vectors is:

\[\mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^{n} u_i v_i = u_1 v_1 + u_2 v_2 + \cdots + u_n v_n\]

In plain English: multiply corresponding components and sum the results. The output is a scalar, a single number that encodes how much the two vectors “align”.

The dot product of a vector and itself results in its magnitude squared. \(\mathbf{v} \cdot \mathbf{v} = \|\mathbf{v}\|^2\)

The angle between vectors

Here is where geometry and algebra merge beautifully. From Euclidean geometry, the Law of Cosines states that for a triangle with sides (a), (b), (c) and angle (\theta) opposite side (c):

\[c^2 = a^2 + b^2 - 2 a b \cos(\theta)\]

Now apply this to vectors. Let (\mathbf{u}) and (\mathbf{v}) be two vectors. The third side of the triangle they form is (\mathbf{u} - \mathbf{v}).

Substituting (a = |\mathbf{u}|), (b = |\mathbf{v}|), (c = |\mathbf{u} - \mathbf{v}|):

\[\|\mathbf{u} - \mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2 - 2 \|\mathbf{u}\| \|\mathbf{v}\|\ cos(\theta)\]

Expanding the left side:

\[\|\mathbf{u} - \mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2 - 2 \|\mathbf{u}\| \|\mathbf{v}\| \cos(\theta)\] \[(\mathbf{u} - \mathbf{v}) \cdot (\mathbf{u} - \mathbf{v}) = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2 - 2 \|\mathbf{u}\| \|\mathbf{v}\| \cos(\theta)\] \[\|\mathbf{u}\|^2 - 2 (\mathbf{u} \cdot \mathbf{v}) + \|\mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2 - 2 \|\mathbf{u}\| \|\mathbf{v}\| \cos(\theta)\]

Cancelling (|\mathbf{u}|^2) and (|\mathbf{v}|^2) from both sides: \(-2 (\mathbf{u} \cdot \mathbf{v}) = -2 \|\mathbf{u}\| \|\mathbf{v}\| \cos(\theta)\)

Dividing both sides by (-2 |\mathbf{u}| |\mathbf{v}|) (assuming neither vector is zero): \(\frac{-2 (\mathbf{u} \cdot \mathbf{v})}{-2 \|\mathbf{u}\| \|\mathbf{v}\|} = \frac{-2 \|\mathbf{u}\| \|\mathbf{v}\| \cos(\theta)}{-2 \|\mathbf{u}\| \|\mathbf{v}\|}\)

\[\boxed{\frac{(\mathbf{u} \cdot \mathbf{v})}{\|\mathbf{u}\| \|\mathbf{v}\|} = \cos(\theta)}\]

In plain English: the cosine of the angle between two vectors equals their dot product divided by the product of their lengths. This formula is foundational, it gives us cosine similarity, one of the most ubiquitous distance metrics in Machine Learning.

Key interpretations:

In the original Word2Vec paper (Mikolov et al., 2013), the famous analogy: \(king − man + woman ≈ queen\) works precisely because of this geometry. Semantic relationships are encoded as directions in vector space, and finding queen means finding the vector whose cosine similarity to the query vector is maximized. Every modern embedding model (BERT, GPT, sentence-transformers) inherits this geometric philosophy. Next time you read something about word representation in vector spaces, remember they are talking about the same geometry we just derived.

Python implementation

Let’s implement everything from scratch, first in pure Python, then verify with NumPy.

  ```bash
  >  python vector_pure_python.en.py
  ==================================
  Vector Operations with pure Python
  ==================================
  u = [1.0, 2.0, 3.0]
  v = [4.0, 0.0, -1.0]
  Addition (u + v)        -> [5.0, 2.0, 2.0]
  Scalar multiply (2 * u) -> [2.0, 4.0, 6.0]
  L2 norm of u (||u||₂)   -> 3.7417
  L1 norm of u (||u||₁)   -> 6.0000
  Dot product (u · v)     -> 1.0000
  Cosine similarity       -> 0.0648
  Angle between u, v.     -> 86.28°
  Unit vector of u (û)    -> [0.2673, 0.5345, 0.8018]
  Verify `||û||₂ = 1`     -> 1.000000
  =============================
  Semantic similarity mini-demo
  =============================
  In real NLP, these would be word embeddings. Here we illustrate
  the principle with handcrafted feature vectors.
  Features: [royalty_score, masculinity, age, power]
  king  = [0.9, 0.9, 0.8, 0.9]
  queen = [0.9, 0.1, 0.8, 0.9]
  man   = [0.0, 0.9, 0.5, 0.4]
  woman = [0.0, 0.1, 0.5, 0.4]
  The famous analogy: king - man + woman ≈ queen
  king − man + woman -> [0.9, 0.1, 0.8, 0.9]
  Cosine similarity to 'queen': 1.0000
  Cosine similarity to 'king':  0.8902
  ==> The analogy vector is closer to 'queen' than 'king'.
  ```

  
  

  


  ```bash
  >  python vector_numpy.en.py
  ============================
  Vector Operations with Numpy
  ============================
  u = [1. 2. 3.]
  v = [ 4.  0. -1.]
  Addition (u + v)        -> [5. 2. 2.]
  Scalar multiply (2 * u) -> [2. 4. 6.]
  L2 norm of u (||u||₂)   -> 3.7417
  L1 norm of u (||u||₁)   -> 6.0000
  Dot product (u · v)     -> 1.0000
  Cosine similarity       -> 0.0648
  Angle between u, v.     -> 86.28°
  Unit vector of u (û)    -> [np.float64(0.2673), np.float64(0.5345), np.float64(0.8018)]
  Verify `||û||₂ = 1`     -> 1.000000
  =============================
  Semantic similarity mini-demo
  =============================
  In real NLP, these would be word embeddings. Here we illustrate
  the principle with handcrafted feature vectors.
  Features: [royalty_score, masculinity, age, power]
  king  = [0.9 0.9 0.8 0.9]
  queen = [0.9 0.1 0.8 0.9]
  man   = [0.  0.9 0.5 0.4]
  woman = [0.  0.1 0.5 0.4]
  The famous analogy: king - man + woman ≈ queen
  king − man + woman -> [np.float64(0.9), np.float64(0.1), np.float64(0.8), np.float64(0.9)]
  Cosine similarity to 'queen': 1.0000
  Cosine similarity to 'king':  0.8902
  ==> The analogy vector is closer to 'queen' than 'king'.
  ```

Machine Learning and AI perspective

Vectors are not a preliminary concept you’ll graduate from, they are the lingua franca of modern AI research. Here are three ways they appear in the context of Artificial Intelligence and Machine Learning.

Embedding spaces and representation learning. Every modern deep learning model learns to represent inputs as vectors. The Transformer’s token embeddings described in the Attention Is All You Need paper are vectors in (\mathbb{R}^{d_{model}}) (typically between 512 and 4096 dimensions). The entire training process can be viewed as optimizing the geometry of these vectors so that semantically similar inputs cluster together. Research on contrastive learning explicitly frames the learning objective in terms of pushing similar sample vectors together and dissimilar sample vectors apart in embedding space.

Retrieval-Augmented Generation (RAG) and vector databases. With the explosion of LLMs, a major applied-research direction is efficient nearest-neighbour search over billions of vectors. In the paper, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Lewis et al. showed that augmenting generation with retrieved document vectors dramatically improves factuality. The entire retrieval step is cosine similarity search, the formula we saw before.

Dimensionality. The geometry of high-dimensional spaces is deeply counter-intuitive, a phenomenon called the curse of dimensionality. In very high dimensions, almost all pairs of vectors become nearly orthogonal ((\cos\theta \approx 0)), which can degrade cosine similarity as a meaningful metric. Understanding when cosine similarity breaks down and what geometric alternatives exist (hyperbolic spaces, Riemannian manifolds) is an active research area. If this interests you, look up Poincaré Embeddings.

Common pitfalls and debugging

  1. Forgetting to normalize before computing cosine similarity, and when NOT to. Cosine similarity measures direction only, discarding magnitude. If two documents are similar but one is ten times longer, cosine similarity ignores the length difference. Sometimes that’s a bug (when magnitude matters), sometimes a feature (sentiment classification where you want topic, not verbosity). Always ask: should magnitude matter here? If yes, use Euclidean distance instead.

  2. Dimension mismatch silently producing wrong results. In NumPy, np.dot(u, v) will raise an error if dimensions don’t match for 1D arrays, but with 2D arrays (matrices), NumPy may broadcast in ways that give a result with the wrong shape. Always .assert u.shape == v.shape before dot products in research code, or use np.einsum with explicit index notation for clarity.

  3. The zero-vector edge case. cosine_similarity([0,0,0], [1,2,3]) is mathematically undefined (you’re dividing by zero). In production systems that compute embeddings, a zero vector usually signals a bug upstream: an empty input, a bad tokenisation, or a collapsed network layer. If you see NaN losses, check your embedding norms first.

  4. Floating-point precision in arccos. Due to floating-point arithmetic, dot products can occasionally yield cosine values slightly outside ([-1, 1]) (e.g., 1.0000000002). Passing this directly to math.acos() raises a ValueError. Always clamp: cos_theta = max(-1.0, min(1.0, cos_theta)) before calling arccos.

  5. Confusing (L^1) and (L^2) regularisation effects. (L^2) regularisation (weight decay) shrinks all weights proportionally, it never drives weights exactly to zero. (L^1) regularisation does drive weights to zero, creating sparse models. This is a direct consequence of the geometry of (L^1) vs (L^2) norm balls. Choosing the wrong regulariser is a common source of models that are either too dense (wasted computation) or not sparse enough (poor interpretability).

Summary and what’s next

Key takeaways from this article:

Coming up in (Matrices & Linear Transformations): We generalize from single vectors to collections of vectors, introduce matrices as functions that transform vector spaces, and derive the rules of matrix multiplication from first principles. You’ll see exactly why a fully-connected neural network layer is nothing more than a matrix multiplication followed by a nonlinearity.


Cheers for making it this far! I hope this journey through the AI universe has been as fascinating for you as it was for me to write down.

We’re keen to hear your thoughts, so don’t be shy, drop your comments, suggestions, and those bright ideas you’re bound to have.

You’ll find all the code and projects in our GitHub repository learn-software-engineering/examples.

Thanks for being part of this learning community. Keep coding and exploring new territories in this captivating world of computing!