Skip to content

Vectors, Scalars and Vector Spaces

In 2017, researchers at Google Brain published Attention Is All You Need, introducing the Transformer architecture that now underlies GPT-4, Gemini, and virtually every state-of-the-art language model. At the heart of that architecture (and of every neural network, recommendation system, and computer vision model) is a deceptively simple object: the vector.

When a language model reads the word “bank”, it doesn’t see a string. It sees a vector in a 4096-dimensional space where “bank (financial)” and “bank (riverbank)” occupy measurably different regions. When a search engine decides that your query matches a document, it is computing an angle between two vectors. When a neural network learns, it is moving vectors through space in response to a gradient, itself a vector.

This article builds your working foundation for all of that. By the end, you will be able to:

  • Formally define vectors, scalars, and vector spaces, and explain why the axioms matter.
  • Compute norms, dot products, and inter-vector angles both by hand and in NumPy.
  • Reason geometrically about high-dimensional data, a non-negotiable skill for Machine Learning research.
  • Read a research paper that uses vector notation without losing the thread.

No fluff, let’s start.

Prerequisites

Before reading this article, you should be comfortable with:

  • High school algebra: variables, functions, the coordinate plane.
  • Python basics: lists, loops, functions, importing libraries.
  • Basic calculus intuition (helpful but not required): the idea that a derivative points in the direction of steepest increase.

Intuition first

The programmer’s analogy: vectors as typed arrays with geometric soul

As a developer, you’ve used arrays your whole career. A Python list [3.0, -1.5, 7.2] stores three numbers. A vector is superficially the same thing, but with a crucial additional structure: position in space and the geometry that connects positions.

Think of it this way. If you have two dictionaries in Python:

user_A = {"age": 28, "purchase_freq": 5, "avg_spend": 120.0}
user_B = {"age": 29, "purchase_freq": 6, "avg_spend": 115.0}

As dictionaries, they’re just data blobs. You can read values, but “how similar are these users?” is not a question the dictionary can answer natively. Now convert them to vectors:

A = [28, 5, 120.0]
B = [29, 6, 115.0]

Suddenly you have geometry. You can measure the distance between them, the angle they form relative to the origin, and whether one is a scaled version of another. This is the leap vectors make over plain arrays: they live in a space equipped with rules for measuring, comparing, and transforming.

Geometric picture: vectors as arrows

Picture a standard 2D coordinate system. The vector v=[3,2]\mathbf{v} = [3, 2] is an arrow starting at the origin (0,0)(0, 0) and ending at the point (3,2)(3, 2). Two things define it completely: its magnitude (how long the arrow is) and its direction (which way it points).

Visual representation of a two-dimension vector

Visual representation of a two-dimension vector

This geometric interpretation is not just visual sugar. In Machine Learning, a data point (a row in your dataset) is a vector, it’s an arrow in feature space. Two similar data points are arrows pointing in roughly the same direction. An outlier is an arrow pointing somewhere unexpected. Dimensionality reduction (PCA, UMAP) is the art of finding a lower-dimensional space where those arrows still tell roughly the same story.

Scalars: the simplest case

A scalar is just a single number, no direction, no components. Temperature, loss value, learning rate: all scalars. When you multiply a vector by a scalar, you stretch or shrink the arrow without rotating it:

Same direction vector, twice as long.

2(32)=(2322)=(64) 2 \cdot \begin{pmatrix} 3 \\ 2 \end{pmatrix} = \begin{pmatrix} 2 \cdot 3 \\ 2 \cdot 2 \end{pmatrix} = \begin{pmatrix} 6 \\ 4 \end{pmatrix}
2 x [3, 2] = [6, 4]

Same direction vector, flipped (180°).

1(32)=(1312)=(32) -1 \cdot \begin{pmatrix} 3 \\ 2 \end{pmatrix} = \begin{pmatrix} -1 \cdot 3 \\ -1 \cdot 2 \end{pmatrix} = \begin{pmatrix} -3 \\ -2 \end{pmatrix}
-1 x [3, 2] = [-3, -2]

This operation, scalar multiplication, is one of the two foundational operations that define a vector space.

When debugging a neural network and the loss explodes, it often means vectors (activations or gradients) are being scaled up by factors much greater than 1.0 each layer. Understanding scalar multiplication geometrically helps you see why gradient clipping or batch normalisation restores stability: they’re renormalizing the length of those arrows.

Mathematical derivation

Formal definitions

Scalar

An element of a field F\mathbb{F}, for our purposes, a real number R\mathbb{R} or complex number C\mathbb{C}. Denoted with standard italics and usually greek characters: α\alpha, β\beta, λ\lambda.

Vector

An ordered tuple of scalars from F\mathbb{F}. An nn-dimensional real vector is an element of the space Rn\mathbb{R}^n:

v=[v1v2vn]Rn\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} \in \mathbb{R}^n
In plain English: v\mathbf{v} is a column of nn real numbers. The subscript tells you which “slot” you’re in.

Vectorial space

A set VV of vectors over a field F\mathbb{F}, equipped with two operations, for which it is closed.

Vector addition: u+vV\mathbf{u} + \mathbf{v} \in V for all u,vV\mathbf{u}, \mathbf{v} \in V. Internal operation with the following properties:

  • Associativity: u+(v+w)=(u+v)+w\mathbf{u} + (\mathbf{v} + \mathbf{w}) = (\mathbf{u} + \mathbf{v}) + \mathbf{w}.
  • Commutativity: u+v=v+u\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}.
  • Identity element: there exists a vector 0V\mathbf{0} \in V, called zero vector, such that v+0=v\mathbf{v} + \mathbf{0} = \mathbf{v} for all vV\mathbf{v} \in V.
  • Inverse element: for all vV\mathbf{v} \in V, there exist an element vV\mathbf{-v} \in V such as v+(v)=0\mathbf{v} + (\mathbf{-v}) = \mathbf{0}.

Scalar multiplication: αvV\alpha \mathbf{v} \in V for all αF,vV\alpha \in \mathbb{F}, \mathbf{v} \in V. External operation with the following properties:

  • Associativity: α(βv)=(αβ)v\alpha (\beta \mathbf{v}) = (\alpha \beta) \mathbf{v}.
  • Identity element: there exists an scalar α\alpha, such as αv=vα=v\alpha \mathbf{v} = \mathbf{v} \alpha = \mathbf{v} para todo vV\mathbf{v} \in V.
  • Distributivity of scalar multiplication with respect to vector addition: For any scalar α\alpha, it is true that α(u+v)=αu+αv\alpha (\mathbf{u} + \mathbf{v}) = \alpha \mathbf{u} + \alpha \mathbf{v} for all u,vV\mathbf{u}, \mathbf{v} \in V.
  • Distributivity of scalar multiplication with respect to scalars addition: For any two scalars α\alpha and β\beta, it is true that (α+β)v=αv+βv(\alpha + \beta) \mathbf{v} = \alpha \mathbf{v} + \beta \mathbf{v} for every vV\mathbf{v} \in V.

What does it mean that the vector space is closed for those operations?

It means that they produce results that are in the same vector space.

  • If uV\mathbf{u} \in V and vV\mathbf{v} \in V then (u+v)V(\mathbf{u} + \mathbf{v}) \in V.
  • For any scalar αR\alpha \in \mathbb{R} and vector vV\mathbf{v} \in V, then αvV\alpha \mathbf{v} \in V.

Vector addition

Given u=[u1,u2,,un]T\mathbf{u} = [u_1, u_2, \ldots, u_n]^T and v=[v1,v2,,vn]T\mathbf{v} = [v_1, v_2, \ldots, v_n]^T:

u+v=[u1+v1u2+v2un+vn] \mathbf{u} + \mathbf{v} = \begin{bmatrix} u_1 + v_1 \\ u_2 + v_2 \\ \vdots \\ u_n + v_n \end{bmatrix}
In plain English: add element-by-element. Geometrically, place the tail of v\mathbf{v} at the head of u\mathbf{u}, the result is the arrow from start to finish following the parallelogram law.

Vector norms

The norm of a vector measures its length. The most common is the Euclidean norm (L2L^2 norm):

v2=v12+v22++vn2=i=1nvi2 \|\mathbf{v}\|_2 = \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2} = \sqrt{\sum_{i=1}^{n} v_i^2}
In plain English: square each component, sum them, take the square root. This is precisely the Pythagorean theorem generalized to nn dimensions.

The general family is the LpL^p norm:

vp=(i=1nvip)1/p \|\mathbf{v}\|_p = \left( \sum_{i=1}^{n} |v_i|^p \right)^{1/p}

Two special cases appear constantly in Machine Learning (ML):

  • L1L^1 norm (Manhattan):

    Used in LASSO regularisation because it induces sparsity, it penalizes any non-zero component equally.

    v1=i=1nvi \|\mathbf{v}\|_1 = \sum_{i=1}^{n} |v_i|
  • LL^\infty norm (max norm):

    Useful when you care about the single largest activation.

    v=maxivi \|\mathbf{v}\|_\infty = \max_i |v_i|

To dig more about Vector norms, check this Wikipedia article.

The dot product

The dot product (inner product) of two vectors is:

uv=i=1nuivi=u1v1+u2v2++unvn \mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^{n} u_i v_i = u_1 v_1 + u_2 v_2 + \cdots + u_n v_n
In plain English: multiply corresponding components and sum the results. The output is a scalar, a single number that encodes how much the two vectors “align”.

The dot product of a vector and itself results in its magnitude squared.

vv=v2 \mathbf{v} \cdot \mathbf{v} = \|\mathbf{v}\|^2

The angle between vectors

Here is where geometry and algebra merge beautifully. From Euclidean geometry, the Law of Cosines states that for a triangle with sides aa, bb, cc and angle θ\theta opposite side cc:

c2=a2+b22abcos(θ) c^2 = a^2 + b^2 - 2 a b \cos(\theta)

Now apply this to vectors. Let u\mathbf{u} and v\mathbf{v} be two vectors. The third side of the triangle they form is uv\mathbf{u} - \mathbf{v}.

Substituting a=ua = \|\mathbf{u}\|, b=vb = \|\mathbf{v}\|, c=uvc = \|\mathbf{u} - \mathbf{v}\|:

uv2=u2+v22uv cos(θ) \|\mathbf{u} - \mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2 - 2 \|\mathbf{u}\| \|\mathbf{v}\|\ cos(\theta)

Expanding the left side:

uv2=u2+v22uvcos(θ)(uv)(uv)=u2+v22uvcos(θ)u22(uv)+v2=u2+v22uvcos(θ) \begin{aligned} \|\mathbf{u} - \mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2 - 2 \|\mathbf{u}\| \|\mathbf{v}\| \cos(\theta) \\ (\mathbf{u} - \mathbf{v}) \cdot (\mathbf{u} - \mathbf{v}) = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2 - 2 \|\mathbf{u}\| \|\mathbf{v}\| \cos(\theta) \\ \|\mathbf{u}\|^2 - 2 (\mathbf{u} \cdot \mathbf{v}) + \|\mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2 - 2 \|\mathbf{u}\| \|\mathbf{v}\| \cos(\theta) \end{aligned}

Cancelling u2\|\mathbf{u}\|^2 and v2\|\mathbf{v}\|^2 from both sides:

2(uv)=2uvcos(θ) -2 (\mathbf{u} \cdot \mathbf{v}) = -2 \|\mathbf{u}\| \|\mathbf{v}\| \cos(\theta)

Dividing both sides by 2uv-2 \|\mathbf{u}\| \|\mathbf{v}\| (assuming neither vector is zero):

2(uv)2uv=2uvcos(θ)2uv \frac{-2 (\mathbf{u} \cdot \mathbf{v})}{-2 \|\mathbf{u}\| \|\mathbf{v}\|} = \frac{-2 \|\mathbf{u}\| \|\mathbf{v}\| \cos(\theta)}{-2 \|\mathbf{u}\| \|\mathbf{v}\|} (uv)uv=cos(θ) \boxed{\frac{(\mathbf{u} \cdot \mathbf{v})}{\|\mathbf{u}\| \|\mathbf{v}\|} = \cos(\theta)}
In plain English: the cosine of the angle between two vectors equals their dot product divided by the product of their lengths. This formula is foundational, it gives us cosine similarity, one of the most ubiquitous distance metrics in Machine Learning.

Key interpretations:

  • cosθ=1\cos\theta = 1 (θ=0°\theta = 0°): vectors point in the same direction (identical topics in a document embedding).
  • cosθ=0\cos\theta = 0 (θ=90°\theta = 90°): vectors are orthogonal, completely unrelated.
  • cosθ=1\cos\theta = -1 (θ=180°\theta = 180°): vectors point in opposite directions (antonyms in a well-trained embedding space).

In the original Word2Vec paper (Mikolov et al., 2013), the famous analogy:

kingman+womanqueen king − man + woman ≈ queen

works precisely because of this geometry. Semantic relationships are encoded as directions in vector space, and finding queen means finding the vector whose cosine similarity to the query vector is maximized. Every modern embedding model (BERT, GPT, sentence-transformers) inherits this geometric philosophy. Next time you read something about word representation in vector spaces, remember they are talking about the same geometry we just derived.

The cross product

The dot product takes two vectors and returns a scalar. The cross product takes two vectors in R3\mathbb{R}^3 and returns a vector, one that is perpendicular to both inputs. It is defined only in three (and seven) dimensions, which makes it more geometrically specialised than the dot product.

Given u=[u1,u2,u3]T\mathbf{u} = [u_1, u_2, u_3]^T and v=[v1,v2,v3]T\mathbf{v} = [v_1, v_2, v_3]^T, the cross product u×v\mathbf{u} \times \mathbf{v} is computed by expanding the following symbolic determinant:

u×v=e1e2e3u1u2u3v1v2v3 \mathbf{u} \times \mathbf{v} = \begin{vmatrix} \mathbf{e}_1 & \mathbf{e}_2 & \mathbf{e}_3 \\ u_1 & u_2 & u_3 \\ v_1 & v_2 & v_3 \end{vmatrix}

Expanding along the first row:

u×v=e1(u2v3u3v2)e2(u1v3u3v1)+e3(u1v2u2v1) \mathbf{u} \times \mathbf{v} = \mathbf{e}_1(u_2 v_3 - u_3 v_2) - \mathbf{e}_2(u_1 v_3 - u_3 v_1) + \mathbf{e}_3(u_1 v_2 - u_2 v_1) u×v=[u2v3u3v2u3v1u1v3u1v2u2v1] \boxed{\mathbf{u} \times \mathbf{v} = \begin{bmatrix} u_2 v_3 - u_3 v_2 \\ u_3 v_1 - u_1 v_3 \\ u_1 v_2 - u_2 v_1 \end{bmatrix}}
In plain English: each component of the result is a 2×22 \times 2 determinant built from the other two components of the inputs. The pattern is cyclic: (2,3)(2,3), (3,1)(3,1), (1,2)(1,2).

Two geometric facts define the cross product completely:

Direction: u×v\mathbf{u} \times \mathbf{v} is always orthogonal to both u\mathbf{u} and v\mathbf{v}. You can verify this directly: (u×v)u=0(\mathbf{u} \times \mathbf{v}) \cdot \mathbf{u} = 0 and (u×v)v=0(\mathbf{u} \times \mathbf{v}) \cdot \mathbf{v} = 0. The orientation follows the right-hand rule: curl the fingers of your right hand from u\mathbf{u} toward v\mathbf{v}, and your thumb points in the direction of u×v\mathbf{u} \times \mathbf{v}.

Magnitude: The length of the result equals the area of the parallelogram spanned by u\mathbf{u} and v\mathbf{v}, which can be expressed as:

u×v=uvsinθ\|\mathbf{u} \times \mathbf{v}\| = \|\mathbf{u}\|\|\mathbf{v}\|\sin\theta
When u\mathbf{u} and v\mathbf{v} are parallel (θ=0°\theta = 0°), the parallelogram is flat and the cross product is the zero vector. When they are perpendicular (θ=90°\theta = 90°), the parallelogram has maximum area and u×v\|\mathbf{u} \times \mathbf{v}\| is maximised. This is the exact opposite behaviour to the dot product, which is maximised when vectors are parallel and zero when perpendicular.

Key algebraic property, anticommutativity:

u×v=(v×u) \mathbf{u} \times \mathbf{v} = -(\mathbf{v} \times \mathbf{u})

Swapping the order flips the sign and the direction. This means the cross product is not commutative, unlike the dot product.

Python implementation

Let’s implement everything from scratch, first in pure Python, then verify with NumPy.

Code
"""
Course:  Artificial Intelligence
Módulo:  Linear Algebra
Article: Vectors, Scalars & Vector Spaces

Pure Python implementation
Python 3.10+
"""
import math
from typing import List


###################
# Vector operations
###################

def vector_add(u: List[float], v: List[float]) -> List[float]:
    """
    Add two vectors element-wise.
    Requires: u and v have the same dimension.
    """
    assert len(u) == len(v), f"Dimension mismatch: {len(u)} vs {len(v)}"
    return [u_i + v_i for u_i, v_i in zip(u, v)]


def scalar_multiply(alpha: float, v: List[float]) -> List[float]:
    """
    Multiply a vector by a scalar.
    Stretches/shrinks the arrow; negative alpha flips its direction.
    """
    return [alpha * v_i for v_i in v]


def l2_norm(v: List[float]) -> float:
    """
    Euclidean (L2) norm: the geometric length of vector v.
    """
    return math.sqrt(sum(v_i ** 2 for v_i in v))


def l1_norm(v: List[float]) -> float:
    """
    Manhattan (L1) norm: sum of absolute values.
    Sparsity-inducing, used in LASSO regularization.
    """
    return sum(abs(v_i) for v_i in v)


def dot_product(u: List[float], v: List[float]) -> float:
    """
    Dot product: sum of element-wise products.
    Returns a scalar measuring how much u and v 'align'.
    Requires: u and v have the same dimension.
    """
    assert len(u) == len(v), f"Dimension mismatch: {len(u)} vs {len(v)}"
    return sum(u_i * v_i for u_i, v_i in zip(u, v))


def cosine_similarity(u: List[float], v: List[float]) -> float:
    """
    Cosine similarity: dot(u,v) / (||u|| * ||v||).
    Returns a value in [-1, 1].
      1.0  -> identical direction (maximally similar)
      0.0  -> orthogonal (unrelated)
     -1.0  -> opposite direction (maximally dissimilar)
    """
    norm_u = l2_norm(u)
    norm_v = l2_norm(v)
    # Guard against division by zero (zero vector has no direction)
    if norm_u == 0 or norm_v == 0:
        raise ValueError("Cosine similarity undefined for zero vectors.")
    return dot_product(u, v) / (norm_u * norm_v)


def angle_between(u: List[float], v: List[float], degrees: bool = True) -> float:
    """
    Angle between two vectors in radians or degrees.
    Uses: theta = arccos(cosine_similarity(u, v))
    Clamp to [-1, 1] first to guard against floating-point errors
    that push cos slightly outside the valid domain of arccos.
    """
    cos_theta = cosine_similarity(u, v)
    cos_theta = max(-1.0, min(1.0, cos_theta))  # numerical clamp
    theta_rad = math.acos(cos_theta)
    return math.degrees(theta_rad) if degrees else theta_rad


def normalize(v: List[float]) -> List[float]:
    """
    Return the unit vector (length 1) pointing in the same direction as v.
    Formula: v_hat = v / ||v||
    Unit vectors are useful when you care about DIRECTION, not magnitude.
    """
    norm = l2_norm(v)
    if norm == 0:
        raise ValueError("Cannot normalize the zero vector.")
    return [v_i / norm for v_i in v]


#########
# Example
#########

if __name__ == "__main__":
    # Two 3-dimensional vectors, imagine these as two user embeddings
    u = [1.0, 2.0, 3.0]
    v = [4.0, 0.0, -1.0]

    print("==================================")
    print("Vector Operations with pure Python")
    print("==================================")
    print(f"u = {u}")
    print(f"v = {v}")

    print(f"Addition (u + v)        -> {vector_add(u, v)}")
    print(f"Scalar multiply (2 * u) -> {scalar_multiply(2.0, u)}")
    print(f"L2 norm of u (||u||₂)   -> {l2_norm(u):.4f}")
    print(f"L1 norm of u (||u||₁)   -> {l1_norm(u):.4f}")
    print(f"Dot product (u · v)     -> {dot_product(u, v):.4f}")
    print(f"Cosine similarity       -> {cosine_similarity(u, v):.4f}")
    print(f"Angle between u, v.     -> {angle_between(u, v):.2f}°")
    print(f"Unit vector of u (û)    -> {[round(x, 4) for x in normalize(u)]}")
    u_hat = normalize(u)
    print(f"Verify `||û||₂ = 1`     -> {l2_norm(u_hat):.6f}")

    print("=============================")
    print("Semantic similarity mini-demo")
    print("=============================")
    print("In real NLP, these would be word embeddings. Here we illustrate")
    print("the principle with handcrafted feature vectors.")
    print("Features: [royalty_score, masculinity, age, power]")

    king = [0.9, 0.9, 0.8, 0.9]
    queen = [0.9, 0.1, 0.8, 0.9]
    man = [0.0, 0.9, 0.5, 0.4]
    woman = [0.0, 0.1, 0.5, 0.4]
    print(f"king  = {king}")
    print(f"queen = {queen}")
    print(f"man   = {man}")
    print(f"woman = {woman}")

    print("The famous analogy: king - man + woman ≈ queen")
    analogy_vec = vector_add(
        vector_add(king, scalar_multiply(-1.0, man)),
        woman
    )
    sim_to_queen = cosine_similarity(analogy_vec, queen)
    sim_to_king = cosine_similarity(analogy_vec, king)

    print(f"king − man + woman -> {[round(x,2) for x in analogy_vec]}")
    print(f"Cosine similarity to 'queen': {sim_to_queen:.4f}")
    print(f"Cosine similarity to 'king':  {sim_to_king:.4f}")
    print("==> The analogy vector is closer to 'queen' than 'king'.")
Output
>  python vector_pure_python.en.py
==================================
Vector Operations with pure Python
==================================
u = [1.0, 2.0, 3.0]
v = [4.0, 0.0, -1.0]
Addition (u + v)        -> [5.0, 2.0, 2.0]
Scalar multiply (2 * u) -> [2.0, 4.0, 6.0]
L2 norm of u (||u||)   -> 3.7417
L1 norm of u (||u||)   -> 6.0000
Dot product (u · v)     -> 1.0000
Cosine similarity       -> 0.0648
Angle between u, v.     -> 86.28°
Unit vector of u (û)    -> [0.2673, 0.5345, 0.8018]
Verify `||û||= 1`     -> 1.000000
=============================
Semantic similarity mini-demo
=============================
In real NLP, these would be word embeddings. Here we illustrate
the principle with handcrafted feature vectors.
Features: [royalty_score, masculinity, age, power]
king  = [0.9, 0.9, 0.8, 0.9]
queen = [0.9, 0.1, 0.8, 0.9]
man   = [0.0, 0.9, 0.5, 0.4]
woman = [0.0, 0.1, 0.5, 0.4]
The famous analogy: king - man + woman ≈ queen
king − man + woman -> [0.9, 0.1, 0.8, 0.9]
Cosine similarity to 'queen': 1.0000
Cosine similarity to 'king':  0.8902
==> The analogy vector is closer to 'queen' than 'king'.

Machine Learning and AI perspective

Vectors are not a preliminary concept you’ll graduate from, they are the lingua franca of modern AI research. Here are three ways they appear in the context of Artificial Intelligence and Machine Learning.

Embedding spaces and representation learning. Every modern deep learning model learns to represent inputs as vectors. The Transformer’s token embeddings described in the Attention Is All You Need paper are vectors in Rdmodel\mathbb{R}^{d_{model}} (typically between 512 and 4096 dimensions). The entire training process can be viewed as optimizing the geometry of these vectors so that semantically similar inputs cluster together. Research on contrastive learning explicitly frames the learning objective in terms of pushing similar sample vectors together and dissimilar sample vectors apart in embedding space.

Retrieval-Augmented Generation (RAG) and vector databases. With the explosion of LLMs, a major applied-research direction is efficient nearest-neighbour search over billions of vectors. In the paper, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Lewis et al. showed that augmenting generation with retrieved document vectors dramatically improves factuality. The entire retrieval step is cosine similarity search, the formula we saw before.

Dimensionality. The geometry of high-dimensional spaces is deeply counter-intuitive, a phenomenon called the curse of dimensionality. In very high dimensions, almost all pairs of vectors become nearly orthogonal (cosθ0\cos\theta \approx 0), which can degrade cosine similarity as a meaningful metric. Understanding when cosine similarity breaks down and what geometric alternatives exist (hyperbolic spaces, Riemannian manifolds) is an active research area. If this interests you, look up Poincaré Embeddings.

Common pitfalls and debugging

  1. Forgetting to normalize before computing cosine similarity, and when NOT to. Cosine similarity measures direction only, discarding magnitude. If two documents are similar but one is ten times longer, cosine similarity ignores the length difference. Sometimes that’s a bug (when magnitude matters), sometimes a feature (sentiment classification where you want topic, not verbosity). Always ask: should magnitude matter here? If yes, use Euclidean distance instead.

  2. Dimension mismatch silently producing wrong results. In NumPy, np.dot(u, v) will raise an error if dimensions don’t match for 1D arrays, but with 2D arrays (matrices), NumPy may broadcast in ways that give a result with the wrong shape. Always .assert u.shape == v.shape before dot products in research code, or use np.einsum with explicit index notation for clarity.

  3. The zero-vector edge case. cosine_similarity([0,0,0], [1,2,3]) is mathematically undefined (you’re dividing by zero). In production systems that compute embeddings, a zero vector usually signals a bug upstream: an empty input, a bad tokenisation, or a collapsed network layer. If you see NaN losses, check your embedding norms first.

  4. Floating-point precision in arccos. Due to floating-point arithmetic, dot products can occasionally yield cosine values slightly outside [1,1][-1, 1] (e.g., 1.0000000002). Passing this directly to math.acos() raises a ValueError. Always clamp: cos_theta = max(-1.0, min(1.0, cos_theta)) before calling arccos.

  5. Confusing L1L^1 and L2L^2 regularisation effects. L2L^2 regularisation (weight decay) shrinks all weights proportionally, it never drives weights exactly to zero. L1L^1 regularisation does drive weights to zero, creating sparse models. This is a direct consequence of the geometry of L1L^1 vs L2L^2 norm balls. Choosing the wrong regulariser is a common source of models that are either too dense (wasted computation) or not sparse enough (poor interpretability).

Summary and what’s next

Key takeaways from this article:

  • A vector is an ordered tuple of scalars that lives in a geometric space, it has both magnitude and direction.
  • A vector space is defined by closure under addition and scalar multiplication, this is why linear operations in neural networks are so well-behaved.
  • The L2L^2 norm measures Euclidean length. The L1L^1 norm measures Manhattan distance and promotes sparsity.
  • The dot product (uv\mathbf{u} \cdot \mathbf{v}) measures alignment. Dividing by both norms gives cosine similarity, the angle-based similarity metric at the core of retrieval, embeddings, and attention mechanisms.
  • The formula cosθ=uvuv\cos\theta = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\|\|\mathbf{v}\|} is derived from the Law of Cosines, it’s not arbitrary, it’s geometry.
  • High-dimensional vectors are the language in which all of modern Machine Learning is written. Fluency here compounds across every future topic.

Coming up in (Matrices & Linear Transformations): We generalize from single vectors to collections of vectors, introduce matrices as functions that transform vector spaces, and derive the rules of matrix multiplication from first principles. You’ll see exactly why a fully-connected neural network layer is nothing more than a matrix multiplication followed by a nonlinearity.


Cheers for making it this far! I hope this journey through the AI universe has been as fascinating for you as it was for me to write down.

We’re keen to hear your thoughts, so don’t be shy, drop your comments, suggestions, and those bright ideas you’re bound to have.

You’ll find all the code and projects in our GitHub repository learn-software-engineering/examples.

Thanks for being part of this learning community. Keep coding and exploring new territories in this captivating world of computing!

Last updated on • renovate[bot]