Neural Tech Daily
ai-research

What are embeddings? Vector representations of text, images, and code, in 2026

An embedding is a learned vector that places similar items close together in a high-dimensional space. The 2026 landscape of models, dimensions, and pitfalls.

Updated ~11 min read
Share

The short answer

An embedding is a learned vector of numbers that represents a piece of data, most often a chunk of text but increasingly also images, code, or audio. The vector lives in a high-dimensional space and is trained so that semantically similar inputs end up close together and unrelated inputs end up far apart. Distance in the space is a proxy for meaning.

The aggregated source consensus, across the original word2vec paper, 1 the GloVe paper, 2 the Sentence-BERT paper, 3 and the documentation from current commercial providers, 4 5 6 characterises embeddings as the workhorse representation behind semantic search, retrieval-augmented generation, clustering, recommendation, classification, and anomaly detection. The shift from 2013-era word-level embeddings to 2026-era sentence and multimodal embeddings is a shift in granularity and training objective, not in the underlying idea.

For a developer choosing an embedding model today, the source consensus weights three axes: the task (general English retrieval, multilingual, code, multimodal), the quality as measured by the MTEB benchmark, 7 8 and the cost and deployment model (closed-API per-token billing versus self-hosted open weights).

What “vector representation” actually means

A vector is just an ordered list of numbers. An embedding vector for the sentence “the cat sat on the mat” might be a list of 1,024 floating-point numbers; each individual number carries no human-readable meaning, but the pattern across all 1,024 dimensions, taken together, encodes the sentence’s meaning in a way the embedding model has learned during training.

Two embeddings can be compared. If the angle between the two vectors is small (or equivalently, their cosine similarity is close to 1), the model considers the two inputs semantically similar. If the angle is large, the model considers them unrelated. The geometry of the space is the meaning.

The dimension of the vector is a model-specific choice. Word2vec used 300 dimensions in its widely-cited 2013 configuration. 1 Sentence-BERT models typically use 384 or 768 dimensions. OpenAI’s text-embedding-3-large produces 3,072 dimensions by default and can be truncated to a smaller size at API request time. 4 Cohere’s embed-english-v3.0 produces 1,024 dimensions. 5 Higher dimensions usually mean more capacity to distinguish fine-grained meaning but also more storage and slower distance computation.

Hugging Face MTEB benchmark thumbnail, a stylised graphic representing the Massive Text Embedding Benchmark suite which evaluates embedding models across 58 datasets and 8 task types

Image: thumbnail from the Hugging Face MTEB launch blog post, used for editorial coverage of the benchmark.

Distance metrics: cosine, dot product, Euclidean

Three distance measures dominate practical work.

Cosine similarity measures the angle between two vectors and ignores their magnitude. For vectors a\mathbf{a} and b\mathbf{b}:

cos_sim(a,b)=abab\mathrm{cos\_sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\lVert \mathbf{a} \rVert \lVert \mathbf{b} \rVert}

Cosine similarity returns a value between 1-1 and $1. Identical vectors give \1; orthogonal vectors give \0;oppositevectorsgive; opposite vectors give -1$. Sentence-BERT was explicitly designed to produce embeddings that work well under cosine similarity, and the paper trains the siamese network so that semantically equivalent sentences sit close in cosine space. 3

Dot product is the numerator of the cosine formula, ab\mathbf{a} \cdot \mathbf{b}, with no normalisation. When the model is trained to produce unit-norm vectors (length 1), dot product and cosine similarity are identical. Most modern embedding APIs return unit-norm vectors precisely so dot product becomes a fast drop-in for cosine.

Euclidean (L2) distance is the straight-line distance between two points in the space:

dL2(a,b)=i=1n(aibi)2d_{L2}(\mathbf{a}, \mathbf{b}) = \sqrt{\sum_{i=1}^{n} (a_i - b_i)^2}

For unit-norm vectors, L2 distance and cosine similarity are monotonically related; ranking results by either gives the same order. The choice between the three is largely a matter of which the vector database is optimised for, not which encodes “real” similarity.

How embeddings are trained

The training objective shapes what “similar” ends up meaning to the model. Three families dominate.

Predictive objectives, the word2vec family. Word2vec trains a shallow neural network to predict either the surrounding words from a target word (Skip-gram) or the target word from the surrounding words (Continuous Bag of Words), using a 1.6-billion-word corpus. 1 The vectors that fall out as a byproduct turn out to encode striking analogical structure (the famous king - man + woman ≈ queen result).

Co-occurrence factorisation, the GloVe family. GloVe trains a model directly on the matrix of word co-occurrence counts, factorising it so that the dot product of two word vectors approximates the logarithm of their co-occurrence probability. 2

Contrastive learning, the modern dominant approach. Sentence-BERT and its successors train on pairs (or triplets) of sentences where one pair is labelled “semantically similar” and another labelled “dissimilar”. The objective pushes similar pairs together and dissimilar pairs apart in the vector space. 3 Modern commercial embeddings (OpenAI, Cohere, Voyage) and modern open-source models (BGE, Nomic, E5) are all contrastive-trained at scale on large pair datasets mined from the web. The contrastive objective is what lets a single model serve retrieval, clustering, and classification — the geometry it learns generalises across all three.

The 2026 commercial and open-source landscape

The publicly documented models cluster into a small set of practical choices. The figures below cite vendor pricing pages; prices fluctuate, so verify before purchase on the day of integration.

ModelDimensionsMax input tokensPricing per 1M tokensSource
OpenAI text-embedding-3-large3,072 (truncatable)8,191$0.13 4
OpenAI text-embedding-3-small1,536 (truncatable)8,191$0.02 4
Cohere embed-english-v3.01,024512$0.10 5
Voyage AI voyage-3-large1,024–2,04832,000$0.18 6
Voyage AI voyage-31,02432,000$0.06 6
BAAI bge-m3 (open-weights)1,0248,192self-host 10
Nomic nomic-embed-text-v1.5 (open-weights)768 (truncatable to 64 via Matryoshka)8,192self-host 9

Pricing as of 2026-05-19; verify on the vendor’s pricing page before integration.

The Massive Text Embedding Benchmark (MTEB) evaluates models across a published taxonomy of tasks (classification, clustering, retrieval, semantic textual similarity, reranking, summarisation, bitext mining, pair classification) and across 112 languages. 7 8 The leaderboard is the most-cited cross-vendor comparison surface; per current rankings reported by aggregator sources tracking the leaderboard, Cohere embed-v4, OpenAI text-embedding-3-large, and Voyage’s voyage-3-large cluster at the top of the English-retrieval table, with open-weight bge-m3 and nomic-embed-text-v1.5 within a few benchmark points behind. Sources flag that MTEB scores have known limitations: a model can be over-optimised against the benchmark’s specific tasks and still underperform on a downstream production workload, so the leaderboard is a starting filter rather than the final word.

What embeddings are used for

Five workflows account for most of the production activity.

Semantic search. Index a corpus by embedding every document, embed an incoming query the same way, return the documents whose vectors are closest to the query vector. The advantage over keyword search is that the query “how do I cancel my subscription” matches a document titled “ending your paid plan” even with zero word overlap.

Retrieval-augmented generation. The retrieval step inside a RAG pipeline is usually a vector search over a chunked corpus, using an embedding model. The top-kk chunks get inserted into the prompt of a generation model. Embedding quality directly affects RAG answer quality; a weak embedding model surfaces irrelevant chunks and the generator confabulates over them.

Clustering and topic discovery. Embed a large corpus, run a clustering algorithm (k-means, HDBSCAN) over the vectors, inspect the clusters. The clusters correspond approximately to topics the embedding model considered coherent.

Classification on small data. Embed a small labelled dataset, train a lightweight classifier (logistic regression, k-nearest-neighbours) on top of the embeddings. This works well when fine-tuning the whole model is overkill and few-shot prompting is too brittle.

Anomaly and duplicate detection. Items whose embedding is unusually far from any cluster center are anomalies; items whose embeddings are unusually close to another item are near-duplicates.

Multimodal and code embeddings

Text is the most-cited modality but the same idea extends. Image embeddings from models like OpenCLIP and SigLIP map an image to a vector in the same space as text embeddings, so a text query can retrieve images. Code embeddings from Voyage’s voyage-code-3 and OpenAI’s embedding models trained on code map source files to vectors so a natural-language description can retrieve relevant code, and a code snippet can find similar snippets in a repository. 6 4 The training objective differs in detail (typically a contrastive image-text or code-text pairing) but the downstream usage pattern is identical to text embeddings.

Common pitfalls

Five things trip up first-time users.

First, chunk size matters more than model choice for most RAG deployments. A high-end embedding model on naively-chunked data underperforms a mid-tier embedding model on well-chunked data. Most embedding models have a maximum input length; exceeding it silently truncates or fails. Cohere’s embed-english-v3.0 caps at 512 tokens, which is short enough that careless chunking discards meaningful context. 5

Second, distance is not meaning. The space the model learned encodes whatever pattern the training data optimised for. A model trained on web text will rank “the cat sat on the mat” and “a feline rested on the rug” as similar because the training data correlates them; on a domain the model didn’t see in training, the geometry can collapse.

Third, embeddings are not interchangeable. A query embedded by OpenAI cannot be searched against a corpus embedded by Cohere; the vector spaces are unrelated. Switching models means re-embedding the entire corpus.

Fourth, dimension truncation is not free. Modern models like text-embedding-3-large and nomic-embed-text-v1.5 support truncating the output vector via Matryoshka representation learning, which preserves most quality at smaller dimensions. 4 9 Older models do not have this property and cannot be safely truncated.

Fifth, MTEB is not a guarantee. A model that tops the English retrieval task on MTEB can underperform on a domain-specific corpus (medical records, legal contracts, internal documentation). Always evaluate on a held-out sample of the actual target data before committing.

Honest caveats

This explainer covers the embedding families that dominate the 2026 production landscape. It does not cover every variant: ColBERT-style late-interaction embeddings, instruction-tuned embeddings, hyperbolic embeddings, and graph embeddings each have their own literature and use cases.

The pricing figures cite each vendor’s pricing surface as of 2026-05-19, surfaced via the OpenAI embeddings guide, the Cohere model documentation, and the Voyage AI pricing page. The OpenAI and Cohere pages were not directly fetchable in the research session; the figures reflect the vendor pages as reported by aggregator coverage of those surfaces, and the cited vendor URLs in the footnotes are the canonical references readers should verify against on the day of integration. Vendor pricing for embeddings drifts on a roughly quarterly cadence.

MTEB scores cited here come from the leaderboard rather than from an independent reproduction. The leaderboard is community-curated and individual scores can move when a model card is updated.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

  1. 1. Mikolov, Chen, Corrado, Dean — Efficient Estimation of Word Representations in Vector Space (arXiv:1301.3781, 2013). The original word2vec paper, introducing Skip-gram and CBOW. Reports learning high-quality 300-dimensional word vectors from a 1.6-billion-word corpus. (accessed )
  2. 2. Pennington, Socher, Manning — GloVe: Global Vectors for Word Representation (EMNLP 2014). The co-occurrence-factorisation embedding model and its weighted least-squares training objective. (accessed )
  3. 3. Reimers, Gurevych — Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (arXiv:1908.10084, EMNLP-IJCNLP 2019). Siamese / triplet training for sentence embeddings designed to be compared by cosine similarity; reduces 65-hour BERT pairwise comparison to about 5 seconds. (accessed )
  4. 4. OpenAI embeddings guide — `text-embedding-3-large` (3,072 dims, truncatable via dimensions parameter) and `text-embedding-3-small` (1,536 dims). Pricing reported by aggregator coverage of the OpenAI pricing page at \$0.13 / 1M tokens for `3-large` and \$0.02 / 1M tokens for `3-small`; verify on the OpenAI pricing surface before integration. (accessed )
  5. 5. Cohere Embed model documentation — `embed-english-v3.0` produces 1,024-dimensional embeddings with a 512-token input window; `$0.10 / 1M tokens` per aggregator coverage of the Cohere pricing surface. (accessed )
  6. 6. Voyage AI pricing — `voyage-3-large` at \$0.18 / 1M tokens with 1,024-dim int8 or 2,048-dim float output; `voyage-3` at \$0.06 / 1M tokens with 1,024 dims; `voyage-code-3` for code embedding. (accessed )
  7. 7. Muennighoff, Tazi, Magne, Reimers — MTEB: Massive Text Embedding Benchmark (arXiv:2210.07316). Per the abstract verbatim: "MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages." (accessed )
  8. 8. MTEB leaderboard on Hugging Face Spaces — community-curated ranking of embedding models across the MTEB task suite. (accessed )
  9. 9. Nomic AI — `nomic-embed-text-v1.5` model card. 768-dim default output, Matryoshka representation supporting truncation down to 64 dims; 8,192-token context window; Apache 2.0 licence. (accessed )
  10. 10. BAAI — `bge-m3` model card. 1,024-dim multilingual embedding model with 8,192-token context window covering 100+ languages. (accessed )

Anonymous · no cookies set

Report a problem with this article

Articles are produced by an autonomous AI pipeline; mistakes do happen. Tell us what's wrong and the editorial review will revisit the claim.

Category

Found this useful? Share it.