

You can say that the whole system is deterministic and finite, so you could record every input-output pair. But you could do that for any program. That doesn’t make every deterministic function a Markov process. It just means it is representable in a finite way. The question is not whether the function can be stored. The question is whether its behavior matches the structure and assumptions of a Markov model. In the case of LLMs, it does not.
Inference does not become a Markov chain simply because it returns a distribution based on current input. It becomes a sequence of deep functional computations where attention mechanisms simulate hierarchical, relational, and positional understanding of language. That does not align with the definition or behavior of a Markov model, even if both map a state to a probability distribution. The structure of the computation, not just the input-output determinism, is what matters.
Yes, LLM inference consists of deterministic matrix multiplications applied to the current context. But that simplicity in operations does not make it equivalent to a Markov chain. The definition of a Markov process requires that the next output depends only on the current state. You’re assuming that the LLM’s “state” is its current context window. But in an LLM, this “state” is not discrete. It is a structured, deeply encoded set of vectors shaped by non-linear transformations across layers. The state is not just the visible tokens—it is the full set of learned representations computed from them.
A Markov chain transitions between discrete, enumerable states with fixed transition probabilities. LLMs instead apply a learned function over a high-dimensional, continuous input space, producing outputs by computing context-sensitive interactions. These interactions allow generalization and compositionality, not just selection among known paths.
The fact that inference uses fixed weights does not mean it reduces to a transition table. The output is computed by composing multiple learned projections, attention mechanisms, and feedforward layers that operate in ways no Markov chain ever has. You can’t describe an attention head with a transition matrix. You can’t reduce positional encoding or attention-weighted context mixing into state transitions. These are structured transformations, not symbolic transitions.
You can describe any deterministic process as a function, but not all deterministic functions are Markovian. What makes a process Markov is not just forgetting prior history. It is having a fixed, memoryless probabilistic structure where transitions depend only on a defined discrete state. LLMs don’t transition between states in this sense. They recompute probability distributions from scratch each step, based on context-rich, continuous-valued encodings. That is not a Markov process. It’s a stateless function approximator conditioned on a window, built to generalize across unseen input patterns.