Andrey Markov and Stochastic Processes

By Eric Cruet

Many years ago mathematician Andrey Markov introduced us to a branch of probability theory by applying mathematics to poetry.  By analyzing the text of Alexander Pushkin’s novel in verse Eugene Onegin, Markov spent hours crawling through patterns of vowels and consonants. In 1913, he summarized his findings in an address to the Imperial Academy of Sciences in St. Petersburg. His analysis explained the models he developed—now known as Hidden Markov Chains—and extended the theory of probability in a new direction. His approach broke new ground because it could be applied to chains of linked events by knowing their probability.  Many times the goal is to predict the antecedent events based on probabilities of future events that might seem unrelated.

Specifically, a hidden Markov model is a hybrid technique consisting of:

  • a machine learning model
  • a discrete hill climbing technique

In computer science, hill climbing is a mathematical optimization technique which belongs to the family of local search. It is an iterative algorithm that starts with an arbitrary solution to a problem, then attempts to find a better solution by incrementally changing a single element of the solution.

A hidden Markov chain (composed of Markov processes) is characterized by:

  • A number of states s1,s2, . . . ,sN
  • Time proceeding in discrete steps: t = 1,2,3, . . .
  • At each time step, a Markov process is in exactly one state
  • Transitions depend on current state (Markov chain order 0,1,2) and transition probability matrices 
  • Incorporates the concept of “memoryless Random Process”: a property of certain probability distributions in which the exponential distributions and the geometric distributions, and their derived probabilities from a set of random samples is distinct and has no information (i.e. “memory”) of earlier samples.


Problem 1: Given a model λ = (A,B,π) and observation sequence , find P(O|λ)

That is, we can score an observation sequence to see how well it fits a given model

Problem 2: Given λ = (A,B,π) and O, find an optimal state sequence or uncover hidden part

Problem 3: Given O, N, and M, find the model λ that maximizes probability of O

That is, train a model to fit observations

Hidden Markov Models in Practice:

Problem 1: Problem 1: Score an observation sequence versus a given model


  • Forward Algorithm – Instead of brute force: forward algorithm – Or “alpha pass”
  •  For t = 0,1,…,T-1 and i=0,1,…,N-1, letαt(i) = P(O0,O1,…,Ot,xt=qi|λ)
  • Probability of “partial sum” to t, and Markov process is in state qi at step t
  • What the? Complicated but….
  • Can be computed recursively, efficiently

Problem 2: Given a model, “uncover” hidden part

  • Solution 2: Given a model, find “most likely” hidden states: Given λ = (A,B,π) and O, find an optimal state sequence
  • Recall that optimal means “maximize expected number of correct states”
  • A better way: backward algorithm – Or “beta pass”

Problem 3 (find N by trial and error)

  • Given an observation sequence;
  • Assume a (hidden) Markov process exists;
  • Train a model based on observations;
  • Then given a sequence of observations, score it versus the model
  • Programmable by an algorithm-train the learner then use!


Next week: An example!

If you’re really interested, Google uses HMM(Hidden Markov Models) to fine tune its PageRank Algorithm):



A revealing introduction to hidden Markov models, by M. Stamp /HMM.pdf

A tutorial on hidden Markov models and selected applications in speech recognition, by L.R. Rabiner ner.pdf

Hunting for metamorphic engines, W. Wong and M. Stamp Journal in Computer Virology, Vol. 2, No. 3, December 2006, pp. 211-229

Hunting for undetectable metamorphic viruses, D. Lin and M. Stamp Journal in Computer Virology, Vol. 7, No. 3, August 2011, pp. 201-214 


Wow, really fascinating, and also behind Kurzweil’s NLP and text-to-speech processing.

Now, can Markov models be used to model semiosis and semantics? Hard problem with too many variables? Variabilities in observers’ knowledge, competence, encyclopedic access? Yet the vast majority of meanings that people assert or use for any culturally meaningful expression will be similar and overlapping, not random (following principles of intersubjectivity in symbolic cognition). Meaning generation is not random, but also not specifically predictable; there are rules and constraints (which make it generative) but also unbounded (infinite productivity from finite means). Can our cognitive processes that produce meaning over some time dimension be modelled on Markov chains?