n-grams, Markov chains and text generators

posted by on 2016.11.03, under Processing, Supercollider
03:

An n-gram is a contiguous sequence of n elements of a text, where each element can be a phoneme, a syllable, a character, a word, etc. Given a text, the collection of its n-grams allows to infer some statistical correlation, and moreover to assemble the n-grams collected into a Markov chain . Right, what’s a Markov chain, then? A Markov chain describes a process in which the next step depends probabilistically only on the current step. For instance, a random walk is an example of a Markovian process. One way to assign a Markov chain to a text is to collect all its n-grams, and for each n-gram we keep track of the next n-gram. We then go through the collection of all the n-grams, and for each of them we choose randomly among the list of the subsequent n-grams. We form then the new n-gram, and proceed. Confusing? Let’s see an example. Suppose we have the text “this is a book and this is my pen”, and suppose we are interested in 2-grams, where a single gram is a word. We have then then the pair (This, is), the pair (is, a), etc. Then, we keep track of all the 1-grams which follow a single 2-gram: for instance, after (this, is) we can have (a) or (my), and we assign to each of them an equal probability. Suppose we start from (This, is): if we happen to choose (a), we form the pair (is, a), to which it must follow (a, book), and so on, until we reach the end of the text. In this way, we can generate a text which has similar statistical distribution of n-grams, in this case pairs of words. The greater n, the closer your generated text will be to the original one.
Inspired by this, I have written a code in p5.js, a set of libraries for Javascript, that generates text starting from n-grams in words. Here “word” only means “groups of characters separated by a whitespace”. Punctuation is not considered from the grammatical point of view, nor the role of articles, nouns, etc. are analysed; neverthless, the results are still quite interesting. The choice of Javascript is dictated by the fact that Processing/Java is not very friendly with dictionaries, and in this case they are really useful.
Here’s the code

var words;
var ngrams = {}
var order = 2;
var txt = "";
var a = 0.1;
var target = 255;

function setup() {
  createCanvas(600, 600);

  words = source.split(" ");

  for (var i = 0; i < words.length - order; i++){
    gram_temp = [];
    for (var j = 0; j < order; j++){
    gram_temp.push(words[i + j]);
      }
      gram = join(gram_temp," ");
      if (!ngrams[gram]){
     ngrams[gram] = [];
    }
      if (i < words.length - order){
     ngrams[gram].push(words[i + order])   
  }
}
  markovIt(ngrams);
  txt = spText(txt);
}

function draw() {
    background(255);
    a += (target - a) * 0.1;
    textSize(12);
        fill(0, a);
    textDisplay(txt);
    if (a  < 0.099 ){
     restart();
     }
}

function restart(){
    markovIt(ngrams);
    txt = spText(txt);
    a = 0;
    target = 255;
}


function textDisplay(ttext){
    textAlign(CENTER);
    text(ttext, 100, 60, width - 100, height - 60);
}

function spText(txt){
    return txt.split(".");
}

function mousePressed(){
   target = 0;
}

function markovIt(ngrams) {
    var index = int(random(0, words.length - order + 1));
    curr_temp = [];
    for (var j = 0; j < order; j++){
     curr_temp.push(words[index + j]);
      }
      current = join(curr_temp, " ");
      result = current;
      if (!ngrams[current]){
        return null;
    }
      var range = int(random(30, 500));
      for (var i = 0; i < range; i++){
        if (!ngrams[current]){
        break;
          }
        possibilities = ngrams[current];
        if (possibilities.length == 0){
          break;
        }
        next = possibilities[int(random(0, possibilities.length))];
        result = result + " " + next;
        tokens = result.split(" ");
        curr_temp = [];
        for (var j = order; j > 0; j--){
        curr_temp.push(tokens[tokens.length - j]);
    }
         current = join(curr_temp, " ");
         }
    txt = result;
   
}

Notice that you need to declare a variable source, which should contain the input text.
As a little homage to Noam Chomsky, a pioneer of grammar studies (and much more), here you can find a working version of the code above using 2-grams in words, and based on this. Click on the canvas to generate some new text.

pagetop