Example 4.3

Example 4.3#

Word level simulation of English using Markov chains. The data used in this example can be downloaded from here.

import re
from collections import defaultdict, Counter
import numpy as np
from random import randint, choice
import matplotlib.pyplot as plt
import time

rng = np.random.default_rng(19)

# Reading the file
with open('holmes.txt') as f:
    words = re.split(' +', f.read())

# Assign an index to each unique word
word_to_index = {word: idx for idx, word in enumerate(set(words))}
index_to_word = {idx: word for word, idx in word_to_index.items()}

# Create transition counts matrix
num_words = len(word_to_index)
transition_counts = np.zeros((num_words, num_words), dtype=int)

for w0, w1 in zip(words[:-1], words[1:]):
    transition_counts[word_to_index[w0], word_to_index[w1]] += 1

# Convert counts to probabilities to get the transition matrix
transition_matrix = transition_counts / transition_counts.sum(axis=1, keepdims=True)

# Handle potential NaNs (rows with no transitions)
transition_matrix[np.isnan(transition_matrix)] = 0

# visualise the transition matrix

# plt.imshow((transition_matrix + 1e-8))
# plt.colorbar()
# plt.show()

# simulate a sentence
T = 100
# w0 = choice(words[:-1])
w0 = 'Sherlock'
print('The first word is', w0, '\n')

print(w0)

for _ in range(T):
    # get the index of w0
    idx = word_to_index[w0]
    # sample the next index using the transition matrix
    idx_next = rng.choice(num_words, p=transition_matrix[idx, :])
    # convert the index back to the word
    w1 = index_to_word[idx_next]
    print(w1, end=' ')
    w0 = w1
The first word is Sherlock 

Sherlock
Holmes broke in
impatiently.

“I should be said,” she cried. “Your merits should rest is the
door, and said Holmes, whose care the
two waifs had been done the great rock to remember, in the wrong track, for the end of his disposition towards some
definite and rolled across the old John Ferrier’s grave. As I do. We heard the hotel, considerably
interested in an old book I saw him
at it on the silence of railings. Having tethered the line like one passed my clients.” Again he said Holmes, sententiously.

“Well, I tell you paid by throwing
himself upon a necromancer.

“From a scuffle. I was stuck the address 
/var/folders/d8/86whr2t12h70n511w7s7cm_w0000gn/T/ipykernel_14256/619480890.py:26: RuntimeWarning: invalid value encountered in divide
  transition_matrix = transition_counts / transition_counts.sum(axis=1, keepdims=True)