Skip to main content

Used for Inference, Prediction and Parameter learning for tree structured Hidden Markov Model.

Project description

TreeHMM

Table of Contents

{:name='concepts-background'}

Concepts Background {#concepts-background}

What is a Hidden Markov model(HMM)?

  • Hidden Markov models (HMMs) are a formal foundation for making probabilistic models of linear sequence. They provide a conceptual toolkit for building complex models just by drawing an intuitive picture. They are at the heart of a diverse range of programs, including genefinding, profile searches, multiple sequence alignment and regulatory site identification. HMMs are the Legos of computational sequence analysis.

Why a Hidden Markov model(HMM)?

  • There are lots of cases where you can't observe the states you are interested in but you can see the effect of being in the state. The observed effect of being in the state is called "emissions" or "observations".

What is a tree?

In graph theory, a tree is an undirected graph in which any two vertices are connected by exactly one path, or equivalently a connected acyclic undirected graph. Tree represents the nodes connected by edges. It is a non-linear data structure. It has the following properties.

  • One node is marked as Root node.
  • Every node other than the root is associated with one parent node.
  • Each node can have an arbitrary number of child node.

What is a Poly-tree?

  • A poly-tree is simply a directed acyclic graph whose underlying undirected graph is a tree.

What is a Tree Hidden Markov model(TreeHMM)?

  • It is the same as an HMM but where the states are linked via a (poly)tree structure rather than a simple path.

What is forward-backward algorithm?

  • The Forward–Backward algorithm is the conventional, recursive, efficient way to evaluate a Hidden Markov Model, that is, to compute the probability of an observation sequence given the model. This probability can be used to classify observation sequences in recognition applications. The goal of the forward-backward algorithm is to find the conditional distribution over hidden states given the data.

What is baum-welch algorithm?

  • The Baum-Welch algorithm is a dynamic programming approach and a special case of the expectation-maximization algorithm (EM algorithm). The purpose is to tune the parameters of the HMM, namely the state transition matrix, the emission matrix, and the initial state distribution, such that the model is maximally like the observed data.

Getting Started

Prerequisites

Create a virtual environment

  • At it's core, the main purpose of Python virtual environments is to create an isolated environment for Python projects. This means that each project can have it's own dependencies, regardless of what dependencies every other project has.

  • On macOS and Linux:

      python3 -m venv envname
    
  • On Windows:

      py -m venv envname        
    

Activate the virtual environment

  • On macOS and Linux:

      source env/bin/activate
    
  • On Windows:

      .\env\Scripts\activate
    

Installation

    pip install treehmm

Package Documentation

This is an extension to normal HMMs where the states are connected via a poly-tree structure rather than simply a path. Moreover, upon available data, some of the hidden states can be observed, and hence, can be estimated more accurately." This package is based on a R package and is 40% faster than it.

There are six python files of interest.

  • initialize_HMM.py
  • backward_sequence_generator.py
  • forward_sequence_generator.py
  • backward.py
  • forward.py
  • baumWelch.py

initialize_HMM.py

Description

  • Initializing treeHMM with given parameters

Usage

  • def initialize_HMM(states, emissions, adjacent_symmetry_matrix, initial_probabilities=None, state_transition_probabilities=None, emission_probabilities=None):

Arguments

  • states : It is a numpy array with first element being discrete state value for the cases(or positive) and second element being discrete state value for the controls(or negative) for given treeHMM.
  • emissions: It is a list of numpy array consisting discrete values of emissions(where "M" is the possible number of emissions) for each feature variable
  • adjacent_symmetry_matrix: It is the Adjacent Symmetry Matrix that describes the topology of the tree
  • initial_probabilities: It is a numpy array of shape (N * 1) containing initial probabilities for the states, where "N" is the possible number of states(Optional). Default is equally probable states
  • state_transition_probabilities: It is a numpy array of shape (N * N) containing transition probabilities for the states, where "N" is the possible number of states(Optional).
  • emission_probabilities: It is a list of numpy arrays of shape (N * M) containing emission probabilities for the states, for each feature variable(optional). Default is equally probable emissions

Returns

  • hmm: A dictionary consisting of above initialised parameters of treeHMM

Examples

sample_tree = np.array([0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]).reshape(5, 5)
states = ['P', 'N']
emissions = [['L', 'R']]
hmm = initialize_HMM.initialize_HMM(states, emissions, sample_tree)

backward_sequence_generator.py

Description

  • Calculates the order in which nodes in the tree should be traversed during the backward pass(leaves to roots).
  • Tree is a complex graphical model where we can have multiple parents and multiple children for a node. Hence the order in which the tree should be tranversed becomes significant. Backward algorithm is a dynamic programming problem where to calculate the values at a node, we need the values of the children nodes beforehand, which need to be traversed before this node. This algorithm outputs a possible(not unique) order of the traversal of nodes ensuring that the childrens are traversed first before the parents

Usage

  • def backward_sequence_generator(hmm, number_of_levels=100):

Arguments

  • hmm: It is a dictionary given as output by initialize_HMM.py file
  • number_of_levels: No. of levels in the tree, if known. Default is 100

Returns

  • backward_tree_sequence: A list of size "D", where "D" is the number of nodes in the tree

Examples

sample_tree = np.array([0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]).reshape(5, 5)
states = ['P', 'N']
emissions = [['L', 'R']]
hmm = initialize_HMM.initialize_HMM(states, emissions, sample_tree)
backward_tree_sequence = backward_sequence_generator.backward_sequence_generator(hmm)

forward_sequence_generator.py

Description

  • Calculate the order in which nodes in the tree should be traversed during the forward pass(roots to leaves)
  • Tree is a complex graphical model where we can have multiple parents and multiple children for a node. Hence the order in which the tree should be tranversed becomes significant. Forward algorithm is a dynamic programming problem where to calculate the values at a node, we need the values of the parent nodes beforehand, which need to be traversed before this node. This algorithm outputs a possible(not unique) order of the traversal of nodes ensuring that the parents are traversed first before the children.

Usage

  • def forward_sequence_generator(hmm, number_of_levels=100):

Arguments

  • hmm: It is a dictionary given as output by initialize_HMM.py file
  • number_of_levels: No. of levels in the tree, if known. Default is 100

Returns

  • forward_tree_sequence: A list of size "D", where "D" is the number of nodes in the tree

Examples

sample_tree = np.array([0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]).reshape(5, 5)
states = ['P', 'N']
emissions = [['L', 'R']]
hmm = initialize_HMM.initialize_HMM(states, emissions, sample_tree)
forward_tree_sequence = forward_sequence_generator.forward_sequence_generator(hmm)

backward.py :

Description

  • Infer the backward probabilities for all the nodes of the treeHMM

Usage

  • def backward(hmm, observation, backward_tree_sequence, observed_states_training_nodes = None):

Arguments

  • hmm: It is a dictionary given as output by initialize_HMM.py file
  • observation: observation is a list of list consisting "k" lists for "k" features, each vector being a character series of discrete emission values at different nodes serially sorted by node number
  • backward_tree_sequence: It is a list denoting the order of nodes in which the tree should be traversed in backward direction(from leaves to roots).It's the output of backward_sequence_generator function.
  • observed_states_training_nodes: (Optional) It is a (L * 2) dataframe where L is the number of training nodes where state values are known. First column should be the node number and the second column being the corresponding known state values of the nodes.

Returns

  • backward_probs: A dataframe of size (N * D) denoting the backward probabilities at each node of the tree, where "N" is possible no. of states and "D" is the total number of nodes in the tree

Details

  • The backward probability for state X and observation at node k is defined as the probability of observing the sequence of observations e_k+1, ... ,e_n under the condition that the state at node k is X. That is: b[X,k] := Prob(E_k+1 = e_k+1,... ,E_n = e_n | X_k = X) where E_1...E_n = e_1...e_n is the sequence of observed emissions and X_k is a random variable that represents the state at node k

Examples

sample_tree = np.array([0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]).reshape(5, 5)  # for "X" (5 nodes) shaped tree
states = ['P', 'N']  # "P" represent cases(or positive) and "N" represent controls(or negative)
emissions = [['L', 'R']]  # one feature with two discrete levels "L" and "R"
hmm = initialize_HMM.initialize_HMM(states, emissions, sample_tree)
observation = [["L", "L", "R", "R", "L"]]
backward_tree_sequence = backward_sequence_generator.backward_sequence_generator(hmm)
data = {'node': [1], 'state': ['P']}
observed_states_training_nodes = pd.DataFrame(data=data, columns=["node", "state"])
backward_probs = backward.backward(hmm, observation, backward_tree_sequence, observed_states_training_nodes)

forward.py

It consists of two functions

  • noisy_or
  • forward

noisy_or : Calculating the probability of transition from multiple nodes to given node in the tree

Description

  • Calculating the probability of transition from multiple nodes to given node in the tree

Usage

  • def noisy_or(hmm, prev_state, cur_state):

Arguments

  • hmm: It is a dictionary given as output by initialize_HMM.py file
  • previous_state: It is a numpy array containing state variable values for the previous nodes
  • current_state: It is a string denoting the state variable value forcurrent node

Returns:

  • transition_prob: The Noisy_OR probability for the transition

Examples

sample_tree = np.array([0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]).reshape(5, 5)  # for "X" (5 nodes) shaped tree
states = ['P', 'N']  # "P" represent cases(or positive) and "N" represent controls(or negative)
emissions = [['L', 'R']]  # one feature with two discrete levels "L" and "R"
hmm = initialize_HMM.initialize_HMM(states, emissions, sample_tree)
transition_prob = forward.noisy_or(hmm,states,"P") # for transition from P & N simultaneously to P

forward : Infer the forward probabilities for all the nodes of the treeHMM

Description

  • forward calculates the forward probabilities for all the nodes

Usage

  • def forward(hmm, observation, forward_tree_sequence, observed_states_training_nodes = None):

Arguments

  • hmm: It is a dictionary given as output by initialize_HMM.py file
  • observation: observation is a list of list consisting "k" lists for "k" features, each vector being a character series of discrete emission values at different nodes serially sorted by node number
  • forward_tree_sequence: It is a list denoting the order of nodes in which the tree should be traversed in backward direction(from leaves to roots).It's the output of backward_sequence_generator function.
  • observed_states_training_nodes: (Optional) It is a (L * 2) dataframe where L is the number of training nodes where state values are known. First column should be the node number and the second column being the corresponding known state values of the nodes.

Returns

  • forward_probs: A dataframe of size (N * D) denoting the forward probabilities at each node of the tree, where "N" is possible no. of states and "D" is the total number of nodes in the tree

Details

  • The forward probability for state X up to observation at node k is defined as the probability of observing the sequence of observations e_1,..,e_k given that the state at node k is X. That is: f[X,k] := Prob( X_k = X | E_1 = e_1,..,E_k = e_k) where E_1...E_n = e_1...e_n is the sequence of observed emissions and X_k is a random variable that represents the state at node k

Examples

sample_tree = np.array([0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]).reshape(5, 5)  # for "X" (5 nodes) shaped tree
states = ['P', 'N']  # "P" represent cases(or positive) and "N" represent controls(or negative)
emissions = [['L', 'R']]  # one feature with two discrete levels "L" and "R"
hmm = initialize_HMM.initialize_HMM(states, emissions, sample_tree)
observation = [["L", "L", "R", "R", "L"]]
forward_tree_sequence = forward_sequence_generator.forward_sequence_generator(hmm)
data = {'node': [1], 'state': ['P']}
observed_states_training_nodes = pd.DataFrame(data=data, columns=["node", "state"])
forward_probs = forward.forward(hmm, observation, forward_tree_sequence, observed_states_training_nodes)

baumWelch.py

It consists of two functions

  • baumWelchRecursion
  • hmm_train_and_test

baumWelchRecursion : Implementation of the Baum Welch Algorithm as a special case of EM algorithm

Description

  • hmm_train_and_test function recursively calls this function to give a final estimate of parameters for tree HMM. Should not be used directly.

Usage

  • def baumWelchRecursion(hmm, observation, observed_states_training_nodes=None, observed_states_validation_nodes=None):

Arguments

  • hmm: It is a dictionary given as output by initialize_HMM.py file
  • observation: observation is a list of list consisting "k" lists for "k" features, each vector being a character series of discrete emission values at different nodes serially sorted by node number
  • observed_states_training_nodes: (Optional) It is a (L * 2) dataframe where L is the number of training nodes where state values are known. First column should be the node number and the second column being the corresponding known state values of the nodes
  • observed_states_validation_nodes: (Optional) It is a (L * 2) dataframe where L is the number of validation nodes where state values are known. First column should be the node number and the second column being the corresponding known state values of the nodes

Returns

  • newparam: A dictionary containing estimated Transition and Emission probability matrices

Examples

sample_tree = np.array([0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]).reshape(5, 5)  # for "X" (5 nodes) shaped tree
states = ['P', 'N']  # "P" represent cases(or positive) and "N" represent controls(or negative)
emissions = [['L', 'R']]  # one feature with two discrete levels "L" and "R"
hmm = initialize_HMM.initialize_HMM(states, emissions, sample_tree)
data = {'node': [1], 'state': ['P']}
observed_states_training_nodes = pd.DataFrame(data=data, columns=["node", "state"])
data1 = {'node' : [2,3,4], 'state' : ['P','N','P']}
observed_states_validation_nodes = pd.DataFrame(data = data1,columns=["node","state"])
newparam = baumWelch.baumWelchRecursion(copy.deepcopy(hmm),observation,observed_states_training_nodes, observed_states_validation_nodes)

hmm_train_and_test : Inferring the parameters of a tree Hidden Markov Model via the Baum-Welch algorithm

Description

  • For an initial Hidden Markov Model (HMM) with some assumed initial parameters and a given set of observations at all the nodes of the tree, the Baum-Welch algorithm infers optimal parameters to the HMM. Since the Baum-Welch algorithm is a variant of the Expectation-Maximisation algorithm, the algorithm converges to a local solution which might not be the global optimum. Note that if you give the training and validation data, the function will message out AUC and AUPR values after every iteration. Also, validation data must contain more than one instance of either of the possible states

Usage

  • def hmm_train_and_test(hmm, observation, observed_states_training_nodes=None, observed_states_validation_nodes=None, maxIterations=50, delta=1e-5, pseudoCount=0):

Arguments

  • hmm: It is a dictionary given as output by initialize_HMM.py file
  • observation: observation is a list of list consisting "k" lists for "k" features, each vector being a character series of discrete emission values at different nodes serially sorted by node number
  • observed_states_training_nodes: (Optional) It is a (L * 2) dataframe where L is the number of training nodes where state values are known. First column should be the node number and the second column being the corresponding known state values of the nodes
  • observed_states_validation_nodes: (Optional) It is a (L * 2) dataframe where L is the number of validation nodes where state values are known. First column should be the node number and the second column being the corresponding known state values of the nodes
  • maxIterations: It is the maximum number of iterations in the Baum-Welch algorithm. Default is 50
  • delta: Additional termination condition, if the transition and emission matrices converge, before reaching the maximum number of iterations(code{maxIterations}). The difference of transition and emission parameters in consecutive iterations must be smaller than code{delta} to terminate the algorithm. Default is 1e-5.
  • pseudoCount: Adding this amount of pseudo counts in the estimation-step of the Baum-Welch algorithm. Default is zero

Returns

  • learntHMM: A dictionary of three elements, first being the infered HMM whose representation is equivalent to the representation in initialize_HMM, second being a list of statistics of algorithm and third being the final state probability distribution at all nodes.

Examples

sample_tree = np.array([0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]).reshape(5, 5)  # for "X" (5 nodes) shaped tree
states = ['P', 'N']  # "P" represent cases(or positive) and "N" represent controls(or negative)
emissions = [['L', 'R']]  # one feature with two discrete levels "L" and "R"
hmm = initialize_HMM.initialize_HMM(states, emissions, sample_tree)
data = {'node': [1], 'state': ['P']}
observed_states_training_nodes = pd.DataFrame(data=data, columns=["node", "state"])
data1 = {'node' : [2,3,4], 'state' : ['P','N','P']}
observed_states_validation_nodes = pd.DataFrame(data = data1,columns=["node","state"])
learntHMM = baumWelch.hmm_train_and_test(copy.deepcopy(hmm),observation,observed_states_training_nodes, observed_states_validation_nodes)

License

Distributed under the GNU General Public License v3.0. See LICENSE for more information

Contributors

Acknowledgments

I want to acknowledge the R package of Tree HMM by Prajwal Bende from which this package is inspired.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

tree_hmm-1.3-py3-none-any.whl (18.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page