The Bayesian approach is used to analyze data and update beliefs based on the data. Unique features of this scan include the ability to incorporate prior information into the scan. The interpretation of credible intervals as fixed ranges to which a parameter belongs with a predetermined probability, and the ability to assign a probability to any hypothesis of interest. In this article, to understand this concept, we will use the ParaMonte python package to perform Bayesian data analysis and visualization, which uses a parallel Markov Monte Carlo chain. The Monte Carlo Markov chain is a method that stimulates the high-dimensional probability distribution for Bayesian inference. Here are the main points to cover in this article.

Contents

  1. Bayesian inference
  2. Monte Carlo Markov Range
  3. About ParaMonte
  4. Bayesian data analysis with ParaMonte

To understand the functionality of extended post-processing and visualizations, we must first understand Bayesian inference and the Monte Carlo Markov chain (MCMC).

Bayesian inference

To make an inference about the population using a sample of data and feeding the sample data into the underlying prediction model which can make predictions about the data we expect to see given certain parameters of this particular model, we need to calculate the probability that we would see this sample of data conditional on a specific choice of parameters in our model. In other words, for example, the model is correct and the parameters that describe the data, what is the likelihood of the parameters based on the observed data D? Probability is known as Bayesian probability and the inference we get from it is Bayesian inference.

Ultimately, Bayesian inference is the prior information multiplied by the likelihood of a parameter, then factors the whole by the marginal likelihood of all parameters in the model.

Are you looking for a comprehensive repository of Python libraries used in data science, check here.

Monte Carlo Markov Range

The name itself is formed by the combination of two different sampling methods Monte Carlo and Markov chain.

Monte Carlo

Monte Carlo is a technique of sampling a probability distribution and estimating the desired quantity based on random sampling. From the samples we have drawn, we can calculate the sum or integral quantity as the mean or variance of the samples. However, this method generally assumes that samples can be easily extracted from the target distribution.

Consider a complex 2D shape, such as a spiral, which can be a way to visualize a Monte Carlo sampling process. To describe the spiral we cannot define an easily understandable function, but by looking at samples in the domain we can estimate if they are part of the spiral. We can summarize the spiral shape as a probability density based on a large number of samples taken from the domain.

Assuming that it is easy to sample from a probability distribution is not always possible. In such cases, Markov chains sample efficiently from an intractable probability distribution.

Markov chain

In the Markov chain, each variable probabilistically depends on the value of the previous variable. Specifically, the next variable is determined only by the last variable. A special type of stochastic process, it deals with the characterization of a series of random variables. Particular attention is paid to the dynamics and limits of the sequence.

Consider a board game involving rolling dice such as ludo. There is an even distribution of probabilities over the six stages of a dice roll, the whole numbers from 1 to 6. Having a position on the board, your next position depends solely on the current position and the dice roll. Your specific positions on the board form a Markovian chain.

When together the MCMC methods attempt to generate samples such that the importance weights associated with each sample are constant, this means that MCMC seeks to generate samples proportional to the posterior probability to arrive at an optimal estimate of the expected value. One could say that MCMC methods calculate the approximate posterior distribution of a parameter by random sampling.

Let’s understand how this concept is used with parallel programming for analysis.

About ParaMonte

Several Python packages already provide probabilistic programming environments for Markov Chain Monte Carlo simulations, in which users would implement their inference problems. Other MCMC packages require users to provide the objective function as a black box with a predefined procedure or that users define the objective function themselves.

In particular, the ParaMonte library enables scalable parallel Monte Carlo simulations on distributed and shared memory architectures. ParaMonte was developed with the following objectives.

  • Automate all Monte Carlo simulations wherever possible to ensure the highest degree of library usability and minimum time investment required to create, run, and post-process simulation models.
  • Low level implementation of the library for high performance Monte Carlo simulations.
  • Parallelization of all simulations via two-way and one-way MPI communications with zero parallel coding on the part of the user.
  • Reporting and post-processing of each simulation and its results, as well as their automatic storage in external files so that they are easily understandable and reproducible in the future.

Let’s implement ParaMonte for visualization of bayesian data analysis in python.

Bayesian data analysis with ParaMonte

In this, we will analyze the linear relationship between the dependent variable and the explanatory variable using ParaMonte through the parallel simulations of the Monte Carlo Markov chain.

Installing the Paramonte Library

! pip install paramonte

Import the necessary libraries

import paramonte as pm
import numpy as np
import scipy as sp
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

Data loading and analysis

df=pd.read_csv('Datasets/test.csv')
fig = plt.figure  ( figsize = (4,5)
                  , dpi = 100
                  )
ax = fig.add_subplot(1,1,1)
ax.scatter( df['x']
          , df['y']
          , color = "red" 
          , s = 5 
          )
plt.show()

The data contains a total of 300 records and the above analysis represents the data breakdown. Now let’s see how the distribution would look when the log is applied to these data points.

We will only use a few data points for the model because as the data points and string size increase, the execution time also increases. Before that, I need to create a function that can apply the log on the data and calculate the probability accordingly.

from scipy.stats import norm
logX = np.log(X)
logY = np.log(y)
 
def getLogLike(param):  
    mean = param[0] + param[1] * logX    
    logProbDensities = norm.logpdf(logY, loc = mean, scale = np.exp(param[2])) 
    return np.sum(logProbDensities)

Build the ParaMonte model

para = pm.ParaDRAM()
para.spec.overwriteRequested = True 
para.spec.outputFileName = "./regression_powerlaw" 
para.spec.randomSeed = 100 
para.spec.variableNameList = ["intercept", "slope", "logSigma"]
para.spec.chainSize = 1000 
para.runSampler( ndim = 3 
               , getLogFunc = getLogLike 
               )
  • overwriteRequested will overwrite the previously saved result so that it does not interrupt the string creation process.
  • Report file for samples, created string and other reports can be saved using ‘outputFileName’.
  • To specify the number of chains to create using ‘chainsSize’.

After setting all the parameters above, build the sampler and specify the number of parameters in ‘ndim’.

Let’s visualize the chain and the samples created by the sampler.

chain.plot.scatter( ycolumns = "AdaptationMeasure"
                  , ccolumns = [] 
                  )
chain.plot.scatter.currentFig.axes.set_ylim([1.e-5,1])
chain.plot.scatter.currentFig.axes.set_yscale("log")

A total of 1000 strings are created as visualized in the graph above with their adaptation score.

sample = para.readSample(renabled = True)[0]
for colname in sample.df.columns:
    sample.plot.line.ycolumns = colname
    sample.plot.line()
    sample.plot.line.currentFig.axes.set_xlabel("MCMC Count")
    sample.plot.line.currentFig.axes.set_ylabel(colname)
    sample.plot.line.savefig( fname = "/traceplot_" + colname )
 
 
for colname in sample.df.columns:
    sample.plot.histplot(xcolumns = colname)
    sample.plot.histplot.currentFig.axes.set_xlabel(colname)
    sample.plot.histplot.currentFig.axes.set_ylabel("MCMC Count")
    sample.plot.histplot.savefig( fname = "/histogram_" + colname )

As we took three variables of intercept, slope and logSigma, there are thus three different samples. Let’s visualize the distribution of these samples.

As we can see, the distribution is almost normal for all variables. But the distribution of the Log function is skewed to the right. Let’s see how this affects the linear relationship between the dependent variable and the independent variable.

values = np.linspace(0,100,101)
yvalues = np.exp(sample.df["intercept"].mean()) * xvalues ** sample.df["slope"].mean()
 
fig = plt.figure(figsize = (4.5,4), dpi = 100)
ax = fig.add_subplot(1,1,1)
 
ax.plot(xvalues, yvalues, "b")
ax.scatter(X, y, color = "red", s = 5)

The expected line of best fit, slope and intercept learned by the ParaMonte model are shown in the graph above and it can be seen that the relationship line is almost linear. It could be perfectly linear if the bias in the data were smoothed out. As a result, there are many possible regression lines corresponding to each parameter defined in the output file, although each has a different chance of being correct. We can visualize them together to understand the level of uncertainty of our best-fit regression.

values = np.linspace(0,100,101)
 
fig = plt.figure(figsize = (4.5,4), dpi = 100)
ax = fig.add_subplot(1,1,1)
 
 
first = 0
last = 300
slopes = sample.df["slope"].values[first:last]
intercepts = sample.df["intercept"].values[first:last]
 
for slope, intercept in zip(slopes, intercepts):
    yvalues = np.exp(intercept) * xvalues ** slope
    ax.plot( xvalues
           , yvalues
           , "black" 
           , alpha = 0.04 
           )
 
ax.scatter( X, y
          , color = "red"
          , s = 5
          , zorder = 100000
          )

This is only the first set of samples the same way we can visualize the other set of samples.

final verdict

ParaMonte is a great package with parallel implementation of Monte Carlo Markov chain which reduces processing time and helps to analyze data more thoroughly. With a practical implementation of this concept, we could perform Bayesian data analysis and visualization with ParaMonte.

References