To reduce the time it takes to run large processes and deliver to clients faster, big calculations such as optimisations and simulations should apply parallel computing in large-scale systems.
Parallel computation is the simultaneous execution of different pieces of a larger computation across multiple computing processors or cores. The basic idea is that if a process can execute a computation in X seconds on a single processor, it should be possible to run the same process in X/n seconds on n processors. Generally, such a speed-up is impossible owing to overhead and various barriers to splitting a problem into n pieces. Still, it is often possible to achieve comparable solutions.
The basic idea of parallel computation is similar to entropy. According to mathematical structures in computer science, entropy determines if there are n objects put in m boxes and n > m then there is at least one box (m) with more than one (n) object because it is impossible to evenly divide object n into all of m boxes. The basic idea of parallel computation was squarely in the "high-performance computing" domain, where expensive machines were linked via high-speed networking to create large clusters of computers. In this scenario, it was necessary to have sophisticated software to manage data communication between different computers within the set. Here, parallel computing was a highly tuned and carefully customised operation, not something one could saunter into casually. Today, almost all computers contain multiple processors or cores, meaning that accessing a "cluster" of CPUs is much easier than it used to be and has opened the door to parallel computing for many people.
This article discusses the functionality in R and Python for executing parallel computations, with a particular focus on the functions used by multi-core computers. Though possible to do more traditional parallel computing via the network-of-workstations style of computing, this article will not discuss this area.
What is memory pressure in parallel computing?
Out-of-memory (OOM) errors are common issues in data-intensive parallel programs, which cause poor performance and execution failures. A recent study of memory pressure in parallel computing proposed a new programming model to address the memory pressure in data-parallel programs. One of the most obvious ways to fix this issue is to increase the machine's ram; although effective, it requires extensive changes to the parallel program. Lightweight virtualisation, such as OS containers - Docker and Kubernetes, can address the memory pressure in data-parallel programs with much less effort.
Unless there is an infinite loop, programs in R and Python will run sequentially, executing one step at a time until they are complete. Sequential execution is the standard primary mode of operation on all computers; however, this will take a long time to complete when the process needs large amounts of data. In such instances, parallel computing should implement the process in which we define portions of the computation to run simultaneously using multiple processors on a single machine or multiple computers.
As the amount of data generated by enterprises and organisations grew exponentially, big data analytics has become increasingly popular in data mining for business intelligence, machine learning for scientific discoveries, and data warehousing. Consequently, data-parallel programming for big data analytics requires harnessing the power of many computers in a cluster to process large amounts of data simultaneously.
Programs should be designed to run concurrently, each working on a portion of the data; in doing so, the parallel program will be faster than a corresponding sequential version that depends on one computer to process all of the data. This program category is called data parallelism because the data is split and processed in parallel.
Several data parallelism architectures were developed for handling Big Data, the most famous being Kubernetes. The other feature is data parallelism in R and Python, the most popular open-source software adopted by many huge IT companies, such as Yahoo, Facebook, eBay, Amazon and more.
Use case "Monte Carlo: a simulation for finance"
Many computational libraries have built-in parallelism for behind-the-scenes use; this kind of "backend parallelism" does not usually affect the code and will improve computational efficiency. However, monitoring what is happening in the background is beneficial because data parallelism may affect other work on the machine. Also, specific shared computing environments may have rules about how many Cores/CPUs the application is allowed to use; if a function fires off multiple parallel jobs, it may cause problems for other applications sharing the system.
Monte Carlo Simulation defines a simulation as an experiment, usually conducted on a computer, involving the use of random numbers. A random number stream is a sequence of statistically independent random variables uniformly distributed in the interval (0,1). Examples of situations where simulation has proved valuable include:
Modelling the flow of patients through a hospital
Modelling the evolution of an epidemic over space and time
Testing a statistical hypothesis
Pricing and option (derivative) on a financial asset.
A challenging feature common to all these examples is using purely analytical methods to either model the real-life situation [Examples (1) and (2)] or solve the underlying mathematical problems [Examples (3) and (4)]. The systems are stochastic in examples (1) and (2). There may be a complex interaction between resources and certain parts of the system, a difficulty compounded by the requirement to find an optimal strategy. In example (3), having obtained data from a statistical investigation, the numerical value of some test statistic is calculated. Still, the distribution of such a statistic under a null hypothesis may be highly complex to derive. In example (4), the problem often reduces to evaluating a multiple integral that is impossible to solve by analytical or conventional numerical methods. However, such integrals can be estimated by Monte Carlo methods. From the 1940s, these methods were used to evaluate definite multiple integrals in mathematical physics. There is a resurgence of interest in such methods, particularly in finance and statistical inference.
The steps to get started for a Monte Carlo Simulation are:
Step 1: Identify the transfer equation
The mathematical expression of your process is called the "transfer equation" this is the ability to create complex equations, even those with multiple responses that may depend on each other.
Step 2: Define the input parameters
By defining the input parameters, the distribution of data is determined. Some inputs may follow the normal distribution, while others follow a triangular or uniform distribution. It is essential to determine distribution parameters for each input. For example, it is necessary to specify the mean and standard deviation for inputs that follow a normal distribution.
Step 3: Set up simulation
For an accurate simulation, creating an extensive, random data set for each input is required, for example, something within the range of 100,000 instances. These random data points simulate the values seen over a long period for each input.
Step 4: Analyse process output
With the simulated data in place, the transfer equation calculates simulated outcomes.
Data parallelism in R
Given R's fast-changing open-source foundation, it is a good mix with Docker as it provides a way to stabilise production environments in a shifting landscape of dependencies. In Docker containers, Kubernetes pods execute the deployed R scripts. Kubernetes include applications such as R-powered APIs using OpenCPU or plumber, Shiny apps, batch R jobs that can scale horizontally over many CPUs, or scheduled analysis.
Applying data parallelism in R code and deploying it through a docker image on a Kubernetes pod makes the R application reliable, flexible, and production-ready. It will also scale, run on many cloud providers, including Google, AWS, and Azure, and once set up, it is easy to deploy. In most cases pushing to GitLab can be the trigger to serve the R application.
The simplest use case of parallelisation is running the same script repeatedly in a parallel order rather than a sequential one. A classic example is Monte Carlo Simulation, i.e., the random generation of numbers given a fixed set of parameters. The following model simulates stock prices after a year (365 days) given a fixed value for standard deviation and average stock price movement per day.
Dockerfile encapsulates the R script Monte-Carlo.R. Remember, Kubernetes works with containers and can access them directly from Docker Hub.
Now, the Kubernetes element comes in the form of a job.YAML file that contains the instructions for the controller. Note that under spec: we specify the number of pods to run our job in parallel (Kubernetes handle distribution of pods over nodes) and the number of completions. Each pod picks up a single run and exists after the script has finished. By the end of this workload, 100 pods were created, run and terminated.
With everything in place (R script, Dockerfile, .yaml file), GitLab is ready to deploy the job in Kubernetes. Assuming the relevant services in the cloud console are present (downloading the cloud SDK and installing akubectl), creating a Kubernetes cluster and deploying the workload there is possible.
Once the Kubernetes cluster is finished, the kubectl command creates a job that parallelises the execution of the R script Monte-Carlo.R is:
Upon completion, the next step is to collect the Monte Carlo Simulation's output from the logs of each Kubernetes pod and write the output to a .txt file:
Reading the output into R, a R plot displays the results:
Histogram of stock prices
Data parallelism in Python
The same data parallelism files are used in Python, R and the Monte Carlo Simulation. The extension of monte-Carlo.R differs from R to Python. Py being the extension for Python.
A Monte Carlo simulation is a valuable tool for predicting future results by calculating a formula multiple times with different random inputs; It is a process that Excel can execute. However, it is complicated without some VBA or potentially expensive third-party plugins. Using NumPy and pandas to build a model, generate multiple potential results, and analyse them is straightforward. The other added benefit is that analysts can run many scenarios by changing the inputs and move on to much more sophisticated models in the future if needs arise. Finally, the results can be shared with non-technical users and facilitate discussions around the uncertainty of the final results.
Big Data Analytics and associated technologies can add immense value to businesses. Still, using these technologies makes it difficult for an organisation to firmly control the vast and heterogeneous data collection so it can be further analysed and investigated. Big Data Analytics provides enormous potential when facing competitors and supporting the growth of individual companies.
Certain factors need to be followed to create timely and productive results from Big Data Analytics; the precise use of Big Data Analytics can increase output, modernisation, and effectiveness for entire divisions and economies. Other benefits include managing and reusing Data Sources for building practical applications and services. Evaluating the best approach for cleaning, filtering and analysing the data is vital. Data parallelism can be used for optimised analytic processing.
The outstanding data parallelism in R and Python with the open-source Kubernetes - lightweight containers. This outstanding data parallelism in R and Python deployed by Kubernetes can accelerate the processing of massive amounts of data through distributed processes, creating fast responses. It can be adopted and customised to meet various development requirements and is scalable by increasing the number of nodes available for processing.
The extensibility and simplicity of data parallelism in R and Python, when deployed by Kubernetes, is a key differentiator that makes it a promising tool for data processing.
About the authors behind the article
Lead author Patricia Gómez Bello is a Data Engineer Manager based in Madrid. Fabrizio Invernizzi is a Data Engineering Director also from the Madrid office.