Are applications of statistical techniques and models that seek to improve

Social and societal developments have their real world manifestations in urban space, and social and economic developments in urban areas are reflected in the structural characteristics of urban sub-areas. Urban geography is well suited to examine attributes and developments of structural characteristics by means of analytical techniques. These analyze the areal structure of urban communities in terms of attributes. Empirical urban research is both regional research specifically in urban areas and social or socio-spatial research. The methods correspond to those found in regional geography in so far as they are utilized for delineation and observation of structural change of agglomerations, city centers/cores, urban expansions, and suburban areas. Within the city itself, the research units are districts and neighborhoods as well as other ‘official’ spatial units of division, be they for planning, political, or statistical purposes (e.g., planning units), school and electoral districts, street rows and blocks. Urban sub-areas may be of any scale: census tracts are commonly used as statistical reference areas. Micro scale urban social geography also makes use of block level data to characterize the increasing differentiation of urban social milieus.

The methods of empirical urban sub-area analysis allow for urban social monitoring. This refers to the inventory, documentation and analysis of detailed socioeconomic structural patterns and processes of change. Complex spatial processes are broken down into individual components.

Statistical techniques aim to characterize and analyze urban space, urban sub-areas, and urban structural developments comprehensively. Three approaches are important:

(a)

For descriptive purposes, methods include computer-assisted cartography and the refined cartographic and analytic methods enabled by Geographic Information Systems (GIS). These may produce unidimensional or multidimensional maps of social, demographic, or other phenomena as differentiated in urban space. GIS, of course, is also suited for the establishment of a long-term statistical cartographic database, which can be periodically updated. Such a database would simplify thematic longitudinal onsite analysis of the target urban region with regard to social, economic, and demographic processes and forecasts.

(b)

In a more exploratory sense, factorial ecological investigations use a number of multivariate descriptive statistical techniques (the methods of factor analysis) to identify the essential dimensions that characterize and differentiate one urban sub-area from another in terms of social science variables. Underlying the concept of urban social areas is the assumption that societal processes reflect natural processes in that they have a competitive dimension that can lead to processes of selection. Social structures and social change in space are seen as the result of mutual adaptation of competing species. According to R. E. Park (1936), socio-ecological studies deal with processes that either uphold an existing social balance or that disturb the existing order in order to reach a new, relatively stable existence. One specific type of factorial ecology is social area analysis. Social area analysis is based on the theory of Shevky and Bell who understood urban social space as being primarily characterized by social rank, urbanism, and ethnicity. As such it only works with a limited set of input variables. Cluster analyses subsequently performed on factor analyzed urban sub-areas can help identify groups of sub-areas with common patterns of variability.

(c)

In order to understand the determinants of and processes responsible for such patterns, one may combine descriptive and analytical statistical techniques. Factor scores from factorial analyses may, for example, be used as input data in multiple regression analyses that relate these aggregate characteristics to explanatory variables.

Social monitoring of urban sub-area characteristics over time enables a scientifically sound evaluation of the current structural change: urban geography falls back on existing statistical data collected by public and private institutions or public welfare organizations. As the data reflects institutional norms and goals, urban geography has no influence on either the exact questions, the survey method, or the aggregation and systematization of the indicators. Consequently, theoretically informed urban research is limited by the quality of these (secondary) data sources. However, the quality of official data banks and the methodology of secondary research in the field of spatial and thematic aggregation of data are improving continually.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0080430767025894

Analytics Defined

Mark Ryan M. Talabis, ... D. Kaye, in Information Security Analytics, 2015

General Statistics

Even simple statistical techniques are helpful in providing insights about data. For example, statistical techniques such as extreme values, mean, median, standard deviations, interquartile ranges, and distance formulas are useful in exploring, summarizing, and visualizing data. These techniques, though relatively simple, are a good starting point for exploratory data analysis. They are useful in uncovering interesting trends, outliers, and patterns in the data. After identifying areas of interest, you can further explore the data using advanced techniques.

We wrote this book with the assumption that the reader had a solid understanding of general statistics. A search on the Internet for “statistical techniques” or “statistics analysis” will provide you many resources to refresh your skills. In Chapter 4, we will use some of these general statistical techniques.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128002070000010

Applications of data-driven model-based methods for process state estimation

Ch. Venkateswarlu, Rama Rao Karri, in Optimal State Estimation for Process Monitoring, Fault Diagnosis and Control, 2022

13.3.1 State estimator development

The description of the batch distillation considered for design and implementaton ANN based state estimator is the same as given in Section 13.2.1.1, and can be referred elsewhere [5]. The details of measurement configuration for state estimation and the development of state estimator are given as follows. Section 13.2.1.

13.3.1.1 Measurement sensors selection

A multivariate statistical technique, called principal component analysis (PCA), is used to select the temperature measurements and to use them as inputs to the composition estimator. The configuration of temperature measurements for state estimation in multicomponent batch distillation using PCA is described in Section 7.5 of Chapter 7: Application of Mechanistic Model-Based Nonlinear Filtering and Observation Techniques for Optimal State Estimation in Multicomponent Batch Distillation.

13.3.1.2 Development of artificial neural network composition estimator

The widely used ANN paradigm is a multilayered feed-forward network (MFFN) with multilayered perceptron, mostly comprising three sequentially arranged layers of processing units. An MFFN provides a mapping between an input (x) and an output (y) through a nonlinear function f, that is, y = f(x). The three layered MFFN has input, hidden, and output layers, each layer comprising of nodes. All the nodes in the input layer are connected using weighted links to the hidden layer nodes; similar links exist between the hidden and output layer nodes. Usually, the input and hidden layers also contain a bias node possessing constant output of one. The nodes in the input layer do not perform any numerical processing, whereas the hidden and output layer nodes perform all numerical processing, and they are termed as active nodes. The details concerning the structure, processing functions, training, and information processing of feed-forward neural networks are elaborated in Section 5.4 of Chapter 5. In this study, two individual neural networks are configured for composition estimation in distillate stream and reboiler. The schematic of the artificial neural composition soft sensor for batch distillation is shown in Fig. 13.5. The notation for the symbols in batch distillation can be referred to Chapter 7.2.,

Are applications of statistical techniques and models that seek to improve

Figure 13.5. Artificial neural composition estimation scheme for multicomponent batch distillation.

The problem of neural network modeling is to obtain a set of weights such that the summation of the squared prediction error defined by the difference between the network predicted outputs and the desired outputs is minimum. The iterative training makes the network recognize patterns in the data and creates an internal model, which provides predictions for the new input condition. The inputs for both the distillate stream and reboiler networks are the temperature measurements of different operating conditions and a unit bias. The distillate section network has the outputs of product compositions for cyclohexane and heptane, and the reboiler network has the outputs of product compositions for toluene and heptane. The data sets representing the inputs and the corresponding outputs for training the neural networks can be obtained through simulation of the batch distillation model. The details of a MFFN with back-propagation training by generalized delta rule are reported in the literature [7].

Network training is an iterative procedure that begins with initializing the weight matrix randomly. The network learning process involves two types of passes: a forward pass and a reverse pass. In the forward pass, an input pattern from the example data set is applied to the input nodes, the weighted sum of the inputs to the active node is calculated, which is then transformed into output using a nonlinear activation function such as a sigmoid function. The outputs of the hidden nodes computed in this manner form the inputs to the output layer nodes whose outputs are evaluated similarly. In the reverse pass, the pattern-specific squared error is computed and used to update the network weights according to the gradient strategy. The repetition of the weight updating procedure for all the patterns in the training set, completes one training iteration. The distillate and bottom ANN estimators are trained iteratively until convergence in the objective is achieved. The trained and learned networks are used to directly infer the product compositions based on temperature measurements of the batch distillation column.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780323858786000087

Introduction to the Agent Approach

Fabrice Bouquet, ... Patrick Taillandier, in Agent-based Spatial Simulation with Netlogo, 2015

1.3.2.1 Statistical and econometric models

The application of statistical techniques in order to derive the mathematical relationships between dependent variables (factors whose value is influenced by other factors) and independent variables is widespread in the modeling of socioeconomic systems and in other fields [ANS 98].

The most commonly used statistical technique is multiple regression analysis (and its variations such as regression in stages or two-stage least squares regression analysis), although other multivariate techniques are also widely used (such as factorial analysis or canonical analysis) [KLE 07].

Econometric models are applications of multiple regression techniques that are used to analyze economic questions. They are systems of equations which express the relationships between demand and/or supply and their root causes, and the relationship between demand and supply themselves (economic/market equilibrium) [BAT 76, WIL 74]. Generally known as econometric analysis, this set of specialized statistical techniques was developed in order to estimate their coefficients [JUD 88].

The work of Irwin and Bockstael [IRW 02] should be mentioned at this point: they use an economic model to describe to what extent it is worthwhile for the owner of an undeveloped plot of land to transform it into a site for building habitation, depending on the sale value of the land once it has been transformed into a usable site and the cost of achieving this.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781785480553500010

Pearson, Karl (1857–1936)

J. Aldrich, in International Encyclopedia of the Social & Behavioral Sciences, 2001

7 Summary

Of the many statistical techniques Pearson devised, only a few remain in use today and though his ideas sometimes find re-expression in more sophisticated form, such as the correlation curve or the generalized method of moments, there is little to suggest that Pearson continues to directly inspire work in statistics. Pearson broke with the theory of errors but in the next generation through the analysis of variance and regression the theory was restored to stand beside, even to overshadow, Pearsonian statistics. From a modern perspective Pearson's theory seems desperately superficial. Yet the problems he posed have retained their importance and the ground he claimed for the discipline of statistics has not been given up. Pearson was an extraordinarily prolific author and there is also a considerable secondary literature. There is a brief guide to this literature on the website http://www.soton.ac.uk/∼jcol

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0080430767003181

13th International Symposium on Process Systems Engineering (PSE 2018)

Francesco Rossi, ... Gintaras Reklaitis, in Computer Aided Chemical Engineering, 2018

1 Introduction

The application of statistical techniques to the quantification of model uncertainty is a new paradigm, which has recently emerged due to the growing interest of industry and of the PSE community in stochastic optimization frameworks, robust design strategies and quantitative risk assessment. Specifically, strategies for uncertainty quantification are commonly applied in areas such as robust process/product design (especially within the pharmaceutical sector) (Mockus et al., 2011), drug delivery (Lainez et al., 2011) and robust optimization/control of industrial processes (Rossi et al., 2016).

Typically, model uncertainty quantification comes down to the estimation of the joint probability distribution (PDF) of some key uncertain parameters of the model, which often is a system of differential equations (ODE) or differential-algebraic equations (DAE). Currently, there exist two principal types of strategies, applicable to the estimation of the PDF of the uncertain parameters of an ODE/DAE system: rigorous Bayesian inference coupled with random sampling approaches, e.g. Markov-chain Monte Carlo (Green and Worden, 2015), and approximate Bayesian inference exploiting optimization techniques, e.g. Variational Bayes (Beal, 2003) and frameworks based on the Laplace approximation (Dass et al., 2017). Although all of these techniques are wellestablished and commonly applied, they are usually very computationally demanding. Moreover, no systematic analyses have been conducted to assess their accuracy and computational efficiency, especially when ODE/DAE models must be dealt with.

Therefore, this contribution proposes a simulation-based comparison of two different PDF estimation strategies applied to ODE/DAE systems, namely, Bayesian Markovchain Monte Carlo (BMCMC) and a new approach, named PDFE&U, which relies on a combination of fitting, back-projection techniques and maximum likelihood estimation. The level of accuracy and computational efficiency, attainable by these two methodologies, are evaluated by analyzing their outputs, especially their PDFs, and by comparing their computational times. The analysis of the PDFs is performed using both contour plots and well-known statistical indicators, i.e. expectation, variance, covariance, and quantiles. The specific ODE/DAE models used as benchmark systems for this analysis include a batch adaptation of the Tennessee Eastman Challenge problem and a pharmacokinetic/ pharmacodynamic model (PB/PK model).

The rest of the paper is organized as follows: first, we introduce the principal ideas, on which PDFE&U relies; then, we report the most interesting results of our analysis on the accuracy and computational performance insured by PDFE&U and BMCMC; finally, we discuss the most relevant consequences of such analyses.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780444642417502524

29th European Symposium on Computer Aided Process Engineering

Francesco Rossi, ... Gintaras Reklaitis, in Computer Aided Chemical Engineering, 2019

1 Introduction

The application of statistical techniques to the quantification of model uncertainty is a new paradigm, which has recently emerged due to the growing interest of industry and of the PSE community in stochastic optimization, robust design, real-time quality control and quantitative risk assessment. As an example, strategies for uncertainty quantification have been applied in areas such as robust process and product design (Mockus et al., 2011), drug delivery (Laínez et al., 2011) and stochastic dynamic optimization (Rossi et al., 2016).

Typically, model uncertainty quantification comes down to the estimation of the joint probability distribution (PDF) of some key uncertain parameters of the model, which often consists of a system of differential-algebraic equations (DAEs). To solve this type of PDF estimation problem, we usually rely on Bayesian inference methods such as Bayesian Markov-chain Monte Carlo (Green and Worden, 2015), which are well-established but also extremely computationally demanding. Therefore, it is important to investigate and develop new approximate PDF estimation strategies, which offer a good trade-off between accuracy and computational efficiency, and to validate them against state-of-the-art Bayesian inference approaches.

To that end, this contribution considers two approximate PDF estimation strategies plus a conventional one, and compares them to identify the most suitable method for solving PDF estimation problems, in which the underlying model is a DAE system. The approximate PDF estimation methods, analysed in this manuscript, include: (I) a novel Bayesian Markov-chain Monte Carlo algorithm, where sampling is performed by optimization (ODMCMC); and (II) a likelihood-free approach, recently proposed by Rossi et al. (2018), which relies on a combination of parameter estimation, projection techniques and maximum likelihood estimation (PDFE&U). On the other hand, the conventional Bayesian inference strategy, included in this analysis, is standard Bayesian Markov-chain Monte Carlo (BMCMC). Note that we do not include Variational Inference (Beal, 2003) in our study because Yao et al. (2018) recently showed that this type of technique performs satisfactorily only for 28 % of all the problems they considered (over 200).

The comparison of approximate and conventional PDF estimation algorithms is performed by analysing both their computational efficiency and their outputs, i.e. their PDFs, using well-known statistical indicators (expectation, variance and quantiles) and the concept of confidence/credible region. The DAE model, selected for this study, is a pharmacokinetic (PB/PK) model for the administration of Gabapentin.

The rest of the paper is organized as follows: first, we introduce the rationale of PDFE&U and ODMCMC, with particular emphasis on the latter; then, we report the most significant results of our analysis on the accuracy and computational performance of PDFE&U, ODMCMC and BMCMC; finally, we discuss the most relevant consequences of these analyses.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128186343501569

Anomaly detection

Patrick Schneider, Fatos Xhafa, in Anomaly Detection and Complex Event Processing over IoT Data Streams, 2022

Statistical techniques

Statistical techniques use measurements to approximate a model. Whenever a new measurement is registered, it is compared to the model and, if the results are statistically incompatible with it, then it is marked as an anomaly [27]. Statistical methods can be applied to single elements or window segments. An approximation over a window improves the approximation. However, a priori knowledge regarding the data distribution is required, which is often unavailable when data evolves over time.

Among the main advantages of statistical techniques we could distinguish:

1.

If the assumptions of the data distribution hold, statistical techniques provide a statistically justifiable solution for anomaly detection.

2.

The provided anomaly score is linked to a confidence interval that can be used as additional information while a decision is being made regarding a test instance.

3.

If the estimation of the distribution is robust to anomalies, statistical techniques can operate in an unsupervised environment without the need for labeled training data.

There are, however, several limitations of statistical techniques:

1.

The main drawback of statistical techniques is the assumption that data is generated from a specific distribution. This assumption often does not hold, particularly for high-dimensional real data.

2.

Even if the statistical assumption could be reasonably justified, several hypothesis testing statistics can be used to detect anomalies. Choosing the best statistic is often not an easy task [42]. Constructing hypothesis tests for complex distributions to fit high dimensional data sets is in general a complicated task.

3.

Histogram-based techniques are relatively easy to implement, but a major disadvantage for multivariate data is that they cannot capture the interactions between different attributes. An anomaly can have attribute values that are very common individually, but their combination could be unique, i.e., anomalous. A histogram-attribute-based technique might not be able to detect such anomalies.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128238189000134

European Symposium on Computer Aided Process Engineering-12

David Stockill, in Computer Aided Chemical Engineering, 2002

Statistics and Process Data

The use of specialised statistical techniques to analyse process data has not been common practice, certainly in the large-scale petrochemical and energy businesses. Its role in Product development, Laboratory and the discrete parts manufacturing environment (eg SPC) is well known. However statistical toolboxes and modelling packages are becoming available which allow the application of techniques such as Principal Component Analysis, Rank Correlation and so forth without the need to code up programs in specialised Maths packages. Increasingly the basic tools available in packages such as EXCEL offer possibilities that until a few years ago were out of the reach of the non-specialists. The issue is now the sensible and educated use of these techniques.

Unlike (say) advanced control where “standard” solutions have evolved for typical applications (eg Cat Cracker Reactor-Regenerator control) the use of statistics tends to more of a consultancy approach to problem solving. The problems tend to be unique – the approach is to bring together a skilled practitioner with his personal toolbox of skills and techniques and the business problem. This, of course, does present its own difficulties in the introduction of the technology to new users – there isn’t such a simple “off the shelf’ mentality.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/S1570794602800402

26th European Symposium on Computer Aided Process Engineering

Filippo Dal-Pastro, ... Massimiliano Barolo, in Computer Aided Chemical Engineering, 2016

2.2 Mathematical methods

PLS is a multivariate statistical technique that is used to relate an X [I × N] matrix of input variables with a Y [I × M] matrix of responses. The relationship is based on the projection onto a common space of uncorrelated variables called latent variables (LVs). The LVs explain the major sources of systematic variability of the inputs that are mostly correlated to the variability of the outputs. PLS represents X and Y as follows:

(1)X=TPT+EX,

(2)Y=TQT+EY,

(3)T=XW*,

where T [I × A] is the score matrix, P [N × A] and Q [M × A] are the loading matrices and W* [N × A] is the weight matrix. EX [I × N] and EY [I × M] are the residual matrices accounting for the model mismatch. A represents the number of significant LVs chosen to build the model.

The score matrix T of a PLS model represents the coordinates of the observations on the space identified by the latent variables. Stated differently, it describes how the samples relate to each other. This kind of model can be exploited to identify the main sources of variability that are related to the system outputs; for the system under investigation, the model can be used to explain the variability in the process settings and wheat properties that are related to the variability of the product PSD. Loading plots can be used to quantify the correlation between input variables and responses. Namely, the effect of the input variables (i.e., process settings and wheat properties) on the output PSD can be assessed.

What is the definition of statistical techniques?

Statistical techniques use measurements to approximate a model. Whenever a new measurement is registered, it is compared to the model and, if the results are statistically incompatible with it, then it is marked as an anomaly [27].

Which step of the data process involves collecting and Organising the data used to determine results?

Statistical analysis, or statistics, involves collecting, organizing and analyzing data based on established principles to identify patterns and trends.

Is the expected contribution from the customer to the retailers profits over their entire relationship with the retailer?

- The value of a customer, called customer lifetime value (CLV) , is the expected contribution from the customer to the retailer's profits over the entire relationship with the retailer. Retailers typically use past behaviors to forecast their CLV.

What is customer alchemy?

Customer alchemy is the art of turning less profitable customers into more profitable customers.