BEST Viewpoints: Exploratory Viewpoints

Exploratory Viewpoints

This module provides several exploratory data analysis methods useful for visualizing basic patterns and information in data. Line plots (control charts, date list plots, etc.), histograms, box plots, scatter plots, bubble charts, confidence intervals, and hypothesis testing are easily setup for data exploration.
This module is similar in appearance to Hierarchical Viewpoints thus, it is recommended to study the documentation for Hierarchical Viewpoints to understand this module because many options for Exploratory Viewpoints are exactly the same as those for Hierarchical Viewpoints and thus, are only explained once.

Introduction

Once data is loaded Analysis fields are ready for selection as shown below. The multiple options in the Explorer menu can be used to select the type of analysis or chart type to be applied to the selected data fields. In the next example the field Sales was selected as the target analysis field and a Combo of charts was generated due to the selection on the Explorer menu.
The selection made in the Explorer menu as well as other options selected in the Build menu may cause changes in the menu at its right. For example, initially this menu is set at Plot Type which is the type of plot used for the line plots in the Combo (explained in section named Combo), however, when the Build menu is set to Histogram the second menu becomes the Preset menu (see image below) which facilitates the selection of options for the Histogram.
If the Lines menu is selected in the Build menu then Plot Type becomes available again and now the type of Line Plot to be used can be selected as shown below.
This dynamic behavior of menus and options has the intention of making the application as user friendly as possible by only making available the options that are useful depending on the selections made by the user. Finally, the meaning of the different options in menus are designed to be self explanatory when the Mathematica options for the selected type of chart are known and understood. Thus, this document does not have the intention of describing the effect that options may have on the selected type of chart but the user should read the pertinent Mathematica documentation for understanding them.

Total Groups Limit

BEST Viewpoints can process virtually any combination of analysis fields (categorical and non categorical). This is a very powerful and convenient feature of the program however, this may allow the user to ask for generating more information than needed. To avoid this situation and to processing time efficiently the program will ask the user to confirm when the current setup results in too many groups of data for analysis. The image below shows the warning message generated when the current setup exceeds the Total Groups Limit. The user may continue and complete calculations or stop the program and make the necessary changes to get more meaningful results.
When the user decides to stop the evaluation of the current setup the program will provide hints on how to reduce the complexity of the current setup (see image below). In general, a simpler setup can be defined by reducing the number of categorical fields or the category values for the selected categories. Of course, the user may also increase the Total Groups Limit in the Build menu to avoid the generation of the warning message.
The following sections are dedicated to the options available on the Explorer menu.

Combo

Initially the application is ready to create a combo of charts and the user may select which charts to combine together and also can modify each one independently. By analyzing Sales the two default charts (Histogram and Line) are created as shown below. These charts can be modified by the options in the Build menu.
Note that the Build menu has several other tabs. The name and contents of these tabs will change as options are selected but the Setup tab will always be there. This menu contains general options for all the charts. For example, note that in the image below the Lines and Histogram charts are selected (by default). The Combine sub-menu has options to modify the way the plots are arranged and displayed. These options will be more important when there are many charts created simultaneously.
Note that the menu on top of the charts also changes dynamically to make it fit to the user navigation. For example, originally the menu Plot Type is there available to change the type of plot used to create the list plot (ListLinePlot, ListPlot, DateListPlot, ProbabilityPlot, etc.).
For example, when the Histogram menu is selected the options change to a Preset set of definitions to create a histogram (see image below).
Each of these presets will automatically set some options in the Build menu for the histogram (and line in some cases). For example, the Smooth PDF results is summarized in the image shown next. In this case the Smooth Histogram, and the Distribution options are checked. Additionally, the Function used for plotting the histogram is the PDF. Note that the distribution used is not part of the preset, thus, the LogNormal distribution shown in the example were manually selected after the preset.
The next image is what the PDF preset creates. Note that now the checkbox Distribution is selected, and thus, the menu for distributions is now open. The Normal Distribution is by default used for testing and parameters are estimated. Other continuous distributions included in this menu are: LogNormal, Gamma, Weibull, Beta, and Exponential distributions.
Note that there are also discrete distributions that can be used for the same purposes: Bernoulli, Binomial, Negative Binomial, Geometric, Hipergeometric, Beta Binomial, Beta Negative Binomial, and Poisson. The image below shows a goodness of fit test for the Binomial distribution with estimated parameter p and user-provided parameter n.
Adding the Goodness of Fit Test option to make a Kolmogorov-Smirnov test and selecting the Normal Distribution results in the image below. Note that the test was made and the P-Value for the test along with the statistic and sample size is displayed as the plot label.
Grouping by Country (only China and Germany selected) can be used to compare the tests for two different categorical values. Note that the options for Box Whisker Charts and Distribution Charts are displayed on the Charts menu because now these two plots are now selected as part of the Combo.
Note that when the parameters of the distribution tested are displayed in the plot label is because there is no mix of distributions in the plot. Thus, as an example consider the probability plot below. Although two variables are present in the plot, both are being compared against the same parameters which result from combining both variables in the same dataset. In this example this may not seem useful however, if the two variables are for example diameter1 and diameter2 and these two variables need to be compared independently for a given set of parameters, then the example below may be more meaningful. In this case, it is also possible to test against a user-defined set of parameters which can be input in the Histogram menu.
Note that data can be combined in plots in different ways using the Combine menu in the Setup tab. In the image above the data is combined by Categories but the in the next example data is combined by Fields. When combining by Fields (example below) every field has an independent set of plots, while when combining by Categories (example above) every combination of category values (e.g. China) has an independent chart associated where fields (Sales & Price are combined).

Histogram

When Histograms is selected in the Explorer menu the Histogram tab in the Options menu provides ways to get more information from the data being analyzed. In the previous section several options for the histograms were presented already.
As a complementary example to the already discussed information for histograms note that in the image below the number of bins is being defined manually by deselecting the Binning Method checkbox. Additionally, the histogram is compared to the Weibull Distribution with user defined parameter of 1.1 and auto-estimated parameter value Beta of ~2404.26. Finally, all applicable goodness of fit tests are displayed with corresponding statistic and P-value.
Additionally, the Explore option opens a window for graphically and dynamically calculating upper or lower tail probabilities for the currently displaying distribution. The image below shows an example stating that P[Sales>4740.057]0.1212 assuming that Sales is distributed according to a Weibull distribution with the provided or estimated parameters.
Note that the values shown in the Parameter input field will remain fix at the user-defined value for all populations evaluated. In the other hand, the estimated parameters shown in the input field are the last set of parameters estimated. Thus, for example, if more than one variable is being analyzed the parameters in the field are those estimated for the last variable selected (i.e last variable analyzed).

Line Plots

As presented in the previously discussed section Combo, several types of Line Plots can be created in the Exploratory Viewpoints module. Some complementary examples will be presented in this section.
Default options for Line Plots are presented in the menu below. To avoid trying to plot a large number of points that could slow down rendering time of the plot Max Points is set to 500 by default.
The default Plot Type is set to ListLinePlot but when DateListPlot is selected the user must provide the Date Field which is the data field that contains the dates (see example below). Note that once the date field is provided the date format can be selected from a menu. Data is expected to be sorted by the selected Date Field for this plot to produce meaningful results. Sorting can be done in the basic or advanced spreadsheet, but this task is left to the analyst instead of including it as part of the automatic data preparation in Exploratory Viewpoints to emphasize and ensure that line plots always use the original ordering of the imported data.
By selecting ProbabilityPlot in Plot Type, the desired Goodness of Fit Test and Confidence level, a quick distribution assessments can be made. The example below tests Sales and Price against the estimated Weibull Distribution when the parameter Alpha is set to 1.1 for both Sales and Price. Setting Goodness of Fit Output to Test Conclusion provides a written conclusion on both tests: Sales is not rejected but Price is rejected. The Estimate Parameters options is causing for parameters to be estimated for both fields. If the analyst provides distribution parameters then the same parameters will be used for all variables tested.
The Combine sub-menu in the Setup tab is used to decide how data will be combined in plots. In the example below line plots for Sales and Price for China and Germany (Group-By field: Country) are being shaped by the option Combine - Categories. That is, different category values are combined in the same chart and a new chart is created for each analysis field (Sales, Price).
By selecting Combine - Fields the analysis fields (Sales and Price) are combined in the same chart, and a new chart is created for each of the category values of the Group By field (Country).

Statistics

Selecting Statistics in the Explorer menu allows for the creation of distribution charts and box plots in a different format than that shown in the Combo menu. Additionally, mean confidence intervals (Mean CI) , standard deviation confidence intervals (Std Dev CI) and hypothesis testing (equal means and variance ratio tests) can be created.
The mean confidence intervals (Mean CI) in the image below are displayed graphically and as tooltips for each plot. Note that the confidence level can be changed in the Statistics menu.
The Hypotheses option in the Output menu shows the result of making hypotheses on the difference of means (lower left side of the matrix) and the variance ratio tests are performed (upper right side of the matrix). The corresponding two-sided P-values are displayed for each test. The red is used when the p value is smaller than 0.05 to highlight significance.
For example the variance in Sales are statistically different in France and China. The red P-Value of 0.028 is smaller than the 0.05 significance level used for the test.
Additional documentation for these tests can be found in the documentation for the HypothesisTesting package.

Bubble Charts

The Bubble Charts can display up to five fields in a single plot. Consider the example below where Sales, Price and Quantity are analyzed by the levels of the categorical fields Product and Country. Note that the order in which fields and categories are selected determines their position in the plots.
In the menu for Bubbles there are some parameters that can be controlled like the min and max bubble size, and whether a tooltip is displayed in each bubble.

Markers and Tips

Markers&Tips is a line plot with a legend to identify the categorical source of each data point. Some descriptive information for each point will be displayed as a legend and other will be displayed as a tip when the mouse is near the marker in the plot.

Scatter Plots

Scatter Plots can also be created in several formats. The Combine menu can be used to create separate scatter plots by combining by Categories or by Fields.

Control Charts

This perspective is for creating statistical Control Charts as another Exploratory Data Analysis tool. The image below shows an example of an X-Bar and EWMA for X-Bar control charts with Sample Size of 5.
In the example below the XBar and R charts are displayed such that the two charts can be compared. Other types of control charts available include EWMA for XBar, P, Np, C, nC, U, R, and S. Control chart parameters are automatically and continuously estimated. Once the parameters are estimated the Parameters checkbox can be deselected to use the avoid re estimating parameters.
The Control Condition option is used to establish whether control is reached taking into account two limits or one, and in what direction. For example, x<UCL is used to state that a point is considered to be in statistical control when it is value is less than UCL.
Data Format refers to the way that input data is interpreted. 'Subgroups' of sample size n are formed automatically for control charts other than P, NP, and U. For these charts it will be required to enter the data as two columns; the statistic column and the sample size column . Thus in this case the Data Format is called 'Statistic & n'.
Max Proportion is used to set the maximum proportion of samples that the parameter estimation algorithm can discard to estimate parameters. Additional information about the control charts can be found in the SPC section of the MXLPlus Guide.

Capability Analysis

Capability Analysis is a method used to understand how capable is a process of meeting a given set of design specifications. In the image below values have been assigned to the Lower Specification Limit (LSL), Target, and Upper Specification Limit (USL). Additionally, the Mean and Standard Deviation parameters are also assigned a value for testing purposes. By default analysis is made assuming that the real parameters are those estimated from data (dashed distribution). This can be seen in the Capability menu where the Analyze option is set to Statistical. In this case the estimated process capability is Cp=0.43. The conclusion would be that the process is not capable of meeting design specifications if parameters are those estimated from data.
However, as shown below, the Parametric distribution can be used to estimate the capability of the process which then becomes Cp=1.41. The conclusion is that the process is capable of meeting design specifications if the user-provided parameters are used for the analysis (i.e. Analyze option set to Parametric). Note that the P. Factor (P stands for probability) is a factor used to change the scale of the calculated probability. In the example below probabilities are presented as parts per million (P.Factor = 1M) such that small probabilities are easier to understand.
In the examples above both distributions are shown in the output (Display Both). In the next example only the parametric distribution is presented (Display Parametric).
Finally, the Histogram of the data being analyzed can be shown in the output as shown below. Note that the P. Factor used below is 100 such that probabilities are displayed as percentages.