Home Templates Calculators Videos Academy Software About Contact Login
Statistics

Dataset Analysis

Upload a CSV or Excel file, select a numeric column, and get a full Minitab-style analysis: descriptive statistics, normality test (Anderson-Darling), histogram, box plot, Q-Q plot — with specific guidance if your data is non-normal.

PDF Guide
Was this useful?
📂
Drop your file here or click to browse
No data leaves your browser — all analysis runs locally
.CSV.XLSX .XLS.TXT (tab/comma)

Download one of the sample files below, then drag it into the upload zone above — or click the zone to browse for it. Upload one file at a time and watch the full analysis unfold. Start with the skewed dataset to see how non-normal data looks, then try the normal one to compare the difference.

📄
Select column to analyse
Anderson-Darling A²
P-value
n
Histogram + Normal curve
Bars = frequency, curve = fitted normal distribution
Box Plot
Median, IQR, whiskers (1.5×IQR), and outliers
Normal Q-Q Plot
Points on the reference line = normal distribution
⚠ Data is Non-Normal — Here's What To Do
Your data did not pass the Anderson-Darling normality test. This affects which statistical tests and capability indices are valid. Follow the steps below.
✅ Data is Normal — Proceed with Standard Methods
Your data passed the Anderson-Darling normality test (p ≥ 0.05). You can confidently use the following standard statistical tools:
Complete guide

Dataset Analysis Calculator Guide

Use the calculator above to upload a CSV or Excel file, select a numeric column and get a full Minitab-style analysis: descriptive statistics, an Anderson-Darling normality test, histogram, box plot and Q-Q plot — with specific guidance if your data is non-normal. It is a complete first-pass diagnostic before any deeper Six Sigma analysis.

What it is

What is dataset analysis?

A Dataset Analysis is the structured exploratory phase of any statistical study. It combines descriptive statistics (mean, median, σ, skewness, kurtosis), a formal normality test, and visualisations (histogram, box plot, Q-Q plot) to characterise the data before any modelling, capability calculation or hypothesis test is run.

Calculation logic

How the calculation works

The tool computes the standard descriptive statistics, runs the Anderson-Darling test for normality (more powerful than Shapiro-Wilk for tails), and generates three plots. If the data fails the normality test, it provides specific guidance — try a Box-Cox transformation, switch to non-parametric tests, or use non-normal capability methods.

Worked example

Worked example: a quick first pass

You upload 500 cycle-time measurements. The tool reports mean 12.3s, median 11.8s, σ 3.5s, skewness +1.4 (right-skewed), Anderson-Darling p < 0.005 (clearly non-normal). The histogram confirms a right tail; the Q-Q plot shows the upper end pulled away from the line.

The tool flags that running Cpk or a t-test on this data without transformation would mislead. After a log transform the data is approximately normal, and the analyses become valid. That single diagnostic step prevents weeks of wrong conclusions in DMAIC Analyse.

Why it matters

Operational impact

Dataset Analysis catches data-quality and distribution issues before they corrupt downstream analysis. It is the difference between a Six Sigma project built on a defensible foundation and one quietly built on invalid assumptions.

Decision making

When to use it

Use it on every dataset before computing capability indices, running hypothesis tests, building regression models or designing experiments. It is the first step in DMAIC Measure and Analyse.

Lean Six Sigma

Link to Six Sigma

Dataset Analysis sits between data collection and inferential analysis. It pairs with Measurement System Analysis (MSA) to confirm data is both trustworthy and statistically tractable before any decisions are made.

Industry examples

Where dataset analysis is useful

ManufacturingDiagnose process data before computing Cpk or running capability studies.
Clinical researchCheck normality before parametric tests on trial outcomes.
Financial analysisExpose skew and outliers in returns data before applying mean-variance models.
Software performanceProfile latency distributions before SLO modelling — almost always non-normal.
Common mistakes

Watch-outs before using dataset analysis

  • Skipping the diagnostic step and running parametric tests on non-normal data.
  • Stripping outliers without first checking they are real values rather than data-entry errors.
  • Reporting mean and standard deviation on heavily skewed data — median and IQR are more honest.
  • Treating a normality test pass as proof of normality — it just fails to find evidence of non-normality.
  • Using normal-based capability indices (Cpk, Ppk) on visibly non-normal data without transformation.
What to do next

Turn the result into action

If the data is normal, proceed with standard capability, hypothesis tests and DOE. If non-normal, transform (Box-Cox or log), switch to non-parametric methods, or use non-normal capability techniques. Either way, document the choice in the project file.

Resources

Templates, videos and learning

Pair Dataset Analysis with Measurement System Analysis, control charts and capability indices for a complete pre-analysis workflow.

Frequently asked questions

What is descriptive statistics?

A summary of a data set using measures of central tendency (mean, median), spread (σ, IQR, range) and shape (skewness, kurtosis). It is the starting point of any analysis.

What is the Anderson-Darling test?

A formal hypothesis test for normality, more sensitive in the tails than the Shapiro-Wilk test. A small p-value (typically < 0.05) indicates the data is not normally distributed.

What if my data is non-normal?

Three options: (1) transform the data (Box-Cox, log), (2) switch to non-parametric tests (Mann-Whitney, Kruskal-Wallis), or (3) use non-normal capability methods. The tool flags the recommended path.

Why use a Q-Q plot?

A Q-Q plot reveals where data deviates from a reference distribution. A straight line means normal; bends in the tails reveal skew or heavy tails. It is more diagnostic than any single test statistic.

Should I always transform non-normal data?

Not always. If the analysis you plan is robust to non-normality (e.g. ANOVA on large samples), transformation may be unnecessary. The right answer is to match the method to the data, not force the data to fit the method.

Want to learn how to structure data analysis as part of a full improvement project? The Green Belt covers this in full.

View Green Belt →