Data Bytes: Basics Statistics

Population: This is the totality, the complete list of observations, or all the data points about the subject under study.

Sample: A sample is a subset of a population, usually a small portion of the population that is being analyzed.

Parameter versus statistic: Any measure that is calculated on the population is a parameter, whereas on a sample it is called a statistic.

Mean: This is a simple arithmetic average, which is computed by taking the aggregated sum of values divided by a count of those values. The mean is sensitive to outliers in the data. An outlier is the value of a set or column that is highly deviant from the many other values in the same data; it usually has very high or low values.

Median: This is the midpoint of the data, and is calculated by either arranging it in ascending or descending order. If there are N observations.

Mode: This is the most repetitive data point in the data:

Measure of variation: Dispersion is the variation in the data, and measures the inconsistencies in the value of variables in the data. Dispersion actually provides an idea about the spread rather than central values.

Range: This is the difference between the maximum and minimum of the value.

Variance: This is the mean of squared deviations from the mean (xi = data points, µ = mean of the data, N = number of data points). The dimension of variance is the square of the actual values. The reason to use denominator N-1 for a sample instead of N in the population is due the degree of freedom. 1 degree of freedom lost in a sample by the time of calculating variance is due to extraction of substitution of sample:

Standard deviation: This is the square root of variance. By applying the square root on variance, we measure the dispersion with respect to the original variable rather than square of the dimension:

Quantiles: These are simply identical fragments of the data. Quantiles cover percentiles, deciles, quartiles, and so on. These measures are calculated after arranging the data in ascending order:

Percentile: This is nothing but the percentage of data points below the value of the original whole data. The median is the 50th percentile, as the number of data points below the median is about 50 percent of the data.

Decile: This is 10th percentile, which means the number of data points below the decile is 10 percent of the whole data.

Quartile: This is one-fourth of the data, and also is the 25th percentile. The first quartile is 25 percent of the data, the second quartile is 50 percent of the data, the third quartile is 75 percent of the data. The second quartile is also known as the median or 50th percentile or 5th decile.

Interquartile range: This is the difference between the third quartile and first quartile. It is effective in identifying outliers in data. The interquartile range describes the middle 50 percent of the data points.

Data Bytes

Saturday, June 5, 2021

Basics Statistics

No comments:

Post a Comment

Data Platform Nice Image