真实误差与样本误差

真正的错误

真正的错误可以说是假设对从总体中随机抽取的单个样本进行错误分类的概率。这里的人口代表了世界上所有的数据。

让我们考虑一个假设 h(x) 并且真实/目标函数是总体 P 的 f(x)。 h 错误分类随机抽取的实例的概率，即真实错误是：

$T.E. = Prob[f(x) \neq h(x)]$

样本错误

S 相对于目标函数f 和数据样本 S 的样本误差是样本 S 误分类的比例。

$S.E. =\frac{1}{n} \sum_{x \epsilon S}\delta(f(x) \neq h(x))$

$Sample \, Error = \frac{Number\, of\, missclassified \, instances}{Total \, Number \, of \, Instance}$

或者，下面的公式代表也代表样本误差：

$S.E. = \frac{FP + FN}{TP + FP + FN + TN}$
$S.E. = 1 - \frac{TP + TN}{TP + FP + FN + TN}$
SE = 1- 准确度

假设假设 h 错误分类了 33 个样本中的 7 个样本。那么抽样误差应该是：

$SE = \frac{7}{33} = .21$

偏差和方差

偏差：偏差是假设的平均预测与预测的正确值之间的差异。具有高偏差的假设试图过度简化训练（不适用于复杂模型）。它往往具有高训练错误和高测试错误。

$Bias = E[h(x)]- f(x)$

方差：高方差假设在其预测之间具有很高的变异性。他们试图使模型过于复杂，并且不能很好地概括数据。

$Var(X) = E[(X - E[X])^2]$

置信区间

一般来说，真正的误差是复杂的，难以计算。它可以在置信区间的帮助下进行估计。置信区间可以估计为抽样误差的函数。

以下是置信区间的步骤：

随机抽取 n 个样本 S（彼此独立），其中 n 应 > 30 来自总体 P。
计算样本 S 的样本误差。

这里我们假设抽样误差是真实误差的无偏估计量。以下是计算真实误差的公式：

$T.E. = S.E. \pm z_{s} \sqrt{\frac{S.E. (1- S.E.)}{n}}$

其中 z _s是置信区间的 s 百分比的 z 得分值：

% Confidence Interval	50	80	90	95	99	99.5
Z-score	0.67	1.28	1.64	1.96	2.58	2.80

真实误差与样本误差

True Error	Sample Error
The true error represents the probability that a random sample from the population is misclassified.	Sample Error represents the fraction of the sample which is misclassified.
True error is used to estimate the error of the population.	Sample Error is used to estimate the errors of the sample.
True error is difficult to calculate. It is estimated by the confidence interval range on the basis of Sample error.	Sample Error is easy to calculate. You just have to calculate the fraction of the sample that is misclassified.
The true error can be caused by poor data collection methods, selection bias, or non-response bias.	Sampling error can be of type population-specific error (wrong people to survey), selection error, sample-frame error (wrong frame window selected for sample), and non-response error (when respondent failed to respond).

执行：

在这个实现中，我们将使用置信区间来实现对真实误差的估计。

Python3

# imports
import numpy as np
import scipy.stats as st
  
#define sample data
np.random.seed(0)
data = np.random.randint(10, 30, 10000)
  
alphas = [0.90, 0.95, 0.99, 0.995]
for alpha in alphas:
  print(st.norm.interval(alpha=alpha, loc=np.mean(data), scale=st.sem(data)))

# confidence Interval
90%: (17.868667310403545, 19.891332689596453)
95%: (17.67492277275104, 20.08507722724896)
99%: (17.29626006422982, 20.463739935770178)
99.5%: (17.154104780989755, 20.60589521901025)

参考：

样本误差与真实误差