1 Introduction
The main motivation of our studies is the analysis of learning algorithms that are uniformly stable (we recall the definition introduced in [2] below). In this context we are given an i.i.d sample of points distributed independently according to some unknown measure on , a learning algorithm , which maps , that is given the learning sample it outputs the function mapping the instance space into the space of labels . The output of the learning algorithm based on the sample will be denoted by
. We also use the loss function
.Given the random sample the risk of the algorithm is defined as
and the empirical risk as
By the generalization bounds we mean the high probability bounds on which is the difference between the actual risk of the algorithms and its empirical performance on the learning sample. The standard way to prove the generalization bounds is based on the sensitivity of the algorithm to changes in the learning sample, such as leaving one of the data points out or replacing it with a different one. To the best of our knowledge, this idea was first used by Vapnik and Chervonenkis to prove the inexpectation generalization bound for what now is known as hardmargin SVM [13]. Later works by Devroy and Wagner used the notions of stability to prove the high probability generalization bounds for nearest neighbors [3]. The paper [2] provides an extensive analysis of different notions of stability and the corresponding (sometimes) high probability generalization bounds. Among some recent contributions on high probability upper bounds based on the notions of stability is the paper of Maurer [9]
, which studies generalization bounds for a particular case of linear regression with a strongly convex regularizer, as well as the recent work
[15], which provides sharp exponential upper bounds for the SVM in the realizable case. In order not to repeat an extensive survey on the topic, we refer to [4] and [5] and the references therein.We return to the problem of generalization bounds. For the sake of simplicity, we denote . The learning algorithm is uniformly stable with parameter if given any samples
for any we have
(1.1) 
Since it is only a matter of normalization, in order to simplify the notation in what follows, we analyze the generalization error multiplied by the sample size , which is the quantity
The basic, and until very recently the best known, result is the high probability upper bound in [2] claims that for any uniformly stable algorithm with parameter and provided that the loss bounded by , we have with probability at least
(1.2) 
It is easy to observe that this generalization bound is tight only when , which means that only under this assumption, the generalization error converges to zero with the optimal rate . However, in some applications the regime is of interest, and the bound (1.2) can not guarantee any convergence. In order to consider the values of close to , Feldman and Vondrák provided sharper generalization bounds. In a series of breakthrough papers [4, 5], they at first showed the generalization bound of the form,
(1.3) 
where as before, the parameter corresponds to the stability, and the parameter bounds the loss function uniformly. In their second paper [5], Feldman and Vondrák showed a stronger generalization bound,
(1.4) 
Up to the logarithmic factors, the bound (1.4) shows that with high probability in the regime , the generalization error converges to zero with the optimal rate . However, as claimed by Feldman and Vondrák, the bound (1.3) should not be wholly discarded since it does not contain additional logarithmic factors and . More importantly, the bound (1.3) is subgaussian, which means that the dependence on comes only in the form . At the same time, the bound (1.4) shows both subgaussian and the subexponential regimes since it contains two types of terms: and . We will discuss the notions of subgaussian and subexponential high probability upper bounds below.
In [5], the authors ask if their highprobability upper bounds (1.3) and (1.4) can be strengthened and if they can be matched by a high probability lower bound. In this paper, we are making some progress in answering both questions. We shortly summarize our findings:

Our main probabilistic result is Theorem 3.1, presented in Section 3. As one of the immediate corollaries, it implies the risk bound of the form,
(1.5) which removes the parasitic term from (1.4). We emphasize that our analysis is inspired by the original samplesplitting argument of Feldman and Vondrák [5], although our proof is significantly more straightforward. In particular, we avoid several involved technical steps, which ultimately leads us to better generalization bounds.

In Section 4, we show that the bound of our Theorem 3.1 is tight unless some additional properties of the corresponding random variables are used. Our lower bounds are presented by some specific functions satisfying the assumptions of Theorem 3.1. We remark that our lower bound does not entirely answer the question of the optimality of (1.5) for uniformly stable algorithms, as it only shows the tightness of the bound implying (1.5). We discuss it in more detail in Section 4.
Notation
We provide some notation that will be used throughout the paper. The symbol will denote an indicator of the event . For a pair of nonnegative functions the notation or will mean that for some universal constant it holds that . Similarly, we introduce to be equivalent to . For we define and . The norm of a random variable will be denoted as . Let denote the set . To avoid some technical problems for by we usually mean
In what follows we work with the functions of independent variables . For we will write . In addition, for and we write . In particular, if we have an a.s. bound for any realisation of , then by a simple integration argument we have
(1.6) 
i.e., in this sense a conditional bound is stronger than the unconditional one. Finally, for slightly abusing the notation we set and .
Several facts from Probability Theory
When dealing with high probability bounds in the Learning Theory, one usually derives a bound of the form,
(1.7) 
with probability at least for any and some . Here, is a random variable of interest, e.g., the excess risk. The term with is referred to as a subgaussian tail, as it matches the deviations of a Gaussian random variable. The term with is called a subexponential tail. In general, the bound above represents a mixture of subgaussian and subexponential tails. In particular, all the known generalization bounds (1.2), (1.3), (1.4) are of the form (1.7).
An alternative way of studying tail bounds is via the moment norms. Recall that the norm of a random variable is . It is wellknown that a subgaussian random variable has moments
where does not depend on , see e.g., Proposition 2.5.2 in [14]. In addition, the moment norms of a subexpenential r.v. grows not faster than , i.e.
for some not depending on , see e.g., Proposition 2.7.1 in [14]. In what follows, we will consider the random variables with two levels of moments, that is for some that do not depend on
In fact, the above bound and the bound (1.7) are equivalent up to a constant, as the following simple result suggests.
Lemma 1.1 (Equivalence of tails and moments).
Suppose, a random variable has a mixture of subgaussian and subexponential tails, in the sense that it satisfies for any with probability at least ,
for some . Then, for any it holds that
And vice versa, if for any then for any we have with probability at least ,
The proof is a simple adaptation of Theorem 2.3 from [1]. For the sake of completeness, we present it in Section A. We conclude that moment bounds appear naturally when one deals with deviation inequalities. In addition, the moments are easier to work with when we are interested in lower bounds, as we will see in Section 4. Below we also state several wellknown moment inequalities for sums and functions of independent random variables. One of them is the moment version of the bounded differences inequality, which follows immediately from Theorem 15.7 in [1].
Lemma 1.2 (Bounded differences/McDiarmid’s inequality).
Consider a function of independent random variables that take their values in . Suppose, that satisfies the bounded differences property, namely, for any and any it holds that
(1.8) 
Then, we have for any ,
Next, we use the following version of the classical MarcinkiewiczZygmund inequality (we also refer to Chapter 15 of [1] that contains similar inequalities.)
Lemma 1.3 (The MarcinkiewiczZygmund inequality [11]).
Let be independent centered random variables with finite th moment for . Then,
Notice that it is easy to apply the above lemma in the case when a.s. and . Since , we have
(1.9) 
We will refer to it as the moment version of Hoeffding’s inequality.
2 From generalization to concentration of the sum of dependent random variables
In this section, we modify the generalization bound in order to get an equivalent statement about the concentration of the sum of nonindependent random variables. Slightly abusing the notation, we denote
where is an independent copy of . Using the uniform stability, we can write down the following leaveoneout decomposition (see e.g., [2])
Denote
(2.1) 
Our computations lead to the following simple lemma.
Lemma 2.1.
Under the uniform stability condition with parameter (1.1) and uniform boundedness of the loss function we have for defined by , that
Moreover, we have almost surely for , and .
Finally, as a deterministic function satisfies the bounded difference condition (1.8) with for all except the variable.
Proof.
Similarly to the computations above we have,
The remaining properties can be immediately verified. ∎
Therefore, up to a constant term bounded by , which corresponds to in the original generalization bound, obtaining the high probability bounds for is equivalent to obtaining high probability upper bounds for .
In the next example, we provide some intuition on why the naive application of the bounded differences inequality fails to prove sharp generalization bounds. Surprisingly, it appears that the proof of the bound (1.2) is essentially equivalent to applying the triangle inequality to the sum of weakly dependent random variables.
On the suboptimality of the bound (1.2)
At first, we prove an exact moment analog of (1.2) for , for defined by (2.1). We have in mind the illustrative regime of and , this is exactly when the bound (1.4) balances the two terms. By the triangle inequality we have
where we used that conditionally on the random variable is centered and combined this fact with Lemma 1.2. Since is a sum of independent centered bounded random variables, Hoeffding’s inequality (1.9) is applicable to .
Observe that in the proof above we lose a lot by replacing with . Indeed, it is easy to see that random variables , as well as , are weakly correlated. In order to see that, set
where is an independent copy of . For using and together with the bounded difference property, we have
(2.2) 
This suggests that for , the random variables and have small correlation. However, the original argument in [2] does not take this into account and would give the same bound even if all were replaced by the same random variable .
3 The general moment bound
Here we present an upper bound that relies solely on the properties of the functions (2.1) described in Lemma 2.1. In this section, we slightly abuse the notation: the random variables do not have to be related to any learning algorithm. For the sake of brevity we sometimes denote by . At first, we prove our strongest moment bound, which is the main contribution of the paper.
Theorem 3.1.
Proof.
Without loss of generality, we suppose that . Otherwise, we can add extra functions equal to zero, increasing the number of terms by at most two times.
Consider a sequence of partitions with , , and to get from we split each subset in into two equal parts. We have
By construction, we have and for each . For each and , denote by the only set from that contains . In particular, and .
For each and every consider the random variables
i.e. conditioned on and all the variables that are not in the same set as in the partition . In particular, and . We can write a telescopic sum for each ,
and the total sum of interest satisfies by the triangle inequality
(3.1) 
Since and , by applying (1.9) we have
(3.2) 
The only nontrivial part is the second term of the r.h.s. of (3.1). Observe that
that is, the expectation is taken w.r.t. the variables . It is also not hard to see that the function preserves the bounded differences property, just like the the function . Therefore, if we apply Lemma 1.2 conditionally on , we obtain a uniform bound
as there are indices in . We have as well by (1.6).
Let us take a look at the sum for . Since for depends only on , the terms are independent and zero mean conditionally on . Applying Lemma 1.3, we have for any ,
Integrating with respect to and using , we have
It is left to use the triangle inequality over all sets . We have,
Recall, that due to the possible extension of the sample. Then,
Plugging the above bound together with (3.2) into (3.1), we get the required bound. ∎
Before we start the discussion of the details of the proof , let us obtain the following simple corollary.
Corollary 3.2.
Under the uniform stability condition with parameter (1.1) and the uniform boundedness of the loss function , we have that for any with probability at least ,
The last bound is an improvement of the recent upper bound for uniformly stable algorithms by Feldman and Vondrák (1.4). To be precise, we removed the parasitic term.
Proof.
Remark 3.3.
The strategy of the proof of Theorem 3.1 is inspired by the original approach of Feldman and Vondrák [5]. Their clamping can be related to the analysis of the terms . It is important to notice that the truncation part of their analysis creates some technical difficulties since it introduces some bias and changes the stability parameter. In particular, the truncation brings an unnecessary logarithmic factor. We entirely avoid these steps by a simple application of the MarcinkiewiczZygmund inequality. The analog of the dataset reduction step of Feldman and Vondrák is our nested partition scheme. However, the recursive structure of their approach is replaced by an application of telescopic sums, whereas the union bound, which also brings a logarithmic factor, is replaced by the triangle inequality for norms. Apart from an elementary proof, our analysis leads to a better result: we eliminate the parasitic term.
Another interesting direction is the analysis of the first bound of Feldman and Vondrák (1.3), which was originally proved by the techniques taking their roots in Differential Privacy (see the discussions on three various ways to prove this bound in [5]). As already noticed in [5], the bound (1.3) should not be discarded due to the fact that it does not contain additional factors and, from our point of view, more importantly, it has the subgaussian form, since it depends only on . Recall that the second bound (1.4) is a mixture of subgaussian and subexponential tails. Although we can adapt the moment technique to prove (1.3), we instead come to the following more general observation:
The bound of Theorem 3.1 is strong enough to almost recover the subgaussian bound (1.3).
In order to show this, we have by Theorem 3.1, provided that almost surely
(3.3) 
Since and for , (which is rather crude) we have for ,
Similarly to the proof of Corollary 3.2, it immediately implies
(3.4) 
which is (1.3) up to an unnecessary factor. The latter is clearly an artifact of the proof in our case.
4 Lower bounds
Since the bound of Theorem 3.1 implies the best known risk bound, it is natural to ask if it can be improved in general. By Lemma 2.1, we know that the analysis of the generalization bounds is closely related to the analysis of the functions satisfying the assumptions of Theorem 3.1. Therefore, it is interesting to know how sharp the general bound (3.3) is. Recall that
where as before, is a uniform bound on . In this section, we prove that one can not improve the bound of Theorem 3.1, apart from the factor, and the bound is tight with respect to the parameters in some regimes. We notice, however, that this does not completely answer the question of the optimality of the risk bound of Corollary 3.2 for uniformly stable algorithms, but shows that this is the best we can hope for as long as our upper bound is based only on the parameters . In particular, Theorem 3.1 disregards the condition . We discuss this in more detail in what follows.
As before, we need two wellknown facts from probability theory. The first lemma is a moment version of the MontgomerySmith bound
[10] which is due to Hitczenko [7]. It characterizes the moments of Rademacher sums up to a multiplicative constant factor.Lemma 4.1 (Moments of weighted Rademacher sums [7]).
Let be a nonincreasing sequence and are i.i.d. Rademacher signs. Then,
The next lemma is Chebyshev’s association inequality, see e.g., Theorem 2.14 in [1].
Lemma 4.2.
Let and be nondecreasing realvalued functions defined on the real line. If is a realvalued random variable, then
Proposition 4.3 (The lower bound, ).
Let be i.i.d. Rademacher signs. There is an absolute constant and functions that satisfy the conditions of Theorem 3.1 with the parameters , , such that we have for any ,
(4.1) 
Proof.
Consider the functions,
(4.2) 
It is easy to check that and a.s. Moreover, each satisfies the bounded difference property with parameter w.r.t. all except the th variable. Denoting we have,
(4.3) 
By the triangle inequality we have,
For we obviously have . By Lemma 4.2 and since both functions and are nondecreasing for nonnegative , we have
Finally, due to the symmetry of and Lemma 4.1, we have for ,
Finally, for some , our construction implies the following lower bound
∎
Remark 4.4.
The fact that the lower bound (4.1) contains the subexponential term may be alternatively understood as follows. In the case when , the sum (4.2), which is
(4.4) 
corresponds to a special case of Rademacher chaos. The behaviour of (4.4) is well understood and the desired lower bound of order for will immediately follow from Corollary 1, Example 2 by Latała [8]. We present the corresponding bound in the proof of the inequality (4.5) below. This approach, in the case , removes the assumption from Proposition 4.3.
The lower bound of Proposition 4.3 matches the result of Theorem 3.1 up to the logarithmic factor in the regime . In particular, it means that in this regime, the bound has to be subexponential unless we use some properties of the functions , other than mentioned in Theorem 3.1. We additionally note that our moment lower bounds imply the deviation lower bounds. We can show that there are absolute constants such that the functions defined in (4.2) satisfy for every ,
(4.5) 
This bound can be derived through the PaleyZygmund inequality, e.g., using the standard arguments as in [6]. For the sake of completeness, we derive this inequality in Section B.
Besides, in the case , a trivial bound is the best one can have. Like in (4.3), consider the functions , where
are i.i.d. Rademacher signs (it corresponds to a learning algorithm that always outputs the same classifier). By Lemma
4.1 we have for ,
Comments
There are no comments yet.