Quantifying the Complexity and Learnability of Strategic Classification Problems | by Jonathan Yahav | Apr, 2024

Quantifying the Complexity and Learnability of Strategic Classification Problems | by Jonathan Yahav | Apr, 2024

Counting Achievable Labelings: Canonical Shattering Coefficients

Verbally defining shattering coefficients seems straightforward at first glance:

Given a hypothesis class H, its nᵗʰ shattering coefficient, denoted Sₙ(H), represents the largest number of labelings achievable by classifiers in H on a sample of n vectors.

But what is a “labeling”? And what makes it “achievable”? Answering those questions will help us lay some groundwork in pursuit of a more formal definition.

In the of binary classification, a labeling of a sample of feature vectors is simply any one of the ways we can assign values from the set { -1, 1 } to those vectors. As a very simple example, consider two one-dimensional feature vectors (i.e., points on a number line), x₁ = 1 and x₂ = 2.

A visualization of the four possible labelings of the sample x₁ = 1, x₂ = 2. Red points are negatively classified, blue ones are positively classified. Image by the author.

The possible labelings are any combination of the classification values we can assign the feature vectors of one another. We can represent each labeling as a vector, where the first and second coordinate represent the values assigned to x₁ and x₂, respectively. The set of possible labelings is thus { (-1, -1), (-1, 1), (1, -1), (1, 1) }. Note that a sample of size 2 yields 2² = 4 possible labelings — we’ll see how this generalizes to arbitrarily-sized samples soon.

We say that a labeling is achievable by a hypothesis class H if there exists a classifier hH from which that labeling can result. Continuing with our simple example, suppose we are to classifiers of the form xk, k ∈ ℝ, that is, one-dimensional thresholds such that anything to the right of the threshold is positively classified. The labeling (1, -1) is not achievable by this hypothesis class. x₂ being greater than x₁ implies that any threshold that classifies x₁ positively must do the same for x₂. The set of achievable labelings is therefore { (-1, -1), (-1, 1), (1, 1) }.

Examples of one-dimensional threshold classifiers that can be used to achieve all but one of the possible labelings of the sample x₁ = 1, x₂ = 2. Image by the author.

Having understood the basic terminology, we can start to develop some notation to formally express elements of the verbal definition with which we started.

We stick to representing labelings as vectors as we did in our simple example, with each coordinate representing the classification value assigned to the corresponding feature vector. There are 2 possible labelings in total: there are two possible choices for each feature vector, and we can think of a labeling as a collection of n such choices, each made independently of the rest. If a hypothesis class H can achieve all possible labelings of a sample 𝒞, i.e., if the number of achievable labelings of 𝒞 equals 2, we say that H shatters 𝒞ₙ.

Finally, using the notation from above, we converge on a more rigorous definition of Sₙ(H):

In keeping with our explanation of shattering, Sₙ(H) equalling 2 implies that there exists a sample of size n that is shattered by H.

Estimating Hypothesis Class Expressiveness: Canonical VC Dimension

The Vapnik–Chervonenkis (VC) dimension is a way to gauge the expressive power of a hypothesis class. It’s based on the idea of shattering we just defined, and it plays an important role in helping us determine which hypothesis are PAC learnable and which aren’t.

Let’s begin by attempting to intuitively define the canonical VC dimension:

Given a hypothesis class H, its VC dimension, denoted VCdim(H), is defined to be the greatest number n for which there exists a sample of size n that is shattered by H.

Using Sₙ(H) enables us to express this much more cleanly and succinctly:

VCdim(H) = max{ n ∈ ℕ : Sₙ(H) = 2 }

However, this definition isn’t precise. Note that the set of numbers for which the shattering coefficient equals 2may be infinite. (Consequently, it is possible that VCdim(H) = ∞.) If that’s the case, the set has no well-defined maximum. We address this by taking the supremum instead:

VCdim(H) = sup{ n ∈ ℕ : Sₙ(H) = 2 }

This rigorous and concise definition is the one we’ll use moving forward.

Adding Preferences to the Mix: Strategic Shattering Coefficients

Generalizing the canonical notions we just went over so that they work in a strategic setup is fairly straightforward. Redefining shattering coefficients in terms of the point best response we defined in the previous article is practically all we’ll have to do.

Given a hypothesis class H, a preference set R, and a cost function c, the nᵗʰ shattering coefficient of Sᴛʀᴀᴄ⟨H, R, c⟩, denoted σ(H, R, c), represents the largest number of labelings achievable by classifiers in H on a set of n potentially-manipulated feature vectors, i.e., n data point best responses.

As a reminder, this is how we defined the data point best response:

We can tweak the notation we used in our discussion of canonical shattering coefficients to further formalize this:

The main difference is that each x in the sample has to have a corresponding r. Other than that, putting the data point best response where we had x in the canonical case works smoothly.

As a quick sanity check, let’s consider what happens if R = { 0 }. The realized reward term 𝕀(h(z) = 1) ⋅ r will be 0 across all the data points. Maximizing utility thus becomes synonymous with minimizing cost. The best way to minimize the cost incurred by a data point is trivial: never manipulating its feature vector.

Δ(x, r; h) ends up always just being x, placing us firmly within the territory of canonical classification. It follows that σ(H, { 0 }, c) = Sₙ(H) for all H, c. This is consistent with our observation that the impartial preference class represented by R = { 0 } is equivalent to canonical binary classification.

Expressiveness with Preferences: Strategic VC Dimension (SVC)

Having defined the nᵗʰ strategic shattering coefficient, we can simply swap out the Sₙ(H) in the canonical definition of the VC dimension for σ(H, R, c).

SVC(H, R, c) = sup{ n ∈ ℕ : σ(H, R, c) = 2 }

Based on the example we considered above, we find that SVC(H, { 0 }, c) = VCdim(H) for any H, c. Indeed, SVC is to VCdim as the strategic shattering coefficient is to its canonical equivalent: both are elegant generalizations of non-strategic concepts.

From SVC to Strategic PAC Learnability: The Fundamental Theorem of Strategic Learning

We can now use SVC to state the Fundamental Theorem of Strategic Learning, which relates the of a strategic classification problem to its (agnostic) PAC learnability.

A strategic classification instance Sᴛʀᴀᴄ⟨H, R, c⟩ is agnostic PAC learnable if and only if SVC(H, R, c) is finite. The sample complexity for strategic agnostic PAC learning is m(δ, ε) ≤ ⁻² ⋅ (SVC(H, R, c) + log⁡(1/δ)), with C being a constant.

We won’t elaborate too much on how this can be proven. Suffice it to say that it boils down to a clever reduction to the (well-documented) Fundamental Theorem of Statistical Learning, which is essentially the non-strategic version of the theorem. If you’re mathematically inclined and interested in the nuts and bolts of the proof, you can find it in Appendix B of the paper.

This theorem essentially completes our generalization of classic PAC learning to a strategic classification setting. It that the way we defined SVC actually doesn’t just make sense in our heads; it actually works as a generalization of VCdim where it matters most. Armed with the Fundamental Theorem, we are well-equipped to analyze strategic classification problems much as we would any old binary classification problem. In my opinion, having the ability to determine whether a strategic problem is theoretically learnable or not is pretty incredible!

Source link