Bias-Variance Decomposition as a Diagnosis Tool for

Preparing to load PDF file. please wait...

0 of 0
100%
Bias-Variance Decomposition as a Diagnosis Tool for

Transcript Of Bias-Variance Decomposition as a Diagnosis Tool for

IEEE TRANSACTIONS ON COMPUTERS, VOL. ?, NO. ?, ? ?

1

Start Simple and then Refine: Bias-Variance Decomposition as a Diagnosis
Tool for Leakage Profiling
Liran Lerman, Nikita Veshchikov, Olivier Markowitch, Franc¸ois-Xavier Standaert
Abstract—Evaluating the resistance of cryptosystems to side-channel attacks is an important research challenge. Profiled attacks reveal the degree of resilience of a cryptographic device when an adversary examines its physical characteristics. So far, evaluation laboratories launch several physical attacks (based on engineering intuitions) in order to find one strategy that eventually extracts secret information (such as a secret cryptographic key). The certification step represents a complex task because in practice the evaluators have tight memory and time constraints. In this paper, we propose a principled way of guiding the design of the most successful evaluation strategies thanks to the (bias-variance) decomposition of a security metric of profiled attacks. Our results show that we can successfully apply our framework on unprotected and protected algorithms implemented in software and hardware.
Index Terms—Side-channel attacks, profiled attacks, bias-variance decomposition, diagnosis tool, evaluation tool.
!

1 INTRODUCTION
In 1996, Kocher introduced side-channel cryptanalyses that analyze physical characteristics (called leakages or traces) of cryptographic devices in order to extract secret information [21]. The rationale is that the leakages of (hardware and software) implementations depend on the manipulated data and the executed operations. As a result, cryptographic algorithms secured from a point of view of classical cryptanalysis can be insecured when implemented in devices. Kocher exploited the execution time in order to recover the secret key from several crypto systems. In 1999, Kocher extended the previous proposition with (differential) power analysis that compare the outputs of a leakage model (parameterized by secret key hypotheses) with the actual power leakages [22]. Nowadays, cryptanalytic side channel attacks also employ other sources of emanations such as the electromagnetic radiation [14] and the sound [15]. In this paper, we focus on side-channel attacks based on the power consumption and the electromagnetic radiation as both leakages can be addressed using the same techniques.
The evaluators of the robustness of cryptographic devices usually analyze several (evaluation) settings by varying, for example, the leakage dimension, the number of leakages as well as the distinguishers (that compare actual leakages with the modeled leakages). The large number of possible evaluation settings requires to consider a significant number of attack strategies that also depend on the a priori information on the target device. Non-profiled attacks (introduced by Kocher [22]) work under the assumption that the adversary has a priori knowledge on the physical behavior
• L. Lerman, N. Veshchikov and O. Markowitch are affiliated with the QualSec group from the Universite´ libre de Bruxelles in Belgium.
• F.-X. Standaert is affiliated to the ICTEAM/ELEN/Crypto Group of the Universite´ catholique de Louvain in Belgium.
• Corresponding author e-mail: [email protected]

of the cryptographic device (e.g., the power consumption of a device linearly correlates to the manipulated data). By contrast, profiled attacks exploit an offline learning step (also known as profiling step) executed on a device (similar to the target device) in order to automatically extract knowledge about the leakage function [12]. We focus on profiled attacks introduced as the strongest leakage analysis in an information theoretic sense [4].
In practice, profiled attack strategies still require (1) to minimize the number of estimated parameters of the leakage model due to a limited number of measurements available during the learning phase (that can lead to estimation error), and (2) to assume some knowledge on the leakage distribution (that can lead to assumption error) when the adversary considers parametric profiled attacks. These constraints lead the cryptographic community to propose a plethora of profiled attacks, e.g., [4], [23], [25], [31].
1.1 Our contributions
In Eurocrypt 2014, Durvaux et al. proposed a certification tool to determine whether the estimated profiled attacks suffer from an assumption error or from an estimation error [11] (it was then simplified/specialized in 2016 [10]). Here we complement the certification tool by providing a diagnosis tool (based on the bias-variance decomposition [8], [9], [24]) that guides the design of the best profiled attack optimizing one or several constraints decided by the evaluation laboratory (e.g., maximizing the success probability of profiled attacks with 5 attack traces measured on the target device or minimizing the number of leakages required to reach the success probability of 0.7) or assumed by the implemented scheme (e.g., maximizing the success probability of profiled attacks from a single leakage measured on a fresh re-keying scheme [28]).
As far as we know, the certification process of cryptographic devices consists of testing several popular pro-

IEEE TRANSACTIONS ON COMPUTERS, VOL. ?, NO. ?, ? ?
filed attacks with several settings in order to find the best strategy. Nevertheless, this approach requires a significant computational power. Here, we promote a technique that starts with a specific evaluation setting (e.g., a low number of leakages with a simple attack) and then refines the attack according to our diagnosis tool. The rationale to consider simple settings at the beginning of the evaluation lies to a variance issue: the success probability of profiled attacks with excessive complexity could be lower than the success probability of simpler approaches (due to variance issues). Note also that complex approaches suffer from high time/memory complexity: the time/memory complexity increases as a function of (1) the number of points per leakage, (2) the number of leakages exploited during the learning and the attack steps, and (3) (especially against protected devices) the complexity of the attack.
Our diagnosis tool works in a systematic way (1) by extracting each term impacting the error rate, and (2) by reducing the contribution of those terms in order to reach a lower error rate, optionally with respect to constraints fixed by the evaluator (such as the number of leakages used during the learning step). As the last step, an evaluator can apply the certification tool of Durvaux et al. in order to gauge the quality of the best attack found with the diagnosis tool whether the framework advises a profiled attack providing probabilities for each target value (also known as a probability-based profiled attack)1. It is worth to note that our diagnosis tool works on probability-based and scorebased profiled attacks. We apply our framework to test eight popular profiled attacks on actual datasets as well as on simulated leakages. We demonstrate the usefulness of our diagnosis tool by applying our framework on unprotected and protected (software and hardware) implementations.
1.2 Related works
Our framework represents a counterpart in the profiled attack approach of the recently proposed non-profiled stepwise linear regression attacks [36]. More precisely, the stepwise linear regression techniques tune the parameters of specific probability-based non-profiled attacks while our approach applies to all types of profiled attacks. Similarly, our diagnosis tool extends the papers of Choudary et al. that report several guidelines in the context of template attacks and stochastic attacks (representing well known probabilitybased profiled attacks) [5], [6]. All the hints proposed by Choudary et al. can be used in our diagnosis tool to avoid several numerical and estimation issues as well as to reduce the running time of the attacks (e.g., by exploiting dimensionality reduction techniques and efficient profiled attacks).
The main advantages of our approach are: (1) the framework can be applied to any profiled attack (including probability-based and score-based profiled attacks) thanks to the decomposition of a security metric (for example the success rate or success probability, put forward by Standaert et al. [33]), (2) the framework operates without any knowledge of the true leakage distribution, and (3) the framework works on unprotected and protected implementations.
1. The certification tool of Durvaux et al. operates only on probabilitybased profiled attacks.

2
We base our approach on the bias-variance decomposition [8], [9], which was recently introduced to the field of side-channel analysis by Lerman et al. [24]. Our paper differs from the paper of Lerman et al. in three main points:
1) We use the bias-variance decomposition for different goals. We do not only use this tool in order to find out what impacts the error rate of template attacks and stochastic attacks but we also exploit this tool in order to extract the best profiled attack (i.e., fine-tune the meta-parameters). Thus, we provide a framework that can be efficiently applied during the evaluation of cryptographic systems taking into account real-world constraints.
2) We study additional profiled attacks in order to cover all types of profiled attacks. We do not only analyze probability-based profiled attacks (i.e., template attacks and stochastic attacks) but we additionally study score-based profiled attacks and demonstrate that our diagnosis tool can be used on any profiled attacks.
3) We conducted the experiments in different settings. We do not only test attacks on simulated leakages (that highlight the usefulness of this approach from a theoretical point of view) but we also consider experiments on real power traces collected on an unprotected and a protected implementations (in hardware and in software), which allows to evaluate our diagnosis tool in practice.

1.3 Organization of this paper
The rest of the paper is organized as follows. Section 2 contains preliminary notions on physical attacks. Section 3 provides introduction to the bias-variance decomposition. Section 4 details our diagnosis tool, and the results of its application on profiled attacks against unprotected software as well as hardware implementations. Section 5 extends the results of Section 4 to a protected environment. Section 6 analyses the impact of an assumption made during the experiments. Finally, Section 7 concludes the paper.

2 BACKGROUND ON SIDE-CHANNEL ATTACKS
2.1 Physical attacks

We assume that the adversary wants to retrieve the secret key used when the cryptographic device (that executes a known encryption algorithm) encrypts known plaintexts. In order to find the secret key, the adversary targets a set of key-related information (called the target intermediate values) with a divide-and-conquer approach. The divide-andconquer strategy extracts information on separate parts of the secret key (e.g., the adversary extracts each byte of the key independently) and then combines the results in order to get the full secret key. In the following, we systematically use the term key to denote the target of our attacks, though, in fact, we address one byte at a time.
During the execution of the encryption algorithm, the cryptographic device processes a function f (e.g., the SBox of the block-cipher AES)

f: P ×K →Y

(1)

y = fk(p),

IEEE TRANSACTIONS ON COMPUTERS, VOL. ?, NO. ?, ? ?
that outputs the target intermediate value y and where k ∈ K is a key-related information (e.g., one byte of the secret key), and p ∈ P represents information known by the adversary (e.g., one byte of the plaintext). For the sake of simplicity, we assume that (1) the function f is bijective, (2) the output of the function f is encoded using 8 bits, and (3) the adversary targets one byte of the key.

2.1.1 Side-channel attacks
Let jT y be the j-th leakage measured when the device manipulates the value y. In the following, we represent each leakage with a vector of real values measured at different instants on the analyzed device. The number of samples (i.e., the length of the vector jT y) equals to n in unprotected contexts. We denote jt T y the j-th leakage (associated to the target value y) measured at time t such that:

jt T y

=

tL

(y)

+

j t

y,

(2)

=

tL

(fk

(p))

+

j t

y,

(3)

where

j t

y



R is the noise of the trace jt T y

(i.e., a standard

additive noise assumption defined by Mangard et al. [26])

following for example a Gaussian distribution, and tL is

the (deterministic) leakage function at time t. Let assume

that the adversary targets the output of the function fk (p).

The function tL can be linear or nonlinear of the output

bits of fk (p) [1]. More precisely, we say that the leakage functions are linear if they provide leakages jt T y depending

on the weighted sum of each bit of the output of fk (p) while nonlinear leakage functions provide leakages jt T y

depending on the weighted sum of products of bits of the

output of fk (p). Formally, we consider polynomial (leakage)

functions tL (y), i.e.:

tL (y) = tc + tαu gu (y),

(4)

u

where tc and tαu are real numbers while gu(y) is a monomial of the form b∈B Bitb (y) in which Bitb (y) returns the b-th bit of y and B ⊂ {1, 2, ..., 8}. This represents a usual assumption in side-channel attacks [31]. Linear leakage functions consider that the cardinality of B in each monomial equals to 1 while nonlinear leakage functions contain monomials (with a non-zero coefficient) of the form
b∈B Bitb (y) in which the cardinality of B is strictly bigger than 1. Evaluators often model leakage functions as (1) the Hamming weight of the manipulated value y (denoted HW(y)) for software implementations (representing a linear leakage function), and (2) the Hamming distance (HD) between two manipulated values for hardware implementations (representing a nonlinear leakage function since if a and b are in {0, 1}, then the Hamming distance between a and b equals to a + b − 2ab)2.
A side-channel attack is a process during which an attacker analyses leakages measured on a target device in order to extract information on the secret value. In order to protect the implementations against physical attacks, the designers employ (among others) d-order masking techniques that split each sensitive information y (that depends on the secret key) in d + 1 uniformly distributed variables (called

2. Note that the HD is linear if viewed as a function of the bit flips.

3

shares), denoted {x0, x1, ..., xd}, such that y =

d i=0

xi

(where represents the method combining shares) [3], [16].

Typically, the shares {x1, ..., xd} represent d uniformly dis-

tributed random values (called the masks) while x0 equals

to y +

d i=1

xi.

The

masking

schemes

(of

order

d)

leak

no information on the sensitive variable y whenever the

adversary combines strictly less than d + 1 different instants

in order to recover the sensitive information (when each

sample depends on a different share). In the following, we

assume that the masking techniques of order d use d + 1

shares, leading to (d + 1, d + 1) secret sharing schemes.

Furthermore, in protected contexts, we denote n the number of samples in the vector jT y associated to one share (i.e., jT y

contains n × (d + 1) samples in total).

Profiled attacks (belonging to side-channel attacks) rep-

resent efficient attacks thanks to a learning step (also known

as a profiling step). More precisely, these approaches build

a distinguisher A(TPS, TAS) that:

1) during the profiling step, it estimates a parameter θ with a set of leakages (called profiling set and denoted TPS) containing Np (profiling) traces per target value, and
2) during the attack step, it returns the extracted secret key k from a set of attack leakages TAS (called attacking set) measured on the target device using a constant secret key.

The quality of the distinguisher can be analyzed by estimating the success rate, i.e. the probability that the distinguisher returns the right key based on a set of attack traces. The error rate equals to the probability that the distinguisher does not return the right key based on the attacking set.
The Bayes classifier (also known as the Bayes) denotes the best possible theoretical attack which practical attacks can reach when there is no assumption error and no estimation error. More formally, let Ab(·) be the Bayes classifier that takes as input a set of attack traces TAS, the function Ab(·) represents a classifier that minimizes the error rate, i.e.:

Ab(TAS) ∈ argmax Pr [k | TAS]

(5)

k∈K

= argmax Pr [TAS | k] × Pr [k] . (6)
k∈K

Note that several values k could maximize the value Pr [TAS | k] × Pr [k], leading to a set of possible keys (which explains the symbol ∈ in Equation 5). Note also that, in the side-channel attacks literature, the Bayes classifier represents the model estimating Equation 6 while, in this paper, the Bayes classifier refers to the optimal classification with known probability density functions used in Equation 6.

2.2 Concrete distinguishers
In practice, an adversary estimates the Bayes classifier (i.e., the Equation 6) with concrete distinguishers detailed in this section. In the following, we define the complexity of an attack by the number of parameters to estimate (i.e., for a given distinguisher, an increase of the number of parameters to estimate leads to an increase of the complexity of the attack).

IEEE TRANSACTIONS ON COMPUTERS, VOL. ?, NO. ?, ? ?
2.2.1 Template attacks
(Gaussian) Template attacks (TA) [4] estimates the Equation 6 by assuming that Pr jT y | y follows a Gaussian distribution N (µˆy, Σˆ y) for each value y where µˆy and Σˆ y are respectively the sample mean and the sample covariance matrix of the traces associated to y. In what follows we assume that the noise is independent of y in unprotected contexts [6]. This property allows to estimate the same physical noise (represented by Σ) for all the target values. By contrast, in protected settings exploiting first-order masking schemes, we assume that the adversary has no information on the shares (except the instants depending on the shares) during the profiling step leading to a dependency between the second-order moment Σy and the sensitive information y. Note that the adversary can know the mask values during the profiling step and still not use them in order to speed up the attack phase by exploiting less templates. In the following we will relax this assumption, which leads to another attack (that we call stochastic-based mixture attacks). Note also that template attacks using no information on the mask values (during the profiling step) allow us to highlight the effect on the outputs given by the diagnosis tool when the evaluators exploit bias models. Indeed, in protected contexts, template attacks (unlike template-based mixture attacks that take into account the mixture structure of the probability density function) represent biased models since these attacks assume that the leakages associated to a sensitive information y (split in several shares {x0, ..., xd}) follow a Gaussian distribution (i.e., by estimating Pr jT y | y with one Gaussian distribution) while the leakages follow a multimodal distribution, i.e.:

Pr jT y | y =

d
Pr jT y | x1, ..., xd, x0 = y + xi

x1 ,...,xd

i=1

× Pr [x0, ..., xd] ,

(7)

where Pr jT y | x1, ..., xd, x0 = y +

d i=1

xi

(in which the

symbols and + represent the methods combining shares)

follows a Gaussian distribution if we consider Gaussian

noise. In other words, although the unbiased/optimal pro-

filed attacks represent the Gaussian mixture templates at-

tacks (in which the adversary estimates the multimodal

Gaussian distribution with a mixture of Gaussian distribu-

tions), we exploit (unimodal) template attacks that work if

the first or the second order moments contain information

on the target value (e.g., unimodal template attacks ap-

plied on first-order masking schemes). Note that unimodal

template attacks extract the same quantity of information

from the leakages as template-based mixture attacks if the

leakages contain a high physical Gaussian noise and if

the target implementation executes a first-order masking

scheme (as reported by Grosso et al. [17]).

During the attack step, the adversary classifies TAS = 1T , ..., Na T (where jT denotes the j-th measurement on
the target device, and Na is the number of attack traces) by

4

using the equation:

Na
kˆ ∈ arg max Pr jT | y = fk(pj) × Pr [y = fk(pj)], (8)
k∈K j=1

Na
≈ arg max Pˆr jT | y = fk(pj); θˆy
k∈K j=1

× Pˆr [y = fk(pj)] ,

(9)

where θˆy = {µˆy, Σˆ y}, and pj is the j-th plaintext used by the device when the adversary measured the j-th attack trace.

2.2.2 Stochastic attacks and stochastic-based mixture attacks

Stochastic attacks (SA) [31] model the leakage information at time t as a function of the target value y with a regression model h spanned by U functions gu (where u ∈ [1; U ]), i.e.:

jt T y

=

h

(y,

tθ)

+

j t

y,

(10)

U

= tc +

tαu

gu

(y)

+

j t

y,

(11)

u=1

where tθ = {tc, tα1, ..., tαU } ∈ RU+1 represents the parameter of the regression model h at time t. Stochastic attacks assume that gu is a monomial of the form b∈B Bitb (y) where Bitb (y) returns the b-th bit of y and B ⊂ {1, 2, ..., 8}.
We define the degree of a stochastic attack as the maximum number of variables in each monomial of h with a non-
zero coefficient. More formally, stochastic attacks of degree i contain all the monomials of the form b∈B Bitb (y) in which the cardinality of B is in {1, 2, ..., i}. We denote SAi stochastic attacks of degree i (e.g., SA1 denotes stochastic attacks of degree 1). Note that, in unprotected contexts, template attacks are equivalent to stochastic attacks of degree 8 when the adversary estimates only one Σ for all the target
values (see Section 2.2 “Profiled attacks” in [24]). As a result,
in unprotected contexts, stochastic attacks of degree strictly less than 8 represent distinguishers with lower complexity
than template attacks. In protected environments, each function h takes as
input the value of one share. For example, if the device manipulates the share xi at time t, then stochastic attacks modelize the deterministic part of the leakage at time t by h (xi, tθ). As a result, unlike with template attacks, we assume that the stochastic attacks (that we call stochasticbased mixture attacks in protected contexts) know the value
of the manipulated masks during the profiling step.
Regarding the attack step against an unprotected
implementation, the adversary uses Equation 8 by assuming that Pr jT y | y follows the Gaussian distribution N (h (y, θ), Σ) where h(y, θ) represents the vector [h(y, 1θ), h(y, 2θ), ..., h(y, nθ)], n represents the number of samples (i.e., the length of jT y), and Σ is the covariance matrix of the residual term3. In a protected environment,
due to the fact that the adversary has no a priori knowledge
on the value of the manipulated masks during the attack

3. The residual term represents the deviation of the actual leakages (associated to known keys and known plaintexts) from the output provided by the estimated leakage model (parametrized with the same keys and plaintexts).

IEEE TRANSACTIONS ON COMPUTERS, VOL. ?, NO. ?, ? ?

step, we consider stochastic-based mixture attacks iterating on all the possible mask values, leading to the selection of k maximizing the following equation:

Na
Pr
j=1 x1,...,xd

d
jT | x1, ..., xd, y = fk(pj), x0 = y + xi
i=1

× Pr [x0, ..., xd] .

(12)

In other words, in protected contexts, we define the

stochastic-based mixture attacks as a (finite) sum (also

known as a mixture) of Gaussian distributions given that

Pr jT | x1, ..., xd, y = fk(pj), x0 = y +

d i=1

xi

follows a

Gaussian distribution if we consider Gaussian noise. As

a result, unlike profiled attacks which ignore the mixture

structure, attacks (whether classical template or stochastic-

based) which model the mixture structure represent unbi-

ased models (if the leakages contain Gaussian noise and

if the stochastic-based model contains the right degree)

since such adversaries take into account the multimodal

structure of the probability density function associated to

each sensitive value y split in several shares.

5

The loss function (denoted LOSS(k, k )) represents the cost of predicting k when the true target value is k. In this paper we consider the zero-one loss function: the cost is zero when k equals to k and one in the other case.
The main prediction (denoted Am(TAS)) represents the most frequent prediction on the set of attack traces TAS given by the estimated classifier when varying the profiling sets4. The bias term represents the difference (according to the loss function) between the main prediction and the prediction provided by the Bayes classifier, i.e.:

B(TAS) = LOSS(Am(TAS), Ab(TAS)).

(14)

The variance measures the variation of a prediction on a set of attack traces as a function of different profiling sets, i.e.:

V(TAS) = ETPS [LOSS(Am(TAS), A(TAS, TPS))] , (15)
where A(TAS, TPS) represents a profiled attack using the profiling set TPS and the attacking set TAS.
Based on these notations, Domingos demonstrated that the multiplicative factors c1 and c2 equal:

c1 = Pr [A = Ab] − Pr [A = Ab] × Pr [A = k | Ab = k] , (16)

2.2.3 Profiled correlation power analysis
In 2000, Coron et al. [7] and Mayer-Sommer [27] proposed the Correlation Power Analysis (CPA) that recover the key from the target device by selecting the key that maximizes the absolute value of the (Pearson) correlation between the actual leakages and the estimated leakages based on the assumed secret key. Profiled Correlation Power Analysis (PCPA) model the leakage function of each instant from a profiling set with a model. More precisely, PCPA based on template attacks (denoted PCPA-TA) consider that the leakage model associated to the target value y equals to µˆy while PCPA based on stochastic attacks of degree i (denoted PCPA-SAi) assume that the leakage model equals to h(y, θˆ). As a result, PCPA skip the estimation of the Σ parameter. In the following, we will show that the diagnosis tool also works on score-based profiled attacks by exploiting PCPA.

3 BIAS-VARIANCE DECOMPOSITION OF THE ER-
ROR RATE
In this section, we detail the bias-variance decomposition (of the error rate) used in our decision tool and recently introduced in the side-channel literature by Lerman et al. [24].

3.1 Preliminary notions
Domingos showed that the error rate of a classifier can be decomposed into three components [8], [9]: the error rate of the Bayes classifier Rb(·), the bias B(·) and the variance V(·), generally leading to the equality:

Error rate =ETAS [c1 × Rb(TAS)] + ETAS [B(TAS)] + ETAS [c2 × V(TAS)] ,

(13) Bias Variance

where E denotes the mean operator, the two values c1 and c2 are in R, and Rb(TAS), B(TAS) and V(TAS) are three functions providing values in R.

c2 = −Pr [A = Ab | A = Am] Am = Ab , (17)

1

Am = Ab

where A = A(TAS, TPS), Ab = Ab(TAS) and Am = Am(TAS). Figure 1 illustrates the bias-variance decomposition for
two profiled attacks. In this figure, attack 1 contains a high bias (since the most frequent prediction given by attack 1 differs from the output of the Bayes classifier) but a smaller variance compared to attack 2.
In the following, we use the terms bias and variance to denote respectively the average of the bias (i.e., ETAS [B(TAS)]) and the weighted average of the variance (i.e., ETAS [c2 × V(TAS)]). As a result, this weighted variance term can be negative according to the value of c2.

3.2 Estimation of the bias and of the variance terms
In practice, the evaluator has one data source which he randomly splits into two disjoint subsets: a profiling source and an attack source. The first source provides leakages used to build the classifiers while the second source generates leakages to estimate the error rate, the bias and the variance.
Following the proposition of Valentini et al. [34], in our experiments, in order to decompose the error rate of an attack A(·, ·) we build a set of distinguishers (of the same complexity5 and evaluation settings) created with different profiling sets generated by sampling (with replacement) the profiling source. This resampling approach provides several distinguishers with different estimated parameters θˆ. We estimate the bias and the variance terms by sampling (without replacement) the attack source. More precisely, we average the estimated bias and the estimated variance obtained on each attacking set sampled from the attack source.
4. The generated profiling sets have the same cardinality (e.g., in unprotected contexts, each set contains Np leakages per target value and n samples per leakage) and they are sampled from the space of sets of leakages.
5. The complexity of an attack is defined by the number of parameters to estimate (as presented in Section 2.2).

0.3

0.4

IEEE TRANSACTIONS ON COMPUTERS, VOL. ?, NO. ?, ? ? Bias−variance decomposition
Attack 1 Attack 2
Target

6
the attacking set. In other words, the error rate of the Oracle model equals to zero.
We demonstrate in Section 4 and Section 5 that the Oracle model provides useful information on the bias and variance terms in practice. Furthermore, we rationalize the substitution of the Bayes classifier by the Oracle model in Section 6. Informally, the accuracy of this approach increases with the decrease of the overlap between the true probability density functions of the leakages (related to different keys).

0.2

Probability

0.0

0.1

0

20

40

60

80

100

120

Key hypotheses

Fig. 1: Illustration of the bias-variance decomposition for two profiled attacks by reporting the probability to output a key value with a fixed attacking set when the actual (target) key value equals to 42. The probability to output a key value depends on the profiling set. For the sake of simplicity, we assume that the error rate of the Bayes classifier equals to zero.

It is worth to note that, in practice, the evaluator does not know which profiling set neither which attacking set will be used by the adversary. As a result, independently of the use of our diagnosis tool, an evaluator requires to build several profiled attacks (with the same complexity and evaluation settings but with different profiling sets) using several attacking sets in order to estimate the error rate. In this paper, we demonstrate that the evaluator can exploit all the constructed profiled attacks in order to obtain a diagnosis tool with a small overhead. Note also that by building more attacks and by increasing the number of attacking sets, we increase the accuracy of estimations of the bias, of the variance and of the error rate.
3.3 Oracle model
In a realistic evaluation scenario, an evaluator has no access to the Bayes classifier (representing the best physical attack minimizing the error rate) during the estimation of the bias and the variance terms. More precisely, the evaluators of crypto systems lack knowledge (1) on the structure of the leakage distributions (leading to assumption errors), and (2) on the best values (i.e., maximizing the success probability) of the estimated parameters used by the classifier (leading to estimation errors). In practice, we counteract this issue by replacing the Bayes classifier by the Oracle model that represents an idealization of the Bayes classifier. The Oracle model (also known as the Oracle) denotes an adversary that extracts and outputs the correct key from a set of attack traces regardless of the quantity of information present in

3.4 Diagnosis tool and guidelines
The bias-variance decomposition of the error rate allows to figure out how to tune and to improve an attack. Several ways exist in order to reduce the bias or the variance. In the following, we consider three parameters that can be optimized: (1) the error rate and/or the number of attack leakages, or (2) the number of profiling leakages. For example, an evaluator may vary the number of profiling leakages in order to reach an error rate of 0.2 with 5 attack leakages.
Based on the considered constraints, the evaluator can reduce the variance by increasing the size of the profiling set [2]. Note however that the machine learning theory says that the increase of the profiling set does not impact the bias term (see for example the paper of Domingos [9]). The classifiers having small complexity/variance should be used (e.g., SA1 in unprotected contexts) if the evaluator requires to fix the profiling set to a small size [24]. Furthermore, the bias contribution in the error rate can be reduced by increasing the complexity of the model (potentially at the cost of increasing the variance contribution in the error rate) or by increasing the size of the attacking set [19]. Finally, if the number of points related to the target value increases, the bias term reduces but the variance term can increase (because we have more parameters to estimate) or can reduce (because we increase the quantity of information related to the target value in each leakage) [35]. Table 1 resumes the impact of each parameter on the bias and on the variance terms. Note that, in this paper, we consider DPA scenario (where we target an operation that involves known plaintexts and a fixed key) while other outcomes could be obtained in other settings.
It is worth to note that we reduce the variance of the estimation of metrics (1) by increasing the number of attacking sets provided by the attack source, (2) by increasing the number of estimated distinguishers, and (3) by reducing the noise in the leakage (in order to reduce the error of the assumption that the Bayes classifier equals to the Oracle model).
4 EXPERIMENTS ON UNPROTECTED DEVICE
In this section, we show how a simple approach (i.e., the bias-variance decomposition) guides the evaluators to find the best profiled attack against an unprotected device. More precisely, the evaluators reduce the error rate of a strategy by reducing the term dominating the error rate.
4.1 Acquisition setup
A set of 80 000 power traces was collected on an 8-bit Atmel (ATMega 328) microcontroller at a 16 MHz clock frequency.

IEEE TRANSACTIONS ON COMPUTERS, VOL. ?, NO. ?, ? ?

Bias

Variance

size of profiling set



complexity of the model

size of the attacking set

?

number of points of interest

?

TABLE 1: Impact of each parameter on the bias and on the variance terms. The symbol and represent respectively an increase and a decrease of the term/parameter. The symbol − is used when the term does not change while the symbol ? represents an unknown effect on the term.

The power consumption of the device was measured using an Agillent Infiniium 9000 Series oscilloscope that was set up to acquire 200 MSamples/s. In order to measure the device’s power consumption we inserted a 10 Ω resistor placed between the ground pin of the microcontroller and the ground of the power supply. In order to reduce noise in power traces we used averaging (done be the oscilloscope), thus each power trace represents an average of 64 single acquisitions. We separate the dataset of 80 000 power traces into two parts: the profiling source (containing 56 000 traces) and the attack source (containing 24 000 traces).
Our target device executes AES using a constant 128-bit key and random plaintexts. We target the first round of the cipher (manipulating the AES SBox) and focus on the first byte of the key.
4.2 Case studies
We vary five parameters in our experiments: (1) the number of points per trace, (2) the number of profiling traces per target value, (3) the number of attack traces, (4) the type of attack and (5) the leakage degree/complexity. The main purpose is to provide a large quantity of information allowing to play with a large quantity of stories (representing a sequence of steps allowing to reach a final setting starting from one setting) although we exhibit three stories that evaluators may meet against unprotected devices, leaving the other stories to be discovered by the interested readers.
Our experiments test eight popular families of profiled attacks: template attacks, three stochastic attacks (of degrees 1, 2 and 3), as well as profiled correlation power attacks based on template attacks and based on three different stochastic attacks (of degrees 1, 2 and 3).
We consider several scenarios by using three different sizes of the attacking set (Na ∈ {5, 10, 20}), two different sizes of the profiling set (Np ∈ {3, 10}), five different leakage dimensions (n ∈ {1, 2, 5, 10, 20}) and two different leakage functions.
We consider a linear feature selection6 in order to reduce the dimensionality of the problem by selecting points that correlate (linearly) with the Hamming weight of the output of the SBox7. This approach allows to simulate two leakage
6. We estimate the Pearson correlation with 1 000 leakages from the profiling source.
7. Note that, during our experiments, the dimensionality reduction algorithm selected points correlated (linearly) the most with the output of the SBox. Nevertheless, this provide points highly correlated to the Hamming weight of the output of the SBox thanks to the high correlation between {0, ..., 2m−1} and HW({0, ..., 2m−1}) for small m.

7
functions: a linear leakage function (when the adversary targets the output of the SBox), and a nonlinear leakage function (when the adversary targets the input of the SBox). It is also worth to note that linear feature selections speed up the execution of the diagnosis tool. However, we advise to use a nonlinear feature selection algorithm (such as the minimum redundancy maximum relevance proposed by Peng et al. [29]) in an evaluation case in order to extract all the information available in the leakages.
For each case study, we estimate the parameters of each profiled attack 100 times (with different profiling sets). This number of estimated distinguishers already provides interesting insights on the quantity of bias and variance. An interesting future work can be to find strategies extracting the best number of profiled attacks to build.
Section 2.2.2 reports that, in unprotected contexts, stochastic attacks of degree strictly less than 8 represent distinguishers with lower complexity than template attacks. As a result, template attacks contain a lower bias (but a higher variance) than stochastic attacks of degree strictly less than 8. This is why we assume in the following that the best profiled attack (that we can mount against unprotected devices) represents a template attack having a large profiling set (that leads to a small variance term). Figure 2 plots the error rate of template attacks as a function of the number of attack traces, the number of points per trace, as well as the number of profiling traces. This figure highlights (1) the lower bound of the error rate of considered attacks, and (2) the difficulty of estimating the actual bias and the actual variance terms due to the assumption that the Bayes classifier equals to the Oracle model. More precisely, Section 6 demonstrates that the accuracy of the bias-variance decomposition increases with the decrease of the error rate of the Bayes classifier. In the case study using 5 attack traces and 1 point per leakage, the best attack that we can mount converges to an estimated error rate of at least 0.89 (which represents the estimated error rate of the Bayes classifier). As a result, the bias-variance decomposition based on the Bayes classifier could significantly differ from the bias-variance decomposition based on the Oracle model (leading to a high estimation error of the decomposition of the error rate). Although small attacking sets can provide inaccurate estimation of the bias and variance terms, the rationale to consider such small sets by evaluators relies on several factors such as the running time of the evaluation (i.e., the smaller the attacking set, the lower the running time) and constraints decided by the executed cryptographic device (e.g., fresh re-keying scheme constraints the size of the attacking set to small values for evaluators). Nevertheless, we demonstrate in the following that an evaluator can still extract information on the contribution of the bias and the variance terms.
4.3 Bias-variance decomposition of probability-based profiled attacks
Tables 2 and 3 provide the bias-variance error rate decomposition of respectively stochastic attacks and template attacks when the adversary targets the output of the SBox (that simulates a linear leakage function). Due to space constraints, we describe only a couple of scenarios. Let assume that the

IEEE TRANSACTIONS ON COMPUTERS, VOL. ?, NO. ?, ? ?

1.0

1.0

0.9

0.9

0.8

0.8

0.7

0.7

Error rate Error rate

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0.0

0.0

0 20 50 80 110 140 170 200

0 20 50 80 110 140 170 200

Traces per target value (profiling)

Traces per target value (profiling)

(a) Na = 5

(b) Na = 10

Error rate

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
0 20 50 80 110 140 170 200
Traces per target value (profiling)

Points per trace
1 2 5 10 20

(c) Na = 20

(d) Legend

Fig. 2: Error rate of template attacks as a function of the number of profiling traces per target value, the number of points per trace and the number of attack traces (Na) for unprotected software implementation.

adversary applies template attacks with 10 attacking traces, 3 profiling traces per target value and 1 point per leakage. The bias reaches a high value (0.66) compared to the variance (0.16) leading to a high error rate (0.82). In order to reduce the bias, the adversary can use template attacks using more points per leakage. The adversary reaches the error rate of 0.25 by using 5 points per trace leading to a lower bias (0.04) but a higher variance (0.21) due to the increase of the number of estimated parameters. In order to reach a lower error rate of 0.10, the attacker can select stochastic attacks of degree 1 leading to a lower variance (0.01) at the cost of a small increase of the bias (0.09) due to the decrease of the complexity of the distinguisher.
Tables 4 shows the bias-variance error rate decomposition of template attacks and stochastic attacks when the adversary targets the input of the SBox while the leakage depends on the output of the SBox (i.e., a simulation of a nonlinear leakage function). Let assume that the adversary first applies stochastic attacks of degree 1 using 5 attack traces, 3 profiling traces and 1 point per leakage. The experiment leads to a high bias of 0.99, a very low variance of 0.00 and an error rate of 0.99. In this case, the adversary can reduce the bias term by using template attacks, which provide a bias of 0.89, a variance of 0.05, and an error rate of 0.94. The high bias of stochastic attacks and template attacks highlights that we should increase the number of points per leakage to, for example, 5 in order to reach a lower bias term of 0.39 leading to a decrease of the error rate to 0.69 (although the variance term increases to 0.29 due to the increase of the complexity of the distinguisher).

Na Np

n

1

2

3

5

10

5

20

1

2

10

5

10

SA1

20

1

2

3

5

10

10

20

1

2

10

5

10

20

1

2

3

5

10

5

20

1

2

10

5

10

SA2

20

1

2

3

5

10

10

20

1

2

10

5

10

20

8

Error Bias
0.93 0.66 0.51 0.15 0.02 0.93 0.66 0.52 0.15 0.02 0.66 0.21 0.09 0.00 0.00 0.66 0.21 0.09 0.00 0.00

rate composition

Variance Total

0.00

0.94

0.01

0.67

0.03

0.54

0.02

0.17

0.01

0.02

0.00

0.93

0.00

0.67

0.01

0.52

0.01

0.16

0.00

0.02

0.01

0.67

0.00

0.21

0.01

0.10

0.00

0.00

0.00

0.00

0.00

0.67

0.00

0.21

0.00

0.09

0.00

0.00

0.00

0.00

0.93

0.01

0.94

0.64

0.04

0.68

0.50

0.09

0.58

0.12

0.06

0.18

0.01

0.01

0.02

0.93

0.00

0.93

0.65

0.01

0.66

0.50

0.03

0.53

0.12

0.02

0.14

0.01

0.00

0.01

0.78

0.03

0.81

0.19

0.04

0.23

0.08

0.05

0.12

0.00

0.00

0.00

0.00

0.00

0.00

0.78

0.02

0.80

0.19

0.01

0.20

0.07

0.01

0.09

0.00

0.00

0.00

0.00

0.00

0.00

TABLE 2: Error rate decomposition of stochastic attacks of degree 1 and 2 (denoted respectively SA1 and SA2) targeting the output of the SBox of an unprotected software implementation. Each distinguisher uses Na attack traces, Np profiling traces per target value, and n points per trace.

4.4 Bias-variance decomposition of score-based profiled attacks
Table 5 shows the bias-variance decomposition of the error rate of score-based profiled attacks based on profiled correlation power analysis exploiting one point per leakage and targeting the output of the SBox. Let assume that the attacker first tests a PCPA-SA1 (i.e., profiled correlation power analysis using stochastic attacks of degree 1) with 10 attacking traces and 3 profiling traces per target value. The strategy exhibits a high bias of 0.82 and a low variance of 0.01 leading to an error rate of 0.83. In order to reduce the error rate, the attacker can change PCPASA1 to more complex distinguishers such as PCPA-TA (i.e., profiled correlation power analysis using template attacks). This change decreases the bias term to 0.67 at the cost of an increase of the variance term to 0.17 leading to an increase of the error rate of 0.84. Note that although we increase the error rate, we know that this change can be helpful whether we increase eventually the size of the profiling

IEEE TRANSACTIONS ON COMPUTERS, VOL. ?, NO. ?, ? ?

Na Np

n

1

2

3

5

10

5

20

1

2

10

5

10

TA

20

1

2

3

5

10

10

20

1

2

10

5

10

20

Error Bias
0.89 0.54 0.40 0.06 0.00 0.89 0.55 0.39 0.06 0.00 0.66 0.11 0.04 0.00 0.00 0.67 0.11 0.04 0.00 0.00

rate composition

Variance Total

0.05

0.94

0.19

0.73

0.29

0.69

0.24

0.30

0.01

0.01

0.02

0.91

0.07

0.62

0.14

0.53

0.07

0.13

0.00

0.00

0.16

0.82

0.22

0.32

0.21

0.25

0.02

0.02

0.00

0.00

0.07

0.74

0.07

0.18

0.05

0.10

0.00

0.00

0.00

0.00

TABLE 3: Error rate decomposition of template attacks (denoted TA) targeting the output of the SBox of an unprotected software implementation. Each distinguisher uses Na attack traces, Np profiling traces per target value, and n points per trace.

9

Na Np

10 3

PCPA-SA1

10

20 3

10

10 3

PCPA-SA2

10

20 3

10

10 3

PCPA-SA3

10

20 3

10

PCPA-TA

10 3 10
20 3 10

Error rate composition

Bias

Variance

Total

0.82

0.01

0.83

0.82

0.00

0.82

0.41

0.01

0.42

0.41

0.01

0.42

0.80

0.03

0.83

0.81

0.01

0.82

0.37

0.07

0.44

0.38

0.02

0.40

0.77

0.07

0.84

0.78

0.03

0.81

0.33

0.13

0.46

0.32

0.05

0.37

0.67

0.17

0.84

0.66

0.08

0.74

0.17

0.26

0.43

0.17

0.08

0.25

TABLE 5: Error rate decomposition of profiled correlation power analysis using template attacks (denoted PCPA-TA) and stochastic attacks of degrees 1, 2 and 3 (denoted PCPASA1, PCPA-SA2 and PCPA-SA3) targeting the output of the SBox of an unprotected software implementation. Each distinguisher uses Na attack traces, Np profiling traces per target value, and 1 points per trace.

Na

n

1

5

2

SA1

5

1

10

2

5

1

5

2

SA2

5

1

10

2

5

1

5

2

SA3

5

1

10

2

5

1

5

2

TA

5

1

10

2

5

Error rate composition

Bias

Variance

Total

0.99

0.00

0.99

0.99

0.00

0.99

0.99

0.00

0.99

0.99

0.00

0.99

0.99

0.00

0.99

0.99

0.00

0.99

0.99

0.00

0.99

0.97

0.01

0.98

0.97

0.01

0.98

0.97

0.01

0.98

0.95

0.01

0.96

0.94

0.03

0.97

0.96

0.01

0.97

0.92

0.02

0.94

0.90

0.05

0.95

0.92

0.02

0.94

0.80

0.06

0.86

0.74

0.15

0.89

0.89

0.05

0.94

0.54

0.19

0.73

0.39

0.29

0.69

0.65

0.17

0.82

0.11

0.21

0.32

0.04

0.21

0.25

TABLE 4: Error rate decomposition of template attacks (denoted TA) and stochastic attacks of degrees 1, 2 and 3 (denoted respectively SA1, SA2 and SA3) targeting the input of the SBox of an unprotected software implementation. Each distinguisher uses Na attack traces, 3 profiling traces per target value, and n points per trace.

set to 10 profiling traces per target value. Indeed, this last change decreases the variance to 0.08 (whereas the bias term remains constant) leading to a decrease of the error rate to 0.74. This example highlights that the security metric without its decomposition does not provide enough infor-

mation on how to improve the efficiency of attacks. Note that an adversary could reduce the bias term by increasing the number of points per leakage, which allows to reduce the error rate of the profiled attacks.
Noisy hardware contexts
The previous sections show that an evaluator can exploit our diagnosis tool on unprotected software implementations. This section attests that an adversary can also apply the same diagnosis tool on a more challenging hardware noisy context. More precisely, we employ the public dataset of the second version of the DPAContest8. The public dataset contains 1 640 000 traces collected on an FPGA implementing the unprotected AES-128. We consider 70% of the dataset for the profiling source and the reminder for the attack source. We target the Hamming distance between the first byte of the ciphertext and the first byte of the input of the SBox executed during the last round. For the sake of place, Table 6 reports the results only for template attacks and profiled correlation power analysis based on template attacks. Let assume that the evaluator starts with template attacks using 1 000 attack traces, 200 profiling traces per target value and 2 points per leakage. The bias term equals to 0.89 while the variance term reaches 0.02 leading to an error rate of 0.91. In order to reduce the error rate, we can reduce the bias term by increasing the number of instants per trace to 50, which provides a smaller bias term of 0.58 but a higher variance term of 0.19 (with a smaller error rate of 0.77). We can reduce the error rate by reducing the variance with a larger profiling set of 2 000 profiling traces per target value. This last setting provides an error rate of 0.59 (with a bias of 0.55 and a small variance of 0.03). It is worth to note that the evaluator can reduce the bias term (as well as the error rate) by increasing the attacking set (as shown in the Table 6).
8. http://www.dpacontest.org/v2/

IEEE TRANSACTIONS ON COMPUTERS, VOL. ?, NO. ?, ? ?
Furthermore, other techniques could be exploited to reduce the error rate such as targeting other (or several) sensitive values and adopting a key enumeration strategy (which was probably applied by best attacks in the DPAContest v2).

TA PCPA-TA

Na 1 000
4 000 1 000 4 000

Np

n

2 200 25
50 75 2 2 000 25 50 75 2 200 25 50 75 2 2 000 25 50 75

10

50

1

200

2 000

10

50

1

200

2 000

Error Bias
0.89 0.69 0.58 0.56 0.90 0.68 0.55 0.53 0.72 0.41 0.34 0.31 0.72 0.39 0.32 0.29

rate composition

Variance Total

0.02

0.91

0.12

0.80

0.19

0.77

0.23

0.79

0.00

0.90

0.02

0.70

0.03

0.59

0.04

0.58

0.03

0.75

0.16

0.57

0.16

0.51

0.22

0.53

0.01

0.73

0.03

0.41

0.00

0.32

0.02

0.31

0.83

0.09

0.92

0.82

0.04

0.86

0.82

0.01

0.83

0.82

0.00

0.82

0.29

0.33

0.62

0.30

0.08

0.38

0.30

0.01

0.32

0.29

0.00

0.30

TABLE 6: Error rate decomposition of template attacks (denoted TA) and profiled correlation power analysis using template attacks (denoted PCPA-TA) targeting the Hamming distance between the ciphertext and the input of the last round of an unprotected hardware implementation of AES. Each distinguisher uses Na attack traces, Np profiling traces per target value, and n points per trace.

5 EXPERIMENTS ON PROTECTED DEVICE
This section extends the previous results when considering a more challenging protected (software) implementation. More precisely, we analyze a masking countermeasure representing a well-known technique (due to its theoretical soundness) in order to increase the degree of resilience of an implementation against side-channel attacks. Our diagnosis tool naturally extends to protected devices because our framework decomposes a security metric that can be estimated on any device.
5.1 Acquisition setup
The dataset (of 80 000 power traces) measured on the protected device was collected with the same acquisition board used for the unprotected software case studies. The profiling source and the attack source contain respectively 56 000 and 24 000 leakages. We also applied the same filtering technique (i.e., an average of 64 single acquisitions). The main difference with the previous (unprotected) setting lies in the target implementation. More precisely, this section analyses the implementation of a first-order masked AES SBox (with 2 shares) based on table lookups [30], [32]. This strategy pre-computes a new masked AES SBox in memory

10
(denoted SBox∗) for each execution of the cryptographic algorithm such that:
SBox∗k (x ⊕ min) = SBox (x) ⊕ mout ∀x ∈ {0, 1}8 (18)
for a given pair of input and output mask bytes (denoted respectively min and mout) that are independent and identically distributed from a uniformly random source. Our implementation avoids a first order leakage by providing the masked input byte (i.e., p ⊕ k ⊕ min where p and k represent the plaintext and the key) to the executed implementation.
As previously, our target device uses a random fixed 128bit key and random plaintexts. We target the first byte of the key used during the first round of the masked AES.
5.2 Case studies
We vary four parameters in our experiments: (1) the number of points per trace, (2) the number of profiling traces, (3) the number of attack traces and (4) the type of attack. Unlike the unprotected context, we focus here only on two popular profiled attacks in order to highlight the usefulness of our diagnosis tool in protected environments. More precisely, we test template attacks and stochastic-based mixture attacks of degree 1, leaving the other attacks as future work.
Due to the increase of the complexity of the case studies (related to the implemented protection), we increase the size of the attacking set to Na ∈ {25, 50, 100} and the size of the profiling set to Np ∈ {20, 40, 60, 80, 100}. In our experiment, we also consider two leakage dimensions (n ∈ {2, 5} per share, leading to {4, 10} points per leakage) but only one (linear) leakage function. Similarly to the unprotected setting, for each case study, we estimate the parameters of each profiled attack 100 times (with different profiling sets sampled with replacement from the profiling source) in order to estimate the bias, the variance and the error rate.
Section 2.2.2 reports that, in protected contexts, stochastic-based mixture attacks are unbiased models (if the leakages contain Gaussian noise and if the stochastic-based models contain the right degree) since stochastic-based mixture attacks (like any attack exploiting the mixture structure of the density function) take into account the multimodal structure of the density function associated to each value y split in several shares. In order to estimate the complexity of the scenarios based on the collected dataset, Table 7 reports the (estimation) of the lower bound of the error rate of stochastic-based mixture attacks (and of template attacks) by considering a large profiling set (of 400 leakages per target value). The purpose of this table is to show in the following that, even with a high estimation error of the decomposition of the error rate (due to a large difference between the Bayes classifier, represented by the stochasticbased mixture attacks, and the Oracle model), we can extract information on the bias and on the variance.
5.3 Bias-variance decomposition of profiled attacks
Tables 8 and 9 provide the bias-variance error rate decomposition of respectively template attacks and stochasticbased mixture attacks (of degree 1) targeting the output of the SBox implemented with the masking scheme. Let assume that the adversary applies template attacks with 25 attacking traces, 20 profiling traces per target value and 2
Error RateAttacksTemplate AttacksLeakagesBias