Elsevier

Journal of Theoretical Biology

Volume 433, 21 November 2017, Pages 21-38
Journal of Theoretical Biology

Selection originating from protein stability/foldability: Relationships between protein folding free energy, sequence ensemble, and fitness

Highlights

A Boltzmann distribution with protein fitness is derived.

Relationships between folding free energy, inverse statistical potential and fitness.

Selective temperature, glass transition temperature and folding free energy are estimated.

Relationship between selective temperature and substitution rate (Ka/Ks).

Protein stability/foldability is kept in a balance of positive selection and random drift.

Abstract

Assuming that mutation and fixation processes are reversible Markov processes, we prove that the equilibrium ensemble of sequences obeys a Boltzmann distribution with exp(4Nem(11/(2N))), where m is Malthusian fitness and Ne and N are effective and actual population sizes. On the other hand, the probability distribution of sequences with maximum entropy that satisfies a given amino acid composition at each site and a given pairwise amino acid frequency at each site pair is a Boltzmann distribution with exp(ψN), where the evolutionary statistical energy ψN is represented as the sum of one body (h) (compositional) and pairwise (J) (covariational) interactions over all sites and site pairs. A protein folding theory based on the random energy model (REM) indicates that the equilibrium ensemble of natural protein sequences is well represented by a canonical ensemble characterized by exp(ΔGND/kBTs) or by exp(GN/kBTs) if an amino acid composition is kept constant, where ΔGNDGNGD, GN and GD are the native and denatured free energies, and Ts is the effective temperature representing the strength of selection pressure. Thus, 4Nem(11/(2N)),ΔψND(ψN+ψD), and ΔGND/kBTs must be equivalent to each other. With h and J estimated by the DCA program, the changes (ΔψN) of ψN due to single nucleotide nonsynonymous substitutions are analyzed. The results indicate that the standard deviation of ΔGN(=kBTsΔψN) is approximately constant irrespective of protein families, and therefore can be used to estimate the relative value of Ts. Glass transition temperature Tg and ΔGND are estimated from estimated Ts and experimental melting temperature (Tm) for 14 protein domains. The estimates of ΔGND agree with their experimental values for 5 proteins, and those of Ts and Tg are all within a reasonable range. In addition, approximating the probability density function (PDF) of ΔψN by a log-normal distribution, PDFs of ΔψN and Ka/Ks, which is the ratio of nonsynonymous to synonymous substitution rate per site, in all and in fixed mutants are estimated. The equilibrium values of ψN, at which the average of Δψ in fixed mutants is equal to zero, well match ψN averaged over homologous sequences, confirming that the present methods for a fixation process of mutations and for the equilibrium ensemble of ψN give a consistent result with each other. The PDFs of Ka/Ks at equilibrium confirm that Ts negatively correlates with the amino acid substitution rate (the mean of Ka/Ks) of protein. Interestingly, stabilizing mutations are significantly fixed by positive selection, and balance with destabilizing mutations fixed by random drift, although most of them are removed from population. Supporting the nearly neutral theory, neutral selection is not significant even in fixed mutants.

Keywords

Folding free energy change
Inverse statistical potential
Boltzmann distribution
Selective temperature
Positive selection

1. Introduction

Natural proteins can fold their sequences into unique structures. Protein’s stability and foldability result from natural selection and are not typical characteristics of random polymers (Bryngelson and Wolynes, 1987; Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b). Natural selection maintains protein’s stability and foldability over evolutionary timescales. On the basis of the random energy model (REM) for protein folding, it was discussed (Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b) that the equilibrium ensemble of natural protein sequences in sequence space is well represented by a canonical ensemble characterized by a Boltzmann factor exp(ΔGND(σ)/kBTs), where ΔGND(σ)(GN(σ)GD(σ)) is the folding free energy of sequence σ, GN and GD are the free energies of the native and denatured states, kB is the Boltzmann constant, and Ts is the effective temperature representing the strength of selection pressure and must satisfy Ts < Tg < Tm for natural proteins to fold into unique native structures; Tg is glass transition temperature and Tm is melting temperature. The REM also indicates that the free energy of denatured conformations (GD) is a function of amino acid frequencies only and does not depend on amino acid order, and therefore the Boltzmann factor will be taken as exp(GN(σ)/kBTs), if amino acid frequencies are kept constant. It was shown by lattice Monte Carlo simulations (Shakhnovich, 1994) that lattice protein sequences selected with this Boltzmann factor were not trapped by competing structures but could fold into unique native structures. Selective temperatures were also estimated (Dokholyan and Shakhnovich, 2001) for actual proteins to yield good correlations of sequence entropy between actual protein families and sequences designed with this type of Boltzmann factor.

On the other hand, the maximum entropy principle insists that the probability distribution of sequences in sequence space, which satisfies constraints on amino acid compositions at all sites and on amino acid pairwise frequencies for all site pairs, is a Boltzmann distribution with the Boltzmann factor exp(ψN(σ)), where the total interaction ψN(σ) of a sequence σ is represented as the sum of one-body (h) (compositional) and pairwise (J) (covariational) interactions between residues in the sequence; ψN(σ) is called the evolutionary statistical energy by Hopf et al. (2017). The inverse statistical potentials, the one-body (h) and pairwise (J) interactions, that satisfy those constraints for homologous sequences have been estimated (Ekeberg et al., 2014; 2013; Marks et al., 2011; Morcos et al., 2011) as one of inverse Potts problems and successfully employed to predict contacting residue pairs in protein structures (Ekeberg et al., 2014; 2013; Hopf et al., 2012; Marks et al., 2011; Miyazawa, 2013; Morcos et al., 2011; Sułkowska et al., 2012). Morcos et al. (2014) noticed that the ψN in the Boltzmann factor is the dimensionless energy corresponding to GN/kBTs, and estimated selective temperatures (Ts) for several protein families by comparing the difference (ΔψND) of ψ between the native and the molten globule states with folding free energies (ΔGND) estimated with associative-memory, water-mediated, structure, and energy model (AWSEM) (Davtyan et al., 2012).

A purpose of the present study is to establish relationships between protein foldability/stability, sequence distribution, and protein fitness. First, we prove that if mutation and fixation processes in protein evolution are reversible Markov processes, the equilibrium ensemble of genes will obey a Boltzmann distribution with the Boltzmann factor exp(4Nem(11/(2N))), where Ne and N are effective and actual population sizes, and m is the Malthusian fitness of a gene. In other words, correspondences between ΔGND/kBTs,ΔψND(ψNψD) and 4Nem(11/(2N)) are obtained by equating these three Boltzmann distributions with each other; ψDGD/kBTs+constant.

The second purpose is to analyze the effects (ΔψN) of single amino acid substitutions on the evolutionary statistical energy of a protein, and to estimate from the distribution of ΔψN the effective temperature of natural selection (Ts) and then glass transition temperature (Tg) and folding free energy (ΔGND) of protein. We estimate the one-body (h) and pairwise (J) interactions with the DCA program, which is available at “http://dca.rice.edu/portal/dca/home”, and then analyze the changes (ΔψN) of the evolutionary statistical energy (ψN) of a natural sequence due to single amino acid substitutions caused by single nucleotide changes. The data of ΔψN due to single nucleotide nonsynonymous substitutions for 14 protein domains show that the standard deviation of ΔψN over all the substitutions at all sites hardly depends on the evolutionary statistical energy (ψN) of each homologous sequence and is nearly constant for each protein family, indicating that the standard deviation of ΔGN ≃ kBTsΔψN is nearly constant irrespective of protein families. From this finding, Ts for each protein family has been estimated in relative to Ts for the PDZ family, which is determined by directly comparing ΔΔψND(Δ(ψNψD)ΔψN) with the experimental values of folding free energy changes, ΔΔGND, due to single amino acid substitutions. Also Tg and ΔGND for each protein family are estimated on the basis of the REM from the estimated Ts and an experimental melting temperature Tm. The estimates of Ts and Tg are all within a reasonable range, and those of ΔGND are well compared with experimental ΔGND for 5 protein families. The present method for estimating Ts is simpler than the method (Morcos et al., 2014) using AWSEM, and also is useful for the prediction of ΔGND, because the experimental data of ΔGND are limited in comparison with Tm, and also experimental conditions such as temperature and pH tend to be different among them. In addition, it has been revealed that ΔψN averaged over all single nucleotide nonsynonymous substitutions is a linear function of ψN/L of each homologous sequence, where L is sequence length; the average of ΔψN decreases as ψN/L increases. This characteristic is required for homologous proteins to stay at the equilibrium state of the native conformational energy GN ≃ kBTsψN, and indicates a weak dependency (Miyazawa, 2016; Serohijos et al., 2012) of ΔΔGND on ΔGND/L of protein across protein families.

The third purpose is to study an amino acid substitution process in protein evolution, which is characterized by the fitness, m=ΔψND/(4Ne(11/(2N))). We employ a monoclonal approximation for mutation and fixation processes of genes, in which protein evolution proceeds with single amino acid substitutions fixed at a time in a population. In this approximation, ψN of a protein gene attains the equilibrium, ψN=ψNeq, when the average of ΔψN( ≃ ΔΔψND) over singe nucleotide nonsynonymous mutations fixed in a population is equal to zero. Approximating the distribution of ΔψN due to singe nucleotide nonsynonymous mutations by a log-normal distribution, their distribution for fixed mutants is numerically calculated and used to calculate the averages of various quantities and also the probability density functions (PDF) of Ka/Ks in all arising mutants and also in fixed mutants only; Ka/Ks is defined as the ratio of nonsynonymous to synonymous substitution rate per site. There is a good agreement between the time average (ψNeq) and ensemble average (⟨ψNσ), which is equal to the sample average, ψN¯, of ψN over homologous sequences, supporting the constancy of the standard deviation of ΔψN assumed in the monoclonal approximation.

We also study protein evolution at equilibrium, ψN=ψNeq. The common understanding of protein evolution has been that amino acid substitutions observed in homologous proteins are neutral (Kimura, 1968; 1969; Kimura and Ohta, 1971; 1974) or slightly deleterious (Ohta, 1973; 1992), and random drift is a primary force to fix amino acid substitutions in population. The PDFs of Ka/Ks in all arising mutations and in their fixed mutations are examined to see how significant each of positive, neutral, slightly negative,and negative selections is. Interestingly, stabilizing mutations are significantly fixed in population by positive selection, and balance with destabilizing mutations that are also significantly fixed by random drift, although most negative mutations are removed from population. Contrary to the neutral theory (Kimura, 1968; 1969; Kimura and Ohta, 1971; 1974) and supporting the nearly neutral theory (Ohta, 1973; 1992; 2002), the proportion of neutral selection is not large even in fixed mutants. It is also confirmed that the effective temperature (Ts) of selection negatively correlates with the amino acid substitution rate (Ka/Ks) of protein at equilibrium.

2. Methods

2.1. Knowledge of protein folding

A protein folding theory (Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b), which is based on a random energy model (REM), indicates that the equilibrium ensemble of amino acid sequences, σ(σ1,,σL) where σi is the type of amino acid at site i and L is sequence length, can be well approximated by a canonical ensemble with a Boltzmann factor consisting of the folding free energy, ΔGND(σ, T) and an effective temperature Ts representing the strength of selection pressure.

(1)P(σ)Pmut(σ)exp(ΔGND(σ,T)kBTs)
(2)exp(GN(σ)kBTs)iff(σ)=constant
(3)ΔGND(σ,T)GN(σ)GD(f(σ),T)
where pmut(σ) is the probability of a sequence (σ) randomly occurring in a mutational process and depends only on the amino acid frequencies f(σ), kB is the Boltzmann constant, T is a growth temperature, and GN and GD are the free energies of the native conformation and denatured state, respectively. Selective temperature Ts quantifies how strong the folding constraints are in protein evolution, and is specific to the protein structure and function. The free energy GD of the denatured state does not depend on the amino acid order but the amino acid composition, f(σ), in a sequence (Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b). It is reasonable to assume that mutations independently occur between sites, and therefore the equilibrium frequency of a sequence in the mutational process is equal to the product of the equilibrium frequencies over sites; Pmut(σ)=ipmut(σi), where pmut(σi) is the equilibrium frequency of σi at site i in the mutational process.

The distribution of conformational energies in the denatured state (molten globule state), which consists of conformations as compact as the native conformation, is approximated in the random energy model (REM), particularly the independent interaction model (IIM) (Pande et al., 1997), to be equal to the energy distribution of randomized sequences, which is then approximated by a Gaussian distribution, in the native conformation. That is, the partition function Z for the denatured state is written as follows with the energy density n(E) of conformations that is approximated by a product of a Gaussian probability density and the total number of conformations whose logarithm is proportional to the chain length.

(4)Z=exp(EkBT)n(E)dE
(5)n(E)exp(ωL)N(E¯(f(σ)),δE2(f(σ)))
where ω is the conformational entropy per residue in the compact denatured state, and N(E¯(f(σ)),δE2(f(σ))) is the Gaussian probability density with mean E¯ and variance δE2, which depend only on the amino acid composition of the protein sequence. The free energy of the denatured state is approximated as follows.
(6)GD(f(σ),T)E¯(f(σ))δE2(f(σ))2kBTkBTωL
(7)=E¯(f(σ))δE2(f(σ))ϑ(T/Tg)kBT
(8)ϑ(T/Tg){12(1+T2Tg2)forT>TgTTgforTTg
where E¯ and δE2 are estimated as the mean and variance of interaction energies of randomized sequences in the native conformation. Tg is the glass transition temperature of the protein at which entropy becomes zero (Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b); GD/T|T=Tg=0. The conformational entropy per residue ω in the compact denatured state can be represented with Tg; ωL=δE2/(2(kBTg)2). Thus, unless Tg < Tm, a protein will be trapped at local minima on a rugged free energy landscape before it can fold into a unique native structure.

2.2. Probability distribution of homologous sequences with the same native fold in sequence space

The probability distribution P(σ) of homologous sequences with the same native fold, σ=(σ1,,σL) where σi ∈ {amino  acids, deletion}, in sequence space with maximum entropy, which satisfies a given amino acid frequency at each site and a given pairwise amino acid frequency at each site pair, is a Boltzmann distribution (Marks et al., 2011; Morcos et al., 2011).

(9)P(σ)exp(ψN(σ))
(10)ψN(σ)(iL(hi(σi)+j>iJij(σi,σj)))
where hi and Jij are one-body (compositional) and two-body (covariational) interactions and must satisfy the following constraints.
(11)σP(σ)δσiak=Pi(ak)
(12)σP(σ)δσiakδσjal=Pij(ak,al)
where δσiak is the Kronecker delta, Pi(ak) is the frequency of amino acid ak at site i, and Pij(ak, al) is the frequency of amino acid pair, ak at i and al at j; ak ∈ {amino acids, deletion}. The pairwise interaction matrix J satisfies Jij(ak,al)=Jji(al,ak) and Jii(ak,al)=0. Interactions hi and Jij can be well estimated from a multiple sequence alignment (MSA) in the mean field approximation (Marks et al., 2011; Morcos et al., 2011), or by maximizing a pseudo-likelihood (Ekeberg et al., 2014; 2013). Because ψN(σ) has been estimated under the constraints on amino acid compositions at all sites, only sequences with a given amino acid composition contribute significantly to the partition function, and other sequences may be ignored.

Hence, from Eqs. (2) and (9),

(13)ψN(σ)GN(σ)/(kBTs)+functionoff(σ)
(14)ψD(f(σ),T)GD(f(σ),T)/(kBTs)+functionoff(σ)
(15)ΔψND(σ,T)ΔGND(σ,T)/(kBTs)
(16)ΔψND(σ,T)ψN(σ)ψD(f(σ),T)
(17)ψD(f(σ),T)ψ¯(f(σ))δψ2(f(σ))ϑ(T/Tg)Ts/T
(18)ω=(Ts/Tg)2δψ2/(2L)
where the ψ¯ and δψ2 are estimated as the mean and variance of ψN over randomized sequences; E¯kBTsψ¯ and δE2 ≃ (kBTs)2δψ2.

2.3. The equilibrium distribution of sequences in a mutation-fixation process

Here we assume that the mutational process is a reversible Markov process. That is, the mutation rate per gene, Mμν, from sequence μ(μ1,,μL) to ν satisfies the detailed balance condition

(19)Pmut(μ)Mμν=Pmut(ν)Mνμ
where Pmut(ν) is the equilibrium frequency of sequence ν in a mutational process, Mμν. The mutation rate per population is equal to 2NMμν for a diploid population, where N is the population size. The substitution rate Rμν from μ to ν is equal to the product of the mutation rate and the fixation probability with which a single mutant gene becomes to fully occupy the population (Crow and Kimura, 1970).
(20)Rμν=2NMμνu(s(μν))
where u(s(μ → ν)) is the fixation probability of mutants from μ to ν the selective advantage of which is equal to s.

For genic selection (no dominance) or gametic selection in a Wright-Fisher population of diploid, the fixation probability, u, of a single mutant gene, the selective advantage of which is equal to s and the frequency of which in a population is equal to qm=1/(2N), was estimated (Crow and Kimura, 1970) as

(21)2Nu(s)=2N1e4Nesqm1e4Nes
(22)=u(s)u(0)withqm=12N
where Ne is effective population size. Eq. (21) will be also valid for haploid population if 2Ne and 2N are replaced by Ne and N, respectively. Also, for Moran population of haploid, 4Ne and 2N should be replaced by Ne and N, respectively. Fixation probabilities for various selection models, which are compiled from p. 192 and p. 424–427 of Crow and Kimura (1970) and from Moran (1958) and Ewens (1979), are listed in Table S.7. The selective advantage of a mutant sequence ν to a wildtype μ is equal to
(23)s(μν)=m(ν)m(μ)
where m(ν) is the Malthusian fitness of a mutant sequence, and m(μ) is for the wildtype.

This Markov process of substitutions in sequence is reversible, and the equilibrium frequency of sequence μ, Peq(μ), in the total process consisting of mutation and fixation processes is represented by

(24)Peq(μ)=Pmut(μ)exp(4Nem(μ)(1qm))νPmut(ν)exp(4Nem(ν)(1qm))
because both the mutation and fixation processes satisfy the detailed balance conditions, Eq. (19) and the following equation, respectively.
(25)exp(4Nem(μ)(1qm))u(s(μν))=exp(4Nem(μ)qm)exp(4Nem(ν)qm)exp(4Nem(μ))exp(4Nem(ν))
(26)=exp(4Nem(ν)(1qm))u(s(νμ))
As a result, the ensemble of homologous sequences in molecular evolution obeys a Boltzmann distribution.

2.4. Relationships between m(σ), ψN(σ), and ΔGND(σ) of protein sequence

From Eqs. (1), (9), and (24), we can get the following relationships among the Malthusian fitness m, the folding free energy change ΔGND and ΔψND of protein sequence.

(27)Peq(μ)=Pmut(μ)exp(4Nem(μ)(1qm))νPmut(ν)exp(4Nem(ν)(1qm)))
(28)=Pmut(μ¯)exp((ψN(μ)ψD(f(μ)¯,T)))νPmut(ν¯)exp((ψN(ν)ψD(f(ν)¯,T)))
(29)Pmut(μ)exp(ΔGND(μ,T)/(kBTs))νPmut(ν)exp(ΔGND(ν,T)/(kBTs))
where f(σ)¯σf(σ)P(σ) and logPmut(σ¯)σP(σ)log(iPmut(σi)). Then, the following relationships are derived for sequences for which f(μ)=f(μ)¯.
(30)4Nem(μ)(1qm)=ΔψND(μ,T)+constant
(31)ΔGND(μ,T)kBTs+constant
The selective advantage of ν to μ is represented as follows for f(μ)=f(ν)=f(σ)¯.
(32)4Nes(μν)(1qm)=(4Nem(ν)4Nem(μ))(1qm)
(33)=(ΔψND(ν,T)ΔψND(μ,T))=(ψN(ν)ψN(μ))
(34)(ΔGND(ν,T)ΔGND(μ,T))/(kBTs)=(GN(ν)GN(μ))/(kBTs)
It should be noted here that only sequences for which f(σ)=f(σ)¯ contribute significantly to the partition functions in Eq. (28), and other sequences may be ignored.

Eq. (33) indicates that evolutionary statistical energy ψ should be proportional to effective population size Ne, and therefore it is ideal to estimate one-body (h) and two-body (J) interactions from homologous sequences of species that do not significantly differ in effective population size. Also, Eq. (34) indicates that selective temperature Ts is inversely proportional to the effective population size Ne; Ts∝1/Ne, because free energy is a physical quantity and should not depend on effective population size.

2.5. The ensemble average of folding free energy, ΔGND(σ, T), over sequences

The ensemble average of ΔGND(σ, T) over sequences with Eq. (1) is

(35)ΔGND(σ,T)σ
(36)[σΔGND(σ,T)Pmut(σ)exp(ΔGND(σ,T)kBTs)]/[σPmut(σ)exp(ΔGND(σ,T)kBTs)]
(37)[σ|f(σ)=f(σN)¯GN(σ)exp(GN(σ)kBTs)]/[σ|f(σ)=f(σN)¯exp(GN(σ)kBTs)]GD(f(σN)¯,T)
(38)=GN(σ)σGD(f(σN)¯,T)
where σN denotes a natural sequence, and f(σN)¯ denotes the average of amino acid frequencies f(σN) over homologous sequences. In Eq. (37), the sum over all sequences is approximated by the sum over sequences the amino acid composition of which is the same as that over the natural sequences.

The ensemble averages of GN and ψN(σ) are estimated in the Gaussian approximation (Pande et al., 1997).

(39)GN(σ)σEexp(E/(kBTs))n(E)dEexp(E/(kBTs))n(E)dE
(40)=E¯(f(σN)¯)δE2(f(σN)¯)/(kBTs)
(41)ψN(σ)σ[σψND(σ)exp(ψN(σ))]/[σexp(ψN(σ))]
(42)ψ¯(f(σN)¯)δψ2(f(σN)¯)
The ensemble averages of ΔGND(σ, T) and ψN(σ) over sequences are observable as the sample averages of ΔGND(σN, T) and ψN(σN) over homologous sequences fixed in protein evolution, respectively.
(43)ΔGND(σN,T)¯/(kBTs)=ΔGND(σ,T)σ/(kBTs)
(44)δψ2(f(σN)¯)[ϑ(T/Tg)Ts/T1]
(45)ψN(σN)¯σNwσNψN(σN)σNwσN
(46)=ψN(σ)σ
where the overline denotes a sample average with a sample weight wσN for each homologous sequence, which is used to reduce phylogenetic biases in the set of homologous sequences.

The folding free energy becomes equal to zero at the melting temperature Tm; ΔGND(σN,Tm)σ=0. Thus, the following relationship must be satisfied (Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b).

(47)ϑ(Tm/Tg)TsTm=Ts2Tm(1+Tm2Tg2)=1withTsTgTm

2.6. Probability distributions of selective advantage, fixation rate and Ka/Ks

Let us consider the probability distributions of characteristic quantities that describe the evolution of genes. First of all, the probability density function (PDF) of selective advantage s, p(s), of mutant genes can be calculated from the PDF of the change of ΔψND due to a mutation from μ to ν, ΔΔψND(ΔψND(ν,T)ΔψND(μ,T)). The PDF of 4Nes, p(4Nes)=p(s)/(4Ne), may be more useful than p(s).

(48)p(4Nes)=p(ΔΔψND)|dΔΔψNDd4Nes|=p(ΔΔψND)(1qm)
where ΔΔψND must be regarded as a function of 4Nes, that is, ΔΔψND=4Nes(1qm); see Eq. (33).

The PDF of fixation probability u can be represented by

(49)p(u)=p(4Nes)d4Nesdu=p(4Nes)(e4Nes1)2e4Nes(qm1)qm(e4Nes1)(e4Nesqm1)
where 4Nes must be regarded as a function of u.

The ratio of the substitution rate per nonsynonymous site (Ka) for nonsynonymous substitutions with selective advantage s to the substitution rate per synonymous site (Ks) for synonymous substitutions with s = 0 is

(50)KaKs=u(s)u(0)=u(s)qm
assuming that synonymous substitutions are completely neutral and mutation rates at both types of sites are the same. The PDF of Ka/Ks is
(51)p(Ka/Ks)=p(u)dud(Ka/Ks)=p(u)qm

2.7. Probability distributions of ΔΔψND, 4Nes, u, and Ka/Ks in fixed mutant genes

The PDF of ΔΔψND in fixed mutants is proportional to that multiplied by the fixation probability.

(52)p(ΔΔψND,fixed)=p(ΔΔψND)u(s(ΔΔψND))u(s(ΔΔψND))
(53)uu(s)p(ΔΔψND)dΔΔψND
Likewise, the PDF of selective advantage in fixed mutants is
(54)p(4Nesfixed)=p(4Nes)u(s)u(s)
and those of the u and Ka/Ks in fixed mutants are
(55)p(ufixed)=p(u)uu
(56)p((KaKs)fixed)=p(KaKs)uu=p(KaKs)KaKsKaKs
The average of Ka/Ks in fixed mutants is equal to the ratio of the second moment to the first moment of Ka/Ks in all arising mutants; Ka/Ksfixed=(Ka/Ks)2/Ka/Ks.

3. Materials

3.1. Sequence data

We study the single domains of 8 Pfam (Finn et al., 2016) families and both the single domains and multi-domains from 3 Pfam families. In Table 1, their Pfam ID for a multiple sequence alignment, and UniProt ID and PDB ID with the starting- and ending-residue positions of the domains are listed. The full alignments for their families at the Pfam are used to estimate one-body interactions h and pairwise interactions J with the DCA program from “http://dca.rice.edu/portal/dca/home ” (Marks et al., 2011; Morcos et al., 2011). To estimate the sample (ψN¯) and ensemble (⟨ψNσ) averages of the evolutionary statistical energy, M unique sequences with no deletions are used. In order to reduce phylogenetic biases in the set of homologous sequences, we employ a sample weight (wσN) for each sequence, which is equal to the inverse of the number of sequences that are less than 20% different from a given sequence in a given set of homologous sequences. Only representatives of unique sequences with no deletions, which are at least 20% different from each other, are used to calculate the changes of the evolutionary statistical energy (ΔψN) due to single nucleotide nonsynonymous substitutions; the number of the representatives is almost equal to the effective number of sequences (Meff) in Table 1.

Table 1. Protein families, and structures studied.

Pfam familyUniProt IDNaNeffb,cMdMeffc,eLfPDB ID
HTH_3RPC1_BP434/7-5915315(15917)11691.2162864893.73531R69-A:6-58
NitroreductaseQ97IT9_CLOAB/4-766008(6084)4912.961057854.71733E10-A/B:4-76g
SBP_bac_3hGLNH_ECOLI/27-2449874(9972)7374.9614099.702181WDN-A:5-222
SBP_bac_3GLNH_ECOLI/111-2049712(9898)7442.85829689.64941WDN-A:89-182
OmpAPAL_ECOLI/73-1676035(6070)4920.4422071761.24951OAP-A:52-146
DnaBDNAB_ECOLI/31-1281929(1957)1284.941187697.30981JWE-A:30-127
LysR_substratehBENM_ACIAD/90-28025138(25226)20707.0685(1)67.001912F6G-A/B:90-280g
LysR_substrateBENM_ACIAD/163-26525032(25164)21144.74121(1)99.271032F6G-A/B:163-265g
Methyltransf_5hRSMH_THEMA/8-2921942(1953)1286.67578(2)357.972851N2X-A:8-292
Methyltransf_5RSMH_THEMA/137-2161877(1911)1033.35975(2)465.53801N2X-A:137-216
SH3_1SRC_HUMAN:90-1379716(16621)3842.471191458.31481FMK-A:87-134
ACBPACBP_BOVIN/3-822130(2526)1039.0616170.72802ABD-A:2-81
PDZPTN13_MOUSE/1358-143813814(23726)4748.761255339.99811GM1-A:16-96
Copper-bindAZUR_PSEAE:24-1481136(1169)841.5667(1)45.231255AZU-B/C:4-128g
a

The number of unique sequences and the total number of sequences in parentheses; the full alignments in the Pfam (Finn et al., 2016) are used.

b

The effective number of sequences.

c

A sample weight (wσN) for a given sequence is equal to the inverse of the number of sequences that are less than 20% different from the given sequence.

d

The number of unique sequences that include no deletion unless specified. The number in parentheses indicates the maximum number of deletions allowed.

e

The effective number of unique sequences that include no deletion or at most the specified number of deletions.

f

The number of residues.

g

Contacts are calculated in the homodimeric state for these protein.

h

These proteins consist of two domains, and other ones are single domains.

4. Results

First, We describe how one-body and pairwise interactions, h and J, are estimated. Then, the changes of evolutionary statistical energy (ΔψN) due to single nucleotide nonsynonymous changes on natural sequences are analyzed with respect to dependences on the ψN of the wildtype sequences. The results indicate that the standard deviation of ΔGN ≃ kBTsΔψN is almost constant over protein families. Hence, the selective temperatures, Ts, of various protein families can be estimated in a relative scale from the standard deviation of ΔψN. The Ts of a reference protein is estimated by comparing the expected values of ΔΔGND with their experimental values. Folding free energies ΔGND are estimated from estimated Ts and experimental melting temperature Tm, and compared with their experimental values for 5 protein families. Glass transition temperatures Tg are also estimated from Ts and Tm.

Secondly, based on the distribution of ΔψN, protein evolution is studied. Evolutionary statistical energy (ψN) attains the equilibrium when the average of ΔψN over fixed mutations is equal to zero. The PDF of ΔψN is approximated by log-normal distributions. The basic relationships are that 1) the standard deviation of ΔψN is constant specific to a protein family, and 2) the mean of ΔψN linearly depends on ψN. The equilibrium value of ψN is shown to agree with the mean of ψN over homologous proteins in each protein family. In the present approximation, the standard deviation of ΔψN and selective temperature Ts at the equilibrium are simple functions of the equilibrium value of mean ΔψN, ΔψN¯eq. Lastly, the probability distribution of Ka/Ks, which is the ratio of nonsynonymous to synonymous substitution rate per site, is analyzed as a function of ΔψN¯eq, in order to examine how significant neutral selection is in the selection maintaining protein stability and foldability. Also, it is confirmed that selective temperature Ts negatively correlates with the mean of Ka/Ks, which represents the evolutionary rate of protein.

4.1. Important parameters in the estimations of one-body and pairwise interactions, h and J, and of the evolutionary statistical energy, ψN(σ)

The one-body (h) and pairwise (J) interactions for amino acid order in a protein sequence are estimated here by the DCA method (Marks et al., 2011; Morcos et al., 2011), although there are multiple methods for estimating them (Ekeberg et al., 2014; 2013). In the case of the DCA method, the ratio of pseudocount (0 ≤ pc ≤ 1) defined in Eqs. (S.70) and (S.71) is a parameter and controls the values of the ensemble and sample averages of ψN in sequence space, ⟨ψN(σ)⟩σ in Eq. (42) and ψN(σN)¯ in Eq. (45); a weight for observed counts is defined to be equal to (1pc). Sample average means the average over all homologous sequences with a weight for each sequence to reduce phylogenetic biases. An appropriate value must be chosen for the ratio of pseudocount in a reasonable manner.

Another problem is that the estimates of h and J(Marks et al., 2011; Morcos et al., 2011) may be noisy as a result of estimating many interaction parameters from a relatively small number of sequences. Therefore, only pairwise interactions within a certain distance are taken into account; the estimate of J is modified as follows, according to Morcos et al. (2014).

(57)J^ijq(ak,al)=Jijq(ak,al)H(rcutoffrij)
where Jq is the statistical estimate of J in the mean field approximation in which the amino acid aq is the reference state, H is the Heaviside step function, and rij is the distance between the centers of amino acid side chains at sites i and j in a protein structure, and rcutoff is a distance threshold for residue pairwise interactions. The one-body interactions hi(ak) are estimated in the isolated two-state model (Morcos et al., 2011) rather than the mean field approximation; see the Method section in the Supplement for details. The zero-sum gauge is employed to represent h and J; kh^is(ak)=klJ^ijs(ak,al)=0 in the zero-sum gauge.

Candidates for the cutoff distance may be about 8 Å for the first interaction shell and 15–16 Å for the second interaction shell between residues; distance between the centers of side chain atoms is employed for residue distance. Here both the distances are tested for the cutoff distance. Pseudocount in the Bayesian statistics is determined usually as a function of the number of samples (sequences), although the ratio of pseudocount pc=0.5 was used for all proteins in the contact prediction (Morcos et al., 2011). Here, an appropriate value for the ratio of pseudocount for the certain cutoff distance, either about 8 Å or 15–16 Å, is chosen for each protein family in such a way that the sample average of the evolutionary statistical energies must be equal to the ensemble average, ψN¯=ψNσ; see Eqs. (42) and (46). As shown in Fig. S.1, the value of rcutoff, where ψN¯=ψNσ is satisfied, monotonously changes as a function of the ratio of pseudocount pc. The values of pc, where ψN¯=ψNσ is satisfied near the specified values of rcutoff, 8 Å and 15.5 Å,  are employed for rcutoff ≃ 8 Å and 15.5 Å, respectively. In the present multiple sequence alignment for the PDZ domain, with the ratios of pseudocount pc=0.205 and pc=0.33, the sample and ensemble averages agree with each other at the cutoff distances rcutoff ∼ 8 Å and rcutoff ∼ 15.5 Å, respectively; see Fig. S.1. In Fig. S.2, the reflective correlation and regression coefficients between the experimental ΔΔGND (Gianni et al., 2007) and ΔψN due to single amino acid substitutions are plotted against the cutoff distance for pairwise interactions in the PDZ domain. The reflective correlation coefficient has the maximum at the rcutoff ∼ 8 Å for pc=0.205 and at rcutoff ∼ 15.5 Å for