Elsevier

Journal of Theoretical Biology

Volume 433, 21 November 2017, Pages 21-38
Journal of Theoretical Biology

Selection originating from protein stability/foldability: Relationships between protein folding free energy, sequence ensemble, and fitness

Highlights

A Boltzmann distribution with protein fitness is derived.

Relationships between folding free energy, inverse statistical potential and fitness.

Selective temperature, glass transition temperature and folding free energy are estimated.

Relationship between selective temperature and substitution rate (Ka/Ks).

Protein stability/foldability is kept in a balance of positive selection and random drift.

Abstract

Assuming that mutation and fixation processes are reversible Markov processes, we prove that the equilibrium ensemble of sequences obeys a Boltzmann distribution with exp(4Nem(11/(2N))), where m is Malthusian fitness and Ne and N are effective and actual population sizes. On the other hand, the probability distribution of sequences with maximum entropy that satisfies a given amino acid composition at each site and a given pairwise amino acid frequency at each site pair is a Boltzmann distribution with exp(ψN), where the evolutionary statistical energy ψN is represented as the sum of one body (h) (compositional) and pairwise (J) (covariational) interactions over all sites and site pairs. A protein folding theory based on the random energy model (REM) indicates that the equilibrium ensemble of natural protein sequences is well represented by a canonical ensemble characterized by exp(ΔGND/kBTs) or by exp(GN/kBTs) if an amino acid composition is kept constant, where ΔGNDGNGD, GN and GD are the native and denatured free energies, and Ts is the effective temperature representing the strength of selection pressure. Thus, 4Nem(11/(2N)),ΔψND(ψN+ψD), and ΔGND/kBTs must be equivalent to each other. With h and J estimated by the DCA program, the changes (ΔψN) of ψN due to single nucleotide nonsynonymous substitutions are analyzed. The results indicate that the standard deviation of ΔGN(=kBTsΔψN) is approximately constant irrespective of protein families, and therefore can be used to estimate the relative value of Ts. Glass transition temperature Tg and ΔGND are estimated from estimated Ts and experimental melting temperature (Tm) for 14 protein domains. The estimates of ΔGND agree with their experimental values for 5 proteins, and those of Ts and Tg are all within a reasonable range. In addition, approximating the probability density function (PDF) of ΔψN by a log-normal distribution, PDFs of ΔψN and Ka/Ks, which is the ratio of nonsynonymous to synonymous substitution rate per site, in all and in fixed mutants are estimated. The equilibrium values of ψN, at which the average of Δψ in fixed mutants is equal to zero, well match ψN averaged over homologous sequences, confirming that the present methods for a fixation process of mutations and for the equilibrium ensemble of ψN give a consistent result with each other. The PDFs of Ka/Ks at equilibrium confirm that Ts negatively correlates with the amino acid substitution rate (the mean of Ka/Ks) of protein. Interestingly, stabilizing mutations are significantly fixed by positive selection, and balance with destabilizing mutations fixed by random drift, although most of them are removed from population. Supporting the nearly neutral theory, neutral selection is not significant even in fixed mutants.

Keywords

Folding free energy change
Inverse statistical potential
Boltzmann distribution
Selective temperature
Positive selection

1. Introduction

Natural proteins can fold their sequences into unique structures. Protein’s stability and foldability result from natural selection and are not typical characteristics of random polymers (Bryngelson and Wolynes, 1987; Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b). Natural selection maintains protein’s stability and foldability over evolutionary timescales. On the basis of the random energy model (REM) for protein folding, it was discussed (Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b) that the equilibrium ensemble of natural protein sequences in sequence space is well represented by a canonical ensemble characterized by a Boltzmann factor exp(ΔGND(σ)/kBTs), where ΔGND(σ)(GN(σ)GD(σ)) is the folding free energy of sequence σ, GN and GD are the free energies of the native and denatured states, kB is the Boltzmann constant, and Ts is the effective temperature representing the strength of selection pressure and must satisfy Ts < Tg < Tm for natural proteins to fold into unique native structures; Tg is glass transition temperature and Tm is melting temperature. The REM also indicates that the free energy of denatured conformations (GD) is a function of amino acid frequencies only and does not depend on amino acid order, and therefore the Boltzmann factor will be taken as exp(GN(σ)/kBTs), if amino acid frequencies are kept constant. It was shown by lattice Monte Carlo simulations (Shakhnovich, 1994) that lattice protein sequences selected with this Boltzmann factor were not trapped by competing structures but could fold into unique native structures. Selective temperatures were also estimated (Dokholyan and Shakhnovich, 2001) for actual proteins to yield good correlations of sequence entropy between actual protein families and sequences designed with this type of Boltzmann factor.

On the other hand, the maximum entropy principle insists that the probability distribution of sequences in sequence space, which satisfies constraints on amino acid compositions at all sites and on amino acid pairwise frequencies for all site pairs, is a Boltzmann distribution with the Boltzmann factor exp(ψN(σ)), where the total interaction ψN(σ) of a sequence σ is represented as the sum of one-body (h) (compositional) and pairwise (J) (covariational) interactions between residues in the sequence; ψN(σ) is called the evolutionary statistical energy by Hopf et al. (2017). The inverse statistical potentials, the one-body (h) and pairwise (J) interactions, that satisfy those constraints for homologous sequences have been estimated (Ekeberg et al., 2014; 2013; Marks et al., 2011; Morcos et al., 2011) as one of inverse Potts problems and successfully employed to predict contacting residue pairs in protein structures (Ekeberg et al., 2014; 2013; Hopf et al., 2012; Marks et al., 2011; Miyazawa, 2013; Morcos et al., 2011; Sułkowska et al., 2012). Morcos et al. (2014) noticed that the ψN in the Boltzmann factor is the dimensionless energy corresponding to GN/kBTs, and estimated selective temperatures (Ts) for several protein families by comparing the difference (ΔψND) of ψ between the native and the molten globule states with folding free energies (ΔGND) estimated with associative-memory, water-mediated, structure, and energy model (AWSEM) (Davtyan et al., 2012).

A purpose of the present study is to establish relationships between protein foldability/stability, sequence distribution, and protein fitness. First, we prove that if mutation and fixation processes in protein evolution are reversible Markov processes, the equilibrium ensemble of genes will obey a Boltzmann distribution with the Boltzmann factor exp(4Nem(11/(2N))), where Ne and N are effective and actual population sizes, and m is the Malthusian fitness of a gene. In other words, correspondences between ΔGND/kBTs,ΔψND(ψNψD) and 4Nem(11/(2N)) are obtained by equating these three Boltzmann distributions with each other; ψDGD/kBTs+constant.

The second purpose is to analyze the effects (ΔψN) of single amino acid substitutions on the evolutionary statistical energy of a protein, and to estimate from the distribution of ΔψN the effective temperature of natural selection (Ts) and then glass transition temperature (Tg) and folding free energy (ΔGND) of protein. We estimate the one-body (h) and pairwise (J) interactions with the DCA program, which is available at “http://dca.rice.edu/portal/dca/home”, and then analyze the changes (ΔψN) of the evolutionary statistical energy (ψN) of a natural sequence due to single amino acid substitutions caused by single nucleotide changes. The data of ΔψN due to single nucleotide nonsynonymous substitutions for 14 protein domains show that the standard deviation of ΔψN over all the substitutions at all sites hardly depends on the evolutionary statistical energy (ψN) of each homologous sequence and is nearly constant for each protein family, indicating that the standard deviation of ΔGN ≃ kBTsΔψN is nearly constant irrespective of protein families. From this finding, Ts for each protein family has been estimated in relative to Ts for the PDZ family, which is determined by directly comparing ΔΔψND(Δ(ψNψD)ΔψN) with the experimental values of folding free energy changes, ΔΔGND, due to single amino acid substitutions. Also Tg and ΔGND for each protein family are estimated on the basis of the REM from the estimated Ts and an experimental melting temperature Tm. The estimates of Ts and Tg are all within a reasonable range, and those of ΔGND are well compared with experimental ΔGND for 5 protein families. The present method for estimating Ts is simpler than the method (Morcos et al., 2014) using AWSEM, and also is useful for the prediction of ΔGND, because the experimental data of ΔGND are limited in comparison with Tm, and also experimental conditions such as temperature and pH tend to be different among them. In addition, it has been revealed that ΔψN averaged over all single nucleotide nonsynonymous substitutions is a linear function of ψN/L of each homologous sequence, where L is sequence length; the average of ΔψN decreases as ψN/L increases. This characteristic is required for homologous proteins to stay at the equilibrium state of the native conformational energy GN ≃ kBTsψN, and indicates a weak dependency (Miyazawa, 2016; Serohijos et al., 2012) of ΔΔGND on ΔGND/L of protein across protein families.

The third purpose is to study an amino acid substitution process in protein evolution, which is characterized by the fitness, m=ΔψND/(4Ne(11/(2N))). We employ a monoclonal approximation for mutation and fixation processes of genes, in which protein evolution proceeds with single amino acid substitutions fixed at a time in a population. In this approximation, ψN of a protein gene attains the equilibrium, ψN=ψNeq, when the average of ΔψN( ≃ ΔΔψND) over singe nucleotide nonsynonymous mutations fixed in a population is equal to zero. Approximating the distribution of ΔψN due to singe nucleotide nonsynonymous mutations by a log-normal distribution, their distribution for fixed mutants is numerically calculated and used to calculate the averages of various quantities and also the probability density functions (PDF) of Ka/Ks in all arising mutants and also in fixed mutants only; Ka/Ks is defined as the ratio of nonsynonymous to synonymous substitution rate per site. There is a good agreement between the time average (ψNeq) and ensemble average (⟨ψNσ), which is equal to the sample average, ψN¯, of ψN over homologous sequences, supporting the constancy of the standard deviation of ΔψN assumed in the monoclonal approximation.

We also study protein evolution at equilibrium, ψN=ψNeq. The common understanding of protein evolution has been that amino acid substitutions observed in homologous proteins are neutral (Kimura, 1968; 1969; Kimura and Ohta, 1971; 1974) or slightly deleterious (Ohta, 1973; 1992), and random drift is a primary force to fix amino acid substitutions in population. The PDFs of Ka/Ks in all arising mutations and in their fixed mutations are examined to see how significant each of positive, neutral, slightly negative,and negative selections is. Interestingly, stabilizing mutations are significantly fixed in population by positive selection, and balance with destabilizing mutations that are also significantly fixed by random drift, although most negative mutations are removed from population. Contrary to the neutral theory (Kimura, 1968; 1969; Kimura and Ohta, 1971; 1974) and supporting the nearly neutral theory (Ohta, 1973; 1992; 2002), the proportion of neutral selection is not large even in fixed mutants. It is also confirmed that the effective temperature (Ts) of selection negatively correlates with the amino acid substitution rate (Ka/Ks) of protein at equilibrium.

2. Methods

2.1. Knowledge of protein folding

A protein folding theory (Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b), which is based on a random energy model (REM), indicates that the equilibrium ensemble of amino acid sequences, σ(σ1,,σL) where σi is the type of amino acid at site i and L is sequence length, can be well approximated by a canonical ensemble with a Boltzmann factor consisting of the folding free energy, ΔGND(σ, T) and an effective temperature Ts representing the strength of selection pressure.

(1)P(σ)Pmut(σ)exp(ΔGND(σ,T)kBTs)
(2)exp(GN(σ)kBTs)iff(σ)=constant
(3)ΔGND(σ,T)GN(σ)GD(f(σ),T)
where pmut(σ) is the probability of a sequence (σ) randomly occurring in a mutational process and depends only on the amino acid frequencies f(σ), kB is the Boltzmann constant, T is a growth temperature, and GN and GD are the free energies of the native conformation and denatured state, respectively. Selective temperature Ts quantifies how strong the folding constraints are in protein evolution, and is specific to the protein structure and function. The free energy GD of the denatured state does not depend on the amino acid order but the amino acid composition, f(σ), in a sequence (Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b). It is reasonable to assume that mutations independently occur between sites, and therefore the equilibrium frequency of a sequence in the mutational process is equal to the product of the equilibrium frequencies over sites; Pmut(σ)=ipmut(σi), where pmut(σi) is the equilibrium frequency of σi at site i in the mutational process.

The distribution of conformational energies in the denatured state (molten globule state), which consists of conformations as compact as the native conformation, is approximated in the random energy model (REM), particularly the independent interaction model (IIM) (Pande et al., 1997), to be equal to the energy distribution of randomized sequences, which is then approximated by a Gaussian distribution, in the native conformation. That is, the partition function Z for the denatured state is written as follows with the energy density n(E) of conformations that is approximated by a product of a Gaussian probability density and the total number of conformations whose logarithm is proportional to the chain length.

(4)Z=exp(EkBT)n(E)dE
(5)n(E)exp(ωL)N(E¯(f(σ)),δE2(f(σ)))
where ω is the conformational entropy per residue in the compact denatured state, and N(E¯(f(σ)),δE2(f(σ))) is the Gaussian probability density with mean E¯ and variance δE2, which depend only on the amino acid composition of the protein sequence. The free energy of the denatured state is approximated as follows.
(6)GD(f(σ),T)E¯(f(σ))δE2(f(σ))2kBTkBTωL
(7)=E¯(f(σ))δE2(f(σ))ϑ(T/Tg)kBT
(8)ϑ(T/Tg){12(1+T2Tg2)forT>TgTTgforTTg
where E¯ and δE2 are estimated as the mean and variance of interaction energies of randomized sequences in the native conformation. Tg is the glass transition temperature of the protein at which entropy becomes zero (Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b); GD/T|T=Tg=0. The conformational entropy per residue ω in the compact denatured state can be represented with Tg; ωL=δE2/(2(kBTg)2). Thus, unless Tg < Tm, a protein will be trapped at local minima on a rugged free energy landscape before it can fold into a unique native structure.

2.2. Probability distribution of homologous sequences with the same native fold in sequence space

The probability distribution P(σ) of homologous sequences with the same native fold, σ=(σ1,,σL) where σi ∈ {amino  acids, deletion}, in sequence space with maximum entropy, which satisfies a given amino acid frequency at each site and a given pairwise amino acid frequency at each site pair, is a Boltzmann distribution (Marks et al., 2011; Morcos et al., 2011).

(9)P(σ)exp(ψN(σ))
(10)ψN(σ)(iL(hi(σi)+j>iJij(σi,σj)))
where hi and Jij are one-body (compositional) and two-body (covariational) interactions and must satisfy the following constraints.
(11)σP(σ)δσiak=Pi(ak)
(12)σP(σ)δσiakδσjal=Pij(ak,al)
where δσiak is the Kronecker delta, Pi(ak) is the frequency of amino acid ak at site i, and Pij(ak, al) is the frequency of amino acid pair, ak at i and al at j; ak ∈ {amino acids, deletion}. The pairwise interaction matrix J satisfies Jij(ak,al)=Jji(al,ak) and Jii(ak,al)=0. Interactions hi and Jij can be well estimated from a multiple sequence alignment (MSA) in the mean field approximation (Marks et al., 2011; Morcos et al., 2011), or by maximizing a pseudo-likelihood (Ekeberg et al., 2014; 2013). Because ψN(σ) has been estimated under the constraints on amino acid compositions at all sites, only sequences with a given amino acid composition contribute significantly to the partition function, and other sequences may be ignored.

Hence, from Eqs. (2) and (9),

(13)ψN(σ)GN(σ)/(kBTs)+functionoff(σ)
(14)ψD(f(σ),T)GD(f(σ),T)/(kBTs)+functionoff(σ)
(15)ΔψND(σ,T)ΔGND(σ,T)/(kBTs)
(16)ΔψND(σ,T)ψN(σ)ψD(f(σ),T)
(17)ψD(f(σ),T)ψ¯(f(σ))δψ2(f(σ))ϑ(T/Tg)Ts/T
(18)ω=(Ts/Tg)2δψ2/(2L)
where the ψ¯ and δψ2 are estimated as the mean and variance of ψN over randomized sequences; E¯kBTsψ¯ and δE2 ≃ (kBTs)2δψ2.

2.3. The equilibrium distribution of sequences in a mutation-fixation process

Here we assume that the mutational process is a reversible Markov process. That is, the mutation rate per gene, Mμν, from sequence μ(μ1,,μL) to ν satisfies the detailed balance condition

(19)Pmut(μ)Mμν=Pmut(ν)Mνμ
where Pmut(ν) is the equilibrium frequency of sequence ν in a mutational process, Mμν. The mutation rate per population is equal to 2NMμν for a diploid population, where N is the population size. The substitution rate Rμν from μ to ν is equal to the product of the mutation rate and the fixation probability with which a single mutant gene becomes to fully occupy the population (Crow and Kimura, 1970).
(20)Rμν=2NMμνu(s(μν))
where u(s(μ → ν)) is the fixation probability of mutants from μ to ν the selective advantage of which is equal to s.

For genic selection (no dominance) or gametic selection in a Wright-Fisher population of diploid, the fixation probability, u, of a single mutant gene, the selective advantage of which is equal to s and the frequency of which in a population is equal to qm=1/(2N), was estimated (Crow and Kimura, 1970) as

(21)2Nu(s)=2N1e4Nesqm1e4Nes
(22)=u(s)u(0)withqm=12N
where Ne is effective population size. Eq. (21) will be also valid for haploid population if 2Ne and 2N are replaced by Ne and N, respectively. Also, for Moran population of haploid, 4Ne and 2N should be replaced by Ne and N, respectively. Fixation probabilities for various selection models, which are compiled from p. 192 and p. 424–427 of Crow and Kimura (1970) and from Moran (1958) and Ewens (1979), are listed in Table S.7. The selective advantage of a mutant sequence ν to a wildtype μ is equal to
(23)s(μν)=m(ν)m(μ)
where m(ν) is the Malthusian fitness of a mutant sequence, and m(μ) is for the wildtype.

This Markov process of substitutions in sequence is reversible, and the equilibrium frequency of sequence μ, Peq(μ), in the total process consisting of mutation and fixation processes is represented by

(24)Peq(μ)=Pmut(μ)exp(4Nem(μ)(1qm))νPmut(ν)exp(4Nem(ν)(1qm))
because both the mutation and fixation processes satisfy the detailed balance conditions, Eq. (19) and the following equation, respectively.
(25)exp(4Nem(μ)(1qm))u(s(μν))=exp(4Nem(μ)qm)exp(4Nem(ν)qm)exp(4Nem(μ))exp(4Nem(ν))
(26)=exp(4Nem(ν)(1qm))u(s(νμ))
As a result, the ensemble of homologous sequences in molecular evolution obeys a Boltzmann distribution.

2.4. Relationships between m(σ), ψN(σ), and ΔGND(σ) of protein sequence

From Eqs. (1), (9), and (24), we can get the following relationships among the Malthusian fitness m, the folding free energy change ΔGND and ΔψND of protein sequence.

(27)Peq(μ)=Pmut(μ)exp(4Nem(μ)(1qm))νPmut(ν)exp(4Nem(ν)(1qm)))
(28)=Pmut(μ¯)exp((ψN(μ)ψD(f(μ)¯,T)))νPmut(ν¯)exp((ψN(ν)ψD(f(ν)¯,T)))
(29)Pmut(μ)exp(ΔGND(μ,T)/(kBTs))νPmut(ν)exp(ΔGND(ν,T)/(kBTs))
where f(σ)¯σf(σ)P(σ) and logPmut(σ¯)σP(σ)log(iPmut(σi)). Then, the following relationships are derived for sequences for which f(μ)=f(μ)¯.
(30)4Nem(μ)(1qm)=ΔψND(μ,T)+constant
(31)ΔGND(μ,T)kBTs+constant
The selective advantage of ν to μ is represented as follows for f(μ)=f(ν)=f(σ)¯.
(32)4Nes(μν)(1qm)=(4Nem(ν)4Nem(μ))(1qm)
(33)=(ΔψND(ν,T)ΔψND(μ,T))=(ψN(ν)ψN(μ))
(34)(ΔGND(ν,T)ΔGND(μ,T))/(kBTs)=(GN(ν)GN(μ))/(kBTs)
It should be noted here that only sequences for which f(σ)=f(σ)¯ contribute significantly to the partition functions in Eq. (28), and other sequences may be ignored.

Eq. (33) indicates that evolutionary statistical energy ψ should be proportional to effective population size Ne, and therefore it is ideal to estimate one-body (h) and two-body (J) interactions from homologous sequences of species that do not significantly differ in effective population size. Also, Eq. (34) indicates that selective temperature Ts is inversely proportional to the effective population size Ne; Ts∝1/Ne, because free energy is a physical quantity and should not depend on effective population size.

2.5. The ensemble average of folding free energy, ΔGND(σ, T), over sequences

The ensemble average of ΔGND(σ, T) over sequences with Eq. (1) is

(35)ΔGND(σ,T)σ
(36)[σΔGND(σ,T)Pmut(σ)exp(ΔGND(σ,T)kBTs)]/[σPmut(σ)exp(ΔGND(σ,T)kBTs)]
(37)[σ|f(σ)=f(σN)¯GN(σ)exp(GN(σ)kBTs)]/[σ|f(σ)=f(σN)¯exp(GN(σ)kBTs)]GD(f(σN)¯,T)
(38)=GN(σ)σGD(f(σN)¯,T)
where σN denotes a natural sequence, and f(σN)¯ denotes the average of amino acid frequencies f(σN) over homologous sequences. In Eq. (37), the sum over all sequences is approximated by the sum over sequences the amino acid composition of which is the same as that over the natural sequences.

The ensemble averages of GN and ψN(σ) are estimated in the Gaussian approximation (Pande et al., 1997).

(39)GN(σ)σEexp(E/(kBTs))n(E)dEexp(E/(kBTs))n(E)dE
(40)=E¯(f(σN)¯)δE2(f(σN)¯)/(kBTs)
(41)ψN(σ)σ[σψND(σ)exp(ψN(σ))]/[σexp(ψN(σ))]
(42)ψ¯(f(σN)¯)δψ2(f(σN)¯)
The ensemble averages of ΔGND(σ, T) and ψN(σ) over sequences are observable as the sample averages of ΔGND(σN, T) and ψN(σN) over homologous sequences fixed in protein evolution, respectively.
(43)ΔGND(σN,T)¯/(kBTs)=ΔGND(σ,T)σ/(kBTs)
(44)δψ2(f(σN)¯)[ϑ(T/Tg)Ts/T1]
(45)ψN(σN)¯σNwσNψN(σN)σNwσN
(46)=ψN(σ)σ
where the overline denotes a sample average with a sample weight wσN for each homologous sequence, which is used to reduce phylogenetic biases in the set of homologous sequences.

The folding free energy becomes equal to zero at the melting temperature Tm; ΔGND(σN,Tm)σ=0. Thus, the following relationship must be satisfied (Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b).

(47)ϑ(Tm/Tg)TsTm=Ts2Tm(1+Tm2Tg2)=1withTsTgTm

2.6. Probability distributions of selective advantage, fixation rate and Ka/Ks

Let us consider the probability distributions of characteristic quantities that describe the evolution of genes. First of all, the probability density function (PDF) of selective advantage s, p(s), of mutant genes can be calculated from the PDF of the change of ΔψND due to a mutation from μ to ν, ΔΔψND(ΔψND(ν,T)ΔψND(μ,T)). The PDF of 4Nes, p(4Nes)=p(s)/(4Ne), may be more useful than p(s).

(48)p(4Nes)=p(ΔΔψND)|dΔΔψNDd4Nes|=p(ΔΔψND)(1qm)
where ΔΔψND must be regarded as a function of 4Nes, that is, ΔΔψND=4Nes(1qm); see Eq. (33).

The PDF of fixation probability u can be represented by

(49)p(u)=p(4Nes)d4Nesdu=p(4Nes)(e4Nes1)2e4Nes(qm1)qm(e4Nes1)(e4Nesqm1)
where 4Nes must be regarded as a function of u.

The ratio of the substitution rate per nonsynonymous site (Ka) for nonsynonymous substitutions with selective advantage s to the substitution rate per synonymous site (Ks) for synonymous substitutions with s = 0 is

(50)KaKs=u(s)u(0)=u(s)qm
assuming that synonymous substitutions are completely neutral and mutation rates at both types of sites are the same. The PDF of Ka/Ks is
(51)p(Ka/Ks)=p(u)dud(Ka/Ks)=p(u)qm

2.7. Probability distributions of ΔΔψND, 4Nes, u, and Ka/Ks in fixed mutant genes

The PDF of ΔΔψND in fixed mutants is proportional to that multiplied by the fixation probability.

(52)p(ΔΔψND,fixed)=p(ΔΔψND)u(s(ΔΔψND))u(s(ΔΔψND))
(53)uu(s)p(ΔΔψND)dΔΔψND
Likewise, the PDF of selective advantage in fixed mutants is
(54)p(4Nesfixed)=p(4Nes)u(s)u(s)
and those of the u and Ka/Ks in fixed mutants are
(55)p(ufixed)=p(u)uu
(56)p((KaKs)fixed)=p(KaKs)uu=p(KaKs)KaKsKaKs
The average of Ka/Ks in fixed mutants is equal to the ratio of the second moment to the first moment of Ka/Ks in all arising mutants; Ka/Ksfixed=(Ka/Ks)2/Ka/Ks.

3. Materials

3.1. Sequence data

We study the single domains of 8 Pfam (Finn et al., 2016) families and both the single domains and multi-domains from 3 Pfam families. In Table 1, their Pfam ID for a multiple sequence alignment, and UniProt ID and PDB ID with the starting- and ending-residue positions of the domains are listed. The full alignments for their families at the Pfam are used to estimate one-body interactions h and pairwise interactions J with the DCA program from “http://dca.rice.edu/portal/dca/home ” (Marks et al., 2011; Morcos et al., 2011). To estimate the sample (ψN¯) and ensemble (⟨ψNσ) averages of the evolutionary statistical energy, M unique sequences with no deletions are used. In order to reduce phylogenetic biases in the set of homologous sequences, we employ a sample weight (wσN) for each sequence, which is equal to the inverse of the number of sequences that are less than 20% different from a given sequence in a given set of homologous sequences. Only representatives of unique sequences with no deletions, which are at least 20% different from each other, are used to calculate the changes of the evolutionary statistical energy (ΔψN) due to single nucleotide nonsynonymous substitutions; the number of the representatives is almost equal to the effective number of sequences (Meff) in Table 1.

Table 1. Protein families, and structures studied.

Pfam familyUniProt IDNaNeffb,cMdMeffc,eLfPDB ID
HTH_3RPC1_BP434/7-5915315(15917)11691.2162864893.73531R69-A:6-58
NitroreductaseQ97IT9_CLOAB/4-766008(6084)4912.961057854.71733E10-A/B:4-76g
SBP_bac_3hGLNH_ECOLI/27-2449874(9972)7374.9614099.702181WDN-A:5-222
SBP_bac_3GLNH_ECOLI/111-2049712(9898)7442.85829689.64941WDN-A:89-182
OmpAPAL_ECOLI/73-1676035(6070)4920.4422071761.24951OAP-A:52-146
DnaBDNAB_ECOLI/31-1281929(1957)1284.941187697.30981JWE-A:30-127
LysR_substratehBENM_ACIAD/90-28025138(25226)20707.0685(1)67.001912F6G-A/B:90-280g
LysR_substrateBENM_ACIAD/163-26525032(25164)21144.74121(1)99.271032F6G-A/B:163-265g
Methyltransf_5hRSMH_THEMA/8-2921942(1953)1286.67578(2)357.972851N2X-A:8-292
Methyltransf_5RSMH_THEMA/137-2161877(1911)1033.35975(2)465.53801N2X-A:137-216
SH3_1SRC_HUMAN:90-1379716(16621)3842.471191458.31481FMK-A:87-134
ACBPACBP_BOVIN/3-822130(2526)1039.0616170.72802ABD-A:2-81
PDZPTN13_MOUSE/1358-143813814(23726)4748.761255339.99811GM1-A:16-96
Copper-bindAZUR_PSEAE:24-1481136(1169)841.5667(1)45.231255AZU-B/C:4-128g
a

The number of unique sequences and the total number of sequences in parentheses; the full alignments in the Pfam (Finn et al., 2016) are used.

b

The effective number of sequences.

c

A sample weight (wσN) for a given sequence is equal to the inverse of the number of sequences that are less than 20% different from the given sequence.

d

The number of unique sequences that include no deletion unless specified. The number in parentheses indicates the maximum number of deletions allowed.

e

The effective number of unique sequences that include no deletion or at most the specified number of deletions.

f

The number of residues.

g

Contacts are calculated in the homodimeric state for these protein.

h

These proteins consist of two domains, and other ones are single domains.

4. Results

First, We describe how one-body and pairwise interactions, h and J, are estimated. Then, the changes of evolutionary statistical energy (ΔψN) due to single nucleotide nonsynonymous changes on natural sequences are analyzed with respect to dependences on the ψN of the wildtype sequences. The results indicate that the standard deviation of ΔGN ≃ kBTsΔψN is almost constant over protein families. Hence, the selective temperatures, Ts, of various protein families can be estimated in a relative scale from the standard deviation of ΔψN. The Ts of a reference protein is estimated by comparing the expected values of ΔΔGND with their experimental values. Folding free energies ΔGND are estimated from estimated Ts and experimental melting temperature Tm, and compared with their experimental values for 5 protein families. Glass transition temperatures Tg are also estimated from Ts and Tm.

Secondly, based on the distribution of ΔψN, protein evolution is studied. Evolutionary statistical energy (ψN) attains the equilibrium when the average of ΔψN over fixed mutations is equal to zero. The PDF of ΔψN is approximated by log-normal distributions. The basic relationships are that 1) the standard deviation of ΔψN is constant specific to a protein family, and 2) the mean of ΔψN linearly depends on ψN. The equilibrium value of ψN is shown to agree with the mean of ψN over homologous proteins in each protein family. In the present approximation, the standard deviation of ΔψN and selective temperature Ts at the equilibrium are simple functions of the equilibrium value of mean ΔψN, ΔψN¯eq. Lastly, the probability distribution of Ka/Ks, which is the ratio of nonsynonymous to synonymous substitution rate per site, is analyzed as a function of ΔψN¯eq, in order to examine how significant neutral selection is in the selection maintaining protein stability and foldability. Also, it is confirmed that selective temperature Ts negatively correlates with the mean of Ka/Ks, which represents the evolutionary rate of protein.

4.1. Important parameters in the estimations of one-body and pairwise interactions, h and J, and of the evolutionary statistical energy, ψN(σ)

The one-body (h) and pairwise (J) interactions for amino acid order in a protein sequence are estimated here by the DCA method (Marks et al., 2011; Morcos et al., 2011), although there are multiple methods for estimating them (Ekeberg et al., 2014; 2013). In the case of the DCA method, the ratio of pseudocount (0 ≤ pc ≤ 1) defined in Eqs. (S.70) and (S.71) is a parameter and controls the values of the ensemble and sample averages of ψN in sequence space, ⟨ψN(σ)⟩σ in Eq. (42) and ψN(σN)¯ in Eq. (45); a weight for observed counts is defined to be equal to (1pc). Sample average means the average over all homologous sequences with a weight for each sequence to reduce phylogenetic biases. An appropriate value must be chosen for the ratio of pseudocount in a reasonable manner.

Another problem is that the estimates of h and J(Marks et al., 2011; Morcos et al., 2011) may be noisy as a result of estimating many interaction parameters from a relatively small number of sequences. Therefore, only pairwise interactions within a certain distance are taken into account; the estimate of J is modified as follows, according to Morcos et al. (2014).

(57)J^ijq(ak,al)=Jijq(ak,al)H(rcutoffrij)
where Jq is the statistical estimate of J in the mean field approximation in which the amino acid aq is the reference state, H is the Heaviside step function, and rij is the distance between the centers of amino acid side chains at sites i and j in a protein structure, and rcutoff is a distance threshold for residue pairwise interactions. The one-body interactions hi(ak) are estimated in the isolated two-state model (Morcos et al., 2011) rather than the mean field approximation; see the Method section in the Supplement for details. The zero-sum gauge is employed to represent h and J; kh^is(ak)=klJ^ijs(ak,al)=0 in the zero-sum gauge.

Candidates for the cutoff distance may be about 8 Å for the first interaction shell and 15–16 Å for the second interaction shell between residues; distance between the centers of side chain atoms is employed for residue distance. Here both the distances are tested for the cutoff distance. Pseudocount in the Bayesian statistics is determined usually as a function of the number of samples (sequences), although the ratio of pseudocount pc=0.5 was used for all proteins in the contact prediction (Morcos et al., 2011). Here, an appropriate value for the ratio of pseudocount for the certain cutoff distance, either about 8 Å or 15–16 Å, is chosen for each protein family in such a way that the sample average of the evolutionary statistical energies must be equal to the ensemble average, ψN¯=ψNσ; see Eqs. (42) and (46). As shown in Fig. S.1, the value of rcutoff, where ψN¯=ψNσ is satisfied, monotonously changes as a function of the ratio of pseudocount pc. The values of pc, where ψN¯=ψNσ is satisfied near the specified values of rcutoff, 8 Å and 15.5 Å,  are employed for rcutoff ≃ 8 Å and 15.5 Å, respectively. In the present multiple sequence alignment for the PDZ domain, with the ratios of pseudocount pc=0.205 and pc=0.33, the sample and ensemble averages agree with each other at the cutoff distances rcutoff ∼ 8 Å and rcutoff ∼ 15.5 Å, respectively; see Fig. S.1. In Fig. S.2, the reflective correlation and regression coefficients between the experimental ΔΔGND (Gianni et al., 2007) and ΔψN due to single amino acid substitutions are plotted against the cutoff distance for pairwise interactions in the PDZ domain. The reflective correlation coefficient has the maximum at the rcutoff ∼ 8 Å for pc=0.205 and at rcutoff ∼ 15.5 Å for pc=0.33, indicating that these cutoff distances are appropriate for these ratios of pseudocount. The ratio of pseudocount and a cutoff distance employed are listed for each protein family in Table 2 and S.5 for rcutoff ∼ 8 and 15.5 Å, respectively. The ratios of pseudocount employed here are all smaller than 0.5, which was reported to be appropriate for contact prediction; by using strong regularization, contact prediction is improved but the generative power of the inferred model is degraded (Barton et al., 2016). In the text, only results with rcutoff ∼ 8 Å are shown. In a supplement, results with rcutoff ∼ 15.5 Å are provided and discussed in comparison with the results of rcutoff ∼ 8 Å.

Table 2. Parameter values for rcutoff ∼ 8 Å employed for each protein family, and the averages of the evolutionary statistical energies (ψN¯) over all homologous sequences and of the means and the standard deviations of interaction changes (ΔψN¯¯ and Sd(ΔψN)¯) due to single nucleotide nonsynonymous mutations at all sites over all homologous sequences in each protein family.

Pfam familyLpcncarcutoffψ¯/Lbδψ2/LbψN¯/LbΔψN¯¯cSd(ΔψN)¯±crψNαψNrψNαψN
(Å)Sd(Sd(ΔψN))for ΔψN¯dfor Sd(ΔψN)e
HTH_3530.187.438.220.19972.79262.98614.25725.3503 ± 0.56270.9611.51050.5980.9888
Nitroreductase730.236.388.250.11842.15972.27883.31153.6278 ± 0.28040.9391.33710.4260.3721
SBP_bac_32180.259.238.100.10002.16242.26183.29553.4496 ± 0.27420.9801.52860.8410.7876
SBP_bac_3940.378.007.900.16341.24951.40541.92912.3436 ± 0.19010.9591.39380.6340.4815
OmpA950.1698.008.200.24573.90934.15426.57577.6916 ± 0.30780.9571.56940.4100.3804
DnaB980.2359.658.170.22843.99764.22916.35026.1244 ± 0.32450.9651.45090.4950.4198
LysR_substrate1910.2358.597.980.22411.48881.71732.27842.6519 ± 0.14450.9641.33470.5410.5664
LysR_substrate1030.2658.848.250.22441.41441.63792.21102.7371 ± 0.20550.9821.41590.7270.5307
Methyltransf_52850.137.997.780.14627.24357.388712.468910.9352 ± 0.30300.9811.91400.1220.0783
Methyltransf_5800.186.787.850.17635.51625.68968.98497.6133 ± 0.43820.9441.48240.1250.1141
SH3_1480.146.428.010.13483.91094.04345.57926.1426 ± 0.29350.9191.40610.1960.1718
ACBP800.229.178.240.05254.64114.70847.76127.1383 ± 0.29700.9721.58840.3350.2235
PDZ810.2059.068.160.23983.11403.35724.75894.6605 ± 0.22550.9541.52820.3690.3042
Copper-bind1250.239.508.270.09404.24504.32727.26506.9283 ± 0.23160.9801.89150.2820.2352
a

The average number of contact residues per site within the cutoff distance; the center of side chain is used to represent a residue.

b

M unique sequences with no deletions are used with a sample weight (wσN) for each sequence; wσN is equal to the inverse of the number of sequences that are less than 20% different from a given sequence. The M and the effective number Meff of the sequences are listed for each protein family in Table 1.

c

The averages of ΔψN¯ and Sd(ΔψN), which are the mean and the standard deviation of ΔψN for a sequence, and the standard deviation of Sd(ΔψN) over homologous sequences. Representatives of unique sequences with no deletions, which are at least 20% different from each other, are used; the number of the representatives used is almost equal to Meff.

d

The correlation and regression coefficients of ΔψN¯ on ψN/L; see Eq. (62).

e

The correlation and regression coefficients of Sd(ΔψN) on ψN/L.

4.2. Changes of the evolutionary statistical energy, ΔψN, by single nucleotide nonsynonymous substitutions

The changes of the evolutionary statistical energy, ΔψN and ΔψD, due to a single amino acid substitution from σiN to σi at site i in a natural sequence σN are defined as

(58)ΔψN(σjiN,σiNσi)ψN(σjiN,σi)ψN(σN)
(59)ΔψD(σjiN,σiNσi,T)ψD(σjiN,σi,T)ψD(σN,T)
(60)ΔΔψND(σjiN,σiNσi)ΔψN(σjiN,σiNσi)ΔψD(σjiN,σiNσi)
(61)ΔψN(σjiN,σiNσi)becausef(σN)f(σjiN,σi)
Here, single amino acid substitutions caused by single nucleotide nonsynonymous mutations are taken into account, unless specified. Let us use a single overline to denote the average of the changes of interaction over all types of single nucleotide nonsynonymous mutations at all sites in a specific native sequence, and a double overline to denote their averages over all homologous sequences in a protein family.

We calculated the ψN of the wildtype and ΔψN due to all types of single nucleotide nonsynonymous substitutions for all homologous sequences, and their means and variances. We have examined the dependence of ΔΔψND¯ΔψN¯ on the ψN of each homologous sequence in each protein family. Fig. 1 for the PDZ family and Figs. S.3 to S.13 for all proteins show that ΔψN¯ is negatively proportional to the ψN/L of the wildtype, that is,

(62)ΔΔψND(σjiN,σiNσi)¯ΔψN(σjiN,σiNσi)¯αψNψN(σN)ψN(σN)¯L+ΔψN(σjiN,σiNσi)¯¯withαψN<0
where L is sequence length. This relationship is found in all of the protein families examined here; the correlation and regression coefficients for rcutoff ∼ 8 and 15.5 Å are listed in Table 2 and S.5, respectively. Most of the correlation coefficients are larger than 0.95, and all are greater than 0.9. It is reasonable that the change of the evolutionary statistical energy (ΔψN) depends on interaction per residue (ψN/L) rather than the evolutionary statistical energy (ψN), because interactions change only for one residue substituted in the sequence. Note that the average interactions including a single residue will be equal to 2ψN/L if all interactions are two-body. The important fact is that the linear dependence of ΔψN on ψN/L shown in Fig. 1 and Table 2 and S.5 is equivalent to the linear dependence of free energy changes caused by single amino acid substitutions on the native conformational energy of the wildtype protein, because the selective temperatures TS of homologous sequences in a protein family are approximated to be equal.

Fig. 1

Fig. 1. Correlation between ΔψN due to single nucleotide nonsynonymous substitutions and ψN of homologous sequences in the PDZ domain family. This figure corresponds to the cutoff distance rcutoff ∼ 8 Å; see Fig. S.12 for rcutoff ∼ 15.5 Å. Each of the black plus or red cross marks corresponds to the mean or the standard deviation of ΔψN due to all types of single nucleotide nonsynonymous substitutions over all sites in each of the homologous sequences of the PDZ domain family. Only 335 representatives of unique sequences with no deletions, which are at least 20% different from each other, are employed; the number of the representatives is almost equal to Meff in Table 1. The solid lines show the regression lines for the mean and the standard deviation of ΔψN.

Is the same type of dependence on ψN/L found for the standard deviation of ΔψN over single nucleotide nonsynonymous substitutions at all sites? Fig. 1, Figs. S.3 to S.13 and Table 2 and S.5 show that the correlation between the standard deviation of ΔψN and ψN of the wildtype is very weak except for Nitroreductase, SBP_bac_3 and LysR_substrate families. Even for these protein families, the standard deviations of Sd(ΔψN) are less than 10% of the mean, Sd(ΔψN)¯; see Table 2 and S.5. Thus, it is indicated that in general the variance/standard deviation of ΔψN due to single amino acid substitutions is almost constant irrespectively of the ψN across homologous sequences. The standard deviations of Sd(ΔψN) is relatively large for the HTH_3, because in Fig. S.3 there is a minor sequence group that has a distinguishable value of Sd(ΔψN) from the major sequence group.

4.3. Effective temperature Ts of selection estimated from the changes of interaction, ΔψN, by single nucleotide nonsynonymous substitutions

In the previous section, it has been shown that the standard deviation of ΔψN hardly depends on ψN of the wildtype and is nearly constant across homologous sequences in every protein family that has its own characteristic temperature (Ts) for selection pressure, indicating that Sd(ΔψN) must be approximated by a function of only kBTs. On the other hand, the free energy of the native structure, ΔGN, must not explicitly depend on kBTs, although it may be approximated by a function of GN. In other words, the following relationships are derived.

(63)Sd(ΔψN(σjiN,σiNσi))independentofψNandconstantacrosshomologoussequencesineveryproteinfamily=functionofkBTs
(64)Sd(ΔGN(σjiN,σiNσi))=functionthatmustnotexplicitlydependonkBTsbutGN
From the equations above, we obtain the important relation that the standard deviation of ΔGN(=kBTsΔψN) does not depend on GN and is nearly constant irrespective of protein families.
(65)Sd(ΔGN(σjiN,σiNσi))kBTsSd(ΔψN(σjiN,σiNσi))constant
This relationship is consistent with the observation that the standard deviation of ΔΔGND( ≃ ΔGN) is nearly constant irrespectively of protein families (Tokuriki et al., 2007). This relationship allows us to estimate a selective temperature (Ts) for a protein family in a scale relative to that of a reference protein from the ratio of the standard deviation of ΔψN. The PDZ family is employed here as a reference protein, and its Ts is estimated by a direct comparison of ΔψN and experimental ΔΔGND; the amino acid pair types and site locations of single amino acid substitutions are the most various, and also the correlation between the experimental ΔΔGND and ΔψN is the best for the PDZ family in the present set of protein families, SH3_1 (Grantcharova et al., 1998), ACBP (Kragelund et al., 1999), PDZ (Gianni et al., 2005; 2007), and Copper-bind (Wilson and Wittung-Stafshede, 2005); see Table 3 and S.6.
(66)kBT^s=kBT^s,PDZ[Sd(ΔψPDZ(σjiN,σiNσi))¯/Sd(ΔψN(σjiN,σiNσi))¯]
where the overline denotes the average over all homologous sequences. Here, the averages of standard deviations over all homologous sequences are employed, because Ts for all homologous sequences are approximated to be equal. It will be confirmed in the later section, “the equilibrium value of ψN in protein evolution”, that the assumption of the constant value specific to each protein family for Sd(ΔψN) is appropriate.

Table 3. Thermodynamic quantities estimated with rcutoff ∼ 8 Å.

Experimental
Pfam familyrakBT^saT^sTmT^gω^bTcΔGNDd
(kcal/mol)(°K)(°K)(°K)(kB)(°K)(kcal/mol)
HTH_3122.6343.7160.10.81822982.95
Nitroreductase180.7337204.00.84772982.81
SBP_bac_3190.1336.1211.00.87712988.03
SBP_bac_3279.8336.1283.80.6072298.85
OmpA85.2320125.40.90272983.13
DnaB107.1312.8142.11.13412982.56
LysR_substrate247.3338256.70.69082983.63
LysR_substrate239.6338250.40.64722982.00
Methyltransf_560.0375110.51.065629841.36
Methyltransf_586.1375135.11.121429811.48
SH3_10.8650.1583106.7344147.41.02532953.76
ACBP0.8250.116991.9324.4131.71.12812786.72
PDZ0.9310.2794140.7312.88168.51.08542981.81
Copper-bind0.8280.178194.6359.3139.90.970929812.07
a

Reflective correlation (r) and regression (kBT^s) coefficients for least-squares regression lines of experimental ΔΔGND on ΔψN through the origin.

b

Conformational entropy per residue, in kB units, in the denatured molten-globule state; see Eq. (18).

c

Temperatures are set up for comparison to be equal to the experimental temperatures for ΔGND or to 298°K if unavailable; see Table S.4 for the experimental data.

d

Folding free energy in kcal/mol units; see Eq. (44).

4.4. A direct comparison of the changes of interaction, ΔψN( ≃ ΔΔψND), with the experimental ΔΔGND due to single amino acid substitutions

In order to determine the Ts for a reference protein, the experimental values (Gianni et al., 2007) of ΔΔGND due to single amino acid substitutions in the PDZ domain are plotted against the changes of interaction, ΔψN, for the same types of substitutions in Figs. 2 and S.14. The slope of the least-squares regression line through the origin, which is an estimate of kBTs, is equal to kBT^s=0.279 kcal/mol, and the reflective correlation coefficient is equal to 0.93. This estimate of kBTs for the PDZ yield Sd(ΔΔGND)¯kBT^sSd(ΔψN)¯=1.30 kcal/mol, which corresponds to 76% of 1.7 kcal/mol (Serohijos et al., 2012) estimated from ProTherm database or 80% of 1.63 kcal/mol (Tokuriki et al., 2007) computationally predicted for single nucleotide mutations by using the FoldX. Using Sd(ΔΔGND)¯=1.30 estimated from the Ts for PDZ, the absolute values of Ts for other proteins are calculated by Eq. (66) and listed in Table 3; see Table S.6 for rcutoff ∼ 15.5 Å. The Ts estimated with rcutoff ∼ 8 and 15.5 Å are compared with each other in Fig. S.15. Morcos et al. (2014) estimated Ts by comparing ΔψND with ΔGND estimated by the associative-memory, water-mediated, structure, and energy model (AWSEM). They estimated ψN with rcutoff=16 Å and probably pc=0.5. In Fig. S.16, the present estimates of Ts are compared with those by Morcos et al. (2014). The Morcos’s estimates of Ts with some exceptions tend to be located between the present estimates with rcutoff ∼ 8 Å and 15.5 Å  which correspond to upper and lower limits for Ts as discussed in the Discussion and the supplement.

Fig. 2

Fig. 2. Regression of the experimental values(Gianni et al., 2007) of folding free energy changes (ΔΔGND) due to single amino acid substitutions on ΔψN( ≃ ΔΔψND) for the same types of substitutions in the PDZ domain. This figure corresponds to the cutoff distance rcutoff ∼ 8 Å; see Fig. S.14 for rcutoff ∼ 15.5 Å. The solid line shows the least-squares regression line through the origin with the slope, 0.279 kcal/mol, which is the estimates of kBTs. The reflective correlation coefficient is equal to 0.93. The free energies are in kcal/mol units.

4.5. Relationship among ΔψN¯¯ of protein families; weak dependency of ΔΔGND on ΔGND/L

The weak dependence of ΔΔGND on ΔGND was found (Miyazawa, 2016; Serohijos et al., 2012) from the analysis of stability changes due to single amino acid substitutions in proteins, which are collected in the ProTherm database (Kumar et al., 2006). To understand this weak dependence, let us consider the average of ΔψN¯ over homologous sequences in each protein family. The following regression line with αψN¯=1.74 is shown in Fig. 3.

(67)ΔψN(σjiN,σiNσi)¯¯αψN¯ψN(σN)¯ψ¯(f(σN)¯)L+βψN¯
(68)=αψN¯δψ2(f(σN)¯)L+βψN¯
(69)αψN¯<0,βψN¯0
Here, ψN(σN)¯ is reduced by ψ¯ because the origin of the ψN scale is not unique. The correlation between ΔψN¯¯ and δψ2/L is significant; the correlation coefficient is larger than 0.99. The intercept βψN¯ should be equal to 0, because if Ts → ∞ then δψ2 → 0 and ΔψN → 0. Actually, Fig. 3 shows that βψN¯ is nearly equal to 0.

Fig. 3

Fig. 3. Dependence of the average of ΔψN¯ due to single nucleotide nonsynonymous substitutions over homologous sequences on δψ2/L across protein families. Plus marks indicate the values for each protein family in the case of rcutoff ∼ 8 Å. The correlation coefficient is equal to 0.995, and the regression line is ΔψN(σjiN,σiNσi)¯¯=1.74(δψ2/L)0.445. See Fig. S.18 for rcutoff ∼ 15.5 Å.

Finally, the regression of ΔΔGND on ΔGND would be derived if Tg, Ts, and Twere constant.

(70)ΔΔGND(σjiN,σiNσi)¯¯αψN¯kBTsδψ2(f(σN)¯)L+kBTsβψN¯
(71)=αΔGNDkBTsδψ2(f(σN))¯L(ϑ(T/Tg)TsT1)+βΔGND
(72)=αΔGNDΔGND(σN,T)L+βΔGND
In general, Ts and Tg are different among protein families, so that the correlation between ΔΔGND¯¯ and ⟨ΔGND⟩/L cannot be strong. In Fig. 4, ΔΔGND¯¯ for the present proteins are plotted against ⟨ΔGND⟩/L. However, it should be noted that the correlation is not expected for ΔΔGND¯¯ and ⟨ΔGND⟩ but for ΔΔGND¯¯ and ⟨ΔGND⟩/L .

Fig. 4

Fig. 4. The sample average of folding free energy change, ΔΔGND¯¯kBTsΔΔψND¯¯, is plotted against the ensemble average of folding free energy per residue, ⟨ΔGNDσ/L ≃ kBTsΔψNDσ/L, for each protein family. The correlation coefficient is r=0.75, and the regression line is ΔΔGND(σjiN,σiNσi)¯¯=2.74ΔGNDσ/L+1.09. See Fig. S.19 for rcutoff ∼ 15.5 Å. The free energies are in kcal/mol units.

4.6. Estimation of Tg, ω, and ⟨ΔGND(σ)⟩σ from Ts and Tm

To estimate glass transition temperature Tg, the conformational entropy per residue ω in the compact denatured state, and the ensemble average of folding free energy in sequence space ⟨ΔGNDσ, melting temperature Tm must be known for each protein; see Eqs. (47), (18), and (44) for Tg, ω and ⟨ΔGNDσ, respectively. The experimental value of Tm (Armengaud et al., 2004; D’Auria et al., 2005; Ganguly et al., 2009; Guelorget et al., 2010; Knapp et al., 1998; Onwukwe et al., 2014; Parsons et al., 2006; Rosa et al., 1995; Sainsbury et al., 2008; Stupák et al., 2006; Torchio et al., 2012; Williams et al., 2002) employed for each protein is listed in Tables 3 and S.6. For comparison, temperature T is set up to be equal to the experimental temperature for ΔGND or to 298°K if unknown.

An estimate of glass transition temperature, T^g, has been calculated with T^s and Tm by Eq. (47), and is listed in Tables 3 and S.6 for each protein. In Fig. 5, T^s/T^g is plotted against Tm/T^g for each protein family. Unless Tg < Tm, a protein will be trapped at local minima on a rugged free energy landscape before it folds into a unique native structure. Protein foldability increases as Tm/Tg increases. A condition, ΔGND=0 at T=Tm, for the first order transition requires that Eq. (47), which is indicated by a dotted curve in Fig. 5, must be satisfied. As a result, Ts/Tg must be lowered to increase Tm/Tg; in other words, proteins must be selected at lower Ts. The present estimates of Ts and Tg would be within a reasonable range (Morcos et al., 2014; Onuchic et al., 1995; Pande et al., 2000) of values required for protein foldability.

Fig. 5

Fig. 5. T^s/T^g is plotted against Tm/T^g for each protein domain. A dotted curve corresponds to Eq. (47), T^s/T^g=2(Tm/T^g)/((Tm/T^g)2+1). Plus marks indicate the values estimated with rcutoff ∼ 8 Å. See Fig. S.20 for rcutoff ∼ 15.5 Å. The effective temperature Ts for selection and glass transition temperature Tg must satisfy Ts < Tg < Tm for proteins to be able to fold into unique native structures.

In Tables 3 and S.6, the ensemble average of ΔGND(σ) over sequences calculated by Eq. 44, and the conformational entropy per residue ω in the compact denatured state by Eq. (18) are also listed for each protein. Fig. 6 shows the comparison of their ensemble averages, ⟨ΔGND(σ)⟩σ, and the experimental values of ΔGND(σN) (Gianni et al., 2005; 2007; Grantcharova et al., 1998; Kragelund et al., 1999; Ruiz-Sanz et al., 1999; Wilson and Wittung-Stafshede, 2005) listed in Table S.4. The correlation in the case of rcutoff ∼ 8 Å is quite good, indicating that the constancy approximation (Eq. (65)) for the variance of ΔGN is appropriate. The conformational entropy per residue in the compact denatured state, ω^ in Eq. (18), estimated from the condition for the first order transition falls into the range of 0.60–1.13kB for rcutoff ∼ 8 Å, which agrees well with the range estimated by Morcos et al. (2014).

Fig. 6

Fig. 6. Folding free energies, ⟨ΔGNDσ ≃ kBTsΔψNDσ, predicted by the present method are plotted against their experimental values, ΔGND(σN). Plus marks indicate the values estimated with rcutoff ∼ 8 Å. See Fig. S.21 for rcutoff ∼ 15.5 Å. The free energies are in kcal/mol units.

4.7. The equilibrium value of evolutionary statistical energy ψN in the mutation–fixation process of amino acid substitutions

Let us consider the fixation process of amino acid substitutions in a monoclonal approximation, in which protein evolution is assumed to proceed with single amino acid substitutions fixed at a time in a population. In this approximation, ΔψND and ψN are at equilibrium and the ensemble of protein sequences attains to the equilibrium state, when the average of ΔΔψND ≃ ΔψN over singe nucleotide nonsynonymous mutations fixed in a population is equal to zero; an amino acid composition is assumed to be constant in protein evolution.

(73)ΔΔψNDfixedΔψNfixed=0ΔψNDandψNareatequilibrium.
The average of ΔψN over fixed mutations, ⟨ΔψNfixed, is calculated numerically with the probability density function (PDF) of ΔΔψND( ≃ ΔψN) for single nucleotide nonsynonymous mutations; see Eqs. (52) and (53). N=106 is employed.

The PDF of ΔΔGND were approximated with a normal distribution(Serohijos et al., 2012) or a bi-normal distribution (Tokuriki et al., 2007). Figs. 7, S.22, and S.23, however, show that a single normal distribution with the observed mean and standard deviation cannot well reproduce the observed distribution of ΔψN due to single nucleotide nonsynonymous mutations. For simplicity, a log-normal distribution, lnN(x;μ,σ), for which x, μ and σ defined as follows, is arbitrarily used here to better reproduce observed distributions of ΔψN, particularly in the domain of ΔψN<ΔψN¯, although other distributions such as inverse Γ distributions can equally well reproduce the observed ones, too.

(74)p(ΔψN)lnN(x;μ,σ)1xN(lnx;μ,σ)
(75)xmax(ΔψNΔψNo,0)
(76)exp(μ+σ2/2)=ΔψN¯ΔψNo
(77)exp(2μ+σ2)(exp(σ2)1)=(ΔψNΔψN¯)2¯)
(78)ΔψNomin(ΔψN¯nshift(ΔψNΔψN¯)2¯)1/2,0)
where ΔψNo is the origin for the log-normal distribution and the shifting factor nshift is taken to be equal to 2, unless specified. It is shown in Figs. 7, S.22, and S.23 that log-normal distributions can better reproduce the observed distribution of ΔψN due to single nucleotide nonsynonymous mutations except in the tails. Disagreements between the log-normal and observed distributions in the domain of ΔψN>ΔψN¯ do not much affect the PDF of ΔψN in fixed mutants, because fixation probabilities for ΔψN(>ΔψN¯) are too low.

Fig. 7

Fig. 7. The observed frequency distribution and the fitted distributions of ΔψN in the PDZ protein family. A black solid line indicates the observed frequency distribution of ΔψN per equal interval in homologous sequences of the PDZ protein family, and red dotted and blue dotted lines indicate the total frequencies of log-normal distributions with nshift=2 or 2.5 and parameters estimated with the mean and variance of the observed distribution for each protein; see Eqs. 74) to (78). A black dotted line indicates the total frequencies of normal distributions the mean and variance of which are equal to those of the observed distribution for each protein. Only representatives of unique sequences with no deletions, which are at least 20% different from each other, are employed; the total count is equal to 222,466 over 335 homologous sequences, which is almost equal to Meff in Table 1. See Fig. S.22 for rcutoff ∼ 15.5 Å.

The average of ΔψN over fixed mutants is uniquely determined by the distribution of ΔΔψN( ≃ ΔψN), which is approximated here by a log-normal distribution estimated from the mean and variance of ΔψN; it depends also on qm, which is assumed to be constant, through fixation probability, because 2NesΔψN/(1qm). In other words, ⟨ΔψNfixed is uniquely determined by the mean and variance of ΔψN. Therefore, under the equilibrium condition ΔψNfixed=0, only one of the mean and variance can be freely specified, and the other is uniquely determined. We employ ΔψN¯ or ψN as a parameter, because ΔψN¯ depends on ψN, and only one of them can be specified. We define Δψ¯Neq as Δψ¯N at which ΔψNfixed=0.

Suppose that the regression equation, Eq. (62), of ΔψN on ψN is exact, and the standard deviation of ΔψN is constant irrespective of ψN; the slope (αψN), ΔψN¯¯,Sd(ΔψN)¯, and ψN¯ that are estimated with rcutoff ∼ 8 Å for the PDZ and listed in Table 2 are employed here. In Fig. 8, the average of ΔψN over single nucleotide nonsynonymous substitutions fixed in a population, ⟨ΔψNfixed, is plotted against ψN/L of a wildtype for the PDZ protein family. This figure shows that ⟨ΔψNfixed changes its value from positive to negative as ψN increases, that is, the value of ψN at which ΔψNfixed=0,ψNeq, is the stable equilibrium value for ψN. In order for protein to have such a stable equilibrium value for folding free energy (ΔGND ≃ kBTsΔψND), the regression coefficient of ΔψN¯ on ψN must be more negative than that of the standard deviation, Sd(ΔψN), because otherwise stabilizing mutations increase as ψN decreases. This condition is, of course, satisfied for all protein families studied here, because the mean of ΔψN over all substitutions at all sites is negatively proportional to ψN of a wildtype, but its standard deviation is nearly constant irrespective of ψN across homologous sequences; see Tables 2 and S.5.

Fig. 8

Fig. 8. The average of ΔψN( ≃ ΔΔψND) over fixed single nucleotide nonsynonymous mutations versus ψN/L of a wildtype for the PDZ protein family. The averages of ΔψN( ≃ ΔΔψND) and 4Nes over the fixed mutants, and the average of Ka/Ks( ≡ u(s)/u(0)) over all the mutants are plotted against ψN/L of a wildtype by solid, broken, and dash-dot lines, respectively; qm=1/(2×106) is assumed. Dotted lines show the values of ⟨ΔψNfixed ± sd, where the sd is the standard deviation of ΔψN over fixed mutants. Fixation probability has been calculated with ΔΔψND ≃ ΔψN; see Eqs. (21) and (33). Here the empirical relationships of Eqs. (62) and (63) are assumed; that is, the mean of ΔψN linearly decreases as ψN increases, but the standard deviation of ΔψN is constant irrespective of ψN. The slope (αψN) and intercept (αψNψN¯/L+ΔψN¯¯) and the average of Sd(ΔψN) over homologous sequences that are estimated with rcutoff ∼ 8Å for the PDZ and listed in Table 2 are employed here. The distribution of ΔψN due to single nucleotide nonsynonymous mutations is approximated by a log-normal distribution with nshift=2.0; see Eqs. (74) to (78). The ψNeq, where ΔΔψNDfixedΔψNfixed=0, is the stable equilibrium value of ψN in the protein evolution of the PDZ protein family. The ψNeq is close to the average of ψN over homologous sequences (ψN¯), indicating that the present approximations for ψNeq and for ψN¯=ψNσ are consistent to each other.

The equilibrium value of ψN for each protein domain is calculated with the estimated values of αψN,ψN¯,ΔψN¯¯, and Sd(ΔψN)¯ listed in Tables 2 and S.5; it should be noticed here that Sd(ΔψN)¯ is assumed to be constant. In Figs. 9 and S.26, the equilibrium values of ψN/L estimated with nshift=1.5,2, and 2.5 in the monoclonal approximation are plotted against the average of ψN/L over homologous sequences for each protein family. The agreement between the time average (ψNeq) and ensemble average (ψNσ(=ψN¯)) is better for rcutoff ∼ 8 Å than for rcutoff ∼ 15.5 Å and is not bad in the case of rcutoff ∼ 8 Å, indicating that the present methods for the fixation process of amino acid substitutions and for the equilibrium ensemble of ψN give a consistent result with each other, and also that it is a good approximation to assume the standard deviation of ΔψN not to depend on ψN in each protein family.

Fig. 9

Fig. 9. The equilibrium value of ψN/L, where ΔψNfixed=0, is plotted against the average of ψN/L over homologous sequences for each protein family. The cutoff distance rcutoff=8Å is employed to estimate ψN of each protein family; see Fig. S.26 for rcutoff=15.5Å. The equilibrium values ψNeq, where ΔψNfixed=0, are calculated by using the linear dependency of ΔψN¯ on ψN (Eq. (62)) and estimated values with rcutoff ∼ 8 in Table 2. The standard deviation of ΔψN is approximated to be constant and equal to Sd(ΔψN)¯; see Eq. (63). Plus, upper triangle, and lower triangle marks indicate the cases of log-normal distributions with nshift=1.5,2.0, and 2.5 employed to approximate the distribution of ΔψN, respectively; see Eqs. (74) to (78).

4.8. Relationships between ΔψN¯(=Δψ¯Neq) and the standard deviation of ΔψN, T^s, and ΔΔG^ND at equilibrium

In the present model, the equilibrium values, ψNeq and the corresponding Δψ¯Neq, are functions of the mean and standard deviation of ΔψN only, because the distribution of ΔΔψND( ≃ ΔψN) is approximately estimated with the mean and standard deviation of ΔψN. On the other hand, ψNeq and Δψ¯Neq should be equal to ψN¯=ψN and ΔψN¯¯, respectively; the time average and ensemble average should be consistent. Actually ψNeq almost agrees with ψN¯ as shown in Fig. 9. Therefore the standard deviation of ΔψN is uniquely determined from its mean as long as ψN and ΔψN¯ are at equilibrium; conversely the equilibrium value of ΔψN¯ is determined by Sd(ΔψN). In Fig. 10, the standard deviation of ΔψN is plotted against ΔψN¯(=Δψ¯Neq). Likewise the estimate of effective temperature of selection, T^s(=(T^sSd¯(ΔψN))PDZ/Sd(ΔψN)), and that of folding free energy change, ΔΔG^ND(=kB(T^sSd¯(ΔψN))PDZ/Sd(ΔψN)·ΔψN¯), are plotted as a function of ΔψN¯(=Δψ¯Neq) in Fig. 11. These figures show that the averages, ΔψN¯¯ and Sd(ψN)¯, over homologous sequences scatter along the expected curves.

Fig. 10

Fig. 10. Relationship between the mean and the standard deviation of ΔψN due to single nucleotide nonsynonymous mutations at equilibrium, ΔψNfixed=0. The standard deviation of ΔψN that satisfies ΔψNfixed=0 is plotted against its mean, ΔψN¯. Broken, solid, and dotted lines indicate the cases of log-normal distributions with nshift=1.5,2.0 and 2.5 employed to approximate the distribution of ΔψN, respectively; see Eqs. (74) to (78). Plus marks indicate the averages, ΔψN¯¯ and Sd(ΔψN)¯, over homologous sequences in each protein family for rcutoff ∼ 8 Å, which are listed in Table 2. See Fig. S.27 for rcutoff ∼ 15.5 Å.

4.9. Protein evolution at equilibrium, ΔψNfixed=0

The common understanding of protein evolution has been that amino acid substitutions observed in homologous proteins are neutral (Kimura, 1968; 1969; Kimura and Ohta, 1971; 1974) or slightly deleterious (Ohta, 1973; 1992), and random drift is a primary force to fix amino acid substitutions in population. In order to see how significant neutral/slightly deleterious substitutions are in protein evolution, the PDFs of Ka/Ks in all single nucleotide nonsynonymous mutations and in their fixed mutations are calculated; Ka/Ks is the ratio of nonsynonymous to synonymous substitution rate per site (Miyata and Yasunaga, 1980) and defined here as Ka/Ks ≡ u(s)/u(0), where u(s) is a fixation probability for selective advantage s; see Eq. (50).

First let us see the distributions of ΔψN at equilibrium, ΔψNfixed=0. Fig. 12 shows the PDFs of ΔψN in all single nucleotide nonsynonymous mutations and in their fixed mutations as a function of ΔψN¯(=Δψ¯Neq), respectively. Because 4Nes(1qm)=ΔΔψNDΔψN, the PDFs of ΔψN can be regarded as the PDFs of 4Nes(1qm). At equilibrium, the distribution of ΔψN in all single nucleotide nonsynonymous mutants becomes wider as the mean of ΔψN increases, however, that in fixed mutants remains to be narrow with a peak near zero.

Fig. 11

Fig. 11. Relationships between T^s and ΔψN¯ and between kBT^sΔψN¯(ΔΔGND¯) and ΔψN¯ at equilibrium, ΔψNfixed=0. The estimate T^s(=(T^sSd¯(ΔψN))PDZ/Sd(ΔψN)) of effective temperature for selection and the estimate of mean folding free energy change, kBT^sΔψN¯(ΔΔGND¯), are plotted against ΔψN¯ under the condition of ΔψNfixed=0. The Ts is estimated in relative to the Ts of the PDZ family in the approximation that the standard deviation of ΔGN due to single nucleotide nonsynonymous mutations is constant irrespective of protein families; see Eq. (65). Broken, solid, and dotted lines indicate the cases of log-normal distributions with nshift=1.5,2.0 and 2.5 employed to approximate the distribution of ΔψN, respectively; see Eqs. (74) to (78). Plus marks indicate those estimates against the average of ΔψN¯ over homologous sequences for each protein family with rcutoff ∼ 8Å, which are listed in Tables 2 and 3. See Fig. S.28 for rcutoff ∼ 15.5Å.

Fig. 12

Fig. 12. PDFs of ΔψN (left) and Ka/Ks (right) in all singe nucleotide nonsynonymous mutants (upper) and in their fixed mutants (lower) as a function of ΔψN¯ at equilibrium, ΔψNfixed=0. Fixation probability has been calculated with ΔΔψND ≃ ΔψN; see Eqs. (21) and (33). The distribution of ΔψN due to single nucleotide nonsynonymous mutations is approximated by a log-normal distribution with nshift=2.0; see Eqs. (74) to (78). The standard deviation of ΔψN is determined to satisfy ΔψNfixed=0 at ΔψN¯=Δψ¯Neq.

The PDFs of Ka/Ks in all single nucleotide nonsynonymous mutations and in their fixed mutations are shown in Fig. 12. The blue line on the landscape of the PDF shows the averages of Ka/Ks. The averages of Ka/Ks in all single nucleotide nonsynonymous mutations and in their fixed mutations are also shown in Fig. 13. The average of Ka/Ks in all the arising mutants is less than 1 and decreases as ΔψN¯Δψ¯Neq increases, indicating that negative mutants significantly occur and increase as ΔψN¯ increases. On the other hand, ⟨Ka/Ksfixed in fixed mutants is larger than 1 and increases as ΔψNeq increases, indicating that positive mutants fix significantly in population and increase as equilibrium folding free energy change increases, that is, equilibrium protein stability decreases. To see each contribution of positive, neutral, slightly negative and negative selections, the value of Ka/Ks is divided arbitrarily into four categories, Ka/Ks > 1.05, 1.05 > Ka/Ks > 0.95, 0.95 > Ka/Ks > 0.5, and 0.5 > Ka/Ks for their selection categories, respectively. The probabilities of each selection category in all single nucleotide nonsynonymous mutations and in their fixed mutations are shown in Fig. 14. The almost 50% of fixed mutations are stabilizing mutations fixed by positive selection (1.05 < Ka/Ks), and another 50% are destabilizing mutations fixed by random drift. They are balanced with each other, and the stability of protein is maintained. Contrary to the neutral theory (Kimura, 1968; 1969; Kimura and Ohta, 1971; 1974), the proportion of neutral selection is not large even in fixed mutations, and slightly negative mutations are significantly fixed. Neutral mutations fixed with 0.95 < Ka/Ks < 1.05 are only less than 10%, and slightly negative mutations fixed with 0.5 < Ka/Ks < 0.95 and negative mutations fixed with Ka/Ks < 0.5 are both from 10 to 30%. The nearly neutral theory (Ohta, 1973; 1992; 2002) insists that most fixed mutations satisfy |Nes| ≤ 2. This condition corresponds to 0.003Ka/Ks(=u(s)/u(0))8; see Eqs. (21) and (50). The PDF of Ka/Ks shown in Fig. 14 indicates that this condition is satisfied, supporting the nearly neutral theory.

Fig. 13

Fig. 13. The averages of Ka/Ks over all single nucleotide nonsynonymous mutations and over their fixed mutations as a function of ΔψN¯ at equilibrium, ΔψNfixed=0. Black and red lines indicate ⟨Ka/Ks⟩ and ⟨Ka/Ksfixed, respectively. Fixation probability has been calculated with ΔΔψND ≃ ΔψN; see Eqs. (21) and (33). Broken, solid, and dotted lines indicate the cases of log-normal distributions with nshift=1.5,2.0 and 2.5 employed to approximate the distribution of ΔψN, respectively; see Eqs. (74) to (78). The standard deviation of ΔψN is determined to satisfy ΔψNfixed=0 at ΔψN¯=Δψ¯Neq.

4.10. Relationship between Ts and Ka/Ks

The effective temperature (Ts) of protein for selection, which is defined in Eq. (1), represents the strength of selection originating from protein stability and foldability. Thus, it must be related with the evolutionary rate (amino acid substitution rate) of protein. As the effective temperature of selection (Ts) decreases, the mean change of evolutionary statistical energy (Δψ¯Neq) due to single amino acid substitutions increases; see Fig. 11. Therefore, destabilizing mutations increase, and an amino acid substitution rate is expected to decrease. Fig. 13 shows that the average of Ka/Ks decreases as Δψ¯Neq increases. The direct relationship between substitution rate and Ts(=(TsSd¯(ΔψN))PDZ/Sd(ΔψN)) is shown in Fig. 15; the average of Ka/Ks decreases as Ts increases. In the selection maintaining protein foldability/stability, the effective temperature of selection is directly reflected in the average amino acid substitution rate.

5. Discussion

A main purpose of the present study is to formulate protein fitness originating from protein foldability and stability. From a phenomenological viewpoint, Drummond and Wilke (2008) took notice of toxicity of misfolded proteins as well as diversion of protein synthesis resources, and formulated a Malthusian fitness of a genome to be negatively proportional to the total amount of misfolded proteins, which must be produced to obtain the necessary amount of folded proteins (Serohijos et al., 2012). They also formulated a Malthusian fitness based on protein dispensability to be negatively proportional to the ratio of unfolded proteins. These formulas of protein fitness can be well approximated by a generic form, m=κexp(ΔGND/(kBT)), where T is growth temperature, and κ( ≥ 0) is a parameter that depends on protein disability and cellular abundance of protein (Miyazawa, 2016).

In the comparison of this generic formula of protein fitness with the present one, it may be interpreted that 4Ne(1qm)κ/T1/Ts, if |ΔGND/(kBT)| ≪ 1, however, the growth temperature T and folding free energy do not always satisfy this condition. These two types of selection should be considered to be the different types of selection, although both are related with protein stability (ΔGND). Selective advantage of mutant is not upper-bounded in the present scheme of a Malthusian fitness but in the case of m=κexp(ΔGND/(kBT)). As a result, PDFs of Ka/Ks in all arising mutations and in fixed mutations have very different shapes between these two formulas of fitness (Miyazawa, 2016). Selection modeled here is one that yields the distribution of homologous sequences in protein evolution. In other word, the present formula for protein fitness models natural selection maintaining protein’s stability, foldability, and function over the evolutionary time scale, which is much longer than the time scale for the selection originating from toxicity of misfolded proteins.

The present formulas for protein fitness, Eqs. (31) and (30), have been derived on the basis of a protein folding theory, particularly the random energy model, and the maximum entropy principle for the distribution of homologous sequences with the same fold in sequence space, respectively. The former indicates that the equilibrium ensemble of sequences can be well approximated by a canonical ensemble with the Boltzmann factor exp(ΔGND/kBTs), and the latter insists that the probability distribution of homologous sequences, which satisfies a given amino acid composition at each site and a given pairwise amino acid frequency at each site pair, can be represented as a Boltzmann distribution with exp(ψN), in which the evolutionary statistical energy (ψN) is represented as the sum of one-body (compositional) and pairwise (covariational) interactions between sites. On the other hand, assuming mutation and fixation processes to be reversible Markov processes leads us to a formulation that the equilibrium ensemble of sequences also obeys a Boltzmann distribution with exp(4Nem(1qm)). As a result, we obtain the correspondences between folding free energy (ΔGND/kBTs), and ΔψND and protein fitness (4Nem(1qm)): the equality between the latter two variables (Eq. (33)), which indicates that ΔψN is proportional to fitness (s), and the approximate equality between the former two variables (Eq. (34)) since a canonical ensemble with ΔGND/(kBTs) is an approximate for the sequence ensemble under natural selection. A discrepancy between evolutionary statistical energies Jij and actual interaction energies was pointed out for non-contacting residue pairs in Monte Carlo simulations of lattice proteins (Jacquin et al., 2016). Also, the ratio of Jij(ak,al) to the corresponding actual contact energy was shown to differ among contact site pairs. On the other hand, Hopf et al. (2017) successfully predicted mutation effects with evolutionary statistical energy and showed that the change of evolutionary statistical energy (ΔψN) due to amino acid substitutions can capture experimental fitness landscapes and identify deleterious human variants.

Fig. 14

Fig. 14. The probabilities of each selection category in all single nucleotide nonsynonymous mutations and in their fixed mutations as a function of ΔψN¯ at equilibrium, ΔψNfixed=0. The left and right figures are for single nucleotide nonsynonymous mutations and for their fixed mutations, respectively. Red solid, red dotted, black broken, and black solid lines indicate positive, neutral, slightly negative and negative selection categories, respectively; the values of Ka/Ks are divided arbitrarily into four categories, Ka/Ks > 1.05, 1.05 > Ka/Ks > 0.95, 0.95 > Ka/Ks > 0.5, and 0.5 > Ka/Ks, which correspond to their selection categories, respectively. Fixation probability has been calculated with ΔΔψND ≃ ΔψN; see Eqs. (21) and (33). The distribution of ΔψN due to single nucleotide nonsynonymous mutations is approximated by a log-normal distribution with nshift=2.0; see Eqs. (74) to (78). The standard deviation of ΔψN is determined to satisfy ΔψNfixed=0 at ΔψN¯(=Δψ¯Neq).

Fig. 15

Fig. 15. The averages of Ka/Ks over all single nucleotide nonsynonymous mutations and over their fixed mutations as a function of the effective temperature of selection, Ts(=(TsSd¯(ΔψN))PDZ/Sd(ΔψN)), at equilibrium, ΔψNfixed=0. Black and red lines indicate ⟨Ka/Ks⟩ and ⟨Ka/Ksfixed, respectively. Fixation probability has been calculated with ΔΔψND ≃ ΔψN; see Eqs. (21) and (33). The distribution of ΔψN due to single nucleotide nonsynonymous mutations is approximated by a log-normal distribution with nshift=2.0; see Eqs. (74) to (78). The standard deviation of ΔψN is determined to satisfy ΔψNfixed=0 at ΔψN¯(=Δψ¯Neq). The Ts is estimated in the scale relative to the Ts of the PDZ family in the approximation that the standard deviation of ΔGN due to single nucleotide nonsynonymous mutations is constant irrespective of protein families; see Eq. (65). Broken, solid, and dotted lines indicate the cases of log-normal distributions with nshift=1.5,2.0 and 2.5 employed to approximate the distribution of ΔψN, respectively; see Eqs. (74) to (78). The curves for rcutoff ∼ 8 and 15.5 Å almost overlap with each other, because the estimates of (T^sSd¯(ΔψN))PDZ for the PDZ with rcutoff ∼ 8 and 15.5 Å are almost equal to each other.

In the analysis of the interaction changes (ΔψN) due to single nucleotide nonsynonymous substitutions, we have employed the cutoff distances for pairwise interactions, rcutoff ∼ 8 and 15.5 Å, which correspond to the first and second interaction shells between residues, respectively. Both the cutoff distances yield similar values for Ts/Ts,PDZ=Sd¯(ΔψN,PDZ)/Sd¯(ΔψN); see Fig. S.15. Thus, the differences in the estimation of Ts between these two cutoff distances principally originate in the estimation of Ts for the reference protein, PDZ. The absolute value of kBT^s,PDZ for the PDZ has been estimated to be equal to the slope of the reflective regression line of ΔΔGND on ΔψN. Therefore, as long as the correlation between ΔΔGND and ΔψN is good enough as shown in Figs. 2 and S.14, kB(T^sSd(ψN)¯)PDZ takes a similar value irrespective of rcutoff, and the estimate T^s,PDZ differs depending on Sd(ψN,PDZ)¯. Thus, ΔψN must correlate with experimental ΔΔGND, but on the basis of the correlation coefficient one cannot determine which estimation of ΔψN is better. Larger the standard deviation of ΔψN is, the smaller the estimate of Ts from a direct comparison between ΔΔGND and ΔψN is. Including the longer range of pairwise interactions tend to increase the variance of ΔψN. The range of interactions must be limited to a realistic value, either the first interaction shell or the second interaction shell. Thus, the estimates of Ts with rcutoff ∼ 8 Å and 15.5Å would be upper and lower limits, respectively. Unfortunately Ts is not directly observable. Comparison of the estimates of folding free energies with their experimental values may be appropriate to judge which value is more appropriate for the cutoff distance, although the number of experimental data is limited. Actual values of Ts may be closer to the estimates with rcutoff ∼ 8Å, because contact predictions based on the estimate of pairwise interactions J succeed for close contacts within the first interaction shell. Also, the estimation of ΔGND and the correlation between ψNeq and ψN¯ are slightly better with rcutoff ∼ 8 Å than 15.5 Å; see Figs. 6, 9, S.21, and S.26.

On the basis of the random energy model(REM) (Pande et al., 1997; Shakhnovich and Gutin, 1993a; 1993b), glass transition temperatures (Tg) and folding free energies (ΔGND) for 14 protein domains are estimated under the condition of ψN¯=ψNσ. The first order transition for protein folding is assumed to estimate the folding free energies by Eq. (44). Selective temperature, Ts, is estimated in the empirical approximation that the standard deviation of ΔψN is constant across homologous sequences with different ψN, so that their estimates may be more coarse-grained, however, this method is easier and faster than the method (Morcos et al., 2014) using the AWSEM (Davtyan et al., 2012). Experimental data for ΔGND are very limited, and also experimental conditions such as temperature and pH tend to be different among them. A prediction method for folding free energy would be useful in such a situation, although the present method requires the knowledge of melting temperature (Tm) besides sequence data, however, experimental data of Tm are more available than for ΔGND.

For proteins to have a stable equilibrium value of ψN in protein evolution, the regression coefficient of mean interaction change (ΔψN¯) on ψN must be more negative than that of their standard deviation (Sd(ΔψN)), otherwise stabilizing mutations increase as ψN decreases. Actually Tables 2 and S.5 show that their mean over all the substitutions at all sites is negatively proportional to ψN of a wildtype, but their standard deviation is nearly constant irrespective of ψN across homologous sequences. The equilibrium value ψNeq, where the average of ΔψN over fixed mutants is equal to zero, is calculated with the approximation of the distribution of ΔψN by a log-normal distribution and the empirical rules of Eqs. (62) and (63). In the monoclonal approximation, it has been confirmed that the time average (ψNeq) and ensemble average (ψN¯=ψNσ) of evolutionary statistical energy (ψN) almost agree with each other. Therefore, this result also supports these approximations and empirical rules, particularly Eq. (63), that is, the constancy of the standard deviation of ΔψN across homologous sequences. In the log-normal distribution approximation, Δψ¯Neq, Sd(ΔψN)eq, T^s, and ΔΔG^ND can be determined as a function of any one of them. Here they have been shown as a function of Δψ¯Neq.

We have also studied the evolution of protein at equilibrium, at which the ensemble of homologous sequences obeys a Boltzmann distribution with exp(ψN)(exp(ΔψND)), and the ensemble averages of evolutionary statistical energy (ψN ≃ GN/(kBTs)) and its change due to a mutation (ΔψN) ≃ ΔΔψND ≃ ΔΔGND/(kBTs)) agree with their steady values; ψNσ=ψN¯=ψNeq and ΔψN¯σ=ΔψN¯¯=Δψ¯Neq. The PDFs of ΔψN and Ka/Ks in all the mutants and in their fixed mutants have been estimated. It is confirmed that the effective temperature (Ts) of selection negatively correlates with the amino acid substitution rate (Ka/Ks) of protein.

New alleles can become fixed owing to random drift or to positive selection of substantially advantageous mutations (Gillespie, 1991; Kimura, 1983; Ohta, 2002). The present study indicates that the stability of protein is maintained in such a way that stabilizing mutations are significantly fixed by positive selection, and balance with destabilizing mutations fixed by random drift. As shown in Fig. 14, the almost 50% of fixed mutations are stabilizing mutations fixed by positive selection (1.05 < Ka/Ks), and another 50% are destabilizing mutations fixed by random drift. An interesting fact is that contrary to the neutral theory (Kimura, 1968; 1969; Kimura and Ohta, 1971; 1974), the proportion of neutral selection is not large even in fixed mutants. In the selection to maintain protein stability/foldability, neutral mutations fixed with 0.95 < Ka/Ks < 1.05 are only less than 10%, and slightly negative mutations fixed with 0.5 < Ka/Ks < 0.95 and negative mutations fixed with Ka/Ks < 0.5 are both from 10 to 30%. As a result, at equilibrium the average of Ka/Ks in all the mutants is less than 1, but that in their fixed mutants is larger than 1. The PDF of Ka/Ks shown in Fig. 14 supports the nearly neutral theory (Ohta, 1973; 1992; 2002), which insists that most fixed mutations satisfy |Nes| ≤ 2 corresponding to 0.003Ka/Ks(=u(s)/u(0))8. It should be noted that these conclusions based on the PDFs of ΔψN and Ks/Ks require only an equilibrium condition of ΔψN¯ = Δψ¯Neq, but does not require the approximation of constancy for the variance of ΔψN across homologous sequences, which is used only to estimate Ts and ψNeq and other relations based on Ts.

In the present study, we have analyzed the mutation-fixation process in equilibrium. The equilibrium state will vary if an environmental condition varies. The evolutionary statistical energy ψN and the inverse of selective temperature 1/Ts are linearly proportional to the effective population size Ne, as indicated by Eq. (33). Thus, the equilibrium values, ψNeq,ΔψN¯eq and Sd(ΔψN)eq, are all linearly proportional to the effective population size Ne. On the other hand, Sd(ΔψN)eq is not linearly proportional to ΔψN¯eq but downward-concave, as shown in Fig. 10. As a result, as Ne decreases, kBTsΔψN¯eqkBTsΔΔψND¯eq(ΔΔGND¯eq) decreases. In other words, the equilibrium value of the mean folding free energy change becomes less positive and therefore that of folding free energy ΔGND¯eqkBTsΔψND¯eq is expected to be higher (less stable) for a smaller number of effective population size Ne; see Eq. (72).

Supplementary document

File 1 — Supplementary methods, tables, and figures

A PDF file in which the details of the methods are described and additional tables and figures are provided; methods, tables, and figures provided in the text are also included as part of their full descriptions for reader’s convenience.

Funding This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Acknowledgement

I would like to thank a reviewer for his excellent comments and suggestions that have helped me improve the paper considerably.

Appendix A. Supplementary materials

Download Acrobat PDF file (2MB)Help with pdf files

Supplementary Data S1. Supplementary Raw Research Data. This is open data under the CC BY license http://creativecommons.org/licenses/by/4.0/

Research data for this article

Open Data

for download under the CC BY licence
 
Supplementary Data S1
(PDF, 2MB)

Supplementary Raw Research Data. This is open data under the CC BY license http://creativecommons.org/licenses/by/4.0/

Download fileDownload data
About research data

References

Armengaud, Urbonavicius, Fernandez, Chaussinand, Bujnicki, Grosjean, 2004
J. Armengaud, J. Urbonavicius, B. Fernandez, G. Chaussinand, J.M. Bujnicki, H. Grosjeann2-Methylation of guanosine at position 10 in tRNA is catalyzed by a THUMP domain-containing, s-adenosylmethionine-dependent methyltransferase, conserved in archaea and eukaryota
J. Biol. Chem., 279 (2004), pp. 37142-37152
Barton, Leonardis, Coucke, Cocco, 2016
J.P. Barton, E.D. Leonardis, A. Coucke, S. CoccoACE: adaptive cluster expansion for maximum entropy graphical model inference
Bioinformatics, 32 (2016), pp. 3089-3097
Bryngelson, Wolynes, 1987
J.D. Bryngelson, P.G. WolynesSpin glasses and the statistical mechanics of protein folding
Proc. Natl. Acad. Sci. USA, 84 (1987), pp. 7524-7528
Crow, Kimura, 1970
J.F. Crow, M. KimuraAn Introduction to Population Genetics Theory
Harper & Row Publishers, New York (1970)
D’Auria, Scirè, Varriale, Scognamiglio, Staiano, Ausili, Marabotti, MosèRossi, Tanfani, 2005
S. D’Auria, A. Scirè, A. Varriale, V. Scognamiglio, M. Staiano, A. Ausili, A. Marabotti, MosèRossi, F. TanfaniBinding of glutamine to glutamine-binding protein from escherichia coli induces changes in protein structure and increases protein stability
Proteins, 58 (2005), pp. 80-87
Davtyan, Schafer, Zheng, Clementi, Wolynes, Papoian, 2012
A. Davtyan, N.P. Schafer, W. Zheng, C. Clementi, P.G. Wolynes, G.A. PapoianAWSEM-MD: protein structure prediction using coarse-grained physical potentials and bioinformatically based local structure biasing
J. Phys. Chem. B, 116 (2012), pp. 8494-8503
Dokholyan, Shakhnovich, 2001
N.V. Dokholyan, E.I. ShakhnovichUnderstanding hierarchical protein evolution from first principles
J. Mol. Biol., 312 (2001), pp. 289-307
Drummond, Wilke, 2008
D.A. Drummond, C.O. WilkeMistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution
Cell, 134 (2) (2008), pp. 341-352, 10.1016/j.cell.2008.05.042
Ekeberg, Hartonen, Aurell, 2014
M. Ekeberg, T. Hartonen, E. AurellFast pseudolikelihood maximization for direct-coupling analysis of protein structure from many homologous amino-acid sequences
J. Comput. Phys., 276 (2014), pp. 341-356
Ekeberg, Lövkvist, Lan, Weigt, Aurell, 2013
M. Ekeberg, C. Lövkvist, Y. Lan, M. Weigt, E. AurellImproved contact prediction in proteins: using pseudolikelihoods to infer potts models
Phys. Rev. E, 87 (2013)

012707–1–16. doi:10.1103/PhysRevE.87.012707.

Ewens, 1979
W.J. EwensMathematical Population Genetics
Springer, New York (1979)
Finn, Coggill, Eberhardt, Eddy, Mistry, Mitchell, Potter, Punta, Qureshi, Sangrador-Vegas, Salazar, Tate, Bateman, 2016
R.D. Finn, P. Coggill, R.Y. Eberhardt, S.R. Eddy, J. Mistry, A.L. Mitchell, S.C. Potter, M. Punta, M. Qureshi, A. Sangrador-Vegas, G.A. Salazar, J. Tate, A. BatemanThe pfam protein families database: towards a more sustainable future
Nucl. Acid Res., 44 (2016), pp. D279-D285
Ganguly, Das, Bandhu, Chanda, Jana, Mondal, Sau, 2009
T. Ganguly, M. Das, A. Bandhu, P.K. Chanda, B. Jana, R. Mondal, S. SauPhysicochemical properties and distinct DNA binding capacity of the repressor of temperate staphylococcus aureusphage ϕ/11
FEBS J., 276 (2009), pp. 1975-1985
Gianni, Calosci1, Aelen, Vuister, Brunori, Travaglini-Allocatelli, 2005
S. Gianni, N. Calosci1, J.M.A. Aelen, G.W. Vuister, M. Brunori, C. Travaglini-AllocatelliKinetic folding mechanism of PDZ2 from PTP-BL
Protein Eng. Des. Select., 18 (2005), pp. 389-395
Gianni, Geierhaas, Calosci, Jemth, Vuister, Travaglini-Allocatelli, Vendruscolo, Brunori, 2007
S. Gianni, C.D. Geierhaas, N. Calosci, P. Jemth, G.W. Vuister, C. Travaglini-Allocatelli, M. Vendruscolo, M. BrunoriA PDZ domain recapitulates a unifying mechanism for protein folding
Proc. Natl. Acad. Sci. USA, 104 (2007), pp. 128-133
Gillespie, 1991
J.H. GillespieThe Causes of Molecular Evolution
Oxford Univ. Press, Oxford (1991)
Grantcharova, Riddle, Santiago, Baker, 1998
V.P. Grantcharova, D.S. Riddle, J.V. Santiago, D. BakerImportant role of hydrogen bonds in the structurally polarized transition state for folding of the src SH3 domain
Nat. Struct. Biol (1998), pp. 714-720
Guelorget, Roovers, Guérineau, Barbey, Li, Golinelli-Pimpaneau, 2010
A. Guelorget, M. Roovers, V. Guérineau, C. Barbey, X. Li, B. Golinelli-PimpaneauInsights into the hyperthermostability and unusual region-specificity of archaeal pyrococcus abyssi tRNA m1A57/58 methyltransferase
Nucl. Acid Res., 38 (2010), pp. 6206-6218
Hopf, Colwell, Sheridan, Rost, Sander, Marks, 2012
T.A. Hopf, L.J. Colwell, R. Sheridan, B. Rost, C. Sander, D.S. MarksThree-dimensional structures of membrane proteins from genomic sequencing
Cell, 149 (2012), pp. 1607-1621
Hopf, Ingraham, Poelwijk, Schärfe, Springer, Sander, Marks, 2017
T.A. Hopf, J.B. Ingraham, F.J. Poelwijk, C.P.I. Schärfe, M. Springer, C. Sander, D.S. MarksMutation effects predicted from sequence co-variation
Nat. Biotech., 35 (2017), pp. 128-135
Jacquin, Gilson, Shakhnovich, Cocco, Monasson, 2016
H. Jacquin, A. Gilson, E. Shakhnovich, S. Cocco, R. MonassonBenchmarking inverse statistical approaches for protein structure and design with exactly solvable models
PLoS Comput. Biol., 12 (2016)

E1004889

Kimura, 1968
M. KimuraEvolutionary rate at the molecular level
Nature, 217 (1968), pp. 624-626
Kimura, 1969
M. KimuraThe rate of molecular evolution considered from the standpoint of population genetics
Proc. Natl. Acad. Sci. USA, 63 (1969), pp. 1181-1188
Kimura, 1983
M. KimuraThe Neutral Theory of Molecular Evolution
Cambridge Univ. Press, Cambridge (1983)
Kimura, Ohta, 1971
M. Kimura, T. OhtaProtein polymorphism as a phase of molecular evolution
Nature, 229 (1971), pp. 467-469
Kimura, Ohta, 1974
M. Kimura, T. OhtaOn some principles governing molecular evolution
Proc. Natl. Acad. Sci. USA, 71 (1974), pp. 2848-2852
Knapp, Mattson, Christova, Berndt, Karshikoff, Vihinen, Smith, Ladenstein, 1998
S. Knapp, P.T. Mattson, P. Christova, K.D. Berndt, A. Karshikoff, M. Vihinen, C.E. Smith, R. LadensteinThermal unfolding of small proteins with sh3 domain folding pattern
Proteins, 31 (1998), pp. 309-319
Kragelund, Osmark, Neergaard, Schiødt, Kristiansen, Knudsen, Poulsen, 1999
B.B. Kragelund, P. Osmark, T.B. Neergaard, J. Schiødt, K. Kristiansen, J. Knudsen, F.M. PoulsenThe formation of a native-like structure containing eight conserved hydrophobic residues is rate limiting in two-state protein folding of ACBP
Nat. Struct. Biol., 6 (1999), pp. 594-601
Kumar, Bava, Gromiha, Prabakaran, Kitajima, Uedaira, Sarai, 2006
M. Kumar, K. Bava, M. Gromiha, P. Prabakaran, K. Kitajima, H. Uedaira, A. SaraiProtherm and proNIT: thermodynamic databases for proteins and protein-nucleic acid interactions
Nucl. Acid Res., 34 (2006), pp. D204-D206
Marks, Colwell, Sheridan, Hopf, Pagnani, Zecchina, Sander, 2011
D.S. Marks, L.J. Colwell, R. Sheridan, T.A. Hopf, A. Pagnani, R. Zecchina, C. SanderProtein 3D structure computed from evolutionary sequence variation
PLoS ONE, 6 (12) (2011)

E28766. doi:10.1371/journal.pone.0028766.

Miyata, Yasunaga, 1980
T. Miyata, T. YasunagaMolecular evolution of mRNA: a method for estimating evolutionary rates of synonymous and amino acid substitutions from homologous nucleotide sequences and its applications
J. Mol. Evol., 16 (1980), pp. 23-36
Miyazawa, 2013
S. MiyazawaPrediction of contact residue pairs based on co-substitution between sites in protein structures
PLoS ONE, 8 (1) (2013)

E54252. doi:10.1371/journal.pone.0054252.

Miyazawa, 2016
S. MiyazawaSelection maintaining protein stability at equilibrium
J. Theor. Biol., 391 (2016), pp. 21-34
Moran, 1958
P.A.P. MoranRandom processes in genetics
Proc. Cambridge Phil. Soc., 54 (1958), pp. 60-71
Morcos, Pagnani, Lunt, Bertolino, Marks, Sander, Zecchina, Onuchic, Hwa, Weigt, 2011
F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D.S. Marks, C. Sander, R. Zecchina, J.N. Onuchic, T. Hwa, M. WeigtDirect-coupling analysis of residue coevolution captures native contacts across many protein families
Proc. Natl. Acad. Sci. USA, 108 (2011), pp. E1293-E1301
Morcos, Schafer, Cheng, Onuchic, Wolynes, 2014
F. Morcos, N.P. Schafer, R.R. Cheng, J.N. Onuchic, P.G. WolynesCoevolutionary information, protein folding landscapes, and the thermodynamics of natural selection
Proc. Natl. Acad. Sci. USA, 111 (2014), pp. 12408-12413
Ohta, 1973
T. OhtaSlightly deleterious mutant substitutions in evolution
Nature, 246 (1973), pp. 96-98
Ohta, 1992
T. OhtaThe nearly neutral theory of molecular evolution
Annu. Rev. Ecol. Syst., 23 (1992), pp. 263-286
Ohta, 2002
T. OhtaNear-neutrality in evolution of genes and gene regulation
Proc. Natl. Acad. Sci. USA, 99 (2002), pp. 16134-16137
Onuchic, Wolynes, Lutheyschulten, Socci, 1995
J.N. Onuchic, P.G. Wolynes, Z. Lutheyschulten, N.D. SocciToward an outline of the topography of a realistic protein-folding funnel
Proc. Natl. Acad. Sci. USA, 92 (1995), pp. 3626-3630
Onwukwe, Kursula1, Koski, Schmitz, Wierenga, 2014
G.U. Onwukwe, P. Kursula1, M.K. Koski, W. Schmitz, R.K. WierengaHuman δ3, δ2-enoyl-CoA isomerase, type 2: a structural enzymology study on the catalytic role of its ACBP domain and helix-10
FEBS J., 282 (2014), pp. 746-768
Pande, Grosberg, Tanaka, 1997
V.S. Pande, A.Y. Grosberg, T. TanakaStatistical mechanics of simple models of protein folding and design
Biophys. J., 73 (1997), pp. 3192-3210
Pande, Grosberg, Tanaka, 2000
V.S. Pande, A.Y. Grosberg, T. TanakaHeteropolymer freezing and design: towards physical models of protein folding
Rev. Mod. Phys., 72 (2000), pp. 259-314
Parsons, Lin, Orban, 2006
L.M. Parsons, F. Lin, J. OrbanPeptidoglycan recognition by pal, an outer membrane lipoprotein
Biochemistry, 45 (2006), pp. 2122-2128
Ramanathan, Shakhnovich, 1994
S. Ramanathan, E. ShakhnovichStatistical mechanics of proteins with evolutionary selected sequences
Phys. Rev. E, 50 (1994), pp. 1303-1312
Rosa, Milardi, Grasso, Guzzi, Sportelli, 1995
C.L. Rosa, D. Milardi, D. Grasso, R. Guzzi, L. SportelliThermodynamics of the thermal unfolding of azurin
J. Phys. Chem., 99 (1995), pp. 14864-14870
Ruiz-Sanz, Simoncsits, Tőrő, Pongor, Mateo, Filimonov, 1999
J. Ruiz-Sanz, A. Simoncsits, I. Tőrő, S. Pongor, P.L. Mateo, V.V. FilimonovA thermodynamic study of the 434-repressor n-terminal domain and of its covalently linked dimers
Eur. J. Biochem., 263 (1999), pp. 246-253
Sainsbury, Ren, Saunders, Stuarta, Owens, 2008
S. Sainsbury, J. Ren, N.J. Saunders, D.I. Stuarta, R.J. OwensCrystallization and preliminary x-ray analysis of crga, a lysr-type transcriptional regulator from pathogenic neisseria meningitidis MC58
Acta Cryst., F64 (2008), pp. 797-801
Serohijos, Rimas, Shakhnovich, 2012
A. Serohijos, Z. Rimas, E. ShakhnovichProtein biophysics explains why highly abundant proteins evolve slowly
Cell Rep., 2 (2) (2012), pp. 249-256, 10.1016/j.celrep.2012.06.022
Shakhnovich, 1994
E.I. ShakhnovichProteins with selected sequences fold into unique native conformation
Phys. Rev. Lett., 72 (1994), pp. 3907-3911
Shakhnovich, Gutin, 1993a
E.I. Shakhnovich, A.M. GutinEngineering of stable and fast-folding sequences of model proteins
Proc. Natl. Acad. Sci. USA, 90 (1993), pp. 7195-7199
Shakhnovich, Gutin, 1993b
E.I. Shakhnovich, A.M. GutinA new approach to the design of stable proteins
Protein Eng., 6 (1993), pp. 793-800
Stupák, Zǒldák, Musatov, Sprinzl, Sedlák, 2006
M. Stupák, G. Zǒldák, A. Musatov, M. Sprinzl, E. SedlákUnusual effect of salts on the homodimeric structure of NADH oxidase from thermus thermophilus in acidic ph
Biochim. Biophys. Acta, 1764 (2006), pp. 129-137
Sułkowska, Morcos, Weigt, Hwa, Onuchic, 2012
J.I. Sułkowska, F. Morcos, M. Weigt, T. Hwa, J.N. OnuchicGenomics-aided structure prediction
Proc. Natl. Acad. Sci. USA, 109 (2012), pp. 10340-10345
Tokuriki, Stricher, Schymkowitz, Serrano, Tawfik, 2007
N. Tokuriki, F. Stricher, J. Schymkowitz, L. Serrano, D.S. TawfikThe stability effects of protein mutations appear to be universally distributed
J. Mol. Biol., 369 (2007), pp. 1318-1332
Torchio, Ermácora, Sica, 2012
G.M. Torchio, M.R. Ermácora, M.P. SicaEquilibrium unfolding of the PDZ domain of β2-syntrophin
Biophys. J., 102 (2012), pp. 2835-2844
Williams, Prosselkov, Liepinsh, Line, Sharipo, Littler, Curmi, Otting, Dixon, 2002
N.K. Williams, P. Prosselkov, E. Liepinsh, I. Line, A. Sharipo, D.R. Littler, P.M.G. Curmi, G. Otting, N.E. DixonIn vivo protein cyclization promoted by a circularly permuted synechocystis sp. PCC6803 dnab mini-intein
J. Biol. Chem., 277 (2002), pp. 7790-7798
Wilson, Wittung-Stafshede, 2005
C.J. Wilson, P. Wittung-StafshedeSnapshots of a dynamic folding nucleus in zinc-substituted pseudomonas aeruginosa azurin
Biochemistry, 44 (2005), pp. 10054-10062