Selection originating from protein stability/foldability: Relationships between protein folding free energy, sequence ensemble, and fitness

doi:10.1016/j.jtbi.2017.08.018

Journal of Theoretical Biology

Volume 433, 21 November 2017, Pages 21-38

Selection originating from protein stability/foldability: Relationships between protein folding free energy, sequence ensemble, and fitness

https://doi.org/10.1016/j.jtbi.2017.08.018 Get rights and content

Highlights

•: A Boltzmann distribution with protein fitness is derived.
•: Relationships between folding free energy, inverse statistical potential and fitness.
•: Selective temperature, glass transition temperature and folding free energy are estimated.
•: Relationship between selective temperature and substitution rate (K_a/K_s).
•: Protein stability/foldability is kept in a balance of positive selection and random drift.

Abstract

Assuming that mutation and fixation processes are reversible Markov processes, we prove that the equilibrium ensemble of sequences obeys a Boltzmann distribution with $\exp (4 N_{e} m (1 - 1 / (2 N))),$ $\exp (4 N_{e} m (1 - 1 / (2 N))),$ where m is Malthusian fitness and N_e and N are effective and actual population sizes. On the other hand, the probability distribution of sequences with maximum entropy that satisfies a given amino acid composition at each site and a given pairwise amino acid frequency at each site pair is a Boltzmann distribution with $\exp (- ψ_{N}),$ $\exp (- ψ_{N}),$ where the evolutionary statistical energy ψ_N is represented as the sum of one body (h) (compositional) and pairwise (J) (covariational) interactions over all sites and site pairs. A protein folding theory based on the random energy model (REM) indicates that the equilibrium ensemble of natural protein sequences is well represented by a canonical ensemble characterized by $\exp (- Δ G_{N D} / k_{B} T_{s})$ $\exp (- Δ G_{N D} / k_{B} T_{s})$ or by $\exp (- G_{N} / k_{B} T_{s})$ $\exp (- G_{N} / k_{B} T_{s})$ if an amino acid composition is kept constant, where $Δ G_{N D} \equiv G_{N} - G_{D},$ $Δ G_{N D} \equiv G_{N} - G_{D},$ G_N and G_D are the native and denatured free energies, and T_s is the effective temperature representing the strength of selection pressure. Thus, $4 N_{e} m (1 - 1 / (2 N)),$ $4 N_{e} m (1 - 1 / (2 N)),$ $- Δ ψ_{N D} (\equiv - ψ_{N} + ψ_{D}),$ $- Δ ψ_{N D} (\equiv - ψ_{N} + ψ_{D}),$ and $- Δ G_{N D} / k_{B} T_{s}$ $- Δ G_{N D} / k_{B} T_{s}$ must be equivalent to each other. With h and J estimated by the DCA program, the changes (Δψ_N) of ψ_N due to single nucleotide nonsynonymous substitutions are analyzed. The results indicate that the standard deviation of $Δ G_{N} (= k_{B} T_{s} Δ ψ_{N})$ $Δ G_{N} (= k_{B} T_{s} Δ ψ_{N})$ is approximately constant irrespective of protein families, and therefore can be used to estimate the relative value of T_s. Glass transition temperature T_g and ΔG_ND are estimated from estimated T_s and experimental melting temperature (T_m) for 14 protein domains. The estimates of ΔG_ND agree with their experimental values for 5 proteins, and those of T_s and T_g are all within a reasonable range. In addition, approximating the probability density function (PDF) of Δψ_N by a log-normal distribution, PDFs of Δψ_N and K_a/K_s, which is the ratio of nonsynonymous to synonymous substitution rate per site, in all and in fixed mutants are estimated. The equilibrium values of ψ_N, at which the average of Δψ in fixed mutants is equal to zero, well match ψ_N averaged over homologous sequences, confirming that the present methods for a fixation process of mutations and for the equilibrium ensemble of ψ_N give a consistent result with each other. The PDFs of K_a/K_s at equilibrium confirm that T_s negatively correlates with the amino acid substitution rate (the mean of K_a/K_s) of protein. Interestingly, stabilizing mutations are significantly fixed by positive selection, and balance with destabilizing mutations fixed by random drift, although most of them are removed from population. Supporting the nearly neutral theory, neutral selection is not significant even in fixed mutants.

Keywords

Folding free energy change

Inverse statistical potential

Boltzmann distribution

Selective temperature

Positive selection

1. Introduction

Natural proteins can fold their sequences into unique structures. Protein’s stability and foldability result from natural selection and are not typical characteristics of random polymers (Bryngelson and Wolynes, 1987; Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b). Natural selection maintains protein’s stability and foldability over evolutionary timescales. On the basis of the random energy model (REM) for protein folding, it was discussed (Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b) that the equilibrium ensemble of natural protein sequences in sequence space is well represented by a canonical ensemble characterized by a Boltzmann factor $\exp (- Δ G_{N D} (σ) / k_{B} T_{s}),$ $\exp (- Δ G_{N D} (σ) / k_{B} T_{s}),$ where $Δ G_{N D} (σ) (\equiv G_{N} (σ) - G_{D} (σ))$ $Δ G_{N D} (σ) (\equiv G_{N} (σ) - G_{D} (σ))$ is the folding free energy of sequence σ, G_N and G_D are the free energies of the native and denatured states, k_B is the Boltzmann constant, and T_s is the effective temperature representing the strength of selection pressure and must satisfy T_s < T_g < T_m for natural proteins to fold into unique native structures; T_g is glass transition temperature and T_m is melting temperature. The REM also indicates that the free energy of denatured conformations (G_D) is a function of amino acid frequencies only and does not depend on amino acid order, and therefore the Boltzmann factor will be taken as $\exp (- G_{N} (σ) / k_{B} T_{s}),$ $\exp (- G_{N} (σ) / k_{B} T_{s}),$ if amino acid frequencies are kept constant. It was shown by lattice Monte Carlo simulations (Shakhnovich, 1994) that lattice protein sequences selected with this Boltzmann factor were not trapped by competing structures but could fold into unique native structures. Selective temperatures were also estimated (Dokholyan and Shakhnovich, 2001) for actual proteins to yield good correlations of sequence entropy between actual protein families and sequences designed with this type of Boltzmann factor.

On the other hand, the maximum entropy principle insists that the probability distribution of sequences in sequence space, which satisfies constraints on amino acid compositions at all sites and on amino acid pairwise frequencies for all site pairs, is a Boltzmann distribution with the Boltzmann factor $\exp (- ψ_{N} (σ)),$ $\exp (- ψ_{N} (σ)),$ where the total interaction ψ_N(σ) of a sequence σ is represented as the sum of one-body (h) (compositional) and pairwise (J) (covariational) interactions between residues in the sequence; ψ_N(σ) is called the evolutionary statistical energy by Hopf et al. (2017). The inverse statistical potentials, the one-body (h) and pairwise (J) interactions, that satisfy those constraints for homologous sequences have been estimated (Ekeberg et al., 2014; 2013; Marks et al., 2011; Morcos et al., 2011) as one of inverse Potts problems and successfully employed to predict contacting residue pairs in protein structures (Ekeberg et al., 2014; 2013; Hopf et al., 2012; Marks et al., 2011; Miyazawa, 2013; Morcos et al., 2011; Sułkowska et al., 2012). Morcos et al. (2014) noticed that the ψ_N in the Boltzmann factor is the dimensionless energy corresponding to G_N/k_BT_s, and estimated selective temperatures (T_s) for several protein families by comparing the difference (Δψ_ND) of ψ between the native and the molten globule states with folding free energies (ΔG_ND) estimated with associative-memory, water-mediated, structure, and energy model (AWSEM) (Davtyan et al., 2012).

A purpose of the present study is to establish relationships between protein foldability/stability, sequence distribution, and protein fitness. First, we prove that if mutation and fixation processes in protein evolution are reversible Markov processes, the equilibrium ensemble of genes will obey a Boltzmann distribution with the Boltzmann factor $\exp (4 N_{e} m (1 - 1 / (2 N))),$ $\exp (4 N_{e} m (1 - 1 / (2 N))),$ where N_e and N are effective and actual population sizes, and m is the Malthusian fitness of a gene. In other words, correspondences between $- Δ G_{N D} / k_{B} T_{s},$ $- Δ G_{N D} / k_{B} T_{s},$ $- Δ ψ_{N D} (\equiv ψ_{N} - ψ_{D})$ $- Δ ψ_{N D} (\equiv ψ_{N} - ψ_{D})$ and $4 N_{e} m (1 - 1 / (2 N))$ $4 N_{e} m (1 - 1 / (2 N))$ are obtained by equating these three Boltzmann distributions with each other; $ψ_{D} ≃ G_{D} / k_{B} T_{s} + constant$ $ψ_{D} ≃ G_{D} / k_{B} T_{s} + constant$ .

The second purpose is to analyze the effects (Δψ_N) of single amino acid substitutions on the evolutionary statistical energy of a protein, and to estimate from the distribution of Δψ_N the effective temperature of natural selection (T_s) and then glass transition temperature (T_g) and folding free energy (ΔG_ND) of protein. We estimate the one-body (h) and pairwise (J) interactions with the DCA program, which is available at “http://dca.rice.edu/portal/dca/home”, and then analyze the changes (Δψ_N) of the evolutionary statistical energy (ψ_N) of a natural sequence due to single amino acid substitutions caused by single nucleotide changes. The data of Δψ_N due to single nucleotide nonsynonymous substitutions for 14 protein domains show that the standard deviation of Δψ_N over all the substitutions at all sites hardly depends on the evolutionary statistical energy (ψ_N) of each homologous sequence and is nearly constant for each protein family, indicating that the standard deviation of ΔG_N ≃ k_BT_sΔψ_N is nearly constant irrespective of protein families. From this finding, T_s for each protein family has been estimated in relative to T_s for the PDZ family, which is determined by directly comparing $Δ Δ ψ_{N D} (\equiv Δ (ψ_{N} - ψ_{D}) ≃ Δ ψ_{N})$ $Δ Δ ψ_{N D} (\equiv Δ (ψ_{N} - ψ_{D}) ≃ Δ ψ_{N})$ with the experimental values of folding free energy changes, ΔΔG_ND, due to single amino acid substitutions. Also T_g and ΔG_ND for each protein family are estimated on the basis of the REM from the estimated T_s and an experimental melting temperature T_m. The estimates of T_s and T_g are all within a reasonable range, and those of ΔG_ND are well compared with experimental ΔG_ND for 5 protein families. The present method for estimating T_s is simpler than the method (Morcos et al., 2014) using AWSEM, and also is useful for the prediction of ΔG_ND, because the experimental data of ΔG_ND are limited in comparison with T_m, and also experimental conditions such as temperature and pH tend to be different among them. In addition, it has been revealed that Δψ_N averaged over all single nucleotide nonsynonymous substitutions is a linear function of ψ_N/L of each homologous sequence, where L is sequence length; the average of Δψ_N decreases as ψ_N/L increases. This characteristic is required for homologous proteins to stay at the equilibrium state of the native conformational energy G_N ≃ k_BT_sψ_N, and indicates a weak dependency (Miyazawa, 2016; Serohijos et al., 2012) of ΔΔG_ND on ΔG_ND/L of protein across protein families.

The third purpose is to study an amino acid substitution process in protein evolution, which is characterized by the fitness, $m = - Δ ψ_{N D} / (4 N_{e} (1 - 1 / (2 N)))$ $m = - Δ ψ_{N D} / (4 N_{e} (1 - 1 / (2 N)))$ . We employ a monoclonal approximation for mutation and fixation processes of genes, in which protein evolution proceeds with single amino acid substitutions fixed at a time in a population. In this approximation, ψ_N of a protein gene attains the equilibrium, $ψ_{N} = ψ_{N}^{e q},$ $ψ_{N} = ψ_{N}^{e q},$ when the average of Δψ_N( ≃ ΔΔψ_ND) over singe nucleotide nonsynonymous mutations fixed in a population is equal to zero. Approximating the distribution of Δψ_N due to singe nucleotide nonsynonymous mutations by a log-normal distribution, their distribution for fixed mutants is numerically calculated and used to calculate the averages of various quantities and also the probability density functions (PDF) of K_a/K_s in all arising mutants and also in fixed mutants only; K_a/K_s is defined as the ratio of nonsynonymous to synonymous substitution rate per site. There is a good agreement between the time average ( $ψ_{N}^{e q}$ $ψ_{N}^{e q}$ ) and ensemble average (⟨ψ_N⟩_σ), which is equal to the sample average, $\bar{ψ_{N}},$ $\bar{ψ_{N}},$ of ψ_N over homologous sequences, supporting the constancy of the standard deviation of Δψ_N assumed in the monoclonal approximation.

We also study protein evolution at equilibrium, $ψ_{N} = ψ_{N}^{e q}$ $ψ_{N} = ψ_{N}^{e q}$ . The common understanding of protein evolution has been that amino acid substitutions observed in homologous proteins are neutral (Kimura, 1968; 1969; Kimura and Ohta, 1971; 1974) or slightly deleterious (Ohta, 1973; 1992), and random drift is a primary force to fix amino acid substitutions in population. The PDFs of K_a/K_s in all arising mutations and in their fixed mutations are examined to see how significant each of positive, neutral, slightly negative,and negative selections is. Interestingly, stabilizing mutations are significantly fixed in population by positive selection, and balance with destabilizing mutations that are also significantly fixed by random drift, although most negative mutations are removed from population. Contrary to the neutral theory (Kimura, 1968; 1969; Kimura and Ohta, 1971; 1974) and supporting the nearly neutral theory (Ohta, 1973; 1992; 2002), the proportion of neutral selection is not large even in fixed mutants. It is also confirmed that the effective temperature (T_s) of selection negatively correlates with the amino acid substitution rate (K_a/K_s) of protein at equilibrium.

2. Methods

2.1. Knowledge of protein folding

A protein folding theory (Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b), which is based on a random energy model (REM), indicates that the equilibrium ensemble of amino acid sequences, $σ \equiv (σ_{1}, \dots, σ_{L})$ $σ \equiv (σ_{1}, \dots, σ_{L})$ where σ_i is the type of amino acid at site i and L is sequence length, can be well approximated by a canonical ensemble with a Boltzmann factor consisting of the folding free energy, ΔG_ND(σ, T) and an effective temperature T_s representing the strength of selection pressure.

(1)

\begin{matrix} P (σ) & \propto & P^{m u t} (σ) \exp (\frac{- Δ G_{N D} (σ, T)}{k_{B} T_{s}}) \end{matrix}

$\begin{matrix} P (σ) & \propto & P^{m u t} (σ) \exp (\frac{- Δ G_{N D} (σ, T)}{k_{B} T_{s}}) \end{matrix}$

(2)

\begin{matrix} \propto & \exp (\frac{- G_{N} (σ)}{k_{B} T_{s}}) if f (σ) = constant \end{matrix}

$\begin{matrix} \propto & \exp (\frac{- G_{N} (σ)}{k_{B} T_{s}}) if f (σ) = constant \end{matrix}$

(3)

\begin{matrix} Δ G_{N D} (σ, T) & \equiv & G_{N} (σ) - G_{D} (f (σ), T) \end{matrix}

$\begin{matrix} Δ G_{N D} (σ, T) & \equiv & G_{N} (σ) - G_{D} (f (σ), T) \end{matrix}$

where p^mut(σ) is the probability of a sequence (σ) randomly occurring in a mutational process and depends only on the amino acid frequencies f(σ), k_B is the Boltzmann constant, T is a growth temperature, and G_N and G_D are the free energies of the native conformation and denatured state, respectively. Selective temperature T_s quantifies how strong the folding constraints are in protein evolution, and is specific to the protein structure and function. The free energy G_D of the denatured state does not depend on the amino acid order but the amino acid composition, f(σ), in a sequence (Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b). It is reasonable to assume that mutations independently occur between sites, and therefore the equilibrium frequency of a sequence in the mutational process is equal to the product of the equilibrium frequencies over sites;

P^{m u t} (σ) = \prod_{i} p^{m u t} (σ_{i}),

$P^{m u t} (σ) = \prod_{i} p^{m u t} (σ_{i}),$ where p^mut(σ_i) is the equilibrium frequency of σ_i at site i in the mutational process.

The distribution of conformational energies in the denatured state (molten globule state), which consists of conformations as compact as the native conformation, is approximated in the random energy model (REM), particularly the independent interaction model (IIM) (Pande et al., 1997), to be equal to the energy distribution of randomized sequences, which is then approximated by a Gaussian distribution, in the native conformation. That is, the partition function Z for the denatured state is written as follows with the energy density n(E) of conformations that is approximated by a product of a Gaussian probability density and the total number of conformations whose logarithm is proportional to the chain length.

(4)

\begin{matrix} Z & = & \int \exp (\frac{- E}{k_{B} T}) n (E) d E \end{matrix}

$\begin{matrix} Z & = & \int \exp (\frac{- E}{k_{B} T}) n (E) d E \end{matrix}$

(5)

\begin{matrix} n (E) & \approx & \exp (ω L) N (\bar{E} (f (σ)), δ E^{2} (f (σ))) \end{matrix}

$\begin{matrix} n (E) & \approx & \exp (ω L) N (\bar{E} (f (σ)), δ E^{2} (f (σ))) \end{matrix}$

where ω is the conformational entropy per residue in the compact denatured state, and

N (\bar{E} (f (σ)), δ E^{2} (f (σ)))

$N (\bar{E} (f (σ)), δ E^{2} (f (σ)))$ is the Gaussian probability density with mean

\bar{E}

$\bar{E}$ and variance δE², which depend only on the amino acid composition of the protein sequence. The free energy of the denatured state is approximated as follows.

(6)

\begin{matrix} G_{D} (f (σ), T) & \approx & \bar{E} (f (σ)) - \frac{δ E^{2} (f (σ))}{2 k_{B} T} - k_{B} T ω L \end{matrix}

$\begin{matrix} G_{D} (f (σ), T) & \approx & \bar{E} (f (σ)) - \frac{δ E^{2} (f (σ))}{2 k_{B} T} - k_{B} T ω L \end{matrix}$

(7)

\begin{matrix} = & \bar{E} (f (σ)) - δ E^{2} (f (σ)) \frac{ϑ (T / T_{g})}{k_{B} T} \end{matrix}

$\begin{matrix} = & \bar{E} (f (σ)) - δ E^{2} (f (σ)) \frac{ϑ (T / T_{g})}{k_{B} T} \end{matrix}$

(8)

\begin{matrix} ϑ (T / T_{g}) & \equiv & {\begin{matrix} \frac{1}{2} (1 + \frac{T^{2}}{T_{g}^{2}}) & for T > T_{g} \\ \frac{T}{T_{g}} & for T \leq T_{g} \end{matrix} \end{matrix}

$\begin{matrix} ϑ (T / T_{g}) & \equiv & {\begin{matrix} \frac{1}{2} (1 + \frac{T^{2}}{T_{g}^{2}}) & for T > T_{g} \\ \frac{T}{T_{g}} & for T \leq T_{g} \end{matrix} \end{matrix}$

where

\bar{E}

$\bar{E}$ and δE² are estimated as the mean and variance of interaction energies of randomized sequences in the native conformation. T_g is the glass transition temperature of the protein at which entropy becomes zero (Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b);

- \partial G_{D} {/ \partial T |}_{T = T_{g}} = 0

$- \partial G_{D} {/ \partial T |}_{T = T_{g}} = 0$ . The conformational entropy per residue ω in the compact denatured state can be represented with T_g;

ω L = δ E^{2} / (2 {(k_{B} T_{g})}^{2})

$ω L = δ E^{2} / (2 {(k_{B} T_{g})}^{2})$ . Thus, unless T_g < T_m, a protein will be trapped at local minima on a rugged free energy landscape before it can fold into a unique native structure.

2.2. Probability distribution of homologous sequences with the same native fold in sequence space

The probability distribution P(σ) of homologous sequences with the same native fold, $σ = (σ_{1}, \dots, σ_{L})$ $σ = (σ_{1}, \dots, σ_{L})$ where σ_i ∈ {amino acids, deletion}, in sequence space with maximum entropy, which satisfies a given amino acid frequency at each site and a given pairwise amino acid frequency at each site pair, is a Boltzmann distribution (Marks et al., 2011; Morcos et al., 2011).

(9)

\begin{matrix} P (σ) & \propto & \exp (- ψ_{N} (σ)) \end{matrix}

$\begin{matrix} P (σ) & \propto & \exp (- ψ_{N} (σ)) \end{matrix}$

(10)

\begin{matrix} ψ_{N} (σ) & \equiv & - (\sum_{i}^{L} (h_{i} (σ_{i}) + \sum_{j > i} J_{i j} (σ_{i}, σ_{j}))) \end{matrix}

$\begin{matrix} ψ_{N} (σ) & \equiv & - (\sum_{i}^{L} (h_{i} (σ_{i}) + \sum_{j > i} J_{i j} (σ_{i}, σ_{j}))) \end{matrix}$

where h_i and J_ij are one-body (compositional) and two-body (covariational) interactions and must satisfy the following constraints.

(11)

\begin{matrix} \sum_{σ} P (σ) δ_{σ_{i} a_{k}} & = & P_{i} (a_{k}) \end{matrix}

$\begin{matrix} \sum_{σ} P (σ) δ_{σ_{i} a_{k}} & = & P_{i} (a_{k}) \end{matrix}$

(12)

\begin{matrix} \sum_{σ} P (σ) δ_{σ_{i} a_{k}} δ_{σ_{j} a_{l}} & = & P_{i j} (a_{k}, a_{l}) \end{matrix}

$\begin{matrix} \sum_{σ} P (σ) δ_{σ_{i} a_{k}} δ_{σ_{j} a_{l}} & = & P_{i j} (a_{k}, a_{l}) \end{matrix}$

where

δ_{σ_{i} a_{k}}

$δ_{σ_{i} a_{k}}$ is the Kronecker delta, P_i(a_k) is the frequency of amino acid a_k at site i, and P_ij(a_k, a_l) is the frequency of amino acid pair, a_k at i and a_l at j; a_k ∈ {amino acids, deletion}. The pairwise interaction matrix J satisfies

J_{i j} (a_{k}, a_{l}) = J_{j i} (a_{l}, a_{k})

$J_{i j} (a_{k}, a_{l}) = J_{j i} (a_{l}, a_{k})$ and

J_{i i} (a_{k}, a_{l}) = 0

$J_{i i} (a_{k}, a_{l}) = 0$ . Interactions h_i and J_ij can be well estimated from a multiple sequence alignment (MSA) in the mean field approximation (Marks et al., 2011; Morcos et al., 2011), or by maximizing a pseudo-likelihood (Ekeberg et al., 2014; 2013). Because ψ_N(σ) has been estimated under the constraints on amino acid compositions at all sites, only sequences with a given amino acid composition contribute significantly to the partition function, and other sequences may be ignored.

Hence, from Eqs. (2) and (9),

(13)

\begin{matrix} ψ_{N} (σ) & ≃ & G_{N} (σ) / (k_{B} T_{s}) + function of f (σ) \end{matrix}

$\begin{matrix} ψ_{N} (σ) & ≃ & G_{N} (σ) / (k_{B} T_{s}) + function of f (σ) \end{matrix}$

(14)

\begin{matrix} ψ_{D} (f (σ), T) & ≃ & G_{D} (f (σ), T) / (k_{B} T_{s}) + function of f (σ) \end{matrix}

$\begin{matrix} ψ_{D} (f (σ), T) & ≃ & G_{D} (f (σ), T) / (k_{B} T_{s}) + function of f (σ) \end{matrix}$

(15)

\begin{matrix} Δ ψ_{N D} (σ, T) & ≃ & Δ G_{N D} (σ, T) / (k_{B} T_{s}) \end{matrix}

$\begin{matrix} Δ ψ_{N D} (σ, T) & ≃ & Δ G_{N D} (σ, T) / (k_{B} T_{s}) \end{matrix}$

(16)

\begin{matrix} Δ ψ_{N D} (σ, T) & \equiv & ψ_{N} (σ) - ψ_{D} (f (σ), T) \end{matrix}

$\begin{matrix} Δ ψ_{N D} (σ, T) & \equiv & ψ_{N} (σ) - ψ_{D} (f (σ), T) \end{matrix}$

(17)

\begin{matrix} ψ_{D} (f (σ), T) & \approx & \bar{ψ} (f (σ)) - δ ψ^{2} (f (σ)) ϑ (T / T_{g}) T_{s} / T \end{matrix}

$\begin{matrix} ψ_{D} (f (σ), T) & \approx & \bar{ψ} (f (σ)) - δ ψ^{2} (f (σ)) ϑ (T / T_{g}) T_{s} / T \end{matrix}$

(18)

\begin{matrix} ω & = & {(T_{s} / T_{g})}^{2} δ ψ^{2} / (2 L) \end{matrix}

$\begin{matrix} ω & = & {(T_{s} / T_{g})}^{2} δ ψ^{2} / (2 L) \end{matrix}$

where the

\bar{ψ}

$\bar{ψ}$ and δψ² are estimated as the mean and variance of ψ_N over randomized sequences;

\bar{E} ≃ k_{B} T_{s} \bar{ψ}

$\bar{E} ≃ k_{B} T_{s} \bar{ψ}$ and δE² ≃ (k_BT_s)²δψ².

2.3. The equilibrium distribution of sequences in a mutation-fixation process

Here we assume that the mutational process is a reversible Markov process. That is, the mutation rate per gene, M_μν, from sequence $μ \equiv (μ_{1}, \dots, μ_{L})$ $μ \equiv (μ_{1}, \dots, μ_{L})$ to ν satisfies the detailed balance condition

(19)

\begin{matrix} P^{m u t} (μ) M_{μ ν} & = & P^{m u t} (ν) M_{ν μ} \end{matrix}

$\begin{matrix} P^{m u t} (μ) M_{μ ν} & = & P^{m u t} (ν) M_{ν μ} \end{matrix}$

where P^mut(ν) is the equilibrium frequency of sequence ν in a mutational process, M_μν. The mutation rate per population is equal to 2NM_μν for a diploid population, where N is the population size. The substitution rate R_μν from μ to ν is equal to the product of the mutation rate and the fixation probability with which a single mutant gene becomes to fully occupy the population (Crow and Kimura, 1970).

(20)

\begin{matrix} R_{μ ν} & = & 2 N M_{μ ν} u (s (μ \to ν)) \end{matrix}

$\begin{matrix} R_{μ ν} & = & 2 N M_{μ ν} u (s (μ \to ν)) \end{matrix}$

where u(s(μ → ν)) is the fixation probability of mutants from μ to ν the selective advantage of which is equal to s.

For genic selection (no dominance) or gametic selection in a Wright-Fisher population of diploid, the fixation probability, u, of a single mutant gene, the selective advantage of which is equal to s and the frequency of which in a population is equal to $q_{m} = 1 / (2 N),$ $q_{m} = 1 / (2 N),$ was estimated (Crow and Kimura, 1970) as

(21)

\begin{matrix} 2 N u (s) & = & 2 N \frac{1 - e^{- 4 N_{e} s q_{m}}}{1 - e^{- 4 N_{e} s}} \end{matrix}

$\begin{matrix} 2 N u (s) & = & 2 N \frac{1 - e^{- 4 N_{e} s q_{m}}}{1 - e^{- 4 N_{e} s}} \end{matrix}$

(22)

\begin{matrix} = & \frac{u (s)}{u (0)} with q_{m} = \frac{1}{2 N} \end{matrix}

$\begin{matrix} = & \frac{u (s)}{u (0)} with q_{m} = \frac{1}{2 N} \end{matrix}$

where N_e is effective population size. Eq. (21) will be also valid for haploid population if 2N_e and 2N are replaced by N_e and N, respectively. Also, for Moran population of haploid, 4N_e and 2N should be replaced by N_e and N, respectively. Fixation probabilities for various selection models, which are compiled from p. 192 and p. 424–427 of Crow and Kimura (1970) and from Moran (1958) and Ewens (1979), are listed in Table S.7. The selective advantage of a mutant sequence ν to a wildtype μ is equal to

(23)

\begin{matrix} s (μ \to ν) & = & m (ν) - m (μ) \end{matrix}

$\begin{matrix} s (μ \to ν) & = & m (ν) - m (μ) \end{matrix}$

where m(ν) is the Malthusian fitness of a mutant sequence, and m(μ) is for the wildtype.

This Markov process of substitutions in sequence is reversible, and the equilibrium frequency of sequence μ, P^eq(μ), in the total process consisting of mutation and fixation processes is represented by

(24)

\begin{matrix} P^{e q} (μ) & = & \frac{P^{m u t} (μ) \exp (4 N_{e} m (μ) (1 - q_{m}))}{\sum_{ν} P^{m u t} (ν) \exp (4 N_{e} m (ν) (1 - q_{m}))} \end{matrix}

$\begin{matrix} P^{e q} (μ) & = & \frac{P^{m u t} (μ) \exp (4 N_{e} m (μ) (1 - q_{m}))}{\sum_{ν} P^{m u t} (ν) \exp (4 N_{e} m (ν) (1 - q_{m}))} \end{matrix}$

because both the mutation and fixation processes satisfy the detailed balance conditions, Eq. (19) and the following equation, respectively.

(25)

\begin{matrix} \exp (4 N_{e} m (μ) (1 - q_{m})) u (s (μ \to ν)) \\ = \frac{\exp (- 4 N_{e} m (μ) q_{m}) - \exp (- 4 N_{e} m (ν) q_{m})}{\exp (- 4 N_{e} m (μ)) - \exp (- 4 N_{e} m (ν))} \end{matrix}

$\begin{matrix} \exp (4 N_{e} m (μ) (1 - q_{m})) u (s (μ \to ν)) \\ = \frac{\exp (- 4 N_{e} m (μ) q_{m}) - \exp (- 4 N_{e} m (ν) q_{m})}{\exp (- 4 N_{e} m (μ)) - \exp (- 4 N_{e} m (ν))} \end{matrix}$

(26)

\begin{matrix} = \exp (4 N_{e} m (ν) (1 - q_{m})) u (s (ν \to μ)) \end{matrix}

$\begin{matrix} = \exp (4 N_{e} m (ν) (1 - q_{m})) u (s (ν \to μ)) \end{matrix}$

As a result, the ensemble of homologous sequences in molecular evolution obeys a Boltzmann distribution.

2.4. Relationships between m(σ), ψ_N(σ), and ΔG_ND(σ) of protein sequence

From Eqs. (1), (9), and (24), we can get the following relationships among the Malthusian fitness m, the folding free energy change ΔG_ND and Δψ_ND of protein sequence.

(27)

\begin{matrix} P^{e q} (μ) & = & \frac{P^{m u t} (μ) \exp (4 N_{e} m (μ) (1 - q_{m}))}{\sum_{ν} P^{m u t} (ν) \exp (4 N_{e} m (ν) (1 - q_{m})))} \end{matrix}

$\begin{matrix} P^{e q} (μ) & = & \frac{P^{m u t} (μ) \exp (4 N_{e} m (μ) (1 - q_{m}))}{\sum_{ν} P^{m u t} (ν) \exp (4 N_{e} m (ν) (1 - q_{m})))} \end{matrix}$

(28)

\begin{matrix} = & \frac{P^{m u t} (\bar{μ}) \exp (- (ψ_{N} (μ) - ψ_{D} (\bar{f (μ)}, T)))}{\sum_{ν} P^{m u t} (\bar{ν}) \exp (- (ψ_{N} (ν) - ψ_{D} (\bar{f (ν)}, T)))} \end{matrix}

$\begin{matrix} = & \frac{P^{m u t} (\bar{μ}) \exp (- (ψ_{N} (μ) - ψ_{D} (\bar{f (μ)}, T)))}{\sum_{ν} P^{m u t} (\bar{ν}) \exp (- (ψ_{N} (ν) - ψ_{D} (\bar{f (ν)}, T)))} \end{matrix}$

(29)

\begin{matrix} ≃ & \frac{P^{m u t} (μ) \exp (- Δ G_{N D} (μ, T) / (k_{B} T_{s}))}{\sum_{ν} P^{m u t} (ν) \exp (- Δ G_{N D} (ν, T) / (k_{B} T_{s}))} \end{matrix}

$\begin{matrix} ≃ & \frac{P^{m u t} (μ) \exp (- Δ G_{N D} (μ, T) / (k_{B} T_{s}))}{\sum_{ν} P^{m u t} (ν) \exp (- Δ G_{N D} (ν, T) / (k_{B} T_{s}))} \end{matrix}$

where

\bar{f (σ)} \equiv \sum_{σ} f (σ) P (σ)

$\bar{f (σ)} \equiv \sum_{σ} f (σ) P (σ)$ and

\log P^{m u t} (\bar{σ}) \equiv \sum_{σ} P (σ) \log (\prod_{i} P^{m u t} (σ_{i}))

$\log P^{m u t} (\bar{σ}) \equiv \sum_{σ} P (σ) \log (\prod_{i} P^{m u t} (σ_{i}))$ . Then, the following relationships are derived for sequences for which

f (μ) = \bar{f (μ)}

$f (μ) = \bar{f (μ)}$ .

(30)

\begin{matrix} 4 N_{e} m (μ) (1 - q_{m}) & = & - Δ ψ_{N D} (μ, T) + constant \end{matrix}

$\begin{matrix} 4 N_{e} m (μ) (1 - q_{m}) & = & - Δ ψ_{N D} (μ, T) + constant \end{matrix}$

(31)

\begin{matrix} ≃ & \frac{- Δ G_{N D} (μ, T)}{k_{B} T_{s}} + constant \end{matrix}

$\begin{matrix} ≃ & \frac{- Δ G_{N D} (μ, T)}{k_{B} T_{s}} + constant \end{matrix}$

The selective advantage of ν to μ is represented as follows for

f (μ) = f (ν) = \bar{f (σ)}

$f (μ) = f (ν) = \bar{f (σ)}$ .

(32)

\begin{matrix} 4 N_{e} s (μ \to ν) (1 - q_{m}) \\ = (4 N_{e} m (ν) - 4 N_{e} m (μ)) (1 - q_{m}) \end{matrix}

$\begin{matrix} 4 N_{e} s (μ \to ν) (1 - q_{m}) \\ = (4 N_{e} m (ν) - 4 N_{e} m (μ)) (1 - q_{m}) \end{matrix}$

(33)

\begin{matrix} = - (Δ ψ_{N D} (ν, T) - Δ ψ_{N D} (μ, T)) = - (ψ_{N} (ν) - ψ_{N} (μ)) \end{matrix}

$\begin{matrix} = - (Δ ψ_{N D} (ν, T) - Δ ψ_{N D} (μ, T)) = - (ψ_{N} (ν) - ψ_{N} (μ)) \end{matrix}$

(34)

\begin{matrix} ≃ - (Δ G_{N D} (ν, T) - Δ G_{N D} (μ, T)) / (k_{B} T_{s}) \\ = - (G_{N} (ν) - G_{N} (μ)) / (k_{B} T_{s}) \end{matrix}

$\begin{matrix} ≃ - (Δ G_{N D} (ν, T) - Δ G_{N D} (μ, T)) / (k_{B} T_{s}) \\ = - (G_{N} (ν) - G_{N} (μ)) / (k_{B} T_{s}) \end{matrix}$

It should be noted here that only sequences for which

f (σ) = \bar{f (σ)}

$f (σ) = \bar{f (σ)}$ contribute significantly to the partition functions in Eq. (28), and other sequences may be ignored.

Eq. (33) indicates that evolutionary statistical energy ψ should be proportional to effective population size N_e, and therefore it is ideal to estimate one-body (h) and two-body (J) interactions from homologous sequences of species that do not significantly differ in effective population size. Also, Eq. (34) indicates that selective temperature T_s is inversely proportional to the effective population size N_e; T_s∝1/N_e, because free energy is a physical quantity and should not depend on effective population size.

2.5. The ensemble average of folding free energy, ΔG_ND(σ, T), over sequences

The ensemble average of ΔG_ND(σ, T) over sequences with Eq. (1) is

(35)

\begin{matrix} {〈 Δ G_{N D} (σ, T) 〉}_{σ} \end{matrix}

$\begin{matrix} {〈 Δ G_{N D} (σ, T) 〉}_{σ} \end{matrix}$

(36)

\begin{matrix} \equiv [\sum_{σ} Δ G_{N D} (σ, T) P^{m u t} (σ) \exp (- \frac{Δ G_{N D} (σ, T)}{k_{B} T_{s}})] / \\ [\sum_{σ} P^{m u t} (σ) \exp (- \frac{Δ G_{N D} (σ, T)}{k_{B} T_{s}})] \end{matrix}

$\begin{matrix} \equiv [\sum_{σ} Δ G_{N D} (σ, T) P^{m u t} (σ) \exp (- \frac{Δ G_{N D} (σ, T)}{k_{B} T_{s}})] / \\ [\sum_{σ} P^{m u t} (σ) \exp (- \frac{Δ G_{N D} (σ, T)}{k_{B} T_{s}})] \end{matrix}$

(37)

\begin{matrix} \approx [\sum_{σ | f (σ) = \bar{f (σ_{N})}} G_{N} (σ) \exp (- \frac{G_{N} (σ)}{k_{B} T_{s}})] / \\ [\sum_{σ | f (σ) = \bar{f (σ_{N})}} \exp (- \frac{G_{N} (σ)}{k_{B} T_{s}})] - G_{D} (\bar{f (σ_{N})}, T) \end{matrix}

$\begin{matrix} \approx [\sum_{σ | f (σ) = \bar{f (σ_{N})}} G_{N} (σ) \exp (- \frac{G_{N} (σ)}{k_{B} T_{s}})] / \\ [\sum_{σ | f (σ) = \bar{f (σ_{N})}} \exp (- \frac{G_{N} (σ)}{k_{B} T_{s}})] - G_{D} (\bar{f (σ_{N})}, T) \end{matrix}$

(38)

\begin{matrix} = {〈 G_{N} (σ) 〉}_{σ} - G_{D} (\bar{f (σ_{N})}, T) \end{matrix}

$\begin{matrix} = {〈 G_{N} (σ) 〉}_{σ} - G_{D} (\bar{f (σ_{N})}, T) \end{matrix}$

where σ_N denotes a natural sequence, and

\bar{f (σ_{N})}

$\bar{f (σ_{N})}$ denotes the average of amino acid frequencies f(σ_N) over homologous sequences. In Eq. (37), the sum over all sequences is approximated by the sum over sequences the amino acid composition of which is the same as that over the natural sequences.

The ensemble averages of G_N and ψ_N(σ) are estimated in the Gaussian approximation (Pande et al., 1997).

(39)

\begin{matrix} {〈 G_{N} (σ) 〉}_{σ} & \approx & \frac{\int E \exp (- E / (k_{B} T_{s})) n (E) d E}{\int \exp (- E / (k_{B} T_{s})) n (E) d E} \end{matrix}

$\begin{matrix} {〈 G_{N} (σ) 〉}_{σ} & \approx & \frac{\int E \exp (- E / (k_{B} T_{s})) n (E) d E}{\int \exp (- E / (k_{B} T_{s})) n (E) d E} \end{matrix}$

(40)

\begin{matrix} = & \bar{E} (\bar{f (σ_{N})}) - {δ E}^{2} (\bar{f (σ_{N})}) / (k_{B} T_{s}) \end{matrix}

$\begin{matrix} = & \bar{E} (\bar{f (σ_{N})}) - {δ E}^{2} (\bar{f (σ_{N})}) / (k_{B} T_{s}) \end{matrix}$

(41)

\begin{matrix} {〈 ψ_{N} (σ) 〉}_{σ} & \equiv & [\sum_{σ} ψ_{N D} (σ) \exp (- ψ_{N} (σ))] / \\ [\sum_{σ} \exp (- ψ_{N} (σ))] \end{matrix}

$\begin{matrix} {〈 ψ_{N} (σ) 〉}_{σ} & \equiv & [\sum_{σ} ψ_{N D} (σ) \exp (- ψ_{N} (σ))] / \\ [\sum_{σ} \exp (- ψ_{N} (σ))] \end{matrix}$

(42)

\begin{matrix} \approx & \bar{ψ} (\bar{f (σ_{N})}) - {δ ψ}^{2} (\bar{f (σ_{N})}) \end{matrix}

$\begin{matrix} \approx & \bar{ψ} (\bar{f (σ_{N})}) - {δ ψ}^{2} (\bar{f (σ_{N})}) \end{matrix}$

The ensemble averages of ΔG_ND(σ, T) and ψ_N(σ) over sequences are observable as the sample averages of ΔG_ND(σ_N, T) and ψ_N(σ_N) over homologous sequences fixed in protein evolution, respectively.

(43)

\begin{matrix} \bar{Δ G_{N D} (σ_{N}, T)} / (k_{B} T_{s}) & = & {〈 Δ G_{N D} (σ, T) 〉}_{σ} / (k_{B} T_{s}) \end{matrix}

$\begin{matrix} \bar{Δ G_{N D} (σ_{N}, T)} / (k_{B} T_{s}) & = & {〈 Δ G_{N D} (σ, T) 〉}_{σ} / (k_{B} T_{s}) \end{matrix}$

(44)

\begin{matrix} \approx & {δ ψ}^{2} (\bar{f (σ_{N})}) [ϑ (T / T_{g}) T_{s} / T - 1] \end{matrix}

$\begin{matrix} \approx & {δ ψ}^{2} (\bar{f (σ_{N})}) [ϑ (T / T_{g}) T_{s} / T - 1] \end{matrix}$

(45)

\begin{matrix} \bar{ψ_{N} (σ_{N})} & \equiv & \frac{\sum_{σ_{N}} w_{σ_{N}} ψ_{N} (σ_{N})}{\sum_{σ_{N}} w_{σ_{N}}} \end{matrix}

$\begin{matrix} \bar{ψ_{N} (σ_{N})} & \equiv & \frac{\sum_{σ_{N}} w_{σ_{N}} ψ_{N} (σ_{N})}{\sum_{σ_{N}} w_{σ_{N}}} \end{matrix}$

(46)

\begin{matrix} = & {〈 ψ_{N} (σ) 〉}_{σ} \end{matrix}

$\begin{matrix} = & {〈 ψ_{N} (σ) 〉}_{σ} \end{matrix}$

where the overline denotes a sample average with a sample weight

w_{σ_{N}}

$w_{σ_{N}}$ for each homologous sequence, which is used to reduce phylogenetic biases in the set of homologous sequences.

The folding free energy becomes equal to zero at the melting temperature T_m; ${〈 Δ G_{N D} (σ_{N}, T_{m}) 〉}_{σ} = 0$ ${〈 Δ G_{N D} (σ_{N}, T_{m}) 〉}_{σ} = 0$ . Thus, the following relationship must be satisfied (Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b).

(47)

\begin{matrix} ϑ (T_{m} / T_{g}) \frac{T_{s}}{T_{m}} & = & \frac{T_{s}}{2 T_{m}} (1 + \frac{T_{m}^{2}}{T_{g}^{2}}) = 1 with T_{s} \leq T_{g} \leq T_{m} \end{matrix}

$\begin{matrix} ϑ (T_{m} / T_{g}) \frac{T_{s}}{T_{m}} & = & \frac{T_{s}}{2 T_{m}} (1 + \frac{T_{m}^{2}}{T_{g}^{2}}) = 1 with T_{s} \leq T_{g} \leq T_{m} \end{matrix}$

2.6. Probability distributions of selective advantage, fixation rate and K_a/K_s

Let us consider the probability distributions of characteristic quantities that describe the evolution of genes. First of all, the probability density function (PDF) of selective advantage s, p(s), of mutant genes can be calculated from the PDF of the change of Δψ_ND due to a mutation from μ to ν, $Δ Δ ψ_{N D} (\equiv Δ ψ_{N D} (ν, T) - Δ ψ_{N D} (μ, T))$ $Δ Δ ψ_{N D} (\equiv Δ ψ_{N D} (ν, T) - Δ ψ_{N D} (μ, T))$ . The PDF of 4N_es, $p (4 N_{e} s) = p (s) / (4 N e),$ $p (4 N_{e} s) = p (s) / (4 N e),$ may be more useful than p(s).

(48)

\begin{matrix} p (4 N_{e} s) & = & p (Δ Δ ψ_{N D}) | \frac{d Δ Δ ψ_{N D}}{d 4 N_{e} s} | = p (Δ Δ ψ_{N D}) (1 - q_{m}) \end{matrix}

$\begin{matrix} p (4 N_{e} s) & = & p (Δ Δ ψ_{N D}) | \frac{d Δ Δ ψ_{N D}}{d 4 N_{e} s} | = p (Δ Δ ψ_{N D}) (1 - q_{m}) \end{matrix}$

where ΔΔψ_ND must be regarded as a function of 4N_es, that is,

Δ Δ ψ_{N D} = - 4 N_{e} s (1 - q_{m})

$Δ Δ ψ_{N D} = - 4 N_{e} s (1 - q_{m})$ ; see Eq. (33).

The PDF of fixation probability u can be represented by

(49)

\begin{matrix} p (u) = p (4 N_{e} s) \frac{d 4 N_{e} s}{d u} = p (4 N_{e} s) \frac{{(e^{4 N_{e} s} - 1)}^{2} e^{4 N_{e} s (q_{m} - 1)}}{q_{m} (e^{4 N_{e} s} - 1) - (e^{4 N_{e} s q_{m}} - 1)} \end{matrix}

$\begin{matrix} p (u) = p (4 N_{e} s) \frac{d 4 N_{e} s}{d u} = p (4 N_{e} s) \frac{{(e^{4 N_{e} s} - 1)}^{2} e^{4 N_{e} s (q_{m} - 1)}}{q_{m} (e^{4 N_{e} s} - 1) - (e^{4 N_{e} s q_{m}} - 1)} \end{matrix}$

where 4N_es must be regarded as a function of u.

The ratio of the substitution rate per nonsynonymous site (K_a) for nonsynonymous substitutions with selective advantage s to the substitution rate per synonymous site (K_s) for synonymous substitutions with s = 0 is

(50)

\begin{matrix} \frac{K_{a}}{K_{s}} & = & \frac{u (s)}{u (0)} = \frac{u (s)}{q_{m}} \end{matrix}

$\begin{matrix} \frac{K_{a}}{K_{s}} & = & \frac{u (s)}{u (0)} = \frac{u (s)}{q_{m}} \end{matrix}$

assuming that synonymous substitutions are completely neutral and mutation rates at both types of sites are the same. The PDF of K_a/K_s is

(51)

\begin{matrix} p (K_{a} / K_{s}) & = & p (u) \frac{d u}{d (K_{a} / K_{s})} = p (u) q_{m} \end{matrix}

$\begin{matrix} p (K_{a} / K_{s}) & = & p (u) \frac{d u}{d (K_{a} / K_{s})} = p (u) q_{m} \end{matrix}$

2.7. Probability distributions of ΔΔψ_ND, 4N_es, u, and K_a/K_s in fixed mutant genes

The PDF of ΔΔψ_ND in fixed mutants is proportional to that multiplied by the fixation probability.

(52)

\begin{matrix} p (Δ Δ ψ_{N D, f i x e d}) & = & p (Δ Δ ψ_{N D}) \frac{u (s (Δ Δ ψ_{N D}))}{〈 u (s (Δ Δ ψ_{N D})) 〉} \end{matrix}

$\begin{matrix} p (Δ Δ ψ_{N D, f i x e d}) & = & p (Δ Δ ψ_{N D}) \frac{u (s (Δ Δ ψ_{N D}))}{〈 u (s (Δ Δ ψ_{N D})) 〉} \end{matrix}$

(53)

\begin{matrix} 〈 u 〉 & \equiv & \int_{- \infty}^{\infty} u (s) p (Δ Δ ψ_{N D}) d Δ Δ ψ_{N D} \end{matrix}

$\begin{matrix} 〈 u 〉 & \equiv & \int_{- \infty}^{\infty} u (s) p (Δ Δ ψ_{N D}) d Δ Δ ψ_{N D} \end{matrix}$

Likewise, the PDF of selective advantage in fixed mutants is

(54)

\begin{matrix} p (4 N_{e} s_{f i x e d}) & = & p (4 N_{e} s) \frac{u (s)}{〈 u (s) 〉} \end{matrix}

$\begin{matrix} p (4 N_{e} s_{f i x e d}) & = & p (4 N_{e} s) \frac{u (s)}{〈 u (s) 〉} \end{matrix}$

and those of the u and K_a/K_s in fixed mutants are

(55)

\begin{matrix} p (u_{f i x e d}) & = & p (u) \frac{u}{〈 u 〉} \end{matrix}

$\begin{matrix} p (u_{f i x e d}) & = & p (u) \frac{u}{〈 u 〉} \end{matrix}$

(56)

\begin{matrix} p ({(\frac{K_{a}}{K_{s}})}_{f i x e d}) & = & p (\frac{K_{a}}{K_{s}}) \frac{u}{〈 u 〉} = p (\frac{K_{a}}{K_{s}}) \frac{\frac{K_{a}}{K_{s}}}{〈 \frac{K_{a}}{K_{s}} 〉} \end{matrix}

$\begin{matrix} p ({(\frac{K_{a}}{K_{s}})}_{f i x e d}) & = & p (\frac{K_{a}}{K_{s}}) \frac{u}{〈 u 〉} = p (\frac{K_{a}}{K_{s}}) \frac{\frac{K_{a}}{K_{s}}}{〈 \frac{K_{a}}{K_{s}} 〉} \end{matrix}$

The average of K_a/K_s in fixed mutants is equal to the ratio of the second moment to the first moment of K_a/K_s in all arising mutants;

{〈 K_{a} / K_{s} 〉}_{f i x e d} = 〈 {(K_{a} / K_{s})}^{2} 〉 / 〈 K_{a} / K_{s} 〉

${〈 K_{a} / K_{s} 〉}_{f i x e d} = 〈 {(K_{a} / K_{s})}^{2} 〉 / 〈 K_{a} / K_{s} 〉$ .

3. Materials

3.1. Sequence data

We study the single domains of 8 Pfam (Finn et al., 2016) families and both the single domains and multi-domains from 3 Pfam families. In Table 1, their Pfam ID for a multiple sequence alignment, and UniProt ID and PDB ID with the starting- and ending-residue positions of the domains are listed. The full alignments for their families at the Pfam are used to estimate one-body interactions h and pairwise interactions J with the DCA program from “http://dca.rice.edu/portal/dca/home ” (Marks et al., 2011; Morcos et al., 2011). To estimate the sample ( $\bar{ψ_{N}}$ $\bar{ψ_{N}}$ ) and ensemble (⟨ψ_N⟩_σ) averages of the evolutionary statistical energy, M unique sequences with no deletions are used. In order to reduce phylogenetic biases in the set of homologous sequences, we employ a sample weight ( $w_{σ_{N}}$ $w_{σ_{N}}$ ) for each sequence, which is equal to the inverse of the number of sequences that are less than 20% different from a given sequence in a given set of homologous sequences. Only representatives of unique sequences with no deletions, which are at least 20% different from each other, are used to calculate the changes of the evolutionary statistical energy (Δψ_N) due to single nucleotide nonsynonymous substitutions; the number of the representatives is almost equal to the effective number of sequences (M_eff) in Table 1.

Table 1. Protein families, and structures studied.

Pfam family	UniProt ID	N^a	N_eff^b,c	M^d	M_eff^c,e	L^f	PDB ID
HTH_3	RPC1_BP434/7-59	15315(15917)	11691.21	6286	4893.73	53	1R69-A:6-58
Nitroreductase	Q97IT9_CLOAB/4-76	6008(6084)	4912.96	1057	854.71	73	3E10-A/B:4-76^g
SBP_bac_3^h	GLNH_ECOLI/27-244	9874(9972)	7374.96	140	99.70	218	1WDN-A:5-222
SBP_bac_3	GLNH_ECOLI/111-204	9712(9898)	7442.85	829	689.64	94	1WDN-A:89-182
OmpA	PAL_ECOLI/73-167	6035(6070)	4920.44	2207	1761.24	95	1OAP-A:52-146
DnaB	DNAB_ECOLI/31-128	1929(1957)	1284.94	1187	697.30	98	1JWE-A:30-127
LysR_substrate^h	BENM_ACIAD/90-280	25138(25226)	20707.06	85(1)	67.00	191	2F6G-A/B:90-280^g
LysR_substrate	BENM_ACIAD/163-265	25032(25164)	21144.74	121(1)	99.27	103	2F6G-A/B:163-265^g
Methyltransf_5^h	RSMH_THEMA/8-292	1942(1953)	1286.67	578(2)	357.97	285	1N2X-A:8-292
Methyltransf_5	RSMH_THEMA/137-216	1877(1911)	1033.35	975(2)	465.53	80	1N2X-A:137-216
SH3_1	SRC_HUMAN:90-137	9716(16621)	3842.47	1191	458.31	48	1FMK-A:87-134
ACBP	ACBP_BOVIN/3-82	2130(2526)	1039.06	161	70.72	80	2ABD-A:2-81
PDZ	PTN13_MOUSE/1358-1438	13814(23726)	4748.76	1255	339.99	81	1GM1-A:16-96
Copper-bind	AZUR_PSEAE:24-148	1136(1169)	841.56	67(1)	45.23	125	5AZU-B/C:4-128^g

a: The number of unique sequences and the total number of sequences in parentheses; the full alignments in the Pfam (Finn et al., 2016) are used.
b: The effective number of sequences.
c: A sample weight ( $w_{σ_{N}}$ $w_{σ_{N}}$ ) for a given sequence is equal to the inverse of the number of sequences that are less than 20% different from the given sequence.
d: The number of unique sequences that include no deletion unless specified. The number in parentheses indicates the maximum number of deletions allowed.
e: The effective number of unique sequences that include no deletion or at most the specified number of deletions.
f: The number of residues.
g: Contacts are calculated in the homodimeric state for these protein.
h: These proteins consist of two domains, and other ones are single domains.

4. Results

First, We describe how one-body and pairwise interactions, h and J, are estimated. Then, the changes of evolutionary statistical energy (Δψ_N) due to single nucleotide nonsynonymous changes on natural sequences are analyzed with respect to dependences on the ψ_N of the wildtype sequences. The results indicate that the standard deviation of ΔG_N ≃ k_BT_sΔψ_N is almost constant over protein families. Hence, the selective temperatures, T_s, of various protein families can be estimated in a relative scale from the standard deviation of Δψ_N. The T_s of a reference protein is estimated by comparing the expected values of ΔΔG_ND with their experimental values. Folding free energies ΔG_ND are estimated from estimated T_s and experimental melting temperature T_m, and compared with their experimental values for 5 protein families. Glass transition temperatures T_g are also estimated from T_s and T_m.

Secondly, based on the distribution of Δψ_N, protein evolution is studied. Evolutionary statistical energy (ψ_N) attains the equilibrium when the average of Δψ_N over fixed mutations is equal to zero. The PDF of Δψ_N is approximated by log-normal distributions. The basic relationships are that 1) the standard deviation of Δψ_N is constant specific to a protein family, and 2) the mean of Δψ_N linearly depends on ψ_N. The equilibrium value of ψ_N is shown to agree with the mean of ψ_N over homologous proteins in each protein family. In the present approximation, the standard deviation of Δψ_N and selective temperature T_s at the equilibrium are simple functions of the equilibrium value of mean Δψ_N, ${\bar{Δ ψ_{N}}}^{e q}$ ${\bar{Δ ψ_{N}}}^{e q}$ . Lastly, the probability distribution of K_a/K_s, which is the ratio of nonsynonymous to synonymous substitution rate per site, is analyzed as a function of ${\bar{Δ ψ_{N}}}^{e q},$ ${\bar{Δ ψ_{N}}}^{e q},$ in order to examine how significant neutral selection is in the selection maintaining protein stability and foldability. Also, it is confirmed that selective temperature T_s negatively correlates with the mean of K_a/K_s, which represents the evolutionary rate of protein.

4.1. Important parameters in the estimations of one-body and pairwise interactions, h and J, and of the evolutionary statistical energy, ψ_N(σ)

The one-body (h) and pairwise (J) interactions for amino acid order in a protein sequence are estimated here by the DCA method (Marks et al., 2011; Morcos et al., 2011), although there are multiple methods for estimating them (Ekeberg et al., 2014; 2013). In the case of the DCA method, the ratio of pseudocount (0 ≤ p_c ≤ 1) defined in Eqs. (S.70) and (S.71) is a parameter and controls the values of the ensemble and sample averages of ψ_N in sequence space, ⟨ψ_N(σ)⟩_σ in Eq. (42) and $\bar{ψ_{N} (σ_{N})}$ $\bar{ψ_{N} (σ_{N})}$ in Eq. (45); a weight for observed counts is defined to be equal to $(1 - p_{c})$ $(1 - p_{c})$ . Sample average means the average over all homologous sequences with a weight for each sequence to reduce phylogenetic biases. An appropriate value must be chosen for the ratio of pseudocount in a reasonable manner.

Another problem is that the estimates of h and J(Marks et al., 2011; Morcos et al., 2011) may be noisy as a result of estimating many interaction parameters from a relatively small number of sequences. Therefore, only pairwise interactions within a certain distance are taken into account; the estimate of J is modified as follows, according to Morcos et al. (2014).

(57)

\begin{matrix} {\hat{J}}_{i j}^{q} (a_{k}, a_{l}) & = & J_{i j}^{q} (a_{k}, a_{l}) H (r_{c u t o f f} - r_{i j}) \end{matrix}

$\begin{matrix} {\hat{J}}_{i j}^{q} (a_{k}, a_{l}) & = & J_{i j}^{q} (a_{k}, a_{l}) H (r_{c u t o f f} - r_{i j}) \end{matrix}$

where J^q is the statistical estimate of J in the mean field approximation in which the amino acid a_q is the reference state, H is the Heaviside step function, and r_ij is the distance between the centers of amino acid side chains at sites i and j in a protein structure, and r_cutoff is a distance threshold for residue pairwise interactions. The one-body interactions h_i(a_k) are estimated in the isolated two-state model (Morcos et al., 2011) rather than the mean field approximation; see the Method section in the Supplement for details. The zero-sum gauge is employed to represent h and J;

\sum_{k} {\hat{h}}_{i}^{s} (a_{k}) = \sum_{k} \sum_{l} {\hat{J}}_{i j}^{s} (a_{k}, a_{l}) = 0

$\sum_{k} {\hat{h}}_{i}^{s} (a_{k}) = \sum_{k} \sum_{l} {\hat{J}}_{i j}^{s} (a_{k}, a_{l}) = 0$ in the zero-sum gauge.

Candidates for the cutoff distance may be about 8 Å for the first interaction shell and 15–16 Å for the second interaction shell between residues; distance between the centers of side chain atoms is employed for residue distance. Here both the distances are tested for the cutoff distance. Pseudocount in the Bayesian statistics is determined usually as a function of the number of samples (sequences), although the ratio of pseudocount $p_{c} = 0.5$ $p_{c} = 0.5$ was used for all proteins in the contact prediction (Morcos et al., 2011). Here, an appropriate value for the ratio of pseudocount for the certain cutoff distance, either about 8 Å or 15–16 Å, is chosen for each protein family in such a way that the sample average of the evolutionary statistical energies must be equal to the ensemble average, $\bar{ψ_{N}} = {〈 ψ_{N} 〉}_{σ}$ $\bar{ψ_{N}} = {〈 ψ_{N} 〉}_{σ}$ ; see Eqs. (42) and (46). As shown in Fig. S.1, the value of r_cutoff, where $\bar{ψ_{N}} = {〈 ψ_{N} 〉}_{σ}$ $\bar{ψ_{N}} = {〈 ψ_{N} 〉}_{σ}$ is satisfied, monotonously changes as a function of the ratio of pseudocount p_c. The values of p_c, where $\bar{ψ_{N}} = {〈 ψ_{N} 〉}_{σ}$ $\bar{ψ_{N}} = {〈 ψ_{N} 〉}_{σ}$ is satisfied near the specified values of r_cutoff, 8 Å and 15.5 Å, are employed for r_cutoff ≃ 8 Å and 15.5 Å, respectively. In the present multiple sequence alignment for the PDZ domain, with the ratios of pseudocount $p_{c} = 0.205$ $p_{c} = 0.205$ and $p_{c} = 0.33,$ $p_{c} = 0.33,$ the sample and ensemble averages agree with each other at the cutoff distances r_cutoff ∼ 8 Å and r_cutoff ∼ 15.5 Å, respectively; see Fig. S.1. In Fig. S.2, the reflective correlation and regression coefficients between the experimental ΔΔG_ND (Gianni et al., 2007) and Δψ_N due to single amino acid substitutions are plotted against the cutoff distance for pairwise interactions in the PDZ domain. The reflective correlation coefficient has the maximum at the r_cutoff ∼ 8 Å for $p_{c} = 0.205$ $p_{c} = 0.205$ and at r_cutoff ∼ 15.5 Å for $p_{c} = 0.33$ $p_{c} = 0.33$ , indicating that these cutoff distances are appropriate for these ratios of pseudocount. The ratio of pseudocount and a cutoff distance employed are listed for each protein family in Table 2 and S.5 for r_cutoff ∼ 8 and 15.5 Å, respectively. The ratios of pseudocount employed here are all smaller than 0.5, which was reported to be appropriate for contact prediction; by using strong regularization, contact prediction is improved but the generative power of the inferred model is degraded (Barton et al., 2016). In the text, only results with r_cutoff ∼ 8 Å are shown. In a supplement, results with r_cutoff ∼ 15.5 Å are provided and discussed in comparison with the results of r_cutoff ∼ 8 Å.

Table 2. Parameter values for r_cutoff ∼ 8 Å employed for each protein family, and the averages of the evolutionary statistical energies ( $\bar{ψ_{N}}$ $\bar{ψ_{N}}$ ) over all homologous sequences and of the means and the standard deviations of interaction changes ( $\bar{\bar{Δ ψ_{N}}}$ $\bar{\bar{Δ ψ_{N}}}$ and $\bar{Sd (Δ ψ_{N})}$ $\bar{Sd (Δ ψ_{N})}$ ) due to single nucleotide nonsynonymous mutations at all sites over all homologous sequences in each protein family.

Pfam family	L	p_c	n_c^a	r_cutoff	$\bar{ψ} / L$ $\bar{ψ} / L$ ^b	δψ²/L^b	$\bar{ψ_{N}} / L$ $\bar{ψ_{N}} / L$ ^b	$\bar{\bar{Δ ψ_{N}}}$ $\bar{\bar{Δ ψ_{N}}}$ ^c	$\bar{Sd (Δ ψ_{N})} \pm$ $\bar{Sd (Δ ψ_{N})} \pm$ ^c	$r_{ψ_{N}}$ $r_{ψ_{N}}$	$α_{ψ_{N}}$ $α_{ψ_{N}}$	$r_{ψ_{N}}$ $r_{ψ_{N}}$	$α_{ψ_{N}}$ $α_{ψ_{N}}$
				(Å)					Sd(Sd(Δψ_N))	for $\bar{Δ ψ_{N}}$ $\bar{Δ ψ_{N}}$ ^d		for Sd(Δψ_N)^e
HTH_3	53	0.18	7.43	8.22	$- 0.1997$ $- 0.1997$	2.7926	$- 2.9861$ $- 2.9861$	4.2572	5.3503 ± 0.5627	$- 0.961$ $- 0.961$	$- 1.5105$ $- 1.5105$	$- 0.598$ $- 0.598$	$- 0.9888$ $- 0.9888$
Nitroreductase	73	0.23	6.38	8.25	$- 0.1184$ $- 0.1184$	2.1597	$- 2.2788$ $- 2.2788$	3.3115	3.6278 ± 0.2804	$- 0.939$ $- 0.939$	$- 1.3371$ $- 1.3371$	$- 0.426$ $- 0.426$	$- 0.3721$ $- 0.3721$
SBP_bac_3	218	0.25	9.23	8.10	$- 0.1000$ $- 0.1000$	2.1624	$- 2.2618$ $- 2.2618$	3.2955	3.4496 ± 0.2742	$- 0.980$ $- 0.980$	$- 1.5286$ $- 1.5286$	$- 0.841$ $- 0.841$	$- 0.7876$ $- 0.7876$
SBP_bac_3	94	0.37	8.00	7.90	$- 0.1634$ $- 0.1634$	1.2495	$- 1.4054$ $- 1.4054$	1.9291	2.3436 ± 0.1901	$- 0.959$ $- 0.959$	$- 1.3938$ $- 1.3938$	$- 0.634$ $- 0.634$	$- 0.4815$ $- 0.4815$
OmpA	95	0.169	8.00	8.20	$- 0.2457$ $- 0.2457$	3.9093	$- 4.1542$ $- 4.1542$	6.5757	7.6916 ± 0.3078	$- 0.957$ $- 0.957$	$- 1.5694$ $- 1.5694$	$- 0.410$ $- 0.410$	$- 0.3804$ $- 0.3804$
DnaB	98	0.235	9.65	8.17	$- 0.2284$ $- 0.2284$	3.9976	$- 4.2291$ $- 4.2291$	6.3502	6.1244 ± 0.3245	$- 0.965$ $- 0.965$	$- 1.4509$ $- 1.4509$	$- 0.495$ $- 0.495$	$- 0.4198$ $- 0.4198$
LysR_substrate	191	0.235	8.59	7.98	$- 0.2241$ $- 0.2241$	1.4888	$- 1.7173$ $- 1.7173$	2.2784	2.6519 ± 0.1445	$- 0.964$ $- 0.964$	$- 1.3347$ $- 1.3347$	$- 0.541$ $- 0.541$	$- 0.5664$ $- 0.5664$
LysR_substrate	103	0.265	8.84	8.25	$- 0.2244$ $- 0.2244$	1.4144	$- 1.6379$ $- 1.6379$	2.2110	2.7371 ± 0.2055	$- 0.982$ $- 0.982$	$- 1.4159$ $- 1.4159$	$- 0.727$ $- 0.727$	$- 0.5307$ $- 0.5307$
Methyltransf_5	285	0.13	7.99	7.78	$- 0.1462$ $- 0.1462$	7.2435	$- 7.3887$ $- 7.3887$	12.4689	10.9352 ± 0.3030	$- 0.981$ $- 0.981$	$- 1.9140$ $- 1.9140$	$- 0.122$ $- 0.122$	$- 0.0783$ $- 0.0783$
Methyltransf_5	80	0.18	6.78	7.85	$- 0.1763$ $- 0.1763$	5.5162	$- 5.6896$ $- 5.6896$	8.9849	7.6133 ± 0.4382	$- 0.944$ $- 0.944$	$- 1.4824$ $- 1.4824$	0.125	0.1141
SH3_1	48	0.14	6.42	8.01	$- 0.1348$ $- 0.1348$	3.9109	$- 4.0434$ $- 4.0434$	5.5792	6.1426 ± 0.2935	$- 0.919$ $- 0.919$	$- 1.4061$ $- 1.4061$	$- 0.196$ $- 0.196$	$- 0.1718$ $- 0.1718$
ACBP	80	0.22	9.17	8.24	$- 0.0525$ $- 0.0525$	4.6411	$- 4.7084$ $- 4.7084$	7.7612	7.1383 ± 0.2970	$- 0.972$ $- 0.972$	$- 1.5884$ $- 1.5884$	$- 0.335$ $- 0.335$	$- 0.2235$ $- 0.2235$
PDZ	81	0.205	9.06	8.16	$- 0.2398$ $- 0.2398$	3.1140	$- 3.3572$ $- 3.3572$	4.7589	4.6605 ± 0.2255	$- 0.954$ $- 0.954$	$- 1.5282$ $- 1.5282$	$- 0.369$ $- 0.369$	$- 0.3042$ $- 0.3042$
Copper-bind	125	0.23	9.50	8.27	$- 0.0940$ $- 0.0940$	4.2450	$- 4.3272$ $- 4.3272$	7.2650	6.9283 ± 0.2316	$- 0.980$ $- 0.980$	$- 1.8915$ $- 1.8915$	$- 0.282$ $- 0.282$	$- 0.2352$ $- 0.2352$

a: The average number of contact residues per site within the cutoff distance; the center of side chain is used to represent a residue.
b: M unique sequences with no deletions are used with a sample weight ( $w_{σ_{N}}$ $w_{σ_{N}}$ ) for each sequence; $w_{σ_{N}}$ $w_{σ_{N}}$ is equal to the inverse of the number of sequences that are less than 20% different from a given sequence. The M and the effective number M_eff of the sequences are listed for each protein family in Table 1.
c: The averages of $\bar{Δ ψ_{N}}$ $\bar{Δ ψ_{N}}$ and Sd(Δψ_N), which are the mean and the standard deviation of Δψ_N for a sequence, and the standard deviation of Sd(Δψ_N) over homologous sequences. Representatives of unique sequences with no deletions, which are at least 20% different from each other, are used; the number of the representatives used is almost equal to M_eff.
d: The correlation and regression coefficients of $\bar{Δ ψ_{N}}$ $\bar{Δ ψ_{N}}$ on ψ_N/L; see Eq. (62).
e: The correlation and regression coefficients of Sd(Δψ_N) on ψ_N/L.

4.2. Changes of the evolutionary statistical energy, Δψ_N, by single nucleotide nonsynonymous substitutions

The changes of the evolutionary statistical energy, Δψ_N and Δψ_D, due to a single amino acid substitution from $σ_{i}^{N}$ $σ_{i}^{N}$ to σ_i at site i in a natural sequence σ_N are defined as

(58)

\begin{matrix} Δ ψ_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i}) & \equiv & ψ_{N} (σ_{j \neq i}^{N}, σ_{i}) - ψ_{N} (σ_{N}) \end{matrix}

$\begin{matrix} Δ ψ_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i}) & \equiv & ψ_{N} (σ_{j \neq i}^{N}, σ_{i}) - ψ_{N} (σ_{N}) \end{matrix}$

(59)

\begin{matrix} Δ ψ_{D} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i}, T) & \equiv & ψ_{D} (σ_{j \neq i}^{N}, σ_{i}, T) - ψ_{D} (σ_{N}, T) \end{matrix}

$\begin{matrix} Δ ψ_{D} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i}, T) & \equiv & ψ_{D} (σ_{j \neq i}^{N}, σ_{i}, T) - ψ_{D} (σ_{N}, T) \end{matrix}$

(60)

\begin{matrix} Δ Δ ψ_{N D} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i}) & \equiv & Δ ψ_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i}) \\ - Δ ψ_{D} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i}) \end{matrix}

$\begin{matrix} Δ Δ ψ_{N D} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i}) & \equiv & Δ ψ_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i}) \\ - Δ ψ_{D} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i}) \end{matrix}$

(61)

\begin{matrix} ≃ & Δ ψ_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i}) \\ because f (σ_{N}) \approx f (σ_{j \neq i}^{N}, σ_{i}) \end{matrix}

$\begin{matrix} ≃ & Δ ψ_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i}) \\ because f (σ_{N}) \approx f (σ_{j \neq i}^{N}, σ_{i}) \end{matrix}$

Here, single amino acid substitutions caused by single nucleotide nonsynonymous mutations are taken into account, unless specified. Let us use a single overline to denote the average of the changes of interaction over all types of single nucleotide nonsynonymous mutations at all sites in a specific native sequence, and a double overline to denote their averages over all homologous sequences in a protein family.

We calculated the ψ_N of the wildtype and Δψ_N due to all types of single nucleotide nonsynonymous substitutions for all homologous sequences, and their means and variances. We have examined the dependence of $\bar{Δ Δ ψ_{N D}} ≃ \bar{Δ ψ_{N}}$ $\bar{Δ Δ ψ_{N D}} ≃ \bar{Δ ψ_{N}}$ on the ψ_N of each homologous sequence in each protein family. Fig. 1 for the PDZ family and Figs. S.3 to S.13 for all proteins show that $\bar{Δ ψ_{N}}$ $\bar{Δ ψ_{N}}$ is negatively proportional to the ψ_N/L of the wildtype, that is,

(62)

\begin{matrix} \bar{Δ Δ ψ_{N D} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i})} ≃ \bar{Δ ψ_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i})} \\ \approx α_{ψ_{N}} \frac{ψ_{N} (σ_{N}) - \bar{ψ_{N} (σ_{N})}}{L} + \bar{\bar{Δ ψ_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i})}} \\ with α_{ψ_{N}} < 0 \end{matrix}

$\begin{matrix} \bar{Δ Δ ψ_{N D} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i})} ≃ \bar{Δ ψ_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i})} \\ \approx α_{ψ_{N}} \frac{ψ_{N} (σ_{N}) - \bar{ψ_{N} (σ_{N})}}{L} + \bar{\bar{Δ ψ_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i})}} \\ with α_{ψ_{N}} < 0 \end{matrix}$

where L is sequence length. This relationship is found in all of the protein families examined here; the correlation and regression coefficients for r_cutoff ∼ 8 and 15.5 Å are listed in Table 2 and S.5, respectively. Most of the correlation coefficients are larger than 0.95, and all are greater than 0.9. It is reasonable that the change of the evolutionary statistical energy (Δψ_N) depends on interaction per residue (ψ_N/L) rather than the evolutionary statistical energy (ψ_N), because interactions change only for one residue substituted in the sequence. Note that the average interactions including a single residue will be equal to 2ψ_N/L if all interactions are two-body. The important fact is that the linear dependence of Δψ_N on ψ_N/L shown in Fig. 1 and Table 2 and S.5 is equivalent to the linear dependence of free energy changes caused by single amino acid substitutions on the native conformational energy of the wildtype protein, because the selective temperatures T_S of homologous sequences in a protein family are approximated to be equal.

Is the same type of dependence on ψ_N/L found for the standard deviation of Δψ_N over single nucleotide nonsynonymous substitutions at all sites? Fig. 1, Figs. S.3 to S.13 and Table 2 and S.5 show that the correlation between the standard deviation of Δψ_N and ψ_N of the wildtype is very weak except for Nitroreductase, SBP_bac_3 and LysR_substrate families. Even for these protein families, the standard deviations of Sd(Δψ_N) are less than 10% of the mean, $\bar{Sd (Δ ψ_{N})}$ $\bar{Sd (Δ ψ_{N})}$ ; see Table 2 and S.5. Thus, it is indicated that in general the variance/standard deviation of Δψ_N due to single amino acid substitutions is almost constant irrespectively of the ψ_N across homologous sequences. The standard deviations of Sd(Δψ_N) is relatively large for the HTH_3, because in Fig. S.3 there is a minor sequence group that has a distinguishable value of Sd(Δψ_N) from the major sequence group.

4.3. Effective temperature T_s of selection estimated from the changes of interaction, Δψ_N, by single nucleotide nonsynonymous substitutions

In the previous section, it has been shown that the standard deviation of Δψ_N hardly depends on ψ_N of the wildtype and is nearly constant across homologous sequences in every protein family that has its own characteristic temperature (T_s) for selection pressure, indicating that Sd(Δψ_N) must be approximated by a function of only k_BT_s. On the other hand, the free energy of the native structure, ΔG_N, must not explicitly depend on k_BT_s, although it may be approximated by a function of G_N. In other words, the following relationships are derived.

(63)

\begin{matrix} Sd (Δ ψ_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i})) & \approx & independent of ψ_{N} and \\ constant across homologous \\ sequences in every protein \\ family \\ = & function of k_{B} T_{s} \end{matrix}

$\begin{matrix} Sd (Δ ψ_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i})) & \approx & independent of ψ_{N} and \\ constant across homologous \\ sequences in every protein \\ family \\ = & function of k_{B} T_{s} \end{matrix}$

(64)

\begin{matrix} Sd (Δ G_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i})) & = & function that must not explicitly \\ depend on k_{B} T_{s} but G_{N} \end{matrix}

$\begin{matrix} Sd (Δ G_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i})) & = & function that must not explicitly \\ depend on k_{B} T_{s} but G_{N} \end{matrix}$

From the equations above, we obtain the important relation that the standard deviation of

Δ G_{N} (= k_{B} T_{s} Δ ψ_{N})

$Δ G_{N} (= k_{B} T_{s} Δ ψ_{N})$ does not depend on G_N and is nearly constant irrespective of protein families.

(65)

\begin{matrix} Sd (Δ G_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i})) & ≃ & k_{B} T_{s} Sd (Δ ψ_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i})) \\ \approx & constant \end{matrix}

$\begin{matrix} Sd (Δ G_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i})) & ≃ & k_{B} T_{s} Sd (Δ ψ_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i})) \\ \approx & constant \end{matrix}$

This relationship is consistent with the observation that the standard deviation of ΔΔG_ND( ≃ ΔG_N) is nearly constant irrespectively of protein families (Tokuriki et al., 2007). This relationship allows us to estimate a selective temperature (T_s) for a protein family in a scale relative to that of a reference protein from the ratio of the standard deviation of Δψ_N. The PDZ family is employed here as a reference protein, and its T_s is estimated by a direct comparison of Δψ_N and experimental ΔΔG_ND; the amino acid pair types and site locations of single amino acid substitutions are the most various, and also the correlation between the experimental ΔΔG_ND and Δψ_N is the best for the PDZ family in the present set of protein families, SH3_1 (Grantcharova et al., 1998), ACBP (Kragelund et al., 1999), PDZ (Gianni et al., 2005; 2007), and Copper-bind (Wilson and Wittung-Stafshede, 2005); see Table 3 and S.6.

(66)

\begin{matrix} k_{B} {\hat{T}}_{s} & = & k_{B} {\hat{T}}_{s, P D Z} \\ [\bar{Sd (Δ ψ_{P D Z} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i}))} / \bar{Sd (Δ ψ_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i}))}] \end{matrix}

$\begin{matrix} k_{B} {\hat{T}}_{s} & = & k_{B} {\hat{T}}_{s, P D Z} \\ [\bar{Sd (Δ ψ_{P D Z} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i}))} / \bar{Sd (Δ ψ_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i}))}] \end{matrix}$

where the overline denotes the average over all homologous sequences. Here, the averages of standard deviations over all homologous sequences are employed, because T_s for all homologous sequences are approximated to be equal. It will be confirmed in the later section, “the equilibrium value of ψ_N in protein evolution”, that the assumption of the constant value specific to each protein family for Sd(Δψ_N) is appropriate.

Table 3. Thermodynamic quantities estimated with r_cutoff ∼ 8 Å.

				Experimental
Pfam family	r^a	$k_{B} {\hat{T}}_{s}$ $k_{B} {\hat{T}}_{s}$ ^a	${\hat{T}}_{s}$ ${\hat{T}}_{s}$	T_m	${\hat{T}}_{g}$ ${\hat{T}}_{g}$	$\hat{ω}$ $\hat{ω}$ ^b	T^c	⟨ΔG_ND⟩^d
		(kcal/mol)	(°K)	(°K)	(°K)	(k_B)	(°K)	(kcal/mol)
HTH_3	–	–	122.6	343.7	160.1	0.8182	298	$- 2.95$ $- 2.95$
Nitroreductase	–	–	180.7	337	204.0	0.8477	298	$- 2.81$ $- 2.81$
SBP_bac_3	–	–	190.1	336.1	211.0	0.8771	298	$- 8.03$ $- 8.03$
SBP_bac_3	–	–	279.8	336.1	283.8	0.6072	298	$- . 85$ $- . 85$
OmpA	–	–	85.2	320	125.4	0.9027	298	$- 3.13$ $- 3.13$
DnaB	–	–	107.1	312.8	142.1	1.1341	298	$- 2.56$ $- 2.56$
LysR_substrate	–	–	247.3	338	256.7	0.6908	298	$- 3.63$ $- 3.63$
LysR_substrate	–	–	239.6	338	250.4	0.6472	298	$- 2.00$ $- 2.00$
Methyltransf_5	–	–	60.0	375	110.5	1.0656	298	$- 41.36$ $- 41.36$
Methyltransf_5	–	–	86.1	375	135.1	1.1214	298	$- 11.48$ $- 11.48$
SH3_1	0.865	0.1583	106.7	344	147.4	1.0253	295	$- 3.76$ $- 3.76$
ACBP	0.825	0.1169	91.9	324.4	131.7	1.1281	278	$- 6.72$ $- 6.72$
PDZ	0.931	0.2794	140.7	312.88	168.5	1.0854	298	$- 1.81$ $- 1.81$
Copper-bind	0.828	0.1781	94.6	359.3	139.9	0.9709	298	$- 12.07$ $- 12.07$

a: Reflective correlation (r) and regression ( $k_{B} {\hat{T}}_{s}$ $k_{B} {\hat{T}}_{s}$ ) coefficients for least-squares regression lines of experimental ΔΔG_ND on Δψ_N through the origin.
b: Conformational entropy per residue, in k_B units, in the denatured molten-globule state; see Eq. (18).
c: Temperatures are set up for comparison to be equal to the experimental temperatures for ΔG_ND or to 298°K if unavailable; see Table S.4 for the experimental data.
d: Folding free energy in kcal/mol units; see Eq. (44).

4.4. A direct comparison of the changes of interaction, Δψ_N( ≃ ΔΔψ_ND), with the experimental ΔΔG_ND due to single amino acid substitutions

In order to determine the T_s for a reference protein, the experimental values (Gianni et al., 2007) of ΔΔG_ND due to single amino acid substitutions in the PDZ domain are plotted against the changes of interaction, Δψ_N, for the same types of substitutions in Figs. 2 and S.14. The slope of the least-squares regression line through the origin, which is an estimate of k_BT_s, is equal to $k_{B} {\hat{T}}_{s} = 0.279$ $k_{B} {\hat{T}}_{s} = 0.279$ kcal/mol, and the reflective correlation coefficient is equal to 0.93. This estimate of k_BT_s for the PDZ yield $\bar{Sd (Δ Δ G_{N D})} ≃ k_{B} {\hat{T}}_{s} \bar{Sd (Δ ψ_{N})} = 1.30$ $\bar{Sd (Δ Δ G_{N D})} ≃ k_{B} {\hat{T}}_{s} \bar{Sd (Δ ψ_{N})} = 1.30$ kcal/mol, which corresponds to 76% of 1.7 kcal/mol (Serohijos et al., 2012) estimated from ProTherm database or 80% of 1.63 kcal/mol (Tokuriki et al., 2007) computationally predicted for single nucleotide mutations by using the FoldX. Using $\bar{Sd (Δ Δ G_{N D})} = 1.30$ $\bar{Sd (Δ Δ G_{N D})} = 1.30$ estimated from the T_s for PDZ, the absolute values of T_s for other proteins are calculated by Eq. (66) and listed in Table 3; see Table S.6 for r_cutoff ∼ 15.5 Å. The T_s estimated with r_cutoff ∼ 8 and 15.5 Å are compared with each other in Fig. S.15. Morcos et al. (2014) estimated T_s by comparing Δψ_ND with ΔG_ND estimated by the associative-memory, water-mediated, structure, and energy model (AWSEM). They estimated ψ_N with $r_{c u t o f f} = 16$ $r_{c u t o f f} = 16$ Å and probably $p_{c} = 0.5$ $p_{c} = 0.5$ . In Fig. S.16, the present estimates of T_s are compared with those by Morcos et al. (2014). The Morcos’s estimates of T_s with some exceptions tend to be located between the present estimates with r_cutoff ∼ 8 Å and 15.5 Å which correspond to upper and lower limits for T_s as discussed in the Discussion and the supplement.

4.5. Relationship among $\bar{\bar{Δ ψ_{N}}}$ $\bar{\bar{Δ ψ_{N}}}$ of protein families; weak dependency of ΔΔG_ND on ΔG_ND/L

The weak dependence of ΔΔG_ND on ΔG_ND was found (Miyazawa, 2016; Serohijos et al., 2012) from the analysis of stability changes due to single amino acid substitutions in proteins, which are collected in the ProTherm database (Kumar et al., 2006). To understand this weak dependence, let us consider the average of $\bar{Δ ψ_{N}}$ $\bar{Δ ψ_{N}}$ over homologous sequences in each protein family. The following regression line with $α_{\bar{ψ_{N}}} = - 1.74$ $α_{\bar{ψ_{N}}} = - 1.74$ is shown in Fig. 3.

(67)

\begin{matrix} \bar{\bar{Δ ψ_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i})}} & \approx & α_{\bar{ψ_{N}}} \frac{\bar{ψ_{N} (σ_{N})} - \bar{ψ} (\bar{f (σ_{N})})}{L} + β_{\bar{ψ_{N}}} \end{matrix}

$\begin{matrix} \bar{\bar{Δ ψ_{N} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i})}} & \approx & α_{\bar{ψ_{N}}} \frac{\bar{ψ_{N} (σ_{N})} - \bar{ψ} (\bar{f (σ_{N})})}{L} + β_{\bar{ψ_{N}}} \end{matrix}$

(68)

\begin{matrix} = & α_{\bar{ψ_{N}}} \frac{- {δ ψ}^{2} (\bar{f (σ_{N})})}{L} + β_{\bar{ψ_{N}}} \end{matrix}

$\begin{matrix} = & α_{\bar{ψ_{N}}} \frac{- {δ ψ}^{2} (\bar{f (σ_{N})})}{L} + β_{\bar{ψ_{N}}} \end{matrix}$

(69)

α_{\bar{ψ_{N}}} < 0, β_{\bar{ψ_{N}}} \approx 0

$α_{\bar{ψ_{N}}} < 0, β_{\bar{ψ_{N}}} \approx 0$

Here,

\bar{ψ_{N} (σ_{N})}

$\bar{ψ_{N} (σ_{N})}$ is reduced by

\bar{ψ}

$\bar{ψ}$ because the origin of the ψ_N scale is not unique. The correlation between

\bar{\bar{Δ ψ_{N}}}

$\bar{\bar{Δ ψ_{N}}}$ and δψ²/L is significant; the correlation coefficient is larger than 0.99. The intercept

β_{\bar{ψ_{N}}}

$β_{\bar{ψ_{N}}}$ should be equal to 0, because if T_s → ∞ then δψ² → 0 and Δψ_N → 0. Actually, Fig. 3 shows that

β_{\bar{ψ_{N}}}

$β_{\bar{ψ_{N}}}$ is nearly equal to 0.

Finally, the regression of ΔΔG_ND on ΔG_ND would be derived if T_g, T_s, and Twere constant.

(70)

\begin{matrix} \bar{\bar{Δ Δ G_{N D} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i})}} \\ \approx - α_{\bar{ψ_{N}}} k_{B} T_{s} \frac{{δ ψ}^{2} (\bar{f (σ_{N})})}{L} + k_{B} T_{s} β_{\bar{ψ_{N}}} \end{matrix}

$\begin{matrix} \bar{\bar{Δ Δ G_{N D} (σ_{j \neq i}^{N}, σ_{i}^{N} \to σ_{i})}} \\ \approx - α_{\bar{ψ_{N}}} k_{B} T_{s} \frac{{δ ψ}^{2} (\bar{f (σ_{N})})}{L} + k_{B} T_{s} β_{\bar{ψ_{N}}} \end{matrix}$

(71)

\begin{matrix} = α_{Δ G_{N D}} k_{B} T_{s} \frac{{δ ψ}^{2} (\bar{f (σ_{N}))}}{L} (ϑ (T / T_{g}) \frac{T_{s}}{T} - 1) + β_{Δ G_{N D}} \end{matrix}

$\begin{matrix} = α_{Δ G_{N D}} k_{B} T_{s} \frac{{δ ψ}^{2} (\bar{f (σ_{N}))}}{L} (ϑ (T / T_{g}) \frac{T_{s}}{T} - 1) + β_{Δ G_{N D}} \end{matrix}$

(72)

\begin{matrix} = α_{Δ G_{N D}} \frac{〈 Δ G_{N D} (σ_{N}, T) 〉}{L} + β_{Δ G_{N D}} \end{matrix}

$\begin{matrix} = α_{Δ G_{N D}} \frac{〈 Δ G_{N D} (σ_{N}, T) 〉}{L} + β_{Δ G_{N D}} \end{matrix}$

In general, T_s and T_g are different among protein families, so that the correlation between

\bar{\bar{Δ Δ G_{N D}}}

$\bar{\bar{Δ Δ G_{N D}}}$ and ⟨ΔG_ND⟩/L cannot be strong. In Fig. 4,

\bar{\bar{Δ Δ G_{N D}}}

$\bar{\bar{Δ Δ G_{N D}}}$ for the present proteins are plotted against ⟨ΔG_ND⟩/L. However, it should be noted that the correlation is not expected for

\bar{\bar{Δ Δ G_{N D}}}

$\bar{\bar{Δ Δ G_{N D}}}$ and ⟨ΔG_ND⟩ but for

\bar{\bar{Δ Δ G_{N D}}}

$\bar{\bar{Δ Δ G_{N D}}}$ and ⟨ΔG_ND⟩/L .

4.6. Estimation of T_g, ω, and ⟨ΔG_ND(σ)⟩_σ from T_s and T_m

To estimate glass transition temperature T_g, the conformational entropy per residue ω in the compact denatured state, and the ensemble average of folding free energy in sequence space ⟨ΔG_ND⟩_σ, melting temperature T_m must be known for each protein; see Eqs. (47), (18), and (44) for T_g, ω and ⟨ΔG_ND⟩_σ, respectively. The experimental value of T_m (Armengaud et al., 2004; D’Auria et al., 2005; Ganguly et al., 2009; Guelorget et al., 2010; Knapp et al., 1998; Onwukwe et al., 2014; Parsons et al., 2006; Rosa et al., 1995; Sainsbury et al., 2008; Stupák et al., 2006; Torchio et al., 2012; Williams et al., 2002) employed for each protein is listed in Tables 3 and S.6. For comparison, temperature T is set up to be equal to the experimental temperature for ΔG_ND or to 298°K if unknown.

An estimate of glass transition temperature, ${\hat{T}}_{g},$ ${\hat{T}}_{g},$ has been calculated with ${\hat{T}}_{s}$ ${\hat{T}}_{s}$ and T_m by Eq. (47), and is listed in Tables 3 and S.6 for each protein. In Fig. 5, ${\hat{T}}_{s} / {\hat{T}}_{g}$ ${\hat{T}}_{s} / {\hat{T}}_{g}$ is plotted against $T_{m} / {\hat{T}}_{g}$ $T_{m} / {\hat{T}}_{g}$ for each protein family. Unless T_g < T_m, a protein will be trapped at local minima on a rugged free energy landscape before it folds into a unique native structure. Protein foldability increases as T_m/T_g increases. A condition, $Δ G_{N D} = 0$ $Δ G_{N D} = 0$ at $T = T_{m},$ $T = T_{m},$ for the first order transition requires that Eq. (47), which is indicated by a dotted curve in Fig. 5, must be satisfied. As a result, T_s/T_g must be lowered to increase T_m/T_g; in other words, proteins must be selected at lower T_s. The present estimates of T_s and T_g would be within a reasonable range (Morcos et al., 2014; Onuchic et al., 1995; Pande et al., 2000) of values required for protein foldability.

In Tables 3 and S.6, the ensemble average of ΔG_ND(σ) over sequences calculated by Eq. 44, and the conformational entropy per residue ω in the compact denatured state by Eq. (18) are also listed for each protein. Fig. 6 shows the comparison of their ensemble averages, ⟨ΔG_ND(σ)⟩_σ, and the experimental values of ΔG_ND(σ_N) (Gianni et al., 2005; 2007; Grantcharova et al., 1998; Kragelund et al., 1999; Ruiz-Sanz et al., 1999; Wilson and Wittung-Stafshede, 2005) listed in Table S.4. The correlation in the case of r_cutoff ∼ 8 Å is quite good, indicating that the constancy approximation (Eq. (65)) for the variance of ΔG_N is appropriate. The conformational entropy per residue in the compact denatured state, $\hat{ω}$ $\hat{ω}$ in Eq. (18), estimated from the condition for the first order transition falls into the range of 0.60–1.13k_B for r_cutoff ∼ 8 Å, which agrees well with the range estimated by Morcos et al. (2014).

4.7. The equilibrium value of evolutionary statistical energy ψ_N in the mutation–fixation process of amino acid substitutions

Let us consider the fixation process of amino acid substitutions in a monoclonal approximation, in which protein evolution is assumed to proceed with single amino acid substitutions fixed at a time in a population. In this approximation, Δψ_ND and ψ_N are at equilibrium and the ensemble of protein sequences attains to the equilibrium state, when the average of ΔΔψ_ND ≃ Δψ_N over singe nucleotide nonsynonymous mutations fixed in a population is equal to zero; an amino acid composition is assumed to be constant in protein evolution.

(73)

\begin{matrix} {〈 Δ Δ ψ_{N D} 〉}_{f i x e d} ≃ {〈 Δ ψ_{N} 〉}_{f i x e d} = 0 ⟺ Δ ψ_{N D} \\ and ψ_{N} are at equilibrium. \end{matrix}

$\begin{matrix} {〈 Δ Δ ψ_{N D} 〉}_{f i x e d} ≃ {〈 Δ ψ_{N} 〉}_{f i x e d} = 0 ⟺ Δ ψ_{N D} \\ and ψ_{N} are at equilibrium. \end{matrix}$

The average of Δψ_N over fixed mutations, ⟨Δψ_N⟩_fixed, is calculated numerically with the probability density function (PDF) of ΔΔψ_ND( ≃ Δψ_N) for single nucleotide nonsynonymous mutations; see Eqs. (52) and (53).

N = 10^{6}

$N = 10^{6}$ is employed.

The PDF of ΔΔG_ND were approximated with a normal distribution(Serohijos et al., 2012) or a bi-normal distribution (Tokuriki et al., 2007). Figs. 7, S.22, and S.23, however, show that a single normal distribution with the observed mean and standard deviation cannot well reproduce the observed distribution of Δψ_N due to single nucleotide nonsynonymous mutations. For simplicity, a log-normal distribution, $\ln N (x; μ, σ),$ $\ln N (x; μ, σ),$ for which x, μ and σ defined as follows, is arbitrarily used here to better reproduce observed distributions of Δψ_N, particularly in the domain of $Δ ψ_{N} < \bar{Δ ψ_{N}},$ $Δ ψ_{N} < \bar{Δ ψ_{N}},$ although other distributions such as inverse Γ distributions can equally well reproduce the observed ones, too.

(74)

\begin{matrix} p (Δ ψ_{N}) & \approx & \ln N (x; μ, σ) \equiv \frac{1}{x} N (\ln x; μ, σ) \end{matrix}

$\begin{matrix} p (Δ ψ_{N}) & \approx & \ln N (x; μ, σ) \equiv \frac{1}{x} N (\ln x; μ, σ) \end{matrix}$

(75)

\begin{matrix} x & \equiv & \max (Δ ψ_{N} - Δ ψ_{N}^{o}, 0) \end{matrix}

$\begin{matrix} x & \equiv & \max (Δ ψ_{N} - Δ ψ_{N}^{o}, 0) \end{matrix}$

(76)

\begin{matrix} \exp (μ + σ^{2} / 2) & = & \bar{Δ ψ_{N}} - Δ ψ_{N}^{o} \end{matrix}

$\begin{matrix} \exp (μ + σ^{2} / 2) & = & \bar{Δ ψ_{N}} - Δ ψ_{N}^{o} \end{matrix}$

(77)

\begin{matrix} \exp (2 μ + σ^{2}) (\exp (σ^{2}) - 1) & = & \bar{{(Δ ψ_{N} - \bar{Δ ψ_{N}})}^{2}}) \end{matrix}

$\begin{matrix} \exp (2 μ + σ^{2}) (\exp (σ^{2}) - 1) & = & \bar{{(Δ ψ_{N} - \bar{Δ ψ_{N}})}^{2}}) \end{matrix}$

(78)

\begin{matrix} Δ ψ_{N}^{o} & \equiv & \min {(\bar{Δ ψ_{N}} - n_{shift} \bar{{(Δ ψ_{N} - \bar{Δ ψ_{N}})}^{2}})}^{1 / 2}, 0) \end{matrix}

$\begin{matrix} Δ ψ_{N}^{o} & \equiv & \min {(\bar{Δ ψ_{N}} - n_{shift} \bar{{(Δ ψ_{N} - \bar{Δ ψ_{N}})}^{2}})}^{1 / 2}, 0) \end{matrix}$

where

Δ ψ_{N}^{o}

$Δ ψ_{N}^{o}$ is the origin for the log-normal distribution and the shifting factor n_shift is taken to be equal to 2, unless specified. It is shown in Figs. 7, S.22, and S.23 that log-normal distributions can better reproduce the observed distribution of Δψ_N due to single nucleotide nonsynonymous mutations except in the tails. Disagreements between the log-normal and observed distributions in the domain of

Δ ψ_{N} > \bar{Δ ψ_{N}}

$Δ ψ_{N} > \bar{Δ ψ_{N}}$ do not much affect the PDF of Δψ_N in fixed mutants, because fixation probabilities for

Δ ψ_{N} (> \bar{Δ ψ_{N}})

$Δ ψ_{N} (> \bar{Δ ψ_{N}})$ are too low.

The average of Δψ_N over fixed mutants is uniquely determined by the distribution of ΔΔψ_N( ≃ Δψ_N), which is approximated here by a log-normal distribution estimated from the mean and variance of Δψ_N; it depends also on q_m, which is assumed to be constant, through fixation probability, because $2 N_{e} s ≃ - Δ ψ_{N} / (1 - q_{m})$ $2 N_{e} s ≃ - Δ ψ_{N} / (1 - q_{m})$ . In other words, ⟨Δψ_N⟩_fixed is uniquely determined by the mean and variance of Δψ_N. Therefore, under the equilibrium condition ${〈 Δ ψ_{N} 〉}_{f i x e d} = 0,$ ${〈 Δ ψ_{N} 〉}_{f i x e d} = 0,$ only one of the mean and variance can be freely specified, and the other is uniquely determined. We employ $\bar{Δ ψ_{N}}$ $\bar{Δ ψ_{N}}$ or ψ_N as a parameter, because $\bar{Δ ψ_{N}}$ $\bar{Δ ψ_{N}}$ depends on ψ_N, and only one of them can be specified. We define ${\bar{Δ ψ}}_{N}^{e q}$ ${\bar{Δ ψ}}_{N}^{e q}$ as ${\bar{Δ ψ}}_{N}$ ${\bar{Δ ψ}}_{N}$ at which ${〈 Δ ψ_{N} 〉}_{f i x e d} = 0$ ${〈 Δ ψ_{N} 〉}_{f i x e d} = 0$ .

Suppose that the regression equation, Eq. (62), of Δψ_N on ψ_N is exact, and the standard deviation of Δψ_N is constant irrespective of ψ_N; the slope ( $α_{ψ_{N}}$ $α_{ψ_{N}}$ ), $\bar{\bar{Δ ψ_{N}}},$ $\bar{\bar{Δ ψ_{N}}},$ $\bar{Sd (Δ ψ_{N})},$ $\bar{Sd (Δ ψ_{N})},$ and $\bar{ψ_{N}}$ $\bar{ψ_{N}}$ that are estimated with r_cutoff ∼ 8 Å for the PDZ and listed in Table 2 are employed here. In Fig. 8, the average of Δψ_N over single nucleotide nonsynonymous substitutions fixed in a population, ⟨Δψ_N⟩_fixed, is plotted against ψ_N/L of a wildtype for the PDZ protein family. This figure shows that ⟨Δψ_N⟩_fixed changes its value from positive to negative as ψ_N increases, that is, the value of ψ_N at which ${〈 Δ ψ_{N} 〉}_{f i x e d} = 0,$ ${〈 Δ ψ_{N} 〉}_{f i x e d} = 0,$ $ψ_{N}^{e q},$ $ψ_{N}^{e q},$ is the stable equilibrium value for ψ_N. In order for protein to have such a stable equilibrium value for folding free energy (ΔG_ND ≃ k_BT_sΔψ_ND), the regression coefficient of $\bar{Δ ψ_{N}}$ $\bar{Δ ψ_{N}}$ on ψ_N must be more negative than that of the standard deviation, Sd(Δψ_N), because otherwise stabilizing mutations increase as ψ_N decreases. This condition is, of course, satisfied for all protein families studied here, because the mean of Δψ_N over all substitutions at all sites is negatively proportional to ψ_N of a wildtype, but its standard deviation is nearly constant irrespective of ψ_N across homologous sequences; see Tables 2 and S.5.

The equilibrium value of ψ_N for each protein domain is calculated with the estimated values of $α_{ψ_{N}},$ $α_{ψ_{N}},$ $\bar{ψ_{N}},$ $\bar{ψ_{N}},$ $\bar{\bar{Δ ψ_{N}}},$ $\bar{\bar{Δ ψ_{N}}},$ and $\bar{S d (Δ ψ_{N})}$ $\bar{S d (Δ ψ_{N})}$ listed in Tables 2 and S.5; it should be noticed here that $\bar{S d (Δ ψ_{N})}$ $\bar{S d (Δ ψ_{N})}$ is assumed to be constant. In Figs. 9 and S.26, the equilibrium values of ψ_N/L estimated with $n_{s h i f t} = 1.5, 2,$ $n_{s h i f t} = 1.5, 2,$ and 2.5 in the monoclonal approximation are plotted against the average of ψ_N/L over homologous sequences for each protein family. The agreement between the time average ( $ψ_{N}^{e q}$ $ψ_{N}^{e q}$ ) and ensemble average ( ${〈 ψ_{N} 〉}_{σ} (= \bar{ψ_{N}}$ ${〈 ψ_{N} 〉}_{σ} (= \bar{ψ_{N}}$ )) is better for r_cutoff ∼ 8 Å than for r_cutoff ∼ 15.5 Å and is not bad in the case of r_cutoff ∼ 8 Å, indicating that the present methods for the fixation process of amino acid substitutions and for the equilibrium ensemble of ψ_N give a consistent result with each other, and also that it is a good approximation to assume the standard deviation of Δψ_N not to depend on ψ_N in each protein family.

4.8. Relationships between $\bar{Δ ψ_{N}} (= {\bar{Δ ψ}}_{N}^{e q})$ $\bar{Δ ψ_{N}} (= {\bar{Δ ψ}}_{N}^{e q})$ and the standard deviation of Δψ_N, ${\hat{T}}_{s},$ ${\hat{T}}_{s},$ and $Δ Δ {\hat{G}}_{N D}$ $Δ Δ {\hat{G}}_{N D}$ at equilibrium

In the present model, the equilibrium values, $ψ_{N}^{e q}$ $ψ_{N}^{e q}$ and the corresponding ${\bar{Δ ψ}}_{N}^{e q},$ ${\bar{Δ ψ}}_{N}^{e q},$ are functions of the mean and standard deviation of Δψ_N only, because the distribution of ΔΔψ_ND( ≃ Δψ_N) is approximately estimated with the mean and standard deviation of Δψ_N. On the other hand, $ψ_{N}^{e q}$ $ψ_{N}^{e q}$ and ${\bar{Δ ψ}}_{N}^{e q}$ ${\bar{Δ ψ}}_{N}^{e q}$ should be equal to $\bar{ψ_{N}} = 〈 ψ_{N} 〉$ $\bar{ψ_{N}} = 〈 ψ_{N} 〉$ and $\bar{\bar{Δ ψ_{N}}},$ $\bar{\bar{Δ ψ_{N}}},$ respectively; the time average and ensemble average should be consistent. Actually $ψ_{N}^{e q}$ $ψ_{N}^{e q}$ almost agrees with $\bar{ψ_{N}}$ $\bar{ψ_{N}}$ as shown in Fig. 9. Therefore the standard deviation of Δψ_N is uniquely determined from its mean as long as ψ_N and $\bar{Δ ψ_{N}}$ $\bar{Δ ψ_{N}}$ are at equilibrium; conversely the equilibrium value of $\bar{Δ ψ_{N}}$ $\bar{Δ ψ_{N}}$ is determined by Sd(Δψ_N). In Fig. 10, the standard deviation of Δψ_N is plotted against $\bar{Δ ψ_{N}} (= {\bar{Δ ψ}}_{N}^{e q})$ $\bar{Δ ψ_{N}} (= {\bar{Δ ψ}}_{N}^{e q})$ . Likewise the estimate of effective temperature of selection, ${\hat{T}}_{s} (= {({\hat{T}}_{s} \bar{S d} (Δ ψ_{N}))}_{P D Z} / S d (Δ ψ_{N})),$ ${\hat{T}}_{s} (= {({\hat{T}}_{s} \bar{S d} (Δ ψ_{N}))}_{P D Z} / S d (Δ ψ_{N})),$ and that of folding free energy change, $Δ Δ {\hat{G}}_{N D} (= k_{B} {({\hat{T}}_{s} \bar{S d} (Δ ψ_{N}))}_{P D Z} / S d (Δ ψ_{N}) \cdot \bar{Δ ψ_{N}}),$ $Δ Δ {\hat{G}}_{N D} (= k_{B} {({\hat{T}}_{s} \bar{S d} (Δ ψ_{N}))}_{P D Z} / S d (Δ ψ_{N}) \cdot \bar{Δ ψ_{N}}),$ are plotted as a function of $\bar{Δ ψ_{N}} (= {\bar{Δ ψ}}_{N}^{e q})$ $\bar{Δ ψ_{N}} (= {\bar{Δ ψ}}_{N}^{e q})$ in Fig. 11. These figures show that the averages, $\bar{\bar{Δ ψ_{N}}}$ $\bar{\bar{Δ ψ_{N}}}$ and $\bar{Sd (ψ_{N})},$ $\bar{Sd (ψ_{N})},$ over homologous sequences scatter along the expected curves.

4.9. Protein evolution at equilibrium, ${〈 Δ ψ_{N} 〉}_{f i x e d} = 0$ ${〈 Δ ψ_{N} 〉}_{f i x e d} = 0$

The common understanding of protein evolution has been that amino acid substitutions observed in homologous proteins are neutral (Kimura, 1968; 1969; Kimura and Ohta, 1971; 1974) or slightly deleterious (Ohta, 1973; 1992), and random drift is a primary force to fix amino acid substitutions in population. In order to see how significant neutral/slightly deleterious substitutions are in protein evolution, the PDFs of K_a/K_s in all single nucleotide nonsynonymous mutations and in their fixed mutations are calculated; K_a/K_s is the ratio of nonsynonymous to synonymous substitution rate per site (Miyata and Yasunaga, 1980) and defined here as K_a/K_s ≡ u(s)/u(0), where u(s) is a fixation probability for selective advantage s; see Eq. (50).

First let us see the distributions of Δψ_N at equilibrium, ${〈 Δ ψ_{N} 〉}_{f i x e d} = 0$ ${〈 Δ ψ_{N} 〉}_{f i x e d} = 0$ . Fig. 12 shows the PDFs of Δψ_N in all single nucleotide nonsynonymous mutations and in their fixed mutations as a function of $\bar{Δ ψ_{N}} (= {\bar{Δ ψ}}_{N}^{e q}),$ $\bar{Δ ψ_{N}} (= {\bar{Δ ψ}}_{N}^{e q}),$ respectively. Because $4 N_{e} s (1 - q_{m}) = - Δ Δ ψ_{N D} ≃ - Δ ψ_{N},$ $4 N_{e} s (1 - q_{m}) = - Δ Δ ψ_{N D} ≃ - Δ ψ_{N},$ the PDFs of Δψ_N can be regarded as the PDFs of $- 4 N_{e} s (1 - q_{m})$ $- 4 N_{e} s (1 - q_{m})$ . At equilibrium, the distribution of Δψ_N in all single nucleotide nonsynonymous mutants becomes wider as the mean of Δψ_N increases, however, that in fixed mutants remains to be narrow with a peak near zero.

The PDFs of K_a/K_s in all single nucleotide nonsynonymous mutations and in their fixed mutations are shown in Fig. 12. The blue line on the landscape of the PDF shows the averages of K_a/K_s. The averages of K_a/K_s in all single nucleotide nonsynonymous mutations and in their fixed mutations are also shown in Fig. 13. The average of K_a/K_s in all the arising mutants is less than 1 and decreases as $\bar{Δ ψ_{N}} ≃ {\bar{Δ ψ}}_{N}^{e q}$ $\bar{Δ ψ_{N}} ≃ {\bar{Δ ψ}}_{N}^{e q}$ increases, indicating that negative mutants significantly occur and increase as $\bar{Δ ψ_{N}}$ $\bar{Δ ψ_{N}}$ increases. On the other hand, ⟨K_a/K_s⟩_fixed in fixed mutants is larger than 1 and increases as $Δ ψ_{N}^{e q}$ $Δ ψ_{N}^{e q}$ increases, indicating that positive mutants fix significantly in population and increase as equilibrium folding free energy change increases, that is, equilibrium protein stability decreases. To see each contribution of positive, neutral, slightly negative and negative selections, the value of K_a/K_s is divided arbitrarily into four categories, K_a/K_s > 1.05, 1.05 > K_a/K_s > 0.95, 0.95 > K_a/K_s > 0.5, and 0.5 > K_a/K_s for their selection categories, respectively. The probabilities of each selection category in all single nucleotide nonsynonymous mutations and in their fixed mutations are shown in Fig. 14. The almost 50% of fixed mutations are stabilizing mutations fixed by positive selection (1.05 < K_a/K_s), and another 50% are destabilizing mutations fixed by random drift. They are balanced with each other, and the stability of protein is maintained. Contrary to the neutral theory (Kimura, 1968; 1969; Kimura and Ohta, 1971; 1974), the proportion of neutral selection is not large even in fixed mutations, and slightly negative mutations are significantly fixed. Neutral mutations fixed with 0.95 < K_a/K_s < 1.05 are only less than 10%, and slightly negative mutations fixed with 0.5 < K_a/K_s < 0.95 and negative mutations fixed with K_a/K_s < 0.5 are both from 10 to 30%. The nearly neutral theory (Ohta, 1973; 1992; 2002) insists that most fixed mutations satisfy |N_es| ≤ 2. This condition corresponds to $0.003 \leq K_{a} / K_{s} (= u (s) / u (0)) \leq 8$ $0.003 \leq K_{a} / K_{s} (= u (s) / u (0)) \leq 8$ ; see Eqs. (21) and (50). The PDF of K_a/K_s shown in Fig. 14 indicates that this condition is satisfied, supporting the nearly neutral theory.

4.10. Relationship between T_s and K_a/K_s

The effective temperature (T_s) of protein for selection, which is defined in Eq. (1), represents the strength of selection originating from protein stability and foldability. Thus, it must be related with the evolutionary rate (amino acid substitution rate) of protein. As the effective temperature of selection (T_s) decreases, the mean change of evolutionary statistical energy ( ${\bar{Δ ψ}}_{N}^{e q}$ ${\bar{Δ ψ}}_{N}^{e q}$ ) due to single amino acid substitutions increases; see Fig. 11. Therefore, destabilizing mutations increase, and an amino acid substitution rate is expected to decrease. Fig. 13 shows that the average of K_a/K_s decreases as ${\bar{Δ ψ}}_{N}^{e q}$ ${\bar{Δ ψ}}_{N}^{e q}$ increases. The direct relationship between substitution rate and $T_{s} (= {(T_{s} \bar{S d} (Δ ψ_{N}))}_{P D Z} / S d (Δ ψ_{N}))$ $T_{s} (= {(T_{s} \bar{S d} (Δ ψ_{N}))}_{P D Z} / S d (Δ ψ_{N}))$ is shown in Fig. 15; the average of K_a/K_s decreases as T_s increases. In the selection maintaining protein foldability/stability, the effective temperature of selection is directly reflected in the average amino acid substitution rate.

5. Discussion

A main purpose of the present study is to formulate protein fitness originating from protein foldability and stability. From a phenomenological viewpoint, Drummond and Wilke (2008) took notice of toxicity of misfolded proteins as well as diversion of protein synthesis resources, and formulated a Malthusian fitness of a genome to be negatively proportional to the total amount of misfolded proteins, which must be produced to obtain the necessary amount of folded proteins (Serohijos et al., 2012). They also formulated a Malthusian fitness based on protein dispensability to be negatively proportional to the ratio of unfolded proteins. These formulas of protein fitness can be well approximated by a generic form, $m = - κ \exp (Δ G_{N D} / (k_{B} T)),$ $m = - κ \exp (Δ G_{N D} / (k_{B} T)),$ where T is growth temperature, and κ( ≥ 0) is a parameter that depends on protein disability and cellular abundance of protein (Miyazawa, 2016).

In the comparison of this generic formula of protein fitness with the present one, it may be interpreted that $4 N_{e} (1 - q_{m}) κ / T \sim 1 / T_{s},$ $4 N_{e} (1 - q_{m}) κ / T \sim 1 / T_{s},$ if |ΔG_ND/(k_BT)| ≪ 1, however, the growth temperature T and folding free energy do not always satisfy this condition. These two types of selection should be considered to be the different types of selection, although both are related with protein stability (ΔG_ND). Selective advantage of mutant is not upper-bounded in the present scheme of a Malthusian fitness but in the case of $m = - κ \exp (Δ G_{N D} / (k_{B} T))$ $m = - κ \exp (Δ G_{N D} / (k_{B} T))$ . As a result, PDFs of K_a/K_s in all arising mutations and in fixed mutations have very different shapes between these two formulas of fitness (Miyazawa, 2016). Selection modeled here is one that yields the distribution of homologous sequences in protein evolution. In other word, the present formula for protein fitness models natural selection maintaining protein’s stability, foldability, and function over the evolutionary time scale, which is much longer than the time scale for the selection originating from toxicity of misfolded proteins.

The present formulas for protein fitness, Eqs. (31) and (30), have been derived on the basis of a protein folding theory, particularly the random energy model, and the maximum entropy principle for the distribution of homologous sequences with the same fold in sequence space, respectively. The former indicates that the equilibrium ensemble of sequences can be well approximated by a canonical ensemble with the Boltzmann factor $\exp (- Δ G_{N D} / k_{B} T_{s}),$ $\exp (- Δ G_{N D} / k_{B} T_{s}),$ and the latter insists that the probability distribution of homologous sequences, which satisfies a given amino acid composition at each site and a given pairwise amino acid frequency at each site pair, can be represented as a Boltzmann distribution with $\exp (- ψ_{N}),$ $\exp (- ψ_{N}),$ in which the evolutionary statistical energy (ψ_N) is represented as the sum of one-body (compositional) and pairwise (covariational) interactions between sites. On the other hand, assuming mutation and fixation processes to be reversible Markov processes leads us to a formulation that the equilibrium ensemble of sequences also obeys a Boltzmann distribution with $\exp (4 N_{e} m (1 - q_{m}))$ $\exp (4 N_{e} m (1 - q_{m}))$ . As a result, we obtain the correspondences between folding free energy ( $- Δ G_{N D} / k_{B} T_{s}$ $- Δ G_{N D} / k_{B} T_{s}$ ), and $- Δ ψ_{N D}$ $- Δ ψ_{N D}$ and protein fitness ( $4 N_{e} m (1 - q_{m})$ $4 N_{e} m (1 - q_{m})$ ): the equality between the latter two variables (Eq. (33)), which indicates that Δψ_N is proportional to fitness (s), and the approximate equality between the former two variables (Eq. (34)) since a canonical ensemble with ΔG_ND/(k_BT_s) is an approximate for the sequence ensemble under natural selection. A discrepancy between evolutionary statistical energies J_ij and actual interaction energies was pointed out for non-contacting residue pairs in Monte Carlo simulations of lattice proteins (Jacquin et al., 2016). Also, the ratio of $- J_{i j} (a_{k}, a_{l})$ $- J_{i j} (a_{k}, a_{l})$ to the corresponding actual contact energy was shown to differ among contact site pairs. On the other hand, Hopf et al. (2017) successfully predicted mutation effects with evolutionary statistical energy and showed that the change of evolutionary statistical energy (Δψ_N) due to amino acid substitutions can capture experimental fitness landscapes and identify deleterious human variants.

In the analysis of the interaction changes (Δψ_N) due to single nucleotide nonsynonymous substitutions, we have employed the cutoff distances for pairwise interactions, r_cutoff ∼ 8 and 15.5 Å, which correspond to the first and second interaction shells between residues, respectively. Both the cutoff distances yield similar values for $T_{s} / T_{s, P D Z} = \bar{S d} (Δ ψ_{N, P D Z}) / \bar{S d} (Δ ψ_{N})$ $T_{s} / T_{s, P D Z} = \bar{S d} (Δ ψ_{N, P D Z}) / \bar{S d} (Δ ψ_{N})$ ; see Fig. S.15. Thus, the differences in the estimation of T_s between these two cutoff distances principally originate in the estimation of T_s for the reference protein, PDZ. The absolute value of $k_{B} {\hat{T}}_{s, P D Z}$ $k_{B} {\hat{T}}_{s, P D Z}$ for the PDZ has been estimated to be equal to the slope of the reflective regression line of ΔΔG_ND on Δψ_N. Therefore, as long as the correlation between ΔΔG_ND and Δψ_N is good enough as shown in Figs. 2 and S.14, $k_{B} {({\hat{T}}_{s} \bar{Sd (ψ_{N})})}_{P D Z}$ $k_{B} {({\hat{T}}_{s} \bar{Sd (ψ_{N})})}_{P D Z}$ takes a similar value irrespective of r_cutoff, and the estimate ${\hat{T}}_{s, P D Z}$ ${\hat{T}}_{s, P D Z}$ differs depending on $\bar{Sd (ψ_{N, P D Z})}$ $\bar{Sd (ψ_{N, P D Z})}$ . Thus, Δψ_N must correlate with experimental ΔΔG_ND, but on the basis of the correlation coefficient one cannot determine which estimation of Δψ_N is better. Larger the standard deviation of Δψ_N is, the smaller the estimate of T_s from a direct comparison between ΔΔG_ND and Δψ_N is. Including the longer range of pairwise interactions tend to increase the variance of Δψ_N. The range of interactions must be limited to a realistic value, either the first interaction shell or the second interaction shell. Thus, the estimates of T_s with r_cutoff ∼ 8 Å and 15.5Å would be upper and lower limits, respectively. Unfortunately T_s is not directly observable. Comparison of the estimates of folding free energies with their experimental values may be appropriate to judge which value is more appropriate for the cutoff distance, although the number of experimental data is limited. Actual values of T_s may be closer to the estimates with r_cutoff ∼ 8Å, because contact predictions based on the estimate of pairwise interactions J succeed for close contacts within the first interaction shell. Also, the estimation of ΔG_ND and the correlation between $ψ_{N}^{e q}$ $ψ_{N}^{e q}$ and $\bar{ψ_{N}}$ $\bar{ψ_{N}}$ are slightly better with r_cutoff ∼ 8 Å than 15.5 Å; see Figs. 6, 9, S.21, and S.26.

On the basis of the random energy model(REM) (Pande et al., 1997; Shakhnovich and Gutin, 1993a; 1993b), glass transition temperatures (T_g) and folding free energies (ΔG_ND) for 14 protein domains are estimated under the condition of $\bar{ψ_{N}} = {〈 ψ_{N} 〉}_{σ}$ $\bar{ψ_{N}} = {〈 ψ_{N} 〉}_{σ}$ . The first order transition for protein folding is assumed to estimate the folding free energies by Eq. (44). Selective temperature, T_s, is estimated in the empirical approximation that the standard deviation of Δψ_N is constant across homologous sequences with different ψ_N, so that their estimates may be more coarse-grained, however, this method is easier and faster than the method (Morcos et al., 2014) using the AWSEM (Davtyan et al., 2012). Experimental data for ΔG_ND are very limited, and also experimental conditions such as temperature and pH tend to be different among them. A prediction method for folding free energy would be useful in such a situation, although the present method requires the knowledge of melting temperature (T_m) besides sequence data, however, experimental data of T_m are more available than for ΔG_ND.

For proteins to have a stable equilibrium value of ψ_N in protein evolution, the regression coefficient of mean interaction change ( $\bar{Δ ψ_{N}}$ $\bar{Δ ψ_{N}}$ ) on ψ_N must be more negative than that of their standard deviation (Sd(Δψ_N)), otherwise stabilizing mutations increase as ψ_N decreases. Actually Tables 2 and S.5 show that their mean over all the substitutions at all sites is negatively proportional to ψ_N of a wildtype, but their standard deviation is nearly constant irrespective of ψ_N across homologous sequences. The equilibrium value $ψ_{N}^{e q},$ $ψ_{N}^{e q},$ where the average of Δψ_N over fixed mutants is equal to zero, is calculated with the approximation of the distribution of Δψ_N by a log-normal distribution and the empirical rules of Eqs. (62) and (63). In the monoclonal approximation, it has been confirmed that the time average ( $ψ_{N}^{e q}$ $ψ_{N}^{e q}$ ) and ensemble average ( $\bar{ψ_{N}} = {〈 ψ_{N} 〉}_{σ}$ $\bar{ψ_{N}} = {〈 ψ_{N} 〉}_{σ}$ ) of evolutionary statistical energy (ψ_N) almost agree with each other. Therefore, this result also supports these approximations and empirical rules, particularly Eq. (63), that is, the constancy of the standard deviation of Δψ_N across homologous sequences. In the log-normal distribution approximation, ${\bar{Δ ψ}}_{N}^{e q},$ ${\bar{Δ ψ}}_{N}^{e q},$ Sd(Δψ_N)^eq, ${\hat{T}}_{s},$ ${\hat{T}}_{s},$ and $Δ Δ {\hat{G}}_{N D}$ $Δ Δ {\hat{G}}_{N D}$ can be determined as a function of any one of them. Here they have been shown as a function of ${\bar{Δ ψ}}_{N}^{e q}$ ${\bar{Δ ψ}}_{N}^{e q}$ .

We have also studied the evolution of protein at equilibrium, at which the ensemble of homologous sequences obeys a Boltzmann distribution with $\exp (- ψ_{N}) (≃ \exp (- Δ ψ_{N D})),$ $\exp (- ψ_{N}) (≃ \exp (- Δ ψ_{N D})),$ and the ensemble averages of evolutionary statistical energy (ψ_N ≃ G_N/(k_BT_s)) and its change due to a mutation (Δψ_N) ≃ ΔΔψ_ND ≃ ΔΔG_ND/(k_BT_s)) agree with their steady values; ${〈 ψ_{N} 〉}_{σ} = \bar{ψ_{N}} = ψ_{N}^{e q}$ ${〈 ψ_{N} 〉}_{σ} = \bar{ψ_{N}} = ψ_{N}^{e q}$ and $〈 \bar{Δ ψ_{N}} 〉 σ = \bar{\bar{Δ ψ_{N}}} = {\bar{Δ ψ}}_{N}^{e q}$ $〈 \bar{Δ ψ_{N}} 〉 σ = \bar{\bar{Δ ψ_{N}}} = {\bar{Δ ψ}}_{N}^{e q}$ . The PDFs of Δψ_N and K_a/K_s in all the mutants and in their fixed mutants have been estimated. It is confirmed that the effective temperature (T_s) of selection negatively correlates with the amino acid substitution rate (K_a/K_s) of protein.

New alleles can become fixed owing to random drift or to positive selection of substantially advantageous mutations (Gillespie, 1991; Kimura, 1983; Ohta, 2002). The present study indicates that the stability of protein is maintained in such a way that stabilizing mutations are significantly fixed by positive selection, and balance with destabilizing mutations fixed by random drift. As shown in Fig. 14, the almost 50% of fixed mutations are stabilizing mutations fixed by positive selection (1.05 < K_a/K_s), and another 50% are destabilizing mutations fixed by random drift. An interesting fact is that contrary to the neutral theory (Kimura, 1968; 1969; Kimura and Ohta, 1971; 1974), the proportion of neutral selection is not large even in fixed mutants. In the selection to maintain protein stability/foldability, neutral mutations fixed with 0.95 < K_a/K_s < 1.05 are only less than 10%, and slightly negative mutations fixed with 0.5 < K_a/K_s < 0.95 and negative mutations fixed with K_a/K_s < 0.5 are both from 10 to 30%. As a result, at equilibrium the average of K_a/K_s in all the mutants is less than 1, but that in their fixed mutants is larger than 1. The PDF of K_a/K_s shown in Fig. 14 supports the nearly neutral theory (Ohta, 1973; 1992; 2002), which insists that most fixed mutations satisfy |N_es| ≤ 2 corresponding to $0.003 \leq K_{a} / K_{s} (= u (s) / u (0)) \leq 8$ $0.003 \leq K_{a} / K_{s} (= u (s) / u (0)) \leq 8$ . It should be noted that these conclusions based on the PDFs of Δψ_N and K_s/K_s require only an equilibrium condition of $\bar{Δ ψ_{N}}$ $\bar{Δ ψ_{N}}$ = ${\bar{Δ ψ}}_{N}^{e q},$ ${\bar{Δ ψ}}_{N}^{e q},$ but does not require the approximation of constancy for the variance of Δψ_N across homologous sequences, which is used only to estimate T_s and $ψ_{N}^{e q}$ $ψ_{N}^{e q}$ and other relations based on T_s.

In the present study, we have analyzed the mutation-fixation process in equilibrium. The equilibrium state will vary if an environmental condition varies. The evolutionary statistical energy ψ_N and the inverse of selective temperature 1/T_s are linearly proportional to the effective population size N_e, as indicated by Eq. (33). Thus, the equilibrium values, $ψ_{N}^{e q},$ $ψ_{N}^{e q},$ ${\bar{Δ ψ_{N}}}^{e q}$ ${\bar{Δ ψ_{N}}}^{e q}$ and Sd(Δψ_N)^eq, are all linearly proportional to the effective population size N_e. On the other hand, Sd(Δψ_N)^eq is not linearly proportional to ${\bar{Δ ψ_{N}}}^{e q}$ ${\bar{Δ ψ_{N}}}^{e q}$ but downward-concave, as shown in Fig. 10. As a result, as N_e decreases, $k_{B} T_{s} {\bar{Δ ψ_{N}}}^{eq} ≃ k_{B} T_{s} {\bar{Δ Δ ψ_{ND}}}^{eq} (≃ {\bar{Δ Δ G_{ND}}}^{eq})$ $k_{B} T_{s} {\bar{Δ ψ_{N}}}^{eq} ≃ k_{B} T_{s} {\bar{Δ Δ ψ_{ND}}}^{eq} (≃ {\bar{Δ Δ G_{ND}}}^{eq})$ decreases. In other words, the equilibrium value of the mean folding free energy change becomes less positive and therefore that of folding free energy ${\bar{Δ G_{N D}}}^{e q} ≃ k_{B} T_{s} {\bar{Δ ψ_{N D}}}^{e q}$ ${\bar{Δ G_{N D}}}^{e q} ≃ k_{B} T_{s} {\bar{Δ ψ_{N D}}}^{e q}$ is expected to be higher (less stable) for a smaller number of effective population size N_e; see Eq. (72).

Supplementary document

File 1 — Supplementary methods, tables, and figures

A PDF file in which the details of the methods are described and additional tables and figures are provided; methods, tables, and figures provided in the text are also included as part of their full descriptions for reader’s convenience.

Funding This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Acknowledgement

I would like to thank a reviewer for his excellent comments and suggestions that have helped me improve the paper considerably.

Appendix A. Supplementary materials

Download Acrobat PDF file (2MB)Help with pdf files

Supplementary Data S1. Supplementary Raw Research Data. This is open data under the CC BY license http://creativecommons.org/licenses/by/4.0/

Research data for this article

Open Data

for download under the CC BY licence

Supplementary Data S1

(PDF, 2MB)

Supplementary Raw Research Data. This is open data under the CC BY license http://creativecommons.org/licenses/by/4.0/

Download data

About research data

References

Armengaud, Urbonavicius, Fernandez, Chaussinand, Bujnicki, Grosjean, 2004: J. Armengaud, J. Urbonavicius, B. Fernandez, G. Chaussinand, J.M. Bujnicki, H. Grosjeann²-Methylation of guanosine at position 10 in tRNA is catalyzed by a THUMP domain-containing, s-adenosylmethionine-dependent methyltransferase, conserved in archaea and eukaryota
J. Biol. Chem., 279 (2004), pp. 37142-37152
Barton, Leonardis, Coucke, Cocco, 2016: J.P. Barton, E.D. Leonardis, A. Coucke, S. CoccoACE: adaptive cluster expansion for maximum entropy graphical model inference
Bioinformatics, 32 (2016), pp. 3089-3097
Bryngelson, Wolynes, 1987: J.D. Bryngelson, P.G. WolynesSpin glasses and the statistical mechanics of protein folding
Proc. Natl. Acad. Sci. USA, 84 (1987), pp. 7524-7528
Crow, Kimura, 1970: J.F. Crow, M. KimuraAn Introduction to Population Genetics Theory
Harper & Row Publishers, New York (1970)
D’Auria, Scirè, Varriale, Scognamiglio, Staiano, Ausili, Marabotti, MosèRossi, Tanfani, 2005: S. D’Auria, A. Scirè, A. Varriale, V. Scognamiglio, M. Staiano, A. Ausili, A. Marabotti, MosèRossi, F. TanfaniBinding of glutamine to glutamine-binding protein from escherichia coli induces changes in protein structure and increases protein stability
Proteins, 58 (2005), pp. 80-87
Davtyan, Schafer, Zheng, Clementi, Wolynes, Papoian, 2012: A. Davtyan, N.P. Schafer, W. Zheng, C. Clementi, P.G. Wolynes, G.A. PapoianAWSEM-MD: protein structure prediction using coarse-grained physical potentials and bioinformatically based local structure biasing
J. Phys. Chem. B, 116 (2012), pp. 8494-8503
Dokholyan, Shakhnovich, 2001: N.V. Dokholyan, E.I. ShakhnovichUnderstanding hierarchical protein evolution from first principles
J. Mol. Biol., 312 (2001), pp. 289-307
ArticlePDF (910KB)
Drummond, Wilke, 2008: D.A. Drummond, C.O. WilkeMistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution
Cell, 134 (2) (2008), pp. 341-352, 10.1016/j.cell.2008.05.042
ArticlePDF (2MB)
Ekeberg, Hartonen, Aurell, 2014: M. Ekeberg, T. Hartonen, E. AurellFast pseudolikelihood maximization for direct-coupling analysis of protein structure from many homologous amino-acid sequences
J. Comput. Phys., 276 (2014), pp. 341-356
ArticlePDF (2MB)
Ekeberg, Lövkvist, Lan, Weigt, Aurell, 2013: M. Ekeberg, C. Lövkvist, Y. Lan, M. Weigt, E. AurellImproved contact prediction in proteins: using pseudolikelihoods to infer potts models
Phys. Rev. E, 87 (2013); 012707–1–16. doi:10.1103/PhysRevE.87.012707.
Ewens, 1979: W.J. EwensMathematical Population Genetics
Springer, New York (1979)
Finn, Coggill, Eberhardt, Eddy, Mistry, Mitchell, Potter, Punta, Qureshi, Sangrador-Vegas, Salazar, Tate, Bateman, 2016: R.D. Finn, P. Coggill, R.Y. Eberhardt, S.R. Eddy, J. Mistry, A.L. Mitchell, S.C. Potter, M. Punta, M. Qureshi, A. Sangrador-Vegas, G.A. Salazar, J. Tate, A. BatemanThe pfam protein families database: towards a more sustainable future
Nucl. Acid Res., 44 (2016), pp. D279-D285
Ganguly, Das, Bandhu, Chanda, Jana, Mondal, Sau, 2009: T. Ganguly, M. Das, A. Bandhu, P.K. Chanda, B. Jana, R. Mondal, S. SauPhysicochemical properties and distinct DNA binding capacity of the repressor of temperate staphylococcus aureusphage ϕ/11
FEBS J., 276 (2009), pp. 1975-1985
Gianni, Calosci1, Aelen, Vuister, Brunori, Travaglini-Allocatelli, 2005: S. Gianni, N. Calosci1, J.M.A. Aelen, G.W. Vuister, M. Brunori, C. Travaglini-AllocatelliKinetic folding mechanism of PDZ2 from PTP-BL
Protein Eng. Des. Select., 18 (2005), pp. 389-395
Gianni, Geierhaas, Calosci, Jemth, Vuister, Travaglini-Allocatelli, Vendruscolo, Brunori, 2007: S. Gianni, C.D. Geierhaas, N. Calosci, P. Jemth, G.W. Vuister, C. Travaglini-Allocatelli, M. Vendruscolo, M. BrunoriA PDZ domain recapitulates a unifying mechanism for protein folding
Proc. Natl. Acad. Sci. USA, 104 (2007), pp. 128-133
Gillespie, 1991: J.H. GillespieThe Causes of Molecular Evolution
Oxford Univ. Press, Oxford (1991)
Grantcharova, Riddle, Santiago, Baker, 1998: V.P. Grantcharova, D.S. Riddle, J.V. Santiago, D. BakerImportant role of hydrogen bonds in the structurally polarized transition state for folding of the src SH3 domain
Nat. Struct. Biol (1998), pp. 714-720
Guelorget, Roovers, Guérineau, Barbey, Li, Golinelli-Pimpaneau, 2010: A. Guelorget, M. Roovers, V. Guérineau, C. Barbey, X. Li, B. Golinelli-PimpaneauInsights into the hyperthermostability and unusual region-specificity of archaeal pyrococcus abyssi tRNA m¹A57/58 methyltransferase
Nucl. Acid Res., 38 (2010), pp. 6206-6218
Hopf, Colwell, Sheridan, Rost, Sander, Marks, 2012: T.A. Hopf, L.J. Colwell, R. Sheridan, B. Rost, C. Sander, D.S. MarksThree-dimensional structures of membrane proteins from genomic sequencing
Cell, 149 (2012), pp. 1607-1621
ArticlePDF (3MB)
Hopf, Ingraham, Poelwijk, Schärfe, Springer, Sander, Marks, 2017: T.A. Hopf, J.B. Ingraham, F.J. Poelwijk, C.P.I. Schärfe, M. Springer, C. Sander, D.S. MarksMutation effects predicted from sequence co-variation
Nat. Biotech., 35 (2017), pp. 128-135
Jacquin, Gilson, Shakhnovich, Cocco, Monasson, 2016: H. Jacquin, A. Gilson, E. Shakhnovich, S. Cocco, R. MonassonBenchmarking inverse statistical approaches for protein structure and design with exactly solvable models
PLoS Comput. Biol., 12 (2016); E1004889
Kimura, 1968: M. KimuraEvolutionary rate at the molecular level
Nature, 217 (1968), pp. 624-626
Kimura, 1969: M. KimuraThe rate of molecular evolution considered from the standpoint of population genetics
Proc. Natl. Acad. Sci. USA, 63 (1969), pp. 1181-1188
Kimura, 1983: M. KimuraThe Neutral Theory of Molecular Evolution
Cambridge Univ. Press, Cambridge (1983)
Kimura, Ohta, 1971: M. Kimura, T. OhtaProtein polymorphism as a phase of molecular evolution
Nature, 229 (1971), pp. 467-469
Kimura, Ohta, 1974: M. Kimura, T. OhtaOn some principles governing molecular evolution
Proc. Natl. Acad. Sci. USA, 71 (1974), pp. 2848-2852
Knapp, Mattson, Christova, Berndt, Karshikoff, Vihinen, Smith, Ladenstein, 1998: S. Knapp, P.T. Mattson, P. Christova, K.D. Berndt, A. Karshikoff, M. Vihinen, C.E. Smith, R. LadensteinThermal unfolding of small proteins with sh3 domain folding pattern
Proteins, 31 (1998), pp. 309-319
Kragelund, Osmark, Neergaard, Schiødt, Kristiansen, Knudsen, Poulsen, 1999: B.B. Kragelund, P. Osmark, T.B. Neergaard, J. Schiødt, K. Kristiansen, J. Knudsen, F.M. PoulsenThe formation of a native-like structure containing eight conserved hydrophobic residues is rate limiting in two-state protein folding of ACBP
Nat. Struct. Biol., 6 (1999), pp. 594-601
Kumar, Bava, Gromiha, Prabakaran, Kitajima, Uedaira, Sarai, 2006: M. Kumar, K. Bava, M. Gromiha, P. Prabakaran, K. Kitajima, H. Uedaira, A. SaraiProtherm and proNIT: thermodynamic databases for proteins and protein-nucleic acid interactions
Nucl. Acid Res., 34 (2006), pp. D204-D206
Marks, Colwell, Sheridan, Hopf, Pagnani, Zecchina, Sander, 2011: D.S. Marks, L.J. Colwell, R. Sheridan, T.A. Hopf, A. Pagnani, R. Zecchina, C. SanderProtein 3D structure computed from evolutionary sequence variation
PLoS ONE, 6 (12) (2011); E28766. doi:10.1371/journal.pone.0028766.
Miyata, Yasunaga, 1980: T. Miyata, T. YasunagaMolecular evolution of mRNA: a method for estimating evolutionary rates of synonymous and amino acid substitutions from homologous nucleotide sequences and its applications
J. Mol. Evol., 16 (1980), pp. 23-36
Miyazawa, 2013: S. MiyazawaPrediction of contact residue pairs based on co-substitution between sites in protein structures
PLoS ONE, 8 (1) (2013); E54252. doi:10.1371/journal.pone.0054252.
Miyazawa, 2016: S. MiyazawaSelection maintaining protein stability at equilibrium
J. Theor. Biol., 391 (2016), pp. 21-34
ArticlePDF (3MB)
Moran, 1958: P.A.P. MoranRandom processes in genetics
Proc. Cambridge Phil. Soc., 54 (1958), pp. 60-71
Morcos, Pagnani, Lunt, Bertolino, Marks, Sander, Zecchina, Onuchic, Hwa, Weigt, 2011: F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D.S. Marks, C. Sander, R. Zecchina, J.N. Onuchic, T. Hwa, M. WeigtDirect-coupling analysis of residue coevolution captures native contacts across many protein families
Proc. Natl. Acad. Sci. USA, 108 (2011), pp. E1293-E1301
Morcos, Schafer, Cheng, Onuchic, Wolynes, 2014: F. Morcos, N.P. Schafer, R.R. Cheng, J.N. Onuchic, P.G. WolynesCoevolutionary information, protein folding landscapes, and the thermodynamics of natural selection
Proc. Natl. Acad. Sci. USA, 111 (2014), pp. 12408-12413
Ohta, 1973: T. OhtaSlightly deleterious mutant substitutions in evolution
Nature, 246 (1973), pp. 96-98
Ohta, 1992: T. OhtaThe nearly neutral theory of molecular evolution
Annu. Rev. Ecol. Syst., 23 (1992), pp. 263-286
Ohta, 2002: T. OhtaNear-neutrality in evolution of genes and gene regulation
Proc. Natl. Acad. Sci. USA, 99 (2002), pp. 16134-16137
Onuchic, Wolynes, Lutheyschulten, Socci, 1995: J.N. Onuchic, P.G. Wolynes, Z. Lutheyschulten, N.D. SocciToward an outline of the topography of a realistic protein-folding funnel
Proc. Natl. Acad. Sci. USA, 92 (1995), pp. 3626-3630
Onwukwe, Kursula1, Koski, Schmitz, Wierenga, 2014: G.U. Onwukwe, P. Kursula1, M.K. Koski, W. Schmitz, R.K. WierengaHuman δ³, δ²-enoyl-CoA isomerase, type 2: a structural enzymology study on the catalytic role of its ACBP domain and helix-10
FEBS J., 282 (2014), pp. 746-768
Pande, Grosberg, Tanaka, 1997: V.S. Pande, A.Y. Grosberg, T. TanakaStatistical mechanics of simple models of protein folding and design
Biophys. J., 73 (1997), pp. 3192-3210
ArticlePDF (2MB)
Pande, Grosberg, Tanaka, 2000: V.S. Pande, A.Y. Grosberg, T. TanakaHeteropolymer freezing and design: towards physical models of protein folding
Rev. Mod. Phys., 72 (2000), pp. 259-314
Parsons, Lin, Orban, 2006: L.M. Parsons, F. Lin, J. OrbanPeptidoglycan recognition by pal, an outer membrane lipoprotein
Biochemistry, 45 (2006), pp. 2122-2128
Ramanathan, Shakhnovich, 1994: S. Ramanathan, E. ShakhnovichStatistical mechanics of proteins with evolutionary selected sequences
Phys. Rev. E, 50 (1994), pp. 1303-1312
Rosa, Milardi, Grasso, Guzzi, Sportelli, 1995: C.L. Rosa, D. Milardi, D. Grasso, R. Guzzi, L. SportelliThermodynamics of the thermal unfolding of azurin
J. Phys. Chem., 99 (1995), pp. 14864-14870
Ruiz-Sanz, Simoncsits, Tőrő, Pongor, Mateo, Filimonov, 1999: J. Ruiz-Sanz, A. Simoncsits, I. Tőrő, S. Pongor, P.L. Mateo, V.V. FilimonovA thermodynamic study of the 434-repressor n-terminal domain and of its covalently linked dimers
Eur. J. Biochem., 263 (1999), pp. 246-253
Sainsbury, Ren, Saunders, Stuarta, Owens, 2008: S. Sainsbury, J. Ren, N.J. Saunders, D.I. Stuarta, R.J. OwensCrystallization and preliminary x-ray analysis of crga, a lysr-type transcriptional regulator from pathogenic neisseria meningitidis MC58
Acta Cryst., F64 (2008), pp. 797-801
Serohijos, Rimas, Shakhnovich, 2012: A. Serohijos, Z. Rimas, E. ShakhnovichProtein biophysics explains why highly abundant proteins evolve slowly
Cell Rep., 2 (2) (2012), pp. 249-256, 10.1016/j.celrep.2012.06.022
ArticlePDF (1MB)
Shakhnovich, 1994: E.I. ShakhnovichProteins with selected sequences fold into unique native conformation
Phys. Rev. Lett., 72 (1994), pp. 3907-3911
Shakhnovich, Gutin, 1993a: E.I. Shakhnovich, A.M. GutinEngineering of stable and fast-folding sequences of model proteins
Proc. Natl. Acad. Sci. USA, 90 (1993), pp. 7195-7199
Shakhnovich, Gutin, 1993b: E.I. Shakhnovich, A.M. GutinA new approach to the design of stable proteins
Protein Eng., 6 (1993), pp. 793-800
Stupák, Zǒldák, Musatov, Sprinzl, Sedlák, 2006: M. Stupák, G. Zǒldák, A. Musatov, M. Sprinzl, E. SedlákUnusual effect of salts on the homodimeric structure of NADH oxidase from thermus thermophilus in acidic ph
Biochim. Biophys. Acta, 1764 (2006), pp. 129-137
ArticlePDF (244KB)
Sułkowska, Morcos, Weigt, Hwa, Onuchic, 2012: J.I. Sułkowska, F. Morcos, M. Weigt, T. Hwa, J.N. OnuchicGenomics-aided structure prediction
Proc. Natl. Acad. Sci. USA, 109 (2012), pp. 10340-10345
Tokuriki, Stricher, Schymkowitz, Serrano, Tawfik, 2007: N. Tokuriki, F. Stricher, J. Schymkowitz, L. Serrano, D.S. TawfikThe stability effects of protein mutations appear to be universally distributed
J. Mol. Biol., 369 (2007), pp. 1318-1332
ArticlePDF (1MB)
Torchio, Ermácora, Sica, 2012: G.M. Torchio, M.R. Ermácora, M.P. SicaEquilibrium unfolding of the PDZ domain of β2-syntrophin
Biophys. J., 102 (2012), pp. 2835-2844
ArticlePDF (751KB)
Williams, Prosselkov, Liepinsh, Line, Sharipo, Littler, Curmi, Otting, Dixon, 2002: N.K. Williams, P. Prosselkov, E. Liepinsh, I. Line, A. Sharipo, D.R. Littler, P.M.G. Curmi, G. Otting, N.E. DixonIn vivo protein cyclization promoted by a circularly permuted synechocystis sp. PCC6803 dnab mini-intein
J. Biol. Chem., 277 (2002), pp. 7790-7798
Wilson, Wittung-Stafshede, 2005: C.J. Wilson, P. Wittung-StafshedeSnapshots of a dynamic folding nucleus in zinc-substituted pseudomonas aeruginosa azurin
Biochemistry, 44 (2005), pp. 10054-10062

Details & settings

Outline

Figures (15)

Tables (3)

Journal of Theoretical Biology

Selection originating from protein stability/foldability: Relationships between protein folding free energy, sequence ensemble, and fitness

Highlights

Abstract

Keywords

1. Introduction

2. Methods

2.1. Knowledge of protein folding

2.2. Probability distribution of homologous sequences with the same native fold in sequence space

2.3. The equilibrium distribution of sequences in a mutation-fixation process

2.4. Relationships between m(σ), ψN(σ), and ΔGND(σ) of protein sequence

2.5. The ensemble average of folding free energy, ΔGND(σ, T), over sequences

2.6. Probability distributions of selective advantage, fixation rate and Ka/Ks

2.7. Probability distributions of ΔΔψND, 4Nes, u, and Ka/Ks in fixed mutant genes

3. Materials

3.1. Sequence data

4. Results

4.1. Important parameters in the estimations of one-body and pairwise interactions, h and J, and of the evolutionary statistical energy, ψN(σ)

4.2. Changes of the evolutionary statistical energy, ΔψN, by single nucleotide nonsynonymous substitutions

4.3. Effective temperature Ts of selection estimated from the changes of interaction, ΔψN, by single nucleotide nonsynonymous substitutions

4.4. A direct comparison of the changes of interaction, ΔψN( ≃ ΔΔψND), with the experimental ΔΔGND due to single amino acid substitutions

4.6. Estimation of Tg, ω, and ⟨ΔGND(σ)⟩σ from Ts and Tm

4.7. The equilibrium value of evolutionary statistical energy ψN in the mutation–fixation process of amino acid substitutions

4.10. Relationship between Ts and Ka/Ks

5. Discussion

Supplementary document

File 1 — Supplementary methods, tables, and figures

Acknowledgement

Appendix A. Supplementary materials

Research data for this article

Open Data

References

2.4. Relationships between m(σ), ψ_N(σ), and ΔG_ND(σ) of protein sequence

2.5. The ensemble average of folding free energy, ΔG_ND(σ, T), over sequences

2.6. Probability distributions of selective advantage, fixation rate and K_a/K_s

2.7. Probability distributions of ΔΔψ_ND, 4N_es, u, and K_a/K_s in fixed mutant genes

4.1. Important parameters in the estimations of one-body and pairwise interactions, h and J, and of the evolutionary statistical energy, ψ_N(σ)

4.2. Changes of the evolutionary statistical energy, Δψ_N, by single nucleotide nonsynonymous substitutions

4.3. Effective temperature T_s of selection estimated from the changes of interaction, Δψ_N, by single nucleotide nonsynonymous substitutions

4.4. A direct comparison of the changes of interaction, Δψ_N( ≃ ΔΔψ_ND), with the experimental ΔΔG_ND due to single amino acid substitutions

4.6. Estimation of T_g, ω, and ⟨ΔG_ND(σ)⟩_σ from T_s and T_m

4.7. The equilibrium value of evolutionary statistical energy ψ_N in the mutation–fixation process of amino acid substitutions

4.10. Relationship between T_s and K_a/K_s