## 1. Introduction

Natural
proteins can fold their sequences into unique structures. Protein’s
stability and foldability result from natural selection and are not
typical characteristics of random polymers (Bryngelson and Wolynes, 1987; Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b).
Natural selection maintains protein’s stability and foldability over
evolutionary timescales. On the basis of the random energy model (REM)
for protein folding, it was discussed (Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b)
that the equilibrium ensemble of natural protein sequences in sequence
space is well represented by a canonical ensemble characterized by a
Boltzmann factor $\mathrm{exp}(-\Delta {G}_{ND}\left(\mathit{\sigma}\right)/{k}_{B}{T}_{s}),$ where $\Delta {G}_{ND}\left(\mathit{\sigma}\right)(\equiv {G}_{N}\left(\mathit{\sigma}\right)-{G}_{D}\left(\mathit{\sigma}\right))$ is the folding free energy of sequence ** σ**,

*G*and

_{N}*G*are the free energies of the native and denatured states,

_{D}*k*is the Boltzmann constant, and

_{B}*T*is the effective temperature representing the strength of selection pressure and must satisfy

_{s}*T*<

_{s}*T*<

_{g}*T*for natural proteins to fold into unique native structures;

_{m}*T*is glass transition temperature and

_{g}*T*is melting temperature. The REM also indicates that the free energy of denatured conformations (

_{m}*G*) is a function of amino acid frequencies only and does not depend on amino acid order, and therefore the Boltzmann factor will be taken as $\mathrm{exp}(-{G}_{N}\left(\mathit{\sigma}\right)/{k}_{B}{T}_{s}),$ if amino acid frequencies are kept constant. It was shown by lattice Monte Carlo simulations (Shakhnovich, 1994) that lattice protein sequences selected with this Boltzmann factor were not trapped by competing structures but could fold into unique native structures. Selective temperatures were also estimated (Dokholyan and Shakhnovich, 2001) for actual proteins to yield good correlations of sequence entropy between actual protein families and sequences designed with this type of Boltzmann factor.

_{D}On
the other hand, the maximum entropy principle insists that the
probability distribution of sequences in sequence space, which satisfies
constraints on amino acid compositions at all sites and on amino acid
pairwise frequencies for all site pairs, is a Boltzmann distribution with the Boltzmann factor $\mathrm{exp}(-{\psi}_{N}\left(\mathit{\sigma}\right)),$ where the total interaction *ψ _{N}*(

**) of a sequence**

*σ***is represented as the sum of one-body (**

*σ**h*) (compositional) and pairwise (

*J*) (covariational) interactions between residues in the sequence;

*ψ*(

_{N}**) is called the evolutionary statistical energy by Hopf et al. (2017). The inverse statistical potentials, the one-body (**

*σ**h*) and pairwise (

*J*) interactions, that satisfy those constraints for homologous sequences have been estimated (Ekeberg et al., 2014; 2013; Marks et al., 2011; Morcos et al., 2011) as one of inverse Potts problems and successfully employed to predict contacting residue pairs in protein structures (Ekeberg et al., 2014; 2013; Hopf et al., 2012; Marks et al., 2011; Miyazawa, 2013; Morcos et al., 2011; Sułkowska et al., 2012). Morcos et al. (2014) noticed that the

*ψ*in the Boltzmann factor is the dimensionless energy corresponding to

_{N}*G*/

_{N}*k*, and estimated selective temperatures (

_{B}T_{s}*T*) for several protein families by comparing the difference (

_{s}*Δψ*) of

_{ND}*ψ*between the native and the molten globule states with folding free energies (

*ΔG*) estimated with associative-memory, water-mediated, structure, and energy model (AWSEM) (Davtyan et al., 2012).

_{ND}A
purpose of the present study is to establish relationships between
protein foldability/stability, sequence distribution, and protein
fitness. First, we prove that if mutation and fixation processes in
protein evolution are reversible Markov processes, the equilibrium
ensemble of genes will obey a Boltzmann distribution with the Boltzmann
factor $\mathrm{exp}\left(4{N}_{e}m(1-1/\left(2N\right))\right),$ where *N _{e}* and

*N*are effective and actual population sizes, and

*m*is the Malthusian fitness of a gene. In other words, correspondences between $-\Delta {G}_{ND}/{k}_{B}{T}_{s},$$-\Delta {\psi}_{ND}(\equiv {\psi}_{N}-{\psi}_{D})$ and $4{N}_{e}m(1-1/\left(2N\right))$ are obtained by equating these three Boltzmann distributions with each other; ${\psi}_{D}\simeq {G}_{D}/{k}_{B}{T}_{s}+\text{constant}$.

The second purpose is to analyze the effects (*Δψ _{N}*)
of single amino acid substitutions on the evolutionary statistical
energy of a protein, and to estimate from the distribution of

*Δψ*the effective temperature of natural selection (

_{N}*T*) and then glass transition temperature (

_{s}*T*) and folding free energy (

_{g}*ΔG*) of protein. We estimate the one-body (

_{ND}*h*) and pairwise (

*J*) interactions with the DCA program, which is available at “http://dca.rice.edu/portal/dca/home”, and then analyze the changes (

*Δψ*) of the evolutionary statistical energy (

_{N}*ψ*) of a natural sequence due to single amino acid substitutions caused by single nucleotide changes. The data of

_{N}*Δψ*due to single nucleotide nonsynonymous substitutions for 14 protein domains show that the standard deviation of

_{N}*Δψ*over all the substitutions at all sites hardly depends on the evolutionary statistical energy (

_{N}*ψ*) of each homologous sequence and is nearly constant for each protein family, indicating that the standard deviation of

_{N}*ΔG*≃

_{N}*k*is nearly constant irrespective of protein families. From this finding,

_{B}T_{s}Δψ_{N}*T*for each protein family has been estimated in relative to

_{s}*T*for the PDZ family, which is determined by directly comparing $\Delta \Delta {\psi}_{ND}(\equiv \Delta ({\psi}_{N}-{\psi}_{D})\simeq \Delta {\psi}_{N})$ with the experimental values of folding free energy changes,

_{s}*ΔΔG*, due to single amino acid substitutions. Also

_{ND}*T*and

_{g}*ΔG*for each protein family are estimated on the basis of the REM from the estimated

_{ND}*T*and an experimental melting temperature

_{s}*T*. The estimates of

_{m}*T*and

_{s}*T*are all within a reasonable range, and those of

_{g}*ΔG*are well compared with experimental

_{ND}*ΔG*for 5 protein families. The present method for estimating

_{ND}*T*is simpler than the method (Morcos et al., 2014) using AWSEM, and also is useful for the prediction of

_{s}*ΔG*, because the experimental data of

_{ND}*ΔG*are limited in comparison with

_{ND}*T*, and also experimental conditions such as temperature and pH tend to be different among them. In addition, it has been revealed that

_{m}*Δψ*averaged over all single nucleotide nonsynonymous substitutions is a linear function of

_{N}*ψ*/

_{N}*L*of each homologous sequence, where

*L*is sequence length; the average of

*Δψ*decreases as

_{N}*ψ*/

_{N}*L*increases. This characteristic is required for homologous proteins to stay at the equilibrium state of the native conformational energy

*G*≃

_{N}*k*, and indicates a weak dependency (Miyazawa, 2016; Serohijos et al., 2012) of

_{B}T_{s}ψ_{N}*ΔΔG*on

_{ND}*ΔG*/

_{ND}*L*of protein across protein families.

The third purpose is to study an amino acid substitution process in protein evolution, which is characterized by the fitness, $m=-\Delta {\psi}_{ND}/\left(4{N}_{e}(1-1/\left(2N\right))\right)$. We employ a monoclonal
approximation for mutation and fixation processes of genes, in which
protein evolution proceeds with single amino acid substitutions fixed at
a time in a population. In this approximation, *ψ _{N}* of a protein gene attains the equilibrium, ${\psi}_{N}={\psi}_{N}^{eq},$ when the average of

*Δψ*( ≃

_{N}*ΔΔψ*) over singe nucleotide nonsynonymous mutations fixed in a population is equal to zero. Approximating the distribution of

_{ND}*Δψ*due to singe nucleotide nonsynonymous mutations by a log-normal distribution, their distribution for fixed mutants is numerically calculated and used to calculate the averages of various quantities and also the probability density functions (PDF) of

_{N}*K*/

_{a}*K*in all arising mutants and also in fixed mutants only;

_{s}*K*/

_{a}*K*is defined as the ratio of nonsynonymous to synonymous substitution rate per site. There is a good agreement between the time average (${\psi}_{N}^{eq}$) and ensemble average (⟨

_{s}*ψ*⟩

_{N}_{σ}), which is equal to the sample average, $\overline{{\psi}_{N}},$ of

*ψ*over homologous sequences, supporting the constancy of the standard deviation of

_{N}*Δψ*assumed in the monoclonal approximation.

_{N}We also study protein evolution at equilibrium, ${\psi}_{N}={\psi}_{N}^{eq}$.
The common understanding of protein evolution has been that amino acid
substitutions observed in homologous proteins are neutral (Kimura, 1968; 1969; Kimura and Ohta, 1971; 1974) or slightly deleterious (Ohta, 1973; 1992), and random drift is a primary force to fix amino acid substitutions in population. The PDFs of *K _{a}*/

*K*in all arising mutations and in their fixed mutations are examined to see how significant each of positive, neutral, slightly negative,and negative selections is. Interestingly, stabilizing mutations are significantly fixed in population by positive selection, and balance with destabilizing mutations that are also significantly fixed by random drift, although most negative mutations are removed from population. Contrary to the neutral theory (Kimura, 1968; 1969; Kimura and Ohta, 1971; 1974) and supporting the nearly neutral theory (Ohta, 1973; 1992; 2002), the proportion of neutral selection is not large even in fixed mutants. It is also confirmed that the effective temperature (

_{s}*T*) of selection negatively correlates with the amino acid substitution rate (

_{s}*K*/

_{a}*K*) of protein at equilibrium.

_{s}## 2. Methods

### 2.1. Knowledge of protein folding

A protein folding theory (Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b), which is based on a random energy model (REM), indicates that the equilibrium ensemble of amino acid sequences, $\mathit{\sigma}\equiv ({\sigma}_{1},\dots ,{\sigma}_{L})$ where *σ _{i}* is the type of amino acid at site

*i*and

*L*is sequence length, can be well approximated by a canonical ensemble with a Boltzmann factor consisting of the folding free energy,

*ΔG*(

_{ND}**,**

*σ**T*) and an effective temperature

*T*representing the strength of selection pressure.

_{s}*p*(

^{mut}**) is the probability of a sequence (**

*σ***) randomly occurring in a mutational process and depends only on the amino acid frequencies**

*σ***(**

*f***),**

*σ**k*is the Boltzmann constant,

_{B}*T*is a growth temperature, and

*G*and

_{N}*G*are the free energies of the native conformation and denatured state, respectively. Selective temperature

_{D}*T*quantifies how strong the folding constraints are in protein evolution, and is specific to the protein structure and function. The free energy

_{s}*G*of the denatured state does not depend on the amino acid order but the amino acid composition,

_{D}**(**

*f***), in a sequence (Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b). It is reasonable to assume that mutations independently occur between sites, and therefore the equilibrium frequency of a sequence in the mutational process is equal to the product of the equilibrium frequencies over sites; ${P}^{mut}\left(\mathit{\sigma}\right)={\prod}_{i}{p}^{mut}\left({\sigma}_{i}\right),$ where**

*σ**p*(

^{mut}*σ*) is the equilibrium frequency of

_{i}*σ*at site

_{i}*i*in the mutational process.

The
distribution of conformational energies in the denatured state (molten
globule state), which consists of conformations as compact as the native
conformation, is approximated in the random energy model (REM),
particularly the independent interaction model (IIM) (Pande et al., 1997),
to be equal to the energy distribution of randomized sequences, which
is then approximated by a Gaussian distribution, in the native
conformation. That is, the partition function *Z* for the denatured state is written as follows with the energy density *n*(*E*)
of conformations that is approximated by a product of a Gaussian
probability density and the total number of conformations whose
logarithm is proportional to the chain length.

*ω*is the conformational entropy per residue in the compact denatured state, and $\mathcal{N}(\overline{E}\left(\mathit{f}\left(\mathit{\sigma}\right)\right),\delta {E}^{2}\left(\mathit{f}\left(\mathit{\sigma}\right)\right))$ is the Gaussian probability density with mean $\overline{E}$ and variance

*δE*

^{2}, which depend only on the amino acid composition of the protein sequence. The free energy of the denatured state is approximated as follows.

*δE*

^{2}are estimated as the mean and variance of interaction energies of randomized sequences in the native conformation.

*T*is the glass transition temperature of the protein at which entropy becomes zero (Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b); $-\partial {G}_{D}{/\partial T|}_{T={T}_{g}}=0$. The conformational entropy per residue

_{g}*ω*in the compact denatured state can be represented with

*T*; $\omega L=\delta {E}^{2}/\left(2{\left({k}_{B}{T}_{g}\right)}^{2}\right)$. Thus, unless

_{g}*T*<

_{g}*T*, a protein will be trapped at local minima on a rugged free energy landscape before it can fold into a unique native structure.

_{m}### 2.2. Probability distribution of homologous sequences with the same native fold in sequence space

The probability distribution *P*(** σ**) of homologous sequences with the same native fold, $\mathit{\sigma}=({\sigma}_{1},\dots ,{\sigma}_{L})$ where

*σ*∈ {amino acids, deletion}, in sequence space with maximum entropy, which satisfies a given amino acid frequency at each site and a given pairwise amino acid frequency at each site pair, is a Boltzmann distribution (Marks et al., 2011; Morcos et al., 2011).

_{i}*h*and

_{i}*J*are one-body (compositional) and two-body (covariational) interactions and must satisfy the following constraints.

_{ij}*P*(

_{i}*a*) is the frequency of amino acid

_{k}*a*at site

_{k}*i*, and

*P*(

_{ij}*a*) is the frequency of amino acid pair,

_{k}, a_{l}*a*at

_{k}*i*and

*a*at

_{l}*j; a*∈ {amino acids, deletion}. The pairwise interaction matrix

_{k}*J*satisfies ${J}_{ij}({a}_{k},{a}_{l})={J}_{ji}({a}_{l},{a}_{k})$ and ${J}_{ii}({a}_{k},{a}_{l})=0$. Interactions

*h*and

_{i}*J*can be well estimated from a multiple sequence alignment (MSA) in the mean field approximation (Marks et al., 2011; Morcos et al., 2011), or by maximizing a pseudo-likelihood (Ekeberg et al., 2014; 2013). Because

_{ij}*ψ*(

_{N}**) has been estimated under the constraints on amino acid compositions at all sites, only sequences with a given amino acid composition contribute significantly to the partition function, and other sequences may be ignored.**

*σ**δψ*

^{2}are estimated as the mean and variance of

*ψ*over randomized sequences; $\overline{E}\simeq {k}_{B}{T}_{s}\overline{\psi}$ and

_{N}*δE*

^{2}≃ (

*k*)

_{B}T_{s}^{2}

*δψ*

^{2}.

### 2.3. The equilibrium distribution of sequences in a mutation-fixation process

Here we assume that the mutational process is a reversible Markov process. That is, the mutation rate per gene, *M*_{μν}, from sequence $\mathit{\mu}\equiv ({\mu}_{1},\dots ,{\mu}_{L})$ to ** ν** satisfies the detailed balance condition

*P*(

^{mut}**) is the equilibrium frequency of sequence**

*ν***in a mutational process,**

*ν**M*

_{μν}. The mutation rate per population is equal to 2

*NM*

_{μν}for a diploid population, where

*N*is the population size. The substitution rate

*R*

_{μν}from

**to**

*μ***is equal to the product of the mutation rate and the fixation probability with which a single mutant gene becomes to fully occupy the population (Crow and Kimura, 1970).**

*ν**u*(

*s*(

**→**

*μ***)) is the fixation probability of mutants from**

*ν***to**

*μ***the selective advantage of which is equal to**

*ν**s*.

For genic selection (no dominance) or gametic selection in a Wright-Fisher population of diploid, the fixation probability, *u*, of a single mutant gene, the selective advantage of which is equal to *s* and the frequency of which in a population is equal to ${q}_{m}=1/\left(2N\right),$ was estimated (Crow and Kimura, 1970) as

*N*is effective population size. Eq. (21) will be also valid for haploid population if 2

_{e}*N*and 2

_{e}*N*are replaced by

*N*and

_{e}*N*, respectively. Also, for Moran population of haploid, 4

*N*and 2

_{e}*N*should be replaced by

*N*and

_{e}*N*, respectively. Fixation probabilities for various selection models, which are compiled from p. 192 and p. 424–427 of Crow and Kimura (1970) and from Moran (1958) and Ewens (1979), are listed in Table S.7. The selective advantage of a mutant sequence

**to a wildtype**

*ν***is equal to**

*μ**m*(

**) is the Malthusian fitness of a mutant sequence, and**

*ν**m*(

**) is for the wildtype.**

*μ*This Markov process of substitutions in sequence is reversible, and the equilibrium frequency of sequence ** μ**,

*P*(

^{eq}**), in the total process consisting of mutation and fixation processes is represented by**

*μ*### 2.4. Relationships between *m*(*σ*), *ψ*_{N}(*σ*), and *ΔG*_{ND}(*σ*) of protein sequence

*σ*

_{N}

*σ*

_{ND}

*σ*

From Eqs. (1), (9), and (24), we can get the following relationships among the Malthusian fitness *m*, the folding free energy change *ΔG _{ND}* and

*Δψ*of protein sequence.

_{ND}**to**

*ν***is represented as follows for $f\left(\mathit{\mu}\right)=f\left(\mathit{\nu}\right)=\overline{f\left(\mathit{\sigma}\right)}$.**

*μ*Eq. (33) indicates that evolutionary statistical energy *ψ* should be proportional to effective population size *N _{e}*, and therefore it is ideal to estimate one-body (

*h*) and two-body (

*J*) interactions from homologous sequences of species that do not significantly differ in effective population size. Also, Eq. (34) indicates that selective temperature

*T*is inversely proportional to the effective population size

_{s}*N*∝1/

_{e}; T_{s}*N*, because free energy is a physical quantity and should not depend on effective population size.

_{e}### 2.5. The ensemble average of folding free energy, *ΔG*_{ND}(*σ*, *T*), over sequences

_{ND}

*σ*

The ensemble average of *ΔG _{ND}*(

**,**

*σ**T*) over sequences with Eq. (1) is

*σ*_{N}denotes a natural sequence, and $\overline{\mathit{f}\left({\mathit{\sigma}}_{N}\right)}$ denotes the average of amino acid frequencies

**(**

*f***) over homologous sequences. In Eq. (37), the sum over all sequences is approximated by the sum over sequences the amino acid composition of which is the same as that over the natural sequences.**

*σ*_{N}The ensemble averages of *G _{N}* and

*ψ*(

_{N}**) are estimated in the Gaussian approximation (Pande et al., 1997).**

*σ**ΔG*(

_{ND}**,**

*σ**T*) and

*ψ*(

_{N}**) over sequences are observable as the sample averages of**

*σ**ΔG*(

_{ND}**,**

*σ*_{N}*T*) and

*ψ*(

_{N}**) over homologous sequences fixed in protein evolution, respectively.**

*σ*_{N}The folding free energy becomes equal to zero at the melting temperature *T _{m}*; ${\u3008\Delta {G}_{ND}({\mathit{\sigma}}_{N},{T}_{m})\u3009}_{\mathit{\sigma}}=0$. Thus, the following relationship must be satisfied (Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b).

### 2.6. Probability distributions of selective advantage, fixation rate and *K*_{a}/*K*_{s}

_{a}

_{s}

Let
us consider the probability distributions of characteristic quantities
that describe the evolution of genes. First of all, the probability
density function (PDF) of selective advantage *s, p*(*s*), of mutant genes can be calculated from the PDF of the change of *Δψ _{ND}* due to a mutation from

**to**

*μ***, $\Delta \Delta {\psi}_{ND}(\equiv \Delta {\psi}_{ND}(\mathit{\nu},T)-\Delta {\psi}_{ND}(\mathit{\mu},T))$. The PDF of 4**

*ν**N*, $p\left(4{N}_{e}s\right)=p\left(s\right)/\left(4Ne\right),$ may be more useful than

_{e}s*p*(

*s*).

*ΔΔψ*must be regarded as a function of 4

_{ND}*N*, that is, $\Delta \Delta {\psi}_{ND}=-4{N}_{e}s(1-{q}_{m})$; see Eq. (33).

_{e}sThe PDF of fixation probability *u* can be represented by

*N*must be regarded as a function of

_{e}s*u*.

The ratio of the substitution rate per nonsynonymous site (*K _{a}*) for nonsynonymous substitutions with selective advantage s to the substitution rate per synonymous site (

*K*) for synonymous substitutions with s = 0 is

_{s}*K*/

_{a}*K*is

_{s}### 2.7. Probability distributions of *ΔΔψ*_{ND}, 4*N*_{e}s, u, and *K*_{a}/*K*_{s} in fixed mutant genes

_{ND}

_{e}s, u

_{a}

_{s}

The PDF of *ΔΔψ _{ND}* in fixed mutants is proportional to that multiplied by the fixation probability.

*u*and

*K*/

_{a}*K*in fixed mutants are

_{s}*K*/

_{a}*K*in fixed mutants is equal to the ratio of the second moment to the first moment of

_{s}*K*/

_{a}*K*in all arising mutants; ${\u3008{K}_{a}/{K}_{s}\u3009}_{fixed}=\u3008{({K}_{a}/{K}_{s})}^{2}\u3009/\u3008{K}_{a}/{K}_{s}\u3009$.

_{s}## 3. Materials

### 3.1. Sequence data

We study the single domains of 8 Pfam (Finn et al., 2016) families and both the single domains and multi-domains from 3 Pfam families. In Table 1,
their Pfam ID for a multiple sequence alignment, and UniProt ID and PDB
ID with the starting- and ending-residue positions of the domains are
listed. The full alignments for their families at the Pfam are used to
estimate one-body interactions *h* and pairwise interactions *J* with the DCA program from “http://dca.rice.edu/portal/dca/home ” (Marks et al., 2011; Morcos et al., 2011). To estimate the sample ($\overline{{\psi}_{N}}$) and ensemble (⟨*ψ _{N}*⟩

_{σ}) averages of the evolutionary statistical energy,

*M*unique sequences with no deletions are used. In order to reduce phylogenetic biases in the set of homologous sequences, we employ a sample weight (${w}_{{\mathit{\sigma}}_{N}}$) for each sequence, which is equal to the inverse of the number of sequences that are less than 20% different from a given sequence in a given set of homologous sequences. Only representatives of unique sequences with no deletions, which are at least 20% different from each other, are used to calculate the changes of the evolutionary statistical energy (

*Δψ*) due to single nucleotide nonsynonymous substitutions; the number of the representatives is almost equal to the effective number of sequences (

_{N}*M*) in Table 1.

_{eff}Pfam family | UniProt ID | N^{a} | N_{eff}^{b,c} | M^{d} | M_{eff}^{c,e} | L^{f} | PDB ID |
---|---|---|---|---|---|---|---|

HTH_3 | RPC1_BP434/7-59 | 15315(15917) | 11691.21 | 6286 | 4893.73 | 53 | 1R69-A:6-58 |

Nitroreductase | Q97IT9_CLOAB/4-76 | 6008(6084) | 4912.96 | 1057 | 854.71 | 73 | 3E10-A/B:4-76^{g} |

SBP_bac_3^{h} | GLNH_ECOLI/27-244 | 9874(9972) | 7374.96 | 140 | 99.70 | 218 | 1WDN-A:5-222 |

SBP_bac_3 | GLNH_ECOLI/111-204 | 9712(9898) | 7442.85 | 829 | 689.64 | 94 | 1WDN-A:89-182 |

OmpA | PAL_ECOLI/73-167 | 6035(6070) | 4920.44 | 2207 | 1761.24 | 95 | 1OAP-A:52-146 |

DnaB | DNAB_ECOLI/31-128 | 1929(1957) | 1284.94 | 1187 | 697.30 | 98 | 1JWE-A:30-127 |

LysR_substrate^{h} | BENM_ACIAD/90-280 | 25138(25226) | 20707.06 | 85(1) | 67.00 | 191 | 2F6G-A/B:90-280^{g} |

LysR_substrate | BENM_ACIAD/163-265 | 25032(25164) | 21144.74 | 121(1) | 99.27 | 103 | 2F6G-A/B:163-265^{g} |

Methyltransf_5^{h} | RSMH_THEMA/8-292 | 1942(1953) | 1286.67 | 578(2) | 357.97 | 285 | 1N2X-A:8-292 |

Methyltransf_5 | RSMH_THEMA/137-216 | 1877(1911) | 1033.35 | 975(2) | 465.53 | 80 | 1N2X-A:137-216 |

SH3_1 | SRC_HUMAN:90-137 | 9716(16621) | 3842.47 | 1191 | 458.31 | 48 | 1FMK-A:87-134 |

ACBP | ACBP_BOVIN/3-82 | 2130(2526) | 1039.06 | 161 | 70.72 | 80 | 2ABD-A:2-81 |

PDZ | PTN13_MOUSE/1358-1438 | 13814(23726) | 4748.76 | 1255 | 339.99 | 81 | 1GM1-A:16-96 |

Copper-bind | AZUR_PSEAE:24-148 | 1136(1169) | 841.56 | 67(1) | 45.23 | 125 | 5AZU-B/C:4-128^{g} |

- a
The number of unique sequences and the total number of sequences in parentheses; the full alignments in the Pfam (Finn et al., 2016) are used.

- b
The effective number of sequences.

- c
A sample weight (${w}_{{\mathbf{\sigma}}_{N}}$) for a given sequence is equal to the inverse of the number of sequences that are less than 20% different from the given sequence.

- d
The number of unique sequences that include no deletion unless specified. The number in parentheses indicates the maximum number of deletions allowed.

- e
The effective number of unique sequences that include no deletion or at most the specified number of deletions.

- f
The number of residues.

- g
Contacts are calculated in the homodimeric state for these protein.

- h
These proteins consist of two domains, and other ones are single domains.

## 4. Results

First, We describe how one-body and pairwise interactions, *h* and *J*, are estimated. Then, the changes of evolutionary statistical energy (*Δψ _{N}*) due to single nucleotide nonsynonymous changes on natural sequences are analyzed with respect to dependences on the

*ψ*of the wildtype sequences. The results indicate that the standard deviation of

_{N}*ΔG*≃

_{N}*k*is almost constant over protein families. Hence, the selective temperatures,

_{B}T_{s}Δψ_{N}*T*, of various protein families can be estimated in a relative scale from the standard deviation of

_{s}*Δψ*. The

_{N}*T*of a reference protein is estimated by comparing the expected values of

_{s}*ΔΔG*with their experimental values. Folding free energies

_{ND}*ΔG*are estimated from estimated

_{ND}*T*and experimental melting temperature

_{s}*T*, and compared with their experimental values for 5 protein families. Glass transition temperatures

_{m}*T*are also estimated from

_{g}*T*and

_{s}*T*.

_{m}Secondly, based on the distribution of *Δψ _{N}*, protein evolution is studied. Evolutionary statistical energy (

*ψ*) attains the equilibrium when the average of

_{N}*Δψ*over fixed mutations is equal to zero. The PDF of

_{N}*Δψ*is approximated by log-normal distributions. The basic relationships are that 1) the standard deviation of

_{N}*Δψ*is constant specific to a protein family, and 2) the mean of

_{N}*Δψ*linearly depends on

_{N}*ψ*. The equilibrium value of

_{N}*ψ*is shown to agree with the mean of

_{N}*ψ*over homologous proteins in each protein family. In the present approximation, the standard deviation of

_{N}*Δψ*and selective temperature

_{N}*T*at the equilibrium are simple functions of the equilibrium value of mean

_{s}*Δψ*, ${\overline{\Delta {\psi}_{N}}}^{eq}$. Lastly, the probability distribution of

_{N}*K*/

_{a}*K*, which is the ratio of nonsynonymous to synonymous substitution rate per site, is analyzed as a function of ${\overline{\Delta {\psi}_{N}}}^{eq},$ in order to examine how significant neutral selection is in the selection maintaining protein stability and foldability. Also, it is confirmed that selective temperature

_{s}*T*negatively correlates with the mean of

_{s}*K*/

_{a}*K*, which represents the evolutionary rate of protein.

_{s}### 4.1. Important parameters in the estimations of one-body and pairwise interactions, *h* and *J*, and of the evolutionary statistical energy, *ψ*_{N}(*σ*)

_{N}

*σ*

The one-body (*h*) and pairwise (*J*) interactions for amino acid order in a protein sequence are estimated here by the DCA method (Marks et al., 2011; Morcos et al., 2011), although there are multiple methods for estimating them (Ekeberg et al., 2014; 2013). In the case of the DCA method, the ratio of pseudocount (0 ≤ *p _{c}* ≤ 1) defined in Eqs. (S.70) and (S.71) is a parameter and controls the values of the ensemble and sample averages of

*ψ*in sequence space, ⟨

_{N}*ψ*(

_{N}**)⟩**

*σ*_{σ}in Eq. (42) and $\overline{{\psi}_{N}\left({\mathit{\sigma}}_{N}\right)}$ in Eq. (45); a weight for observed counts is defined to be equal to $(1-{p}_{c})$. Sample average means the average over all homologous sequences with a weight for each sequence to reduce phylogenetic biases. An appropriate value must be chosen for the ratio of pseudocount in a reasonable manner.

Another problem is that the estimates of *h* and *J*(Marks et al., 2011; Morcos et al., 2011)
may be noisy as a result of estimating many interaction parameters from
a relatively small number of sequences. Therefore, only pairwise
interactions within a certain distance are taken into account; the
estimate of *J* is modified as follows, according to Morcos et al. (2014).

*J*is the statistical estimate of

^{q}*J*in the mean field approximation in which the amino acid

*a*is the reference state,

_{q}*H*is the Heaviside step function, and

*r*is the distance between the centers of amino acid side chains at sites

_{ij}*i*and

*j*in a protein structure, and

*r*is a distance threshold for residue pairwise interactions. The one-body interactions

_{cutoff}*h*(

_{i}*a*) are estimated in the isolated two-state model (Morcos et al., 2011) rather than the mean field approximation; see the Method section in the Supplement for details. The zero-sum gauge is employed to represent

_{k}*h*and

*J*; ${\sum}_{k}{\widehat{h}}_{i}^{s}\left({a}_{k}\right)={\sum}_{k}{\sum}_{l}{\widehat{J}}_{ij}^{s}({a}_{k},{a}_{l})=0$ in the zero-sum gauge.

Candidates
for the cutoff distance may be about 8 Å for the first interaction
shell and 15–16 Å for the second interaction shell between
residues; distance between the centers of side chain atoms is employed
for residue distance. Here both the distances are tested for the cutoff
distance. Pseudocount in the Bayesian statistics is determined usually
as a function of the number of samples (sequences), although the ratio
of pseudocount ${p}_{c}=0.5$ was used for all proteins in the contact prediction (Morcos et al., 2011).
Here, an appropriate value for the ratio of pseudocount for the certain
cutoff distance, either about 8 Å or 15–16 Å, is chosen for
each protein family in such a way that the sample average of the
evolutionary statistical energies must be equal to the ensemble average,
$\overline{{\psi}_{N}}={\u3008{\psi}_{N}\u3009}_{\mathit{\sigma}}$; see Eqs. (42) and (46). As shown in Fig. S.1, the value of *r _{cutoff}*, where $\overline{{\psi}_{N}}={\u3008{\psi}_{N}\u3009}_{\mathit{\sigma}}$ is satisfied, monotonously changes as a function of the ratio of pseudocount

*p*. The values of

_{c}*p*, where $\overline{{\psi}_{N}}={\u3008{\psi}_{N}\u3009}_{\mathit{\sigma}}$ is satisfied near the specified values of

_{c}*r*, 8 Å and 15.5 Å, are employed for

_{cutoff}*r*≃ 8 Å and 15.5 Å, respectively. In the present multiple sequence alignment for the PDZ domain, with the ratios of pseudocount ${p}_{c}=0.205$ and ${p}_{c}=0.33,$ the sample and ensemble averages agree with each other at the cutoff distances

_{cutoff}*r*∼ 8 Å and

_{cutoff}*r*∼ 15.5 Å, respectively; see Fig. S.1. In Fig. S.2, the reflective correlation and regression coefficients between the experimental

_{cutoff}*ΔΔG*(Gianni et al., 2007) and

_{ND}*Δψ*due to single amino acid substitutions are plotted against the cutoff distance for pairwise interactions in the PDZ domain. The reflective correlation coefficient has the maximum at the

_{N}*r*∼ 8 Å for ${p}_{c}=0.205$ and at

_{cutoff}*r*∼ 15.5 Å for

_{cutoff}