1. Introduction
Natural proteins can fold their sequences into unique structures. Protein’s stability and foldability result from natural selection and are not typical characteristics of random polymers (Bryngelson and Wolynes, 1987; Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b). Natural selection maintains protein’s stability and foldability over evolutionary timescales. On the basis of the random energy model (REM) for protein folding, it was discussed (Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b) that the equilibrium ensemble of natural protein sequences in sequence space is well represented by a canonical ensemble characterized by a Boltzmann factor where is the folding free energy of sequence σ, GN and GD are the free energies of the native and denatured states, kB is the Boltzmann constant, and Ts is the effective temperature representing the strength of selection pressure and must satisfy Ts < Tg < Tm for natural proteins to fold into unique native structures; Tg is glass transition temperature and Tm is melting temperature. The REM also indicates that the free energy of denatured conformations (GD) is a function of amino acid frequencies only and does not depend on amino acid order, and therefore the Boltzmann factor will be taken as if amino acid frequencies are kept constant. It was shown by lattice Monte Carlo simulations (Shakhnovich, 1994) that lattice protein sequences selected with this Boltzmann factor were not trapped by competing structures but could fold into unique native structures. Selective temperatures were also estimated (Dokholyan and Shakhnovich, 2001) for actual proteins to yield good correlations of sequence entropy between actual protein families and sequences designed with this type of Boltzmann factor.
On the other hand, the maximum entropy principle insists that the probability distribution of sequences in sequence space, which satisfies constraints on amino acid compositions at all sites and on amino acid pairwise frequencies for all site pairs, is a Boltzmann distribution with the Boltzmann factor where the total interaction ψN(σ) of a sequence σ is represented as the sum of one-body (h) (compositional) and pairwise (J) (covariational) interactions between residues in the sequence; ψN(σ) is called the evolutionary statistical energy by Hopf et al. (2017). The inverse statistical potentials, the one-body (h) and pairwise (J) interactions, that satisfy those constraints for homologous sequences have been estimated (Ekeberg et al., 2014; 2013; Marks et al., 2011; Morcos et al., 2011) as one of inverse Potts problems and successfully employed to predict contacting residue pairs in protein structures (Ekeberg et al., 2014; 2013; Hopf et al., 2012; Marks et al., 2011; Miyazawa, 2013; Morcos et al., 2011; Sułkowska et al., 2012). Morcos et al. (2014) noticed that the ψN in the Boltzmann factor is the dimensionless energy corresponding to GN/kBTs, and estimated selective temperatures (Ts) for several protein families by comparing the difference (ΔψND) of ψ between the native and the molten globule states with folding free energies (ΔGND) estimated with associative-memory, water-mediated, structure, and energy model (AWSEM) (Davtyan et al., 2012).
A purpose of the present study is to establish relationships between protein foldability/stability, sequence distribution, and protein fitness. First, we prove that if mutation and fixation processes in protein evolution are reversible Markov processes, the equilibrium ensemble of genes will obey a Boltzmann distribution with the Boltzmann factor where Ne and N are effective and actual population sizes, and m is the Malthusian fitness of a gene. In other words, correspondences between and are obtained by equating these three Boltzmann distributions with each other; .
The second purpose is to analyze the effects (ΔψN) of single amino acid substitutions on the evolutionary statistical energy of a protein, and to estimate from the distribution of ΔψN the effective temperature of natural selection (Ts) and then glass transition temperature (Tg) and folding free energy (ΔGND) of protein. We estimate the one-body (h) and pairwise (J) interactions with the DCA program, which is available at “http://dca.rice.edu/portal/dca/home”, and then analyze the changes (ΔψN) of the evolutionary statistical energy (ψN) of a natural sequence due to single amino acid substitutions caused by single nucleotide changes. The data of ΔψN due to single nucleotide nonsynonymous substitutions for 14 protein domains show that the standard deviation of ΔψN over all the substitutions at all sites hardly depends on the evolutionary statistical energy (ψN) of each homologous sequence and is nearly constant for each protein family, indicating that the standard deviation of ΔGN ≃ kBTsΔψN is nearly constant irrespective of protein families. From this finding, Ts for each protein family has been estimated in relative to Ts for the PDZ family, which is determined by directly comparing with the experimental values of folding free energy changes, ΔΔGND, due to single amino acid substitutions. Also Tg and ΔGND for each protein family are estimated on the basis of the REM from the estimated Ts and an experimental melting temperature Tm. The estimates of Ts and Tg are all within a reasonable range, and those of ΔGND are well compared with experimental ΔGND for 5 protein families. The present method for estimating Ts is simpler than the method (Morcos et al., 2014) using AWSEM, and also is useful for the prediction of ΔGND, because the experimental data of ΔGND are limited in comparison with Tm, and also experimental conditions such as temperature and pH tend to be different among them. In addition, it has been revealed that ΔψN averaged over all single nucleotide nonsynonymous substitutions is a linear function of ψN/L of each homologous sequence, where L is sequence length; the average of ΔψN decreases as ψN/L increases. This characteristic is required for homologous proteins to stay at the equilibrium state of the native conformational energy GN ≃ kBTsψN, and indicates a weak dependency (Miyazawa, 2016; Serohijos et al., 2012) of ΔΔGND on ΔGND/L of protein across protein families.
The third purpose is to study an amino acid substitution process in protein evolution, which is characterized by the fitness, . We employ a monoclonal approximation for mutation and fixation processes of genes, in which protein evolution proceeds with single amino acid substitutions fixed at a time in a population. In this approximation, ψN of a protein gene attains the equilibrium, when the average of ΔψN( ≃ ΔΔψND) over singe nucleotide nonsynonymous mutations fixed in a population is equal to zero. Approximating the distribution of ΔψN due to singe nucleotide nonsynonymous mutations by a log-normal distribution, their distribution for fixed mutants is numerically calculated and used to calculate the averages of various quantities and also the probability density functions (PDF) of Ka/Ks in all arising mutants and also in fixed mutants only; Ka/Ks is defined as the ratio of nonsynonymous to synonymous substitution rate per site. There is a good agreement between the time average () and ensemble average (⟨ψN⟩σ), which is equal to the sample average, of ψN over homologous sequences, supporting the constancy of the standard deviation of ΔψN assumed in the monoclonal approximation.
We also study protein evolution at equilibrium, . The common understanding of protein evolution has been that amino acid substitutions observed in homologous proteins are neutral (Kimura, 1968; 1969; Kimura and Ohta, 1971; 1974) or slightly deleterious (Ohta, 1973; 1992), and random drift is a primary force to fix amino acid substitutions in population. The PDFs of Ka/Ks in all arising mutations and in their fixed mutations are examined to see how significant each of positive, neutral, slightly negative,and negative selections is. Interestingly, stabilizing mutations are significantly fixed in population by positive selection, and balance with destabilizing mutations that are also significantly fixed by random drift, although most negative mutations are removed from population. Contrary to the neutral theory (Kimura, 1968; 1969; Kimura and Ohta, 1971; 1974) and supporting the nearly neutral theory (Ohta, 1973; 1992; 2002), the proportion of neutral selection is not large even in fixed mutants. It is also confirmed that the effective temperature (Ts) of selection negatively correlates with the amino acid substitution rate (Ka/Ks) of protein at equilibrium.
2. Methods
2.1. Knowledge of protein folding
A protein folding theory (Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b), which is based on a random energy model (REM), indicates that the equilibrium ensemble of amino acid sequences, where σi is the type of amino acid at site i and L
is sequence length, can be well approximated by a canonical ensemble
with a Boltzmann factor consisting of the folding free energy, ΔGND(σ, T) and an effective temperature Ts representing the strength of selection pressure.
The
distribution of conformational energies in the denatured state (molten
globule state), which consists of conformations as compact as the native
conformation, is approximated in the random energy model (REM),
particularly the independent interaction model (IIM) (Pande et al., 1997),
to be equal to the energy distribution of randomized sequences, which
is then approximated by a Gaussian distribution, in the native
conformation. That is, the partition function Z for the denatured state is written as follows with the energy density n(E)
of conformations that is approximated by a product of a Gaussian
probability density and the total number of conformations whose
logarithm is proportional to the chain length.
2.2. Probability distribution of homologous sequences with the same native fold in sequence space
The probability distribution P(σ) of homologous sequences with the same native fold, where σi ∈ {amino
acids, deletion}, in sequence space with maximum entropy, which
satisfies a given amino acid frequency at each site and a given pairwise
amino acid frequency at each site pair, is a Boltzmann distribution (Marks et al., 2011; Morcos et al., 2011).
Hence, from Eqs. (2) and (9),
2.3. The equilibrium distribution of sequences in a mutation-fixation process
Here we assume that the mutational process is a reversible Markov process. That is, the mutation rate per gene, Mμν, from sequence to ν satisfies the detailed balance condition
For genic selection (no dominance) or gametic selection in a Wright-Fisher population of diploid, the fixation probability, u, of a single mutant gene, the selective advantage of which is equal to s and the frequency of which in a population is equal to was estimated (Crow and Kimura, 1970) as
This Markov process of substitutions in sequence is reversible, and the equilibrium frequency of sequence μ, Peq(μ), in the total process consisting of mutation and fixation processes is represented by
2.4. Relationships between m(σ), ψN(σ), and ΔGND(σ) of protein sequence
From Eqs. (1), (9), and (24), we can get the following relationships among the Malthusian fitness m, the folding free energy change ΔGND and ΔψND of protein sequence.
Eq. (33) indicates that evolutionary statistical energy ψ should be proportional to effective population size Ne, and therefore it is ideal to estimate one-body (h) and two-body (J) interactions from homologous sequences of species that do not significantly differ in effective population size. Also, Eq. (34) indicates that selective temperature Ts is inversely proportional to the effective population size Ne; Ts∝1/Ne, because free energy is a physical quantity and should not depend on effective population size.
2.5. The ensemble average of folding free energy, ΔGND(σ, T), over sequences
The ensemble average of ΔGND(σ, T) over sequences with Eq. (1) is
The ensemble averages of GN and ψN(σ) are estimated in the Gaussian approximation (Pande et al., 1997).
The folding free energy becomes equal to zero at the melting temperature Tm; . Thus, the following relationship must be satisfied (Pande et al., 1997; Ramanathan and Shakhnovich, 1994; Shakhnovich and Gutin, 1993a; 1993b).
2.6. Probability distributions of selective advantage, fixation rate and Ka/Ks
Let
us consider the probability distributions of characteristic quantities
that describe the evolution of genes. First of all, the probability
density function (PDF) of selective advantage s, p(s), of mutant genes can be calculated from the PDF of the change of ΔψND due to a mutation from μ to ν, . The PDF of 4Nes, may be more useful than p(s).
The PDF of fixation probability u can be represented by
The ratio of the substitution rate per nonsynonymous site (Ka) for nonsynonymous substitutions with selective advantage s to the substitution rate per synonymous site (Ks) for synonymous substitutions with s = 0 is
2.7. Probability distributions of ΔΔψND, 4Nes, u, and Ka/Ks in fixed mutant genes
The PDF of ΔΔψND in fixed mutants is proportional to that multiplied by the fixation probability.
3. Materials
3.1. Sequence data
We study the single domains of 8 Pfam (Finn et al., 2016) families and both the single domains and multi-domains from 3 Pfam families. In Table 1, their Pfam ID for a multiple sequence alignment, and UniProt ID and PDB ID with the starting- and ending-residue positions of the domains are listed. The full alignments for their families at the Pfam are used to estimate one-body interactions h and pairwise interactions J with the DCA program from “http://dca.rice.edu/portal/dca/home ” (Marks et al., 2011; Morcos et al., 2011). To estimate the sample () and ensemble (⟨ψN⟩σ) averages of the evolutionary statistical energy, M unique sequences with no deletions are used. In order to reduce phylogenetic biases in the set of homologous sequences, we employ a sample weight () for each sequence, which is equal to the inverse of the number of sequences that are less than 20% different from a given sequence in a given set of homologous sequences. Only representatives of unique sequences with no deletions, which are at least 20% different from each other, are used to calculate the changes of the evolutionary statistical energy (ΔψN) due to single nucleotide nonsynonymous substitutions; the number of the representatives is almost equal to the effective number of sequences (Meff) in Table 1.
Table 1. Protein families, and structures studied.
Pfam family | UniProt ID | Na | Neffb,c | Md | Meffc,e | Lf | PDB ID |
---|---|---|---|---|---|---|---|
HTH_3 | RPC1_BP434/7-59 | 15315(15917) | 11691.21 | 6286 | 4893.73 | 53 | 1R69-A:6-58 |
Nitroreductase | Q97IT9_CLOAB/4-76 | 6008(6084) | 4912.96 | 1057 | 854.71 | 73 | 3E10-A/B:4-76g |
SBP_bac_3h | GLNH_ECOLI/27-244 | 9874(9972) | 7374.96 | 140 | 99.70 | 218 | 1WDN-A:5-222 |
SBP_bac_3 | GLNH_ECOLI/111-204 | 9712(9898) | 7442.85 | 829 | 689.64 | 94 | 1WDN-A:89-182 |
OmpA | PAL_ECOLI/73-167 | 6035(6070) | 4920.44 | 2207 | 1761.24 | 95 | 1OAP-A:52-146 |
DnaB | DNAB_ECOLI/31-128 | 1929(1957) | 1284.94 | 1187 | 697.30 | 98 | 1JWE-A:30-127 |
LysR_substrateh | BENM_ACIAD/90-280 | 25138(25226) | 20707.06 | 85(1) | 67.00 | 191 | 2F6G-A/B:90-280g |
LysR_substrate | BENM_ACIAD/163-265 | 25032(25164) | 21144.74 | 121(1) | 99.27 | 103 | 2F6G-A/B:163-265g |
Methyltransf_5h | RSMH_THEMA/8-292 | 1942(1953) | 1286.67 | 578(2) | 357.97 | 285 | 1N2X-A:8-292 |
Methyltransf_5 | RSMH_THEMA/137-216 | 1877(1911) | 1033.35 | 975(2) | 465.53 | 80 | 1N2X-A:137-216 |
SH3_1 | SRC_HUMAN:90-137 | 9716(16621) | 3842.47 | 1191 | 458.31 | 48 | 1FMK-A:87-134 |
ACBP | ACBP_BOVIN/3-82 | 2130(2526) | 1039.06 | 161 | 70.72 | 80 | 2ABD-A:2-81 |
PDZ | PTN13_MOUSE/1358-1438 | 13814(23726) | 4748.76 | 1255 | 339.99 | 81 | 1GM1-A:16-96 |
Copper-bind | AZUR_PSEAE:24-148 | 1136(1169) | 841.56 | 67(1) | 45.23 | 125 | 5AZU-B/C:4-128g |
- a
The number of unique sequences and the total number of sequences in parentheses; the full alignments in the Pfam (Finn et al., 2016) are used.
- b
The effective number of sequences.
- c
A sample weight () for a given sequence is equal to the inverse of the number of sequences that are less than 20% different from the given sequence.
- d
The number of unique sequences that include no deletion unless specified. The number in parentheses indicates the maximum number of deletions allowed.
- e
The effective number of unique sequences that include no deletion or at most the specified number of deletions.
- f
The number of residues.
- g
Contacts are calculated in the homodimeric state for these protein.
- h
These proteins consist of two domains, and other ones are single domains.
4. Results
First, We describe how one-body and pairwise interactions, h and J, are estimated. Then, the changes of evolutionary statistical energy (ΔψN) due to single nucleotide nonsynonymous changes on natural sequences are analyzed with respect to dependences on the ψN of the wildtype sequences. The results indicate that the standard deviation of ΔGN ≃ kBTsΔψN is almost constant over protein families. Hence, the selective temperatures, Ts, of various protein families can be estimated in a relative scale from the standard deviation of ΔψN. The Ts of a reference protein is estimated by comparing the expected values of ΔΔGND with their experimental values. Folding free energies ΔGND are estimated from estimated Ts and experimental melting temperature Tm, and compared with their experimental values for 5 protein families. Glass transition temperatures Tg are also estimated from Ts and Tm.
Secondly, based on the distribution of ΔψN, protein evolution is studied. Evolutionary statistical energy (ψN) attains the equilibrium when the average of ΔψN over fixed mutations is equal to zero. The PDF of ΔψN is approximated by log-normal distributions. The basic relationships are that 1) the standard deviation of ΔψN is constant specific to a protein family, and 2) the mean of ΔψN linearly depends on ψN. The equilibrium value of ψN is shown to agree with the mean of ψN over homologous proteins in each protein family. In the present approximation, the standard deviation of ΔψN and selective temperature Ts at the equilibrium are simple functions of the equilibrium value of mean ΔψN, . Lastly, the probability distribution of Ka/Ks, which is the ratio of nonsynonymous to synonymous substitution rate per site, is analyzed as a function of in order to examine how significant neutral selection is in the selection maintaining protein stability and foldability. Also, it is confirmed that selective temperature Ts negatively correlates with the mean of Ka/Ks, which represents the evolutionary rate of protein.
4.1. Important parameters in the estimations of one-body and pairwise interactions, h and J, and of the evolutionary statistical energy, ψN(σ)
The one-body (h) and pairwise (J) interactions for amino acid order in a protein sequence are estimated here by the DCA method (Marks et al., 2011; Morcos et al., 2011), although there are multiple methods for estimating them (Ekeberg et al., 2014; 2013). In the case of the DCA method, the ratio of pseudocount (0 ≤ pc ≤ 1) defined in Eqs. (S.70) and (S.71) is a parameter and controls the values of the ensemble and sample averages of ψN in sequence space, ⟨ψN(σ)⟩σ in Eq. (42) and in Eq. (45); a weight for observed counts is defined to be equal to . Sample average means the average over all homologous sequences with a weight for each sequence to reduce phylogenetic biases. An appropriate value must be chosen for the ratio of pseudocount in a reasonable manner.
Another problem is that the estimates of h and J(Marks et al., 2011; Morcos et al., 2011)
may be noisy as a result of estimating many interaction parameters from
a relatively small number of sequences. Therefore, only pairwise
interactions within a certain distance are taken into account; the
estimate of J is modified as follows, according to Morcos et al. (2014).
Candidates for the cutoff distance may be about 8 Å for the first interaction shell and 15–16 Å for the second interaction shell between residues; distance between the centers of side chain atoms is employed for residue distance. Here both the distances are tested for the cutoff distance. Pseudocount in the Bayesian statistics is determined usually as a function of the number of samples (sequences), although the ratio of pseudocount was used for all proteins in the contact prediction (Morcos et al., 2011). Here, an appropriate value for the ratio of pseudocount for the certain cutoff distance, either about 8 Å or 15–16 Å, is chosen for each protein family in such a way that the sample average of the evolutionary statistical energies must be equal to the ensemble average, ; see Eqs. (42) and (46). As shown in Fig. S.1, the value of rcutoff, where is satisfied, monotonously changes as a function of the ratio of pseudocount pc. The values of pc, where is satisfied near the specified values of rcutoff, 8 Å and 15.5 Å, are employed for rcutoff ≃ 8 Å and 15.5 Å, respectively. In the present multiple sequence alignment for the PDZ domain, with the ratios of pseudocount and the sample and ensemble averages agree with each other at the cutoff distances rcutoff ∼ 8 Å and rcutoff ∼ 15.5 Å, respectively; see Fig. S.1. In Fig. S.2, the reflective correlation and regression coefficients between the experimental ΔΔGND (Gianni et al., 2007) and ΔψN due to single amino acid substitutions are plotted against the cutoff distance for pairwise interactions in the PDZ domain. The reflective correlation coefficient has the maximum at the rcutoff ∼ 8 Å for and at rcutoff ∼ 15.5 Å for