15.12.01 · orgchem / biomolecules-aa-protein

Amino acids and protein chemistry

draft3 tiersLean: none

Anchor (Master): Carey & Sundberg — Advanced Organic Chemistry Part A, peptide chemistry sections; Voet & Voet — Biochemistry Ch. 4–6; Walsh — Tetrahedron report on peptide synthesis

Intuition [Beginner]

Proteins are long chains of amino acids, and amino acids are small molecules with a distinctive feature: every one of the twenty standard amino acids has an amino group () and a carboxylic acid group () attached to the same carbon, called the alpha carbon. What makes each amino acid different is the side chain — the fourth substituent on the alpha carbon.

At neutral pH, amino acids exist as zwitterions: the amino group picks up a proton to become and the carboxyl group loses a proton to become . The molecule carries zero net charge but has both a positive and a negative end. This dual nature makes amino acids water-soluble and gives them buffering behaviour near physiological pH.

When two amino acids join, the carboxyl group of one reacts with the amino group of the next, losing a molecule of water and forming a peptide bond (). A chain of many amino acids linked by peptide bonds is a polypeptide, and when a polypeptide folds into a functional three-dimensional shape, it is called a protein.

The twenty standard amino acids sort into groups by side-chain chemistry. Some side chains are nonpolar (glycine, alanine, valine, leucine, isoleucine, proline, phenylalanine, methionine, tryptophan) and tend to be buried in the protein interior away from water. Others are polar but uncharged (serine, threonine, asparagine, glutamine, tyrosine, cysteine) and prefer the protein surface. Four are positively charged at physiological pH (lysine, arginine, histidine) and three are negatively charged (aspartate, glutamate). These charged side chains sit on the protein surface, interacting with water and with other molecules.

Proteins organise into four structural levels. Primary structure is the linear amino acid sequence. Secondary structure is the local folding pattern — alpha helices and beta sheets — held together by hydrogen bonds between backbone amide groups. Tertiary structure is the full three-dimensional fold of a single polypeptide chain, stabilised by hydrogen bonds, hydrophobic interactions, ionic interactions, and disulfide bridges (cysteine-cysteine crosslinks). Quaternary structure is the arrangement of multiple polypeptide chains (subunits) into a functional complex.

Visual [Beginner]

An amino acid at physiological pH looks like a cross: the central alpha carbon has four groups radiating out — the amino group (positive), the carboxylate (negative), a hydrogen, and the side chain (R group, unique to each amino acid).

The general structure of an amino acid zwitterion at pH 7: central alpha carbon bonded to NH3+ (top-left), COO- (top-right), H (bottom-left), and R side chain (bottom-right). Below: the peptide bond formation between two amino acids, showing loss of water and the resulting amide linkage.

Worked example [Beginner]

Draw the structure of alanine at pH 2, 7, and 12. Explain zwitterion formation.

Alanine has the side chain , making it the simplest chiral amino acid.

At pH 2 (acidic). Both the amino group and the carboxyl group are protonated. The structure is , with a net charge of +1. The solution is more acidic than the pKa of the carboxyl group (pKa ~2.3 for alanine), so the carboxyl retains its proton.

At pH 7 (physiological). The carboxyl group has lost its proton () and the amino group retains its proton (). The structure is the zwitterion , with net charge 0. This is the dominant form at the isoelectric point pI = 6.01 for alanine.

At pH 12 (basic). Both groups are deprotonated. The amino group loses its proton () and the carboxyl remains deprotonated (). The structure is , with net charge -1. The solution is well above the pKa of the amino group (pKa ~9.9 for alanine).

The zwitterion at pH 7 exists because the two pKa values (carboxyl ~2.3, amino ~9.9) bracket the pH 7 region. Between these two values, one group is protonated and the other is deprotonated, giving the internally neutralised zwitterion.

Check your understanding [Beginner]

Formal definition [Intermediate+]

The twenty standard amino acids are alpha-amino acids with the general structure , where R is the side chain. In biological systems, the L-enantiomer is used exclusively. The amino acids are classified by side-chain properties:

Nonpolar, aliphatic: Gly (G), Ala (A), Val (V), Leu (L), Ile (I), Pro (P), Met (M). Side chains are hydrocarbons (or a thioether for Met). Proline is unique: its side chain bonds back to the amino nitrogen, making it a secondary amine and restricting backbone flexibility.

Aromatic: Phe (F), Tyr (Y), Trp (W). Phenylalanine is purely hydrophobic; tyrosine has a phenol OH (polar); tryptophan has an indole NH (weakly polar).

Polar, uncharged: Ser (S), Thr (T), Cys (C), Asn (N), Gln (Q). Side chains contain OH, SH, or amide groups that hydrogen-bond but do not ionise at physiological pH.

Positively charged: Lys (K), Arg (R), His (H). Lysine has a primary amine (pKa ~10.5); arginine has a guanidinium group (pKa ~12.5); histidine has an imidazole (pKa ~6.0, partially protonated at pH 7).

Negatively charged: Asp (D), Glu (E). Aspartate has a beta-carboxyl (pKa ~3.9); glutamate has a gamma-carboxyl (pKa ~4.3). Both are deprotonated and negatively charged at pH 7.

Zwitterion equilibrium and isoelectric point. For a neutral amino acid (no ionisable side chain), the isoelectric point is:

At pH = pI, the net charge is zero. For acidic amino acids (Asp, Glu), pI is the average of the two carboxyl pKa values; for basic amino acids (Lys, Arg, His), pI is the average of the amino pKa and the side-chain pKa.

Peptide bond. The amide bond formed between amino acids is planar (partial double-bond character from resonance between the carbonyl and the C-N bond). The resonance gives the peptide bond partial double-bond character, restricting rotation and making the trans configuration strongly preferred (~1000:1 for most residues).

Protein structural hierarchy.

Primary structure: the amino acid sequence, written N-terminus to C-terminus.

Secondary structure: local regular structures stabilised by backbone hydrogen bonds. The alpha helix has 3.6 residues per turn, hydrogen bonds between CO of residue and NH of residue . Beta sheets are extended strands hydrogen-bonded to neighbouring strands (parallel or antiparallel).

Tertiary structure: the full 3D fold of a single chain, stabilised by hydrophobic interactions, hydrogen bonds, ionic interactions, van der Waals forces, and disulfide bonds (Cys-Cys).

Quaternary structure: the arrangement of multiple polypeptide chains (subunits) into a functional assembly. Haemoglobin (four subunits) and DNA polymerase (multiple subunits) are examples.

Counterexamples to common slips

  • All amino acids are not zwitterions at all pH values. At very low pH (below the carboxyl pKa), the amino acid has a net positive charge. At very high pH (above the amino pKa), it has a net negative charge. The zwitterion is dominant only between the two pKa values.

  • Proline disrupts alpha helices. Because proline's side chain bonds to the backbone nitrogen, it lacks an amide hydrogen for hydrogen bonding and introduces a kink. Proline is often found at helix termini or in turns, not in the middle of helices.

  • Disulfide bonds are not hydrogen bonds. Disulfide bridges () form by oxidation of two cysteine thiol groups. They are covalent bonds, not the weak hydrogen bonds that stabilise secondary structure. They are the strongest single type of interaction stabilising tertiary structure and are broken only by reduction.

Key theorem with proof [Intermediate+]

Proposition (Isoelectric point calculation). For an amino acid with two ionisable groups (carboxyl pKa = , amino pKa = , with ), the isoelectric point is .

Proof. The net charge as a function of pH is determined by the fractional protonation of each group. The carboxyl group has fractional protonation , and the amino group has . The net charge is:

Setting : . At the midpoint pH = pI between the two pKa values, , which gives . At this pH, (carboxyl deprotonated) and (amino protonated), so the net charge is indeed .

For amino acids with ionisable side chains (Asp, Glu, Lys, Arg, His, Cys, Tyr), the pI is the average of the two pKa values that bracket the zwitterionic form.

Bridge. The pI calculation builds toward 15.07.01 carbonyl nucleophilic addition, where the nucleophilic amino group of one amino acid attacks the electrophilic carbonyl of another to form the peptide bond — the amide linkage at the heart of every protein. The foundational reason the peptide bond resists rotation is that nitrogen lone-pair delocalisation into the carbonyl orbital gives the C-N bond roughly 40% double-bond character; this is exactly the structural constraint that defines protein secondary structure, restricting the backbone torsion angles and to the allowed regions of the Ramachandran plot. The zwitterion equilibrium appears again in 15.14.01 pending enzyme mechanism, where the Ser-His-Asp catalytic triad exploits the same proton-transfer equilibria described by the pKa calculations above, and the bridge is between the acid-base properties of individual amino acids treated here and the catalytic machinery of enzymes built from those same amino acids.

Exercises [Intermediate+]

Side-chain reactivity and post-translational modifications [Master]

The chemical diversity of the twenty proteinogenic amino acids derives from nine distinct categories of side-chain functionality: aliphatic hydrocarbons (Gly, Ala, Val, Leu, Ile, Pro), thioethers (Met), aromatic rings (Phe, Tyr, Trp), hydroxyl groups (Ser, Thr, Tyr), thiols (Cys), amides (Asn, Gln), carboxylates (Asp, Glu), primary amines (Lys), guanidinium (Arg), and imidazole (His). Each category presents a characteristic reactivity profile that determines how the side chain participates in protein folding, enzyme catalysis, and covalent modification.

Nucleophilic side chains — Cys, Ser, Thr, Lys, His — bear lone pairs available for covalent bond formation. Cysteine's thiol () is the most nucleophilic under physiological conditions because sulfur's large, polarisable electron cloud stabilises the transition state for nucleophilic attack. Serine and threonine hydroxyls are weaker nucleophiles but become powerful when activated within enzyme active sites (the serine protease catalytic triad Ser-His-Asp deprotonates the serine OH, generating an alkoxide nucleophile that attacks the peptide carbonyl). Lysine's -amino group () is mostly protonated at pH 7 but can act as a nucleophile in the unprotonated fraction; in PLP-dependent enzymes the lysine forms a Schiff base (imine) with the pyridoxal phosphate cofactor. Histidine's imidazole () is approximately 50% protonated at physiological pH, making it an effective general acid-base catalyst — it can donate or accept a proton at near-neutral pH, which is why His appears in the catalytic machinery of most enzyme classes.

Post-translational modifications (PTMs) exploit these reactive side chains to diversify protein function beyond what the genome directly encodes. The most widespread PTMs include:

Phosphorylation of Ser, Thr, and Tyr residues by protein kinases attaches a dianionic phosphate monoester () to the hydroxyl oxygen. The kinase-catalysed reaction transfers the -phosphate of ATP to the substrate hydroxyl through an in-line displacement at phosphorus — mechanistically analogous to an SN2 reaction at a tetrahedral phosphorus centre. Phosphorylation introduces two negative charges where there was a neutral hydroxyl, disrupting local electrostatic interactions and creating a binding surface for phospho-recognition domains (SH2 domains bind phosphotyrosine; 14-3-3 proteins bind phosphoserine/phosphothreonine). Reversible phosphorylation, with kinases adding and phosphatases removing the phosphate, is the dominant signalling switch in eukaryotic cells. The human genome encodes approximately 500 kinases and 150 phosphatases, collectively modifying an estimated one-third of all cellular proteins.

Glycosylation attaches oligosaccharide chains to Asn (N-linked, via the consensus sequon Asn-X-Ser/Thr, where X is any residue except Pro) or to Ser/Thr (O-linked). N-linked glycosylation begins in the endoplasmic reticulum with the transfer of a pre-assembled 14-sugar oligosaccharide () from the lipid carrier dolichol phosphate to the amide nitrogen of Asn. The glycan is then trimmed and remodelled in the ER and Golgi. The chemical consequence is a large, hydrophilic, branched polysaccharide that affects protein solubility, stability against proteolysis, and recognition by other proteins. Glycosylation is the most complex PTM in terms of structural diversity: a single glycosylation site can carry dozens of distinct glycoforms.

Disulfide bond formation between two Cys residues generates a covalent linkage that crosslinks distant segments of the polypeptide chain. The oxidation of two thiols to a disulfide has a standard reduction potential V at pH 7, placing the equilibrium under cellular redox control. In the oxidising environment of the endoplasmic reticulum (GSH ratio ~1:1 to 3:1), disulfide formation is thermodynamically favoured. Protein disulfide isomerase (PDI) catalyses both formation and reshuffling of disulfides, allowing the protein to sample disulfide pairings until the native (thermodynamically optimal) arrangement is found. The kinetic accessibility of the native disulfide pattern was demonstrated by Anfinsen's ribonuclease refolding experiments [Anfinsen 1973]: denatured, reduced RNase A, upon removal of denaturant and addition of oxidant, regains full enzymatic activity spontaneously — the primary sequence encodes the correct disulfide pairing.

Ubiquitination attaches the 76-residue protein ubiquitin to the -amino group of lysine residues via an isopeptide bond (ubiquitin's C-terminal glycine carboxyl linked to the target lysine side-chain amine). The conjugation cascade involves three enzyme classes: E1 (ubiquitin-activating enzyme, forming a thioester with ubiquitin's C-terminus), E2 (ubiquitin-conjugating enzyme, accepting ubiquitin from E1 via transthioesterification), and E3 (ubiquitin ligase, catalysing the final isopeptide bond formation on the substrate lysine). Polyubiquitin chains linked through Lys48 of ubiquitin target the modified protein for degradation by the 26S proteasome; chains through Lys63 serve signalling roles in DNA repair and NF-B activation. The human genome encodes two E1 enzymes, approximately 40 E2s, and over 600 E3 ligases, providing extraordinary substrate specificity.

Additional PTMs include acetylation of lysine -amino groups (neutralising the positive charge, regulating chromatin structure through histone modification), methylation of lysine and arginine side chains (adding one to three methyl groups without changing the charge state, creating binding surfaces for chromatin-reader domains), hydroxylation of proline and lysine residues in collagen (4-hydroxyproline stabilises the collagen triple helix through stereoelectronic effects on the pyrrolidine ring pucker; the hydroxylase requires ascorbate as cofactor, explaining why vitamin C deficiency causes scurvy), and -carboxylation of glutamate residues in clotting factors (introducing a second carboxyl group that chelates calcium ions, essential for membrane binding of prothrombin and factors VII, IX, and X).

Solid-phase peptide synthesis [Master]

The Merrifield principle anchors the C-terminal amino acid to an insoluble polymeric resin, converting peptide synthesis from a homogeneous solution problem into a heterogeneous solid-liquid process [Merrifield 1963]. Each coupling cycle adds one amino acid to the growing chain; between cycles, excess reagents and byproducts are removed by washing the resin, and the product remains covalently attached. This eliminates the chromatographic purification required after every step of solution-phase synthesis and makes long peptide sequences practical. Merrifield's original synthesis of bradykinin (9 residues) in 1964 has been extended to routine production of 50–70 residue peptides, with the record for full SPPS exceeding 100 residues under optimised conditions.

Theorem (SPPS overall yield). In solid-phase peptide synthesis with coupling cycles, if each coupling step achieves fractional efficiency (defined as the fraction of resin-bound chains successfully extended), the overall yield of the full-length product is .

Proof. Label the resin-bound chains. After the first coupling cycle, a fraction of chains carry residue 1 attached; a fraction are failure sequences lacking residue 1. After the second coupling, a fraction of the chains that had residue 1 are extended to carry both residues 1 and 2 — this fraction is . The remaining chains are either the original failure sequences (fraction , never extended further) or newly created failures from the second cycle (fraction ). By induction: after cycles, the fraction of chains carrying the correct -residue sequence is . After cycles the yield is .

For a 50-residue peptide at 99% coupling efficiency per step: — roughly 60% yield of full-length product. At 95% per step: — only 7.7% yield, with the remainder being truncated failure sequences. This exponential dependence on coupling efficiency motivates the drive for 99.5% per-cycle efficiency, achieved through excess activated amino acid (2–4 equivalents), extended coupling times, and double-coupling protocols.

Resin types and linkers. The choice of resin determines the C-terminal functional group of the cleaved product. Wang resin (4-hydroxymethylphenoxyacetic acid linked to polystyrene) gives a free C-terminal carboxylic acid upon cleavage with 95% TFA. Rink amide resin gives a C-terminal amide (the amide nitrogen is built into the linker, and TFA cleavage releases the peptide amide). 2-Chlorotrityl chloride resin allows cleavage under exceptionally mild conditions (1% TFA in dichloromethane), preserving acid-labile side-chain protecting groups — useful when the target peptide contains acid-sensitive residues.

Fmoc vs. Boc strategy. The two dominant SPPS strategies differ in how the -amino group is protected between coupling cycles. Fmoc (9-fluorenylmethoxycarbonyl) is removed by mild base — typically 20% piperidine in DMF — via an E1cB elimination mechanism: the base abstracts the acidic proton at the 9-position of the fluorene ring, generating a carbanion that ejects the carbamate, liberating the free amine and dibenzofulvene. Fmoc removal is fast (2 5-minute treatments), quantitative, and compatible with acid-labile side-chain protecting groups. Boc (tert-butoxycarbonyl) is removed by strong acid — typically 50% TFA in dichloromethane — via protonation of the carbonyl oxygen followed by loss of the tert-butyl cation (which is captured by the solvent). Boc removal requires that side-chain protecting groups be stable to TFA, which mandates base-labile protecting groups for side chains (benzyl-based groups removed by hydrogen fluoride in the final cleavage step). Fmoc chemistry is now the standard for research-scale synthesis; Boc chemistry persists in large-scale industrial production because the TFA deprotection is robust and HF cleavage gives high-purity product.

Coupling reagents. Direct reaction of a free carboxyl with a free amine is thermodynamically unfavourable (amide bond formation from carboxylic acid and amine has kJ/mol in water). Coupling reagents activate the carboxyl by converting it to a more electrophilic intermediate. Carbodiimides (DCC, DIC) react with the carboxyl to form an O-acylisourea, which is attacked by the amine. However, O-acylisoureas undergo a side reaction — rearrangement to an unreactive N-acylurea — and promote racemisation at the -carbon by forming an oxazolone intermediate. Adding HOBt (1-hydroxybenzotriazole) or HOAt (1-hydroxy-7-azabenzotriazole) converts the O-acylisourea to an active ester (HOBt ester or HOAt ester) that reacts with the amine rapidly and with minimal racemisation, because the active ester is less electrophilic than the O-acylisourea and does not form oxazolones. Phosphonium-based reagents (PyBOP, PyBrOP) and aminium/uronium reagents (HBTU, HATU, TBTU) generate the same active esters in situ but with faster kinetics and cleaner byproduct profiles. HATU, incorporating the HOAt leaving group, is the gold standard for difficult couplings (sterically hindered residues such as N-methylamino acids or -disubstituted residues).

Protecting-group orthogonality. Fmoc SPPS requires that side-chain protecting groups be stable to piperidine (the Fmoc deprotection reagent) but labile to TFA (the final cleavage reagent). The standard side-chain protecting groups are: Bu (tert-butyl) for Asp/Glu carboxylates, Ser/Thr/Tyr hydroxyls; Boc for Lys -amino; Trt (trityl) for Cys thiol, His imidazole, Asn/Gln amide; and Pmc (2,2,5,7,8-pentamethylchroman-6-sulfonyl) or Pbf (2,2,4,6,7-pentamethyldihydrobenzofuran-5-sulfonyl) for Arg guanidinium. All are removed simultaneously during the TFA cleavage step, along with the peptide-resin linkage. The orthogonality between Fmoc (base-labile, acid-stable) and side-chain groups (acid-labile, base-stable) ensures that Fmoc removal exposes the -amine without affecting side-chain functionality.

Native chemical ligation (NCL). For proteins exceeding ~70 residues, the overall SPPS yield becomes impractically low (). Native chemical ligation, introduced by Kent and coworkers [Dawson 1994], overcomes this limitation by joining two or more unprotected peptide segments through a chemoselective reaction. The C-terminal segment is synthesised as a thioester (COSR); the N-terminal segment carries an N-terminal cysteine. The cysteine thiol attacks the thioester in a transthioesterification step, forming a thioester-linked intermediate that undergoes spontaneous SN acyl migration through a five-membered ring transition state, generating a native peptide bond at the ligation junction. The final product has no residual ligation scar — the junction is indistinguishable from a ribosomally synthesized peptide bond. Sequential NCL of three to five segments has enabled total chemical synthesis of proteins exceeding 200 residues, including functional enzymes (human lysozyme, HIV-1 protease) and proteins with site-specific incorporation of unnatural amino acids, isotopic labels, or post-translational modifications.

Protein chemical analysis and sequencing [Master]

Theorem (Edman degradation). Each cycle of Edman degradation removes and identifies exactly one N-terminal residue from a polypeptide chain, leaving the remainder of the chain intact and available for the next cycle [Edman 1950].

The Edman reagent, phenylisothiocyanate (PITC), reacts with the free N-terminal -amino group under mildly basic conditions (pH ~8, trimethylamine) to form a phenylthiocarbamoyl (PTC) derivative. Treatment with anhydrous acid (typically trifluoroacetic acid) cyclises the PTC derivative to a five-membered anilinothiazolinone (ATZ) ring, simultaneously cleaving the peptide bond between the N-terminal residue and the rest of the chain. The ATZ-amino acid is extracted into organic solvent, converted to the more stable phenylthiohydantoin (PTH) derivative under aqueous acid conditions, and identified by reverse-phase HPLC against known PTH-amino acid standards. The remainder of the polypeptide, now shortened by one residue, has a new free N-terminal amine and enters the next cycle.

The sequential nature of Edman degradation imposes two practical limits. First, the cycle efficiency — the fraction of molecules that complete the full reaction-cyclisation-extraction pathway — is approximately 95–98% per cycle. After cycles, the fraction of molecules yielding the correct -th residue signal is , the same geometric decay as SPPS. At 98% efficiency, signal persists for ~50 cycles; at 95%, ~30 cycles. Second, the N-terminal amino group must be free (not acetylated, formylated, or pyroglutamylated), which is the case for most but not all cellular proteins. Despite these limits, automated Edman sequencers (Applied Biosystems gas-phase instruments) were the workhorse of protein sequencing from the 1970s through the 1990s and remain useful for confirming N-terminal identity and detecting post-translational N-terminal modifications.

Mass spectrometry. The identification and quantification of proteins and peptides by mass spectrometry has largely supplanted Edman degradation for sequencing. Two ionisation methods dominate protein mass spectrometry:

MALDI-TOF (matrix-assisted laser desorption/ionisation — time of flight) [Karas 1988] co-crystallises the protein or peptide sample with a UV-absorbing organic matrix (sinapinic acid for proteins, -cyano-4-hydroxycinnamic acid for peptides). A pulsed UV laser ablates the matrix, carrying the analyte into the gas phase as predominantly singly charged ions (). The ions are accelerated through a fixed potential difference and drift through a field-free tube; the time of flight to the detector is proportional to the square root of the mass-to-charge ratio (). MALDI-TOF is tolerant of salts and buffers, requires minimal sample preparation, and provides accurate mass determination (0.01% or better) for proteins up to ~300 kDa.

ESI (electrospray ionisation) produces a fine aerosol of charged droplets from a solution of the analyte pumped through a metal capillary at high voltage. Solvent evaporation shrinks the droplets until Coulombic repulsion exceeds surface tension (Rayleigh limit), causing fission into smaller droplets; repeated evaporation-fission cycles ultimately yield bare multiply charged protein ions (). Because the protein acquires many charges (typically one proton per ~1 kDa of mass), even large proteins appear at low values ( where is the charge state), within the range of standard quadrupole or time-of-flight mass analyzers. ESI is readily coupled to liquid chromatography (LC-MS/MS), enabling automated peptide separation and sequencing.

Tandem mass spectrometry (MS/MS) for peptide sequencing operates as follows. A protein sample is digested with a site-specific protease (trypsin, which cleaves after Lys and Arg, is the standard). The resulting peptide mixture is separated by reverse-phase HPLC and introduced into the mass spectrometer by ESI. In the first MS stage, peptide precursor ions are selected by . Each selected precursor is fragmented by collision-induced dissociation (CID) — bombardment with inert gas molecules that break the peptide backbone preferentially at the amide bond. The resulting fragment ions (b-ions from the N-terminal side, y-ions from the C-terminal side) are mass-analysed in the second MS stage. The mass difference between consecutive b-ions (or y-ions) corresponds to the residue mass of the amino acid at that position, enabling sequence determination. Database searching (Mascot, SEQUEST) matches the observed fragmentation spectra against theoretical spectra computed from protein sequence databases, identifying the protein and characterising post-translational modifications by their characteristic mass shifts.

Amino acid analysis quantifies the molar ratios of amino acid residues in a protein hydrolysate. The protein is hydrolysed in 6 M HCl at 110 C for 24 hours under vacuum, cleaving all peptide bonds and converting the protein into its constituent free amino acids. The hydrolysate is derivatised with ninhydrin (which produces a purple chromophore, Ruhemann's purple, absorbing at 570 nm, with proline giving a yellow product at 440 nm) or with o-phthalaldehyde (OPA, which reacts with primary amines to form a fluorescent isoindole). Separation by ion-exchange chromatography (the Moore-Stein method) or reverse-phase HPLC gives quantitative compositional data: the number of each amino acid residue per molecule. The method destroys tryptophan (oxidised during acid hydrolysis) and converts glutamine and asparagine to glutamate and aspartate (deamidation), but provides an independent check on protein composition and molecular weight.

Circular dichroism (CD) probes protein secondary structure through the differential absorption of left- and right-circularly polarised UV light by the peptide backbone. The amide chromophore (the peptide bond) absorbs in the far-UV (190–250 nm), and its CD spectrum is sensitive to backbone conformation. Alpha-helical proteins show a characteristic double minimum at 208 nm and 222 nm (the 222 nm signal arises from the transition of the hydrogen-bonded amide). Beta-sheet proteins show a single minimum near 215 nm. Random coil gives a strong negative band near 195 nm. Quantitative analysis of the CD spectrum by reference to libraries of proteins with known crystal structures (CONTIN, SELCON algorithms) yields estimates of the fractional alpha-helix, beta-sheet, and coil content. CD is the standard method for monitoring protein folding and unfolding as a function of temperature, pH, or denaturant concentration, because changes in secondary structure are reflected in real time in the CD signal.

Protein folding and the Levinthal paradox. The protein-folding problem asks: given a linear sequence of amino acids, what three-dimensional structure does it adopt? Anfinsen's experiments [Anfinsen 1973] showed that denatured ribonuclease refolds to its native structure spontaneously in vitro, establishing that the primary sequence encodes the tertiary structure. The Levinthal paradox (1969) highlights the combinatorial scale: a protein of 100 residues has roughly possible backbone conformations. Even sampling conformations at per second, a random search would take longer than the age of the universe. Yet proteins fold in seconds or less. The resolution is that folding is not a random search but follows a folding funnel — an energy landscape that biases the chain toward the native state through a combination of local structure formation (secondary structure nucleation), hydrophobic collapse (burying nonpolar residues), and cooperative stabilisation of the native fold. The Anfinsen thermodynamic hypothesis states that the native state is the global free-energy minimum under physiological conditions; this is approximately true for small single-domain proteins but breaks down for larger proteins requiring chaperone assistance.

Non-proteinogenic amino acids and expanding the genetic code [Master]

The twenty canonical amino acids encoded by the standard genetic code are not the only amino acids found in biological systems, nor are they the only building blocks available for peptide and protein synthesis. Three categories of non-standard amino acids extend the chemical repertoire: naturally occurring amino acids incorporated through specialised translational machinery, unnatural amino acids introduced by synthetic biology approaches, and non--amino acids used in foldamer design.

D-amino acids in bacterial cell walls. Bacterial peptidoglycan contains D-alanine and D-glutamate in its stem peptides, incorporated by racemase enzymes that convert the L-amino acid pool to the D-enantiomer. Alanine racemase (a pyridoxal phosphate-dependent enzyme) interconverts L-Ala and D-Ala with a rate enhancement of over the uncatalysed reaction. D-Ala is ligated to D-Ala by D-alanine ligase to form the dipeptide D-Ala-D-Ala, which is incorporated into the peptidoglycan precursor lipid II and serves as the terminus of the crosslink. Vancomycin, a glycopeptide antibiotic, binds D-Ala-D-Ala with high affinity through a network of five hydrogen bonds, blocking the transglycosylation and transpeptidation steps of cell wall biosynthesis. Vancomycin-resistant enterococci (VRE) evade this mechanism by replacing the terminal D-Ala with D-lactate (D-Ala-D-Lac), which removes one of the five hydrogen bonds (amide NH ester oxygen) and reduces vancomycin binding affinity by a factor of ~1000 — a single atom change that renders the antibiotic ineffective.

Selenocysteine (Sec, U) — the 21st amino acid. Selenocysteine is a cysteine analogue in which sulfur is replaced by selenium. It is incorporated co-translationally at UGA codons, which normally function as stop codons, through a recoding mechanism requiring a cis-acting RNA element (the SECIS element, SECIS = SElenoCysteine Insertion Sequence) in the mRNA and a suite of trans-acting factors (SelB, a specialised elongation factor; a SECIS-binding protein; and selenocysteine synthase, which converts Ser-tRNA to Sec-tRNA). The selenol () of selenocysteine has a of ~5.2, roughly 3 units lower than the thiol of cysteine (). At physiological pH, selenocysteine is predominantly deprotonated (), making it a far more reactive nucleophile than the mostly protonated thiol of cysteine. This enhanced nucleophilicity is critical for the catalytic mechanism of glutathione peroxidase, which reduces hydrogen peroxide and organic hydroperoxides using selenocysteine as the redox-active residue in a ping-pong mechanism.

Pyrrolysine (Pyl, O) — the 22nd amino acid. Pyrrolysine is a lysine derivative with a (4R,5R)-4-substituted pyrroline-5-carboxylate group attached to the -amino nitrogen via an amide linkage. It is found in methanogenic archaea (Methanosarcina species) where it is incorporated at UAG amber stop codons. Unlike selenocysteine, pyrrolysine incorporation requires only a single dedicated aminoacyl-tRNA synthetase (PylRS) and its cognate tRNA (tRNA), without SECIS-like elements or additional elongation factors. The pyrrolysine biosynthetic pathway converts two molecules of lysine to one molecule of pyrrolysine through a series of radical SAM and amidotransferase reactions.

Genetic code expansion with unnatural amino acids. The Schultz laboratory demonstrated in 2001 that the pyrrolysine system (and engineered variants of the tyrosyl and leucyl systems) can be repurposed to incorporate a wide range of unnatural amino acids site-specifically into proteins in living cells. The approach requires an orthogonal tRNA/aminoacyl-tRNA synthetase pair: the engineered tRNA is not recognised by any endogenous synthetase, and the engineered synthetase charges only the tRNA with the unnatural amino acid and does not aminoacylate any endogenous tRNA. The tRNA reads a reassigned codon — typically the amber stop codon (UAG) — inserting the unnatural amino acid at the position specified by the engineered gene. Over 200 unnatural amino acids have been incorporated by this method, including: photo-crosslinking amino acids (p-benzoylphenylalanine, which forms a covalent bond with nearby C-H bonds upon 365 nm irradiation, mapping protein-protein interaction surfaces); fluorescent amino acids (coumarin- and dansyl-modified amino acids for real-time conformational reporting); post-translationally modified amino acids (phosphoserine, acetyllysine, incorporated directly to study PTM function without requiring the modifying enzyme); and amino acids with bioorthogonal reactive handles (azido- and alkynyl-amino acids for click chemistry conjugation, strained cyclooctyne amino acids for copper-free labelling).

-Amino acids and foldamers. -Amino acids have an extra methylene group between the amino and carboxyl groups ( instead of ). Peptides built from -amino acids (-peptides) fold into defined secondary structures — 14-helix, 12-helix, and 10/12-helix — that are analogous to but distinct from the -helix of natural peptides. The additional backbone methylene increases the conformational degrees of freedom (three backbone torsion angles per residue instead of two), but cyclic constraints (trans-2-aminocyclohexanecarboxylic acid, ACHC) restrict the backbone to specific helical conformations. -peptides are resistant to proteolytic degradation because natural proteases do not recognise the extended backbone, making them attractive scaffolds for therapeutic peptides and antimicrobial agents. The field of foldamers — oligomers that adopt well-defined folded structures — extends beyond -amino acids to include peptoids (N-substituted glycines), oligoureas, and aromatic oligoamides, all exploiting non-natural backbones to achieve structures and functions inaccessible to natural peptides.

Synthesis. The central insight of amino acid chemistry is that the twenty proteinogenic side chains partition into a small number of reactive categories — nucleophilic (Cys, Ser, Lys, His), acidic (Asp, Glu), basic (Arg, Lys, His), aromatic (Phe, Tyr, Trp), hydrophobic (Leu, Ile, Val, Met), and conformationally constrained (Pro, Gly) — and this classification determines every level of protein structure, function, and chemical manipulation. Putting these together with the peptide bond's restricted rotation and the zwitterion's pH-dependent charge state, the physical basis of protein folding emerges: hydrophobic burial of nonpolar side chains, hydrogen bonding of polar groups, ionic pairing of charged residues, and covalent crosslinking through disulfide bridges. This is exactly the foundational reason that post-translational modifications modulate protein function — phosphorylating a serine or ubiquitinating a lysine changes the chemical identity of the side chain without altering the genetic specification of the sequence, and the same reactive groups that fold the protein also serve as switches for signalling and regulation. The pattern recurs in peptide synthesis, where protecting-group orthogonality exploits the different reactivity of each functional group, and the bridge is between the combinatorial diversity of twenty (or twenty-two, or more) side chains and the emergent properties of the polypeptide chain. This chemical diversity generalises to the broader principle that sequence encodes structure encodes function — the central organising principle of molecular biology, grounded in the organic chemistry of the amino acid building blocks.

Connections [Master]

  • Carbonyl nucleophilic addition 15.07.01. Peptide bond formation is a nucleophilic addition-elimination at a carbonyl: the amine nitrogen attacks the carboxyl carbonyl carbon, forming a tetrahedral intermediate that collapses with loss of water. The resonance stabilisation of the resulting amide — nitrogen lone-pair delocalisation into the carbonyl — is the same electronic effect treated in 15.07.01 for carboxylic acid derivatives. The peptide bond's partial double-bond character, planarity, and resistance to hydrolysis all derive from this amide resonance, and protease-catalysed peptide bond cleavage proceeds through the same tetrahedral intermediate that nucleophilic addition to a carbonyl generates.

  • Enzyme mechanism 15.14.01 pending. Enzymes are proteins whose catalytic power derives from the reactive side chains positioned in the active site by the tertiary fold. The serine protease catalytic triad (Ser-His-Asp), the cysteine protease mechanism (Cys-His), and metalloprotease zinc coordination (His, Glu, Asp) all exploit the nucleophilic and acid-base chemistry of amino acid side chains described in this unit. The enzyme mechanism unit generalises these individual amino acid reactivities into coherent catalytic strategies — covalent catalysis, general acid-base catalysis, metal ion catalysis — built from the same building blocks.

  • Biomolecules in cells 17.01.01. The amino acid chemistry developed here — zwitterion behaviour, hydrophobic effect, hydrogen bonding, disulfide formation — provides the physical basis for protein folding, membrane association, and intracellular compartmentalisation treated in the cell biology chapter. The hydrophobic burial of nonpolar side chains that drives protein tertiary folding is the same physical interaction that inserts transmembrane helices into the lipid bilayer, and the charge-charge interactions between surface-exposed side chains govern protein-protein association and enzyme-substrate recognition.

  • Nucleic acid chemistry 15.13.01 pending. The phosphodiester bonds linking nucleotides in DNA and RNA are formed by condensation chemistry analogous to peptide bond formation: a 3'-hydroxyl attacks the -phosphate of a nucleoside triphosphate, with elimination of pyrophosphate. The nitrogen-heterocyclic nucleobases (adenine, guanine, cytosine, thymine, uracil) parallel the heterocyclic amino acid side chains (His imidazole, Trp indole, Pro pyrrolidine) in their hydrogen-bonding and aromatic stacking properties. Both biopolymer classes use condensation polymerisation to build information-encoding macromolecules from small-molecule monomers with distinctive chemical functionality.

  • SN1 vs SN2 substitution mechanisms 15.04.02 pending. The peptide coupling reaction in SPPS involves activation of the carboxyl group to form an electrophilic intermediate (active ester, O-acylisourea) that undergoes nucleophilic attack by the amine — mechanistically related to the nucleophilic substitution framework of 15.04.02 pending extended to carbonyl centres. The phosphorylation of serine, threonine, and tyrosine by kinases is an in-line displacement at a tetrahedral phosphorus centre, following the same stereochemical logic as SN2 at carbon: backside attack, inversion at phosphorus, concerted bond-making and bond-breaking.

  • Retrosynthetic analysis 15.10.01. Peptide synthesis applies retrosynthetic logic at every amide bond: the disconnection produces an amine and a carboxylic acid, and the forward amide coupling is the synthetic equivalent. Solid-phase peptide synthesis is a linear retrosynthetic plan executed iteratively; native chemical ligation is a convergent retrosynthetic plan for larger polypeptides. The general retrosynthetic framework of disconnection, synthon, and synthetic equivalent applies directly to polypeptide targets.

  • NMR spectroscopy of organic molecules 15.11.01. Protein NMR (HSQC, NOESY, TOCSY) extends the 1H and 13C techniques developed in the organic NMR unit to macromolecular structure determination. The same chemical-shift, J-coupling, and NOE principles that identify small-molecule functional groups are redeployed with isotopic labelling (N, C) and triple-resonance experiments to resolve the spectral overlap inherent in proteins and assign backbone and side-chain resonances.

  • Cellular organization: organelles 17.03.01 pending. The endoplasmic reticulum is the primary site of disulfide bond formation (Ero1/PDI-mediated oxidation of cysteine thiol pairs) and N-linked glycosylation of asparagine side chains — both are amino acid side-chain chemistry reactions described here, deployed at scale in the ER lumen to fold and mature secretory and membrane proteins. The calnexin chaperone cycle that quality-controls glycoproteins in the ER depends on the glucose-mannose-GlcNAc glycan whose initial transfer to asparagine is an amide-bond-forming reaction.

  • Translation 17.05.03 pending. The ribosome polymerises amino acids into polypeptide chains through amide (peptide) bond formation — the same condensation reaction whose chemistry is developed here. The ester linkage between each amino acid and the 3' OH of its cognate tRNA is the activated intermediate for peptide bond formation, a nucleophilic acyl substitution at the carbonyl carbon. The side-chain chemistry catalogued here (hydrophobic, charged, polar, aromatic) determines the folded protein's tertiary structure and function.

Historical & philosophical context [Master]

Emil Fischer proposed the peptide bond () as the linkage joining amino acids into proteins in 1902, independently of and simultaneously with Franz Hofmeister [Fischer 1902]. Fischer supported the proposal by synthesising the first oligopeptides — up to 18 residues — by solution-phase coupling, demonstrating that amide-linked amino acid chains have properties consistent with those of natural protein hydrolysis products. The Fischer peptide synthesis established that proteins are linear polymers of amino acids connected by amide bonds, settling a debate about whether proteins were colloidal aggregates or defined molecular species.

Frederick Sanger determined the complete amino acid sequence of insulin (51 residues, two chains) in 1955 [Sanger 1955], using partial hydrolysis, paper chromatography, and N-terminal labelling with 1-fluoro-2,4-dinitrobenzene (Sanger's reagent, which forms a stable dinitrophenyl derivative of the N-terminal amino acid). This was the first demonstration that a protein has a unique, defined primary structure — a specific sequence of amino acids — and that the sequence can be determined by chemical methods. Sanger received the Nobel Prize in Chemistry in 1958 for this work; he later received a second Nobel Prize (1980) for developing DNA sequencing methods.

Pehr Edman introduced sequential N-terminal degradation in 1950 [Edman 1950], providing an automated method for reading the amino acid sequence of a polypeptide from the N-terminus one residue at a time. Edman degradation became the standard sequencing method for three decades and was the immediate predecessor of modern mass-spectrometry-based protein sequencing.

Bruce Merrifield introduced solid-phase peptide synthesis in 1963 [Merrifield 1963], transforming peptide synthesis from a laborious multi-step solution-phase process into an automatable cycle of coupling, deprotection, and washing on an insoluble resin support. Merrifield's first SPPS instrument (1969) synthesised peptides automatically; he received the Nobel Prize in Chemistry in 1984. The Fmoc protecting group, introduced by Carpino in 1970 and applied to SPPS with Wang resin, replaced the original Boc chemistry for most research applications because the mild base deprotection is compatible with a wider range of side-chain functionalities.

Christian Anfinsen showed that denatured ribonuclease A refolds to its enzymatically active native state upon removal of denaturant, establishing that the primary amino acid sequence encodes the three-dimensional structure [Anfinsen 1973]. Philip Dawson, Stephen Kent, and coworkers introduced native chemical ligation in 1994 [Dawson 1994], enabling the total chemical synthesis of full-length proteins by joining unprotected peptide segments through a thioester-mediated ligation that generates a native peptide bond at the junction.

Bibliography [Master]

@article{Fischer1902,
  author = {Fischer, Emil},
  title = {{\"U}ber Polypeptide},
  journal = {Ber. dtsch. chem. Ges.},
  volume = {35},
  pages = {1095--1106},
  year = {1902}
}

@article{Sanger1955,
  author = {Sanger, Frederick},
  title = {The arrangement of amino acids in insulin},
  journal = {Biochem. J.},
  volume = {59},
  pages = {479--497},
  year = {1955}
}

@article{Edman1950,
  author = {Edman, Pehr},
  title = {Method for Determination of the Amino Acid Sequence in Peptides},
  journal = {Acta Chem. Scand.},
  volume = {4},
  pages = {283--293},
  year = {1950}
}

@article{Merrifield1963,
  author = {Merrifield, R. Bruce},
  title = {Solid Phase Peptide Synthesis. {I.} {The} Synthesis of a Tetrapeptide},
  journal = {J. Am. Chem. Soc.},
  volume = {85},
  pages = {2149--2154},
  year = {1963}
}

@article{Anfinsen1973,
  author = {Anfinsen, Christian B.},
  title = {Principles that govern the folding of protein chains},
  journal = {Science},
  volume = {181},
  pages = {223--230},
  year = {1973}
}

@article{Dawson1994,
  author = {Dawson, Philip E. and Muir, Tom W. and Clark-Lewis, Irwin and Kent, Stephen B. H.},
  title = {Synthesis of proteins by native chemical ligation},
  journal = {Science},
  volume = {266},
  pages = {776--779},
  year = {1994}
}

@article{Karas1988,
  author = {Karas, Michael and Hillenkamp, Franz},
  title = {Laser desorption ionization of proteins with molecular masses exceeding 10,000 daltons},
  journal = {Anal. Chem.},
  volume = {60},
  pages = {2299--2301},
  year = {1988}
}

@book{Lehninger2017,
  author = {Nelson, David L. and Cox, Michael M.},
  title = {Lehninger Principles of Biochemistry},
  edition = {7th},
  publisher = {W. H. Freeman},
  year = {2017}
}

@book{Voet2016,
  author = {Voet, Donald and Voet, Judith G.},
  title = {Biochemistry},
  edition = {5th},
  publisher = {Wiley},
  year = {2016}
}

@book{Clayden2012,
  author = {Clayden, Jonathan and Greeves, Nick and Warren, Stuart},
  title = {Organic Chemistry},
  edition = {2nd},
  publisher = {Oxford University Press},
  year = {2012}
}

@book{CareySundberg2007,
  author = {Carey, Francis A. and Sundberg, Richard J.},
  title = {Advanced Organic Chemistry, Part {A}: Structure and Mechanisms},
  edition = {5th},
  publisher = {Springer},
  year = {2007}
}