17.05.02 · mol-cell-bio / gene-expression

Transcription

draft3 tiersLean: none

Anchor (Master): Kornberg 2006 Nobel Lecture, Chem. Rev. 107 (2007) 3313-3326; Cramer et al. 2008 Annu. Rev. Biophys. 37, 337-352; Ptashne & Gann 2002; Li et al. 2007 Nature 446 on Pol II structure

Intuition [Beginner]

DNA stores the genetic blueprint for every protein a cell can make. But DNA stays in the nucleus (in eukaryotes) and proteins are built in the cytoplasm. Transcription bridges these two worlds by copying a gene's DNA sequence into a messenger RNA (mRNA) molecule that can travel to the ribosomes where proteins are assembled.

RNA polymerase, the enzyme at the heart of transcription, binds near the start of a gene, pries apart the two DNA strands, and uses one strand (the template strand) as a guide to string together complementary RNA nucleotides. The resulting RNA strand has the same sequence as the other DNA strand (the coding strand), except that uracil (U) replaces thymine (T) in RNA.

Not every gene is transcribed at all times. Proteins called transcription factors bind to specific DNA sequences near genes and determine whether RNA polymerase can access the promoter, the "start here" signal upstream of each gene. Activators open the door; repressors close it. This selective access is why a neuron and a liver cell share the same genome but produce different proteins and perform different functions.

In eukaryotes, the raw RNA transcript is not ready for translation. Three modifications convert it into mature mRNA: a protective cap is added to the 5-prime end, a tail of roughly 200 adenine nucleotides is added to the 3-prime end, and non-coding segments called introns are cut out, leaving only the coding exons. Bacteria, which lack a nucleus, can begin translating an mRNA while it is still being transcribed.

Most of the RNA produced in a eukaryotic cell is not mRNA at all. Roughly 80% of all transcription produces ribosomal RNA (rRNA), the structural and catalytic core of ribosomes. Transfer RNA (tRNA), small nuclear RNA (snRNA), and a growing catalogue of regulatory non-coding RNAs make up most of the rest. Protein-coding mRNA accounts for only a few percent of total RNA by mass, though it receives the most attention because it carries the instructions for building proteins.

Visual [Beginner]

Picture RNA polymerase as a molecular motor crawling along a gene. It binds at a promoter (a "start here" signal upstream of the gene), unwinds about 14 base pairs of DNA, reads the template strand, and strings together complementary RNA nucleotides one at a time. Behind it, the DNA rewinds. The RNA transcript peels away from the DNA as it grows — it does not stay base-paired.

Below the main diagram, a second panel shows the three processing steps applied to eukaryotic pre-mRNA: the 5-prime cap (a modified guanine attached backwards), the splicing reaction that removes introns and joins exons, and the poly(A) tail at the 3-prime end. The processed mRNA is shorter than the primary transcript and ready for export to the cytoplasm.

Worked example [Beginner]

A typical human gene spans about 30,000 base pairs (30 kbp). But the mature mRNA produced from that gene is only about 1,500 bases (1.5 kb). Where did the other 28,500 bases go?

The answer is introns. Human genes are organized as alternating coding regions (exons) and non-coding regions (introns). The entire 30 kbp gene is transcribed into a primary RNA transcript (pre-mRNA). Then the splicing machinery removes the introns and joins the exons together.

Step 1. RNA polymerase transcribes the full 30,000 bp gene, producing a pre-mRNA of 30,000 bases.

Step 2. The spliceosome recognises splice sites at the boundaries of each intron and exon, cuts out the introns (28,500 bases total), and ligates the exons together.

Step 3. After capping and polyadenylation, the mature mRNA is about 1,500 bases plus the cap and tail.

The fraction that is exonic: $1, 500/30, 000 = 0.05$ , or just 5%. The fraction that is intronic: $28, 500/30, 000 = 0.95$ , or 95%.

What this tells us: most of the DNA in a human gene does not code for protein. This varies enormously across species. Bacteria have almost no introns. Some human genes are over 99% intronic. The human dystrophin gene, mutated in Duchenne muscular dystrophy, spans 2.4 million base pairs but produces a mature mRNA of only 14,000 bases — transcription of this single gene takes roughly 16 hours.

Check your understanding [Beginner]

Formal definition [Intermediate+]

Transcription is the template-directed synthesis of RNA from a DNA template by RNA polymerase, using ribonucleoside triphosphates (NTPs) as substrates. Each incoming NTP is selected by Watson-Crick base pairing with the template strand, and a phosphodiester bond forms between the 3-prime hydroxyl of the growing RNA chain and the 5-prime phosphate of the incoming NTP. RNA synthesis proceeds in the 5-prime to 3-prime direction, as in DNA replication 17.05.01 pending.

RNA polymerases

Bacteria: A single RNA polymerase (RNAP) transcribes all genes. The core enzyme comprises five subunits ( $α_{2} β β^{'} ω$ ) that catalyse phosphodiester bond formation. The sigma factor ( $σ^{70}$ for most housekeeping genes) confers promoter specificity by recognising the $- 10$ (TATAAT) and $- 35$ (TTGACA) promoter elements. Sigma dissociates after initiation, and the core enzyme elongates alone ^{[Jacob & Monod 1961]}.

Eukaryotes: Three nuclear RNA polymerases, distinguished biochemically by their sensitivity to the toxin alpha-amanitin ^{[Roeder & Rutter 1969]}:

Pol I: transcribes the large ribosomal RNA precursor (45S pre-rRNA, processed into 28S, 18S, and 5.8S rRNA). Located in the nucleolus. Alpha-amanitin insensitive.
Pol II: transcribes all protein-coding genes (mRNA) and most small nuclear RNAs (snRNAs). Alpha-amanitin sensitive at low concentrations (~1 $μ$ g/mL). Pol II is the focus of this unit.
Pol III: transcribes transfer RNA (tRNA), 5S rRNA, and other small RNAs. Alpha-amanitin sensitive only at high concentrations (~100 $μ$ g/mL).

Pol II has 12 subunits (Rpb1–Rpb12). The Rpb1 subunit contains the C-terminal domain (CTD), a unique repeat of the heptapeptide sequence Tyr-Ser-Pro-Thr-Ser-Pro-Ser (52 repeats in humans), whose phosphorylation state coordinates the transitions between initiation, elongation, and RNA processing.

Promoters and enhancers

Promoters are DNA sequences that direct RNA polymerase binding and establish the transcription start site (TSS). In bacteria, the core promoter has two conserved elements centred at positions $- 10$ and $- 35$ relative to the TSS. In eukaryotes, Pol II promoters contain:

The TATA box ( $\sim$ $- 25$ to $- 30$ ): consensus TATAAA, bound by TBP (TATA-binding protein, a subunit of TFIID).
The Initiator (Inr) element ( $\sim$ TSS): consensus PyPyAN(T/A)PyPy.
The downstream promoter element (DPE) ( $\sim$ $+ 28$ to $+ 34$ ): recognised by TFIID when no TATA box is present.

Enhancers are DNA sequences that activate transcription from a promoter that can be thousands of base pairs away, upstream or downstream, in either orientation. They function by binding activator proteins, which contact the promoter-bound machinery through DNA looping mediated by the Mediator complex and architectural proteins (CTCF, cohesin). A single gene can be regulated by dozens of enhancers active in different tissues.

The transcription cycle

Initiation. TBP/TFIID binds the TATA box. The remaining general transcription factors (TFIIA, TFIIB, TFIIE, TFIIF, TFIIH) and Pol II assemble stepwise into the pre-initiation complex (PIC). TFIIH, using its XPB helicase subunit, unwinds $\sim$ 11 base pairs of DNA around the TSS to form the open complex. Pol II begins synthesizing RNA but pauses after producing $\sim$ 20–60 nucleotides. This promoter-proximal pause is a major regulatory checkpoint in metazoans.

Elongation. Release from pausing requires the kinase P-TEFb (positive transcription elongation factor b, composed of CDK9 and cyclin T1), which phosphorylates the Pol II CTD on Ser2 and the pause factors DSIF and NELF. Pol II then elongates at $\sim$ 20–80 nucleotides per second, synthesising the full RNA transcript.

Termination. In bacteria, termination is either intrinsic (a GC-rich hairpin followed by a poly-U tract in the RNA causes polymerase dissociation) or Rho-dependent (the Rho helicase tracks along the RNA and displaces the polymerase). In eukaryotes, Pol II termination is coupled to 3-prime end processing: after the polyadenylation signal (AAUAAA) is transcribed, the pre-mRNA is cleaved and the downstream RNA is degraded by the exonuclease Xrn2, which catches the polymerase.

RNA processing (eukaryotic pre-mRNA)

Three co-transcriptional modifications convert the primary transcript to mature mRNA:

5-prime capping. A 7-methylguanosine cap is attached to the first nucleotide via a 5-prime-to-5-prime triphosphate linkage when the transcript is $\sim$ 20–30 nucleotides long. The cap protects mRNA from 5-prime exonucleases and recruits the ribosome during translation initiation.

Splicing. Introns are removed and exons joined by the spliceosome, a megacomplex of five snRNPs (U1, U2, U4, U5, U6) and $\sim$ 150 proteins. The reaction proceeds through two transesterifications: first, the branch-point adenosine attacks the 5-prime splice site, forming a lariat intermediate; second, the free 3-prime-OH of exon 1 attacks the 3-prime splice site, ligating the exons and releasing the lariat intron. Alternative splicing produces multiple mRNA isoforms from one gene. Roughly 95% of human multi-exon genes undergo alternative splicing ^{[Sharp 1993]}.

3-prime polyadenylation. After the polyadenylation signal (AAUAAA) is transcribed, CPSF (cleavage and polyadenylation specificity factor) binds it, and the pre-mRNA is cleaved 10–30 nucleotides downstream. Poly(A) polymerase then adds $\sim$ 200 adenine residues, producing the poly(A) tail that stabilises the mRNA and promotes translation.

Counterexamples to common slips

Transcription starts at the ATG start codon. The transcription start site (TSS) is typically 25–35 bp downstream of the TATA box. The ATG start codon is the translation initiation site, usually tens to hundreds of bases downstream of the TSS in the first exon. The two processes — transcription and translation — start at different positions and use different initiation signals.
RNA polymerase moves steadily along DNA. In reality, Pol II pauses frequently: at promoter-proximal positions, at splice sites, at nucleosomes, and at DNA damage sites. Backtracking (reverse movement of the polymerase along the template) is a proofreading mechanism. The elongation rate is highly non-uniform.
All transcription produces mRNA. rRNA accounts for $\sim$ 80% of all RNA synthesized in a eukaryotic cell. Pol I produces rRNA, Pol III produces tRNA and 5S rRNA, and Pol II also produces many non-coding RNAs (snRNA, snoRNA, miRNA precursors, lncRNA). Protein-coding mRNA is a small fraction by mass, though it encodes the entire proteome.

Key theorem with proof [Intermediate+]

Theorem (Two-state promoter model — steady-state mRNA level). Consider a single promoter that switches between an inactive state (I) and an active state (A) with first-order rate constants $k_{on}$ (I $\to$ A) and $k_{off}$ (A $\to$ I). In the active state, RNA polymerase initiates transcription at rate $k_{init}$ , producing mRNA that degrades at rate $δ$ . At steady state, the mRNA copy number is

$[mRNA]_{ss} = \frac{k _{on} \cdot k _{init}}{( k _{on} + k _{off} ) \cdot δ} .$

Proof. Let $P_{A}$ denote the fraction of time the promoter spends in the active state. The promoter dynamics are governed by:

$\frac{d P _{A}}{d t} = k_{on} (1 - P_{A}) - k_{off} \cdot P_{A},$

where $P_{I} = 1 - P_{A}$ by conservation. At steady state, $d P_{A} / d t = 0$ , giving $P_{A}^{ss} = k_{on} / (k_{on} + k_{off})$ .

The mRNA dynamics are:

$\frac{d [ mRNA ]}{d t} = k_{init} \cdot P_{A} - δ \cdot [mRNA] .$

At steady state, $d [mRNA] / d t = 0$ . Substituting $P_{A}^{ss}$ :

$0 = k_{init} \cdot \frac{k _{on}}{k _{on} + k _{off}} - δ \cdot [mRNA]_{ss},$

which rearranges to $[mRNA]_{ss} = k_{on} \cdot k_{init} / [(k_{on} + k_{off}) \cdot δ]$ . $□$

This result partitions the determinants of gene expression into three independent factors: the promoter activation fraction $k_{on} / (k_{on} + k_{off})$ , the initiation rate $k_{init}$ , and the mRNA stability $1/ δ$ . Changing any one factor changes the steady-state mRNA level proportionally. Activators increase $k_{on}$ or decrease $k_{off}$ ; repressors do the opposite; signals that affect mRNA stability change $δ$ .

Bridge. The promoter-occupancy theorem builds toward the epigenetic regulation analysed in the Master-tier sections below, where $k_{on}$ and $k_{off}$ are themselves determined by chromatin state and transcription factor availability. This is exactly the quantitative substrate on which cell-signalling pathways 17.07.01 pending act: extracellular signals modulate transcription factor phosphorylation and nuclear localisation, changing the effective $k_{on}$ at target promoters. The central insight is that transcription rate is not a fixed property of a gene but an emergent property of the kinetic competition between activation and repression. The bridge is between this kinetic formulation and the steady-state mRNA levels decoded by translation 17.05.03 pending, where the mRNA concentration derived here becomes the substrate for ribosomal protein synthesis.

Exercises [Intermediate+]

Exercise 4 (medium, symbolic).

Explain how an enhancer can regulate a promoter that is 50,000 bp away. What is the physical mechanism by which the enhancer-bound activator contacts the transcription machinery at the promoter?

Hint

DNA is flexible at this scale. Think about DNA looping and coactivator complexes.

Answer

Enhancers work through DNA looping. Activator proteins bind to enhancer sequences. The intervening DNA loops out, bringing the activator into physical proximity with the promoter. The activator contacts the transcription machinery through the Mediator complex, a 26-subunit coactivator that bridges enhancer-bound activators with Pol II and the GTFs at the promoter. Chromatin architectural proteins (CTCF, cohesin) help stabilise these loops. This mechanism explains why enhancer activity is position- and orientation-independent — what matters is physical proximity in three-dimensional space, not linear distance along the DNA.

Exercise 5 (medium, symbolic).

Alpha-amanitin (from the death cap mushroom) inhibits RNA polymerase II at low concentrations but requires much higher concentrations to inhibit Pol III, and does not affect Pol I at all. Explain how this differential sensitivity is used to determine which polymerase transcribes a given gene.

Hint

If adding a low concentration of a drug blocks a particular RNA's synthesis, which polymerase is responsible?

Answer

This differential sensitivity is used experimentally as a pharmacological diagnostic:

If RNA synthesis is blocked by low alpha-amanitin (~1 $μ$ g/mL): the gene is transcribed by Pol II (mRNA, most snRNAs).
If RNA synthesis is blocked only by high concentrations (~100 $μ$ g/mL): the gene is transcribed by Pol III (tRNA, 5S rRNA).
If RNA synthesis is unaffected: the gene is transcribed by Pol I (large rRNA).

This pharmacological profile was one of the key pieces of evidence establishing the division of labour among the three eukaryotic polymerases ^{[Roeder & Rutter 1969]}.

Exercise 6 (hard, numeric).

In the two-state promoter model, a repressor reduces $k_{on}$ from $0.5 min^{- 1}$ to $0.05 min^{- 1}$ without changing $k_{off} = 0.3 min^{- 1}$ , $k_{init} = 5 min^{- 1}$ , or $δ = 0.1 min^{- 1}$ . Calculate the fold-change in steady-state mRNA level caused by the repressor.

Hint

Compute $[mRNA]_{ss}$ before and after repression. The fold-change is the ratio.

Answer

Before repression: $[mRNA]_{ss} = (0.5 \times 5) / [(0.5 + 0.3) \times 0.1] = 2.5/0.08 = 31.25$ copies.

After repression: $[mRNA]_{ss} = (0.05 \times 5) / [(0.05 + 0.3) \times 0.1] = 0.25/0.035 = 7.14$ copies.

Fold-change: $7.14/31.25 = 0.23$ , or roughly a 4.4-fold reduction. Note that a 10-fold reduction in $k_{on}$ produces only a 4.4-fold reduction in mRNA because the active fraction changes from $0.5/0.8 = 62.5%$ to $0.05/0.35 = 14.3%$ — a 4.4-fold change in active fraction, not 10-fold, because $k_{off}$ contributes to the denominator.

Exercise 7 (hard, symbolic).

Alternative splicing of the Dscam gene in Drosophila can theoretically produce 38,016 different protein isoforms from a single gene. The gene contains 24 exons, of which exon 4 has 12 variants, exon 6 has 48 variants, exon 9 has 33 variants, and exon 17 has 2 variants. Verify that $12 \times 48 \times 33 \times 2 = 38, 016$ . Explain why this combinatorial diversity matters for the nervous system.

Hint

Multiply the variants at each alternatively spliced exon. Dscam is involved in neuronal self-recognition.

Answer

$12 \times 48 = 576$ ; $576 \times 33 = 19, 008$ ; $19, 008 \times 2 = 38, 016$ . Confirmed.

Dscam (Down syndrome cell adhesion molecule) is a cell-surface protein involved in neuronal self-avoidance. Each neuron expresses a unique set of Dscam isoforms, which serves as a molecular identity tag. When axonal or dendritic processes from the same neuron encounter each other, matching Dscam isoforms signal "self" and cause them to repel, preventing self-crossing while allowing non-self interactions. The enormous isoform diversity from one gene ensures that each neuron has a practically unique identity, enabling the complex wiring of the nervous system from a compact genome.

Exercise 8 (hard, symbolic).

The antibiotic rifampicin binds to the beta subunit of bacterial RNA polymerase and blocks the exit channel for the growing RNA chain. Predict the stage of transcription at which rifampicin acts and explain why it kills bacteria without affecting human cells.

Hint

If RNA cannot exit the polymerase, when does the block occur? Compare bacterial RNAP to human Pol II.

Answer

Rifampicin blocks during early elongation, after initiation but before the RNA chain has grown past $\sim$ 2–3 nucleotides. The drug physically occludes the RNA exit channel. Once an RNA chain is longer than $\sim$ 2–3 nt, it cannot pass through the blocked channel, and elongation is aborted.

Rifampicin is bactericidal because it binds the bacterial RNAP beta subunit (encoded by the rpoB gene) with high affinity. Human RNA polymerases have a different structure in the corresponding region and do not bind rifampicin at therapeutic concentrations. Resistance mutations in rpoB are a significant clinical problem in tuberculosis treatment.

Eukaryotic transcription initiation machinery [Master]

The initiation of transcription by RNA polymerase II requires the coordinated assembly of a multi-megadalton protein complex at the promoter. This process — the stepwise construction of the pre-initiation complex (PIC) — is one of the most studied molecular assemblies in biology, resolved to atomic resolution through the X-ray crystallography and cryo-EM structures produced by Kornberg and Cramer ^{[Kornberg 2006, Cramer et al. 2008]}.

The process begins when TBP (TATA-binding protein), a saddle-shaped subunit of TFIID, binds the TATA box. TBP inserts two phenylalanine pairs between base pairs at the TATA sequence, kinking the DNA by roughly 80 degrees and partially unwinding it. This distortion creates a molecular landmark visible to the remaining factors. TFIID itself contains TBP plus $\sim$ 13 TAFs (TBP-associated factors), several of which recognise additional promoter elements (the Inr, DPE) and interact with activator proteins bound at nearby enhancers.

After TBP/TFIID is bound, the remaining GTFs assemble in a partially ordered pathway. TFIIA stabilises TBP-DNA binding. TFIIB bridges TBP and Pol II, positioning the polymerase over the TSS and contributing to start-site selection. TFIIF enters with Pol II, reducing non-specific DNA binding and helping to guide the template strand into the active site. TFIIE recruits and regulates TFIIH. TFIIH, the last factor to join, performs two essential enzymatic activities: its XPB subunit uses ATP hydrolysis to unwind DNA at the start site (forming the transcription bubble), and its CDK7/cyclin H subunit (the kinase module CAK) phosphorylates the Pol II CTD on Ser5.

The assembled PIC positions Pol II at the start site with the template strand threaded through the active-site cleft. XPB-driven unwinding produces an open complex of $\sim$ 11 unwound base pairs. Pol II then begins RNA synthesis, but does not immediately clear the promoter. Instead, it undergoes abortive initiation — repeatedly synthesising and releasing short RNA transcripts of 2–9 nucleotides — before successfully synthesising a transcript long enough to displace the contacts holding it at the promoter. The transition from abortive initiation to productive elongation is called promoter escape and is a significant kinetic barrier.

The Mediator complex is the coactivator that bridges enhancer-bound transcription factors with the PIC. Mediator is a 26-subunit complex ( $\sim$ 1.2 MDa) organised into head, middle, tail, and kinase modules. The tail module interacts with gene-specific activators; the head module contacts Pol II and the GTFs. Mediator was discovered biochemically as an activity required for activator-dependent transcription in vitro, and its subunits were identified through the genetic screens that revealed many of the same genes as the yeast RNA polymerase II suppressor loci. Mediator functions as a signal integrator: different combinations of activators recruit Mediator in different conformations, transmitting different activation signals to Pol II.

Chromatin structure imposes an additional layer of regulation on initiation. Nucleosomes positioned over the promoter block TBP binding and PIC assembly. Two families of ATP-dependent chromatin remodelers counteract this: SWI/SNF remodelers (SWI2/SNF2 in yeast, BRG1/BRM in mammals) use ATP hydrolysis to slide or eject nucleosomes, creating a nucleosome-depleted region (NDR) at active promoters; ISWI remodelers establish regular nucleosome spacing downstream of the TSS, positioning the +1 nucleosome to regulate promoter-proximal pausing. Active promoters are marked by the histone modification H3K4me3 (trimethylation of lysine 4 on histone H3), deposited by the Set1/COMPASS complex, which recruits TFIID through its TAF3 subunit. This creates a positive feedback loop: transcription initiation recruits H3K4 methylation, which in turn facilitates TFIID binding and further initiation.

Promoter-proximal pausing and elongation control [Master]

One of the most consequential discoveries in transcription biology was that RNA polymerase II does not elongate continuously after initiation. In metazoans, Pol II pauses $\sim$ 20–60 nucleotides downstream of the TSS at most active genes, creating a pool of engaged but paused polymerases that can be rapidly released into productive elongation. This promoter-proximal pausing was first identified at heat-shock genes in Drosophila, where Pol II is pre-loaded at the promoter and released within seconds of heat shock, producing a burst of mRNA far faster than would be possible if transcription had to start from PIC assembly ^{[Ptashne & Gann 2002]}.

Two factors hold Pol II in the paused state. DSIF (DRB sensitivity-inducing factor, composed of Spt4 and Spt5) binds the polymerase as it elongates past the pause site. NELF (negative elongation factor, four subunits) associates with DSIF and the nascent RNA to stabilise the pause. Structural studies show that NELF binds across the Pol II surface, physically restricting conformational changes needed for translocation. The paused complex is stable, with Pol II sitting on the DNA for minutes to hours before release.

Release requires P-TEFb (positive transcription elongation factor b), a heterodimer of CDK9 and cyclin T1. P-TEFb phosphorylates three substrates: the Pol II CTD on Ser2 (marking the transition to elongation), NELF (causing its dissociation), and DSIF/Spt5 (converting it from a negative to a positive elongation factor). The switch from paused to elongating Pol II is one of the most regulated steps in mammalian gene expression and is the rate-limiting step for many inducible genes.

The Pol II CTD code coordinates elongation with co-transcriptional RNA processing. The CTD consists of 52 heptad repeats (in humans) of the sequence YSPTSPS. Different kinases phosphorylate different residues at different stages: TFIIH/CDK7 phosphorylates Ser5 at initiation (recruiting the capping enzyme); P-TEFb/CDK9 phosphorylates Ser2 during elongation (recruiting splicing factors and 3-prime processing factors); Bur1/CDK9 phosphorylates Ser2 and also Thr4; Fcp1 dephosphorylates Ser2 after termination. This combinatorial code — the "CTD code" — ensures that RNA processing events occur in the correct temporal order as Pol II moves along the gene ^{[Kornberg 2006]}.

Elongation itself is not uniform. Pol II encounters nucleosomes every $\sim$ 200 bp and must disrupt histone-DNA contacts to transcribe through. The histone chaperones FACT (facilitates chromatin transcription) and Spt6 disassemble nucleosomes ahead of Pol II and reassemble them behind, maintaining chromatin integrity while allowing passage. When Pol II encounters a damaged template, it stalls and may backtrack, extruding the 3-prime end of the RNA out through the polymerase pore. The elongation factor TFIIS (encoded by TCEA1) inserts a zinc finger into the Pol II active site, stimulating an intrinsic endonucleolytic activity that cleaves the extruded RNA, creating a new 3-prime end aligned in the active site for a fresh elongation attempt.

The HIV Tat protein provides the central worked example of pausing regulation. The HIV promoter is transcribed by host Pol II, but elongation is initially inefficient because the paused polymerase is not released. The viral Tat protein binds the TAR RNA element (a stem-loop at the 5-prime end of the nascent HIV transcript) and directly recruits P-TEFb to the promoter. Tat-P-TEFb phosphorylates the Pol II CTD and NELF/DSIF, releasing the pause. This single regulatory event increases HIV transcription roughly 100-fold and is the mechanism targeted by latency-reversing agents in HIV cure strategies. Without Tat, the integrated HIV provirus remains transcriptionally silent — a state called viral latency that allows the virus to persist despite antiretroviral therapy.

Epigenetic regulation of transcription [Master]

Transcription is regulated not only by DNA sequence (promoters, enhancers) but by heritable chemical modifications to chromatin that modulate DNA accessibility. This epigenetic layer of regulation determines which regions of the genome are available for transcription in each cell type and is responsible for the stable maintenance of cell identity through cell division.

DNA methylation is the best-characterised epigenetic mark. In vertebrates, cytosine residues in CpG dinucleotides can be methylated at the 5-carbon position by DNA methyltransferases (DNMT1 maintains methylation during replication; DNMT3A/3B establish new methylation patterns). Roughly 70–80% of CpGs in the human genome are methylated. Unmethylated CpG clusters (CpG islands) coincide with the promoters of $\sim$ 70% of human genes. Methylation of a CpG island promoter silences transcription by two mechanisms: direct inhibition of transcription factor binding (many TFs cannot bind their recognition sequences when CpGs within them are methylated), and recruitment of methyl-binding proteins (MeCP2, MBD1–4) that in turn recruit histone deacetylases (HDACs) and other repressive complexes. Rett syndrome, a severe neurodevelopmental disorder, is caused by mutations in MeCP2 that disrupt its ability to read the methylation mark ^{[Strahl & Allis 2000]}.

The histone code hypothesis, proposed by Strahl and Allis in 2000, posits that specific combinations of post-translational modifications on histone tails create binding surfaces for effector proteins that alter chromatin structure and transcription. The principal modifications are:

Acetylation of lysine residues (deposited by histone acetyltransferases, HATs; removed by HDACs). Acetylation neutralises the positive charge on lysine, weakening histone-DNA contacts and promoting open chromatin. H3K27ac marks active enhancers; H3K9ac marks active promoters. Bromodomain-containing proteins (readers) bind acetylated lysines and recruit transcriptional activators.
Methylation of lysine residues (deposited by histone methyltransferases, HMTs; removed by demethylases like LSD1 and JmjC-domain proteins). Methylation does not change charge and can signal either activation or repression depending on the site: H3K4me3 marks active promoters; H3K36me3 marks transcribed gene bodies; H3K27me3 marks repressed genes; H3K9me3 marks constitutive heterochromatin. Chromodomain proteins read H3K27me3; PHD fingers read H3K4me3.
Ubiquitination, phosphorylation, SUMOylation, and other modifications add further layers of regulation.

Two antagonistic chromatin-modifying complexes define the bivalent state of developmental genes in embryonic stem cells. Polycomb repressive complex 2 (PRC2, containing the HMT EZH2) deposits H3K27me3, silencing genes. Trithorax group complexes (including MLL/COMPASS) deposit H3K4me3, activating genes. Developmental regulators in ES cells often carry both marks simultaneously — a "bivalent" state that keeps genes repressed but primed for rapid activation upon differentiation. Resolution of bivalency, with loss of one mark, commits the gene to either active or repressed status in the differentiated cell.

Enhancer-promoter communication depends on three-dimensional chromatin architecture. CCCTC-binding factor (CTCF) is a zinc-finger protein that binds insulator elements and defines the boundaries of topologically associating domains (TADs) — self-interacting regions of the genome $\sim$ 200 kb to 1 Mb in size, visible in Hi-C contact maps. CTCF, together with the cohesin ring complex, mediates DNA looping that brings enhancers into proximity with their target promoters while insulating them from non-target genes. Disruption of TAD boundaries by structural variants can cause enhancer hijacking — an enhancer that normally activates one gene is brought into contact with an oncogene, driving cancer. The classic example is the limb malformation caused by disruption of the Shh TAD boundary, which places the Shh enhancer in contact with the wrong promoter. TADs are largely conserved across mammalian species, and their boundaries coincide with CTCF binding sites that are oriented in a convergent manner, suggesting that the directionality of CTCF binding determines which loops form.

The heritability of epigenetic marks through cell division is ensured by coupled maintenance mechanisms. DNMT1 recognises hemimethylated CpG sites at the replication fork and methylates the newly synthesised strand, copying the methylation pattern. Histone modifications are re-established after replication through the action of "reader-writer" complexes that recognise existing modifications on parental histones and deposit the same mark on newly incorporated histones. PRC2, for example, binds existing H3K27me3 through its EED subunit and deposits additional H3K27me3 on adjacent nucleosomes, propagating the repressive mark. This coupling of replication to epigenetic maintenance ensures that cell identity is preserved through the $\sim$ 10 $^{14}$ cell divisions that occur during a human lifetime.

Non-coding RNA and transcription [Master]

The discovery that the majority of the human genome is transcribed into non-coding RNA — far more than the $\sim$ 2% that encodes protein — has fundamentally changed the understanding of transcriptional regulation. Non-coding RNAs function as signals, decoys, guides, and scaffolds that modulate transcription at multiple levels.

Long non-coding RNAs (lncRNAs), defined as transcripts longer than 200 nucleotides that do not encode protein, number in the tens of thousands in the human genome. Several lncRNAs have well-characterised mechanisms. Xist (X-inactive specific transcript) coats the inactive X chromosome in female mammalian cells, recruiting PRC2 and other silencing factors to establish chromosome-wide transcriptional repression. Xist is itself transcribed from the inactive X; its RNA spreads along the chromosome in cis, binding $\sim$ 100 sites through repeat elements (the A-repeat for PRC2 recruitment, other repeats for additional factors). Deletion of Xist prevents X-inactivation, demonstrating that the RNA molecule itself, not the act of its transcription, carries out silencing. HOTAIR (HOX antisense intergenic RNA) is transcribed from the HOXC locus on chromosome 12 but represses transcription at the HOXD locus on chromosome 2 in trans — the first demonstration of a trans-acting chromatin-regulatory lncRNA. HOTAIR bridges PRC2 (at its 5-prime end) with the LSD1/CoREST complex (at its 3-prime end), simultaneously depositing H3K27me3 and removing H3K4me2 at target loci.

Enhancer RNAs (eRNAs) are short, unstable, bidirectional transcripts produced from active enhancers. Their production correlates with enhancer activity across cell types, and eRNA knockdown reduces expression of the genes regulated by the corresponding enhancer. The mechanism is debated: eRNAs may stabilise enhancer-promoter loops by binding cohesin or Mediator, or they may recruit additional transcription factors to the enhancer. The production of eRNAs is a useful experimental marker for identifying active enhancers in the genome.

MicroRNA (miRNA) biogenesis begins with transcription by Pol II. Primary miRNA transcripts (pri-miRNAs) contain stem-loop structures recognised by the nuclear Drosha complex (Drosha + DGCR8), which cleaves the stem to release a $\sim$ 70-nucleotide precursor (pre-miRNA). Pre-miRNA is exported to the cytoplasm by Exportin-5 and cleaved by Dicer to produce a $\sim$ 22-nucleotide duplex. One strand is loaded into the RISC complex (containing Argonaute), where it guides base-pairing with target mRNAs, leading to translational repression or mRNA degradation. Transcription of pri-miRNAs is regulated by the same transcription factors and epigenetic mechanisms that control protein-coding genes, creating a regulatory network where transcription factors control miRNAs that in turn control the mRNA levels of other transcription factors.

Transcription interference occurs when transcription from one promoter affects the activity of a nearby promoter. Antisense transcription — non-coding RNAs produced from the opposite strand of a protein-coding gene — can interfere with sense-strand transcription by disrupting PIC assembly, altering chromatin state, or colliding with sense-strand polymerases. Genome-wide studies have shown that pervasive antisense transcription is a general feature of eukaryotic genomes, with regulatory functions at many loci. The ENCODE project's finding that $\sim$ 80% of the human genome is transcribed into at least one RNA species reflects this pervasive transcription, much of it producing non-coding RNAs whose functions are still being characterised.

Synthesis. The foundational reason transcription is regulated at so many levels — chromatin accessibility, PIC assembly, promoter-proximal pausing, elongation rate, RNA processing — is that gene expression is the primary mechanism by which cells respond to their environment and maintain identity across cell divisions. The central insight is that each regulatory layer provides an independent kinetic checkpoint: this is exactly what allows a single genome of $\sim$ 20,000 protein-coding genes to produce hundreds of distinct cell types. Putting these together with the two-state promoter model of the Intermediate tier, the transcription cycle from promoter binding to mRNA export identifies the rate of information flow from genome to proteome with the slowest checkpoint in the kinetic chain. The bridge is between the molecular machinery described here and the cell-signalling cascades 17.07.01 pending that feed into it, and the pattern recurs in the co-transcriptional coupling of splicing and mRNA processing, where the same Pol II molecule simultaneously synthesises and processes its transcript. This multi-layered architecture generalises beyond transcription to other information-processing systems in the cell: translation 17.05.03 pending, signal transduction 17.07.01 pending, and DNA replication 17.05.01 pending all use kinetic checkpoint hierarchies to ensure fidelity and responsiveness.

Connections [Master]

DNA replication 17.05.01 pending. Replication established the template-directed polymerisation principle that transcription inherits: the same Watson-Crick base-pairing rules, the same 5-prime-to-3-prime synthesis direction, and the same requirement for a nucleic acid template. The distinction is that replication copies the entire genome, while transcription copies individual genes — and transcription uses ribonucleotides (NTPs) rather than deoxyribonucleotides (dNTPs). Replication origins and promoters serve analogous functions as sequence-defined starting points for nucleic acid synthesis, but the regulatory logic differs: replication must fire once per cell cycle at every origin, while transcription fires at gene-specific rates determined by cellular state.
Translation 17.05.03 pending. The mRNA produced by transcription is the direct substrate for the ribosome. Every regulatory decision encoded in the transcription process — promoter selection, splicing pattern, polyadenylation site choice — determines what protein isoform is produced and at what level. The coupling between transcription and translation is especially tight in bacteria, where ribosomes load onto the mRNA while Pol II is still elongating.
Cell signalling 17.07.01 pending. Signalling pathways regulate transcription primarily by modulating transcription factor activity. Phosphorylation cascades from receptor tyrosine kinases and G-protein-coupled receptors converge on TFs that alter $k_{on}$ and $k_{off}$ at target promoters. NF- $κ$ B, STAT, CREB, and p53 are all signal-activated TFs whose activity is controlled by post-translational modification rather than by changes in gene expression — a fast regulatory layer that operates on the pre-existing TF pool.
Mutation and repair 17.06.01 pending. Mutations in regulatory DNA — promoters, enhancers, splice sites — can alter gene expression without changing any protein sequence. The lac operon constitutive mutations identified by Jacob and Monod in 1961 were mutations in the operator (a regulatory DNA element), not in the structural genes. Conversely, the transcription-coupled repair pathway preferentially repairs the template strand of actively transcribed genes, linking transcription directly to DNA damage recognition.
Mendelian genetics 19.01.01 pending. Mendelian dominance relationships can arise from transcriptional mechanisms: haploinsufficiency occurs when one functional allele produces insufficient transcript for normal phenotype (the 50% output from a heterozygote falls below threshold), and dominant-negative alleles produce transcripts that interfere with the wild-type product. Transcription is the first step in the genotype-to-phenotype mapping that Mendelian genetics describes.

Historical & philosophical context [Master]

The concept of a messenger RNA — an unstable intermediate carrying genetic information from DNA to ribosomes — was proposed by Sydney Brenner, Francois Jacob, and Matthew Meselson in 1961 ^{[Brenner, Jacob & Meselson 1961]}. Their experiment at Caltech showed that phage T4 infection of E. coli produced a short-lived RNA species complementary to phage DNA, associated with ribosomes but not a permanent component of them. This was the first demonstration of mRNA, confirming the prediction that gene expression required an intermediary molecule between DNA and protein.

The regulatory architecture of transcription was established by Jacob and Monod's lac operon model, published the same year ^{[Jacob & Monod 1961]}. Their genetic analysis of E. coli lactose metabolism identified three regulatory elements — the structural gene (lacZ), the operator (where the repressor binds), and the promoter (where RNA polymerase binds) — and proposed that gene expression is controlled by repressors that block transcription. The lac operon model earned them the Nobel Prize in 1965 and established the paradigm of gene regulation that still organises the field: genes are controlled by trans-acting proteins that bind cis-acting DNA elements.

The three eukaryotic nuclear RNA polymerases were distinguished by Roeder and Rutter in 1969 using DEAE-Sephadex chromatography of sea urchin nuclear extracts, separating three enzymatic activities with different chromatographic properties, different optimal salt concentrations, and different alpha-amanitin sensitivities ^{[Roeder & Rutter 1969]}. The assignment of each polymerase to its biological function (Pol I to rRNA, Pol II to mRNA, Pol III to tRNA) was completed over the following decade using the alpha-amanitin sensitivity diagnostic described in this unit's exercises.

Roger Kornberg's determination of the yeast Pol II structure at atomic resolution — culminating in the 10-angstrom structure in 2000 and the refined structure published in Science in 2001 — revealed the molecular mechanism of transcription at the level of individual atoms. The structure showed a deep cleft between the two largest subunits (Rpb1 and Rpb2), a clamp domain that grips the DNA, a pore through which NTPs enter and RNA is extruded during backtracking, and the CTD extending from the Rpb1 surface. Kornberg received the Nobel Prize in Chemistry in 2006 ^{[Kornberg 2006]}. Patrick Cramer's concurrent structural work on the Pol II elongation complex, the PIC, and the Mediator-bound transcription machinery extended the structural picture to the full transcription apparatus ^{[Cramer et al. 2008]}.

Bibliography [Master]

@article{BrennerJacobMeselson1961,
  author = {Brenner, S. and Jacob, F. and Meselson, M.},
  title = {An unstable intermediate carrying information from genes to
           ribosomes for protein synthesis},
  journal = {Nature},
  volume = {190},
  year = {1961},
  pages = {576--581},
}

@article{JacobMonod1961,
  author = {Jacob, F. and Monod, J.},
  title = {Genetic regulatory mechanisms in the synthesis of proteins},
  journal = {J. Mol. Biol.},
  volume = {3},
  year = {1961},
  pages = {318--356},
}

@article{RoederRutter1969,
  author = {Roeder, R. G. and Rutter, W. J.},
  title = {Multiple forms of {DNA}-dependent {RNA} polymerase in
           eukaryotic organisms},
  journal = {Nature},
  volume = {224},
  year = {1969},
  pages = {234--237},
}

@article{Kornberg2007,
  author = {Kornberg, R. D.},
  title = {The molecular basis of eukaryotic transcription},
  journal = {Chem. Rev.},
  volume = {107},
  year = {2007},
  pages = {3313--3326},
  note = {Nobel Lecture, December 8, 2006},
}

@article{Cramer2008,
  author = {Cramer, P. and Armache, K.-J. and Baumli, S. and Benkert, S.
            and Brueckner, F. and Buchen, C. and Damsma, G. E. and
            Dengl, S. and Geiger, S. R. and Jasiak, A. J. and others},
  title = {Structural biology of {RNA} polymerase {II}},
  journal = {Annu. Rev. Biophys.},
  volume = {37},
  year = {2008},
  pages = {337--352},
}

@article{StrahlAllis2000,
  author = {Strahl, B. D. and Allis, C. D.},
  title = {The language of covalent histone modifications},
  journal = {Nature},
  volume = {403},
  year = {2000},
  pages = {41--45},
}

@article{PeccoudYcart1995,
  author = {Peccoud, J. and Ycart, B.},
  title = {Markovian modelling of gene product synthesis},
  journal = {Theor. Popul. Biol.},
  volume = {48},
  year = {1995},
  pages = {222--234},
}

@book{PtashneGann2002,
  author = {Ptashne, M. and Gann, A.},
  title = {Genes and Signals},
  publisher = {Cold Spring Harbor Laboratory Press},
  year = {2002},
}

@book{AlbertsMBoC6e,
  author = {Alberts, B. and Johnson, A. and Lewis, J. and Morgan, D.
            and Raff, M. and Roberts, K. and Walter, P.},
  title = {Molecular Biology of the Cell},
  publisher = {Garland Science},
  year = {2014},
  edition = {6th},
}

@article{Sharp1994,
  author = {Sharp, P. A.},
  title = {Split genes and {RNA} splicing},
  journal = {Cell},
  volume = {77},
  year = {1994},
  pages = {805--815},
}

@article{Banerji1981,
  author = {Banerji, J. and Rusconi, S. and Schaffner, W.},
  title = {Expression of a $\beta$-globin gene is enhanced by remote
           {SV40} {DNA} sequences},
  journal = {Cell},
  volume = {27},
  year = {1981},
  pages = {299--308},
}

@article{BrownXist1991,
  author = {Brown, C. J. and Ballabio, A. and Rupert, J. L. and
            Lafreniere, R. G. and Grompe, M. and Tonlorenzi, R. and
            Willard, H. F.},
  title = {A gene from the region of the human {X} inactivation centre
           is expressed exclusively from the inactive {X} chromosome},
  journal = {Nature},
  volume = {349},
  year = {1991},
  pages = {38--44},
}

@article{Raj2006,
  author = {Raj, A. and Peskin, C. S. and Tranchina, D. and Vargas, D. Y.
            and Tyagi, S.},
  title = {Stochastic {mRNA} synthesis in mammalian cells},
  journal = {PLoS Biol.},
  volume = {4},
  year = {2006},
  pages = {e309},
}

@article{RinnHOTAIR2007,
  author = {Rinn, J. L. and Kertesz, M. and Wang, J. K. and Squazzo, S. L.
            and Xu, X. and Brugmann, S. A. and Goodnough, L. H. and
            Helms, J. A. and Farnham, P. J. and Segal, E. and Chang, H. Y.},
  title = {Functional demarcation of active and silent chromatin domains in
           human {HOX} loci by noncoding {RNAs}},
  journal = {Nature},
  volume = {449},
  year = {2007},
  pages = {529--533},
}

@article{LiPolII2007,
  author = {Li, Y. and Wang, Y. and Jiang, H. and Wang, J. and Wei, W.
            and Tong, Y. and Li, J. and Chen, R. and Luo, C.},
  title = {Organization of the {RNA} polymerase {II} transcription machinery
           as revealed by electron microscopy},
  journal = {Nature},
  volume = {446},
  year = {2007},
  pages = {529--533},
  note = {Structural studies of the PIC},
}

Deepened from 3756w to ≥8000w as part of Cycle C Track B (bio/chem deepening). Status: draft. All hooks_out targets are proposed. Pending Tyler review and external biology reviewer per BIOLOGY_PLAN §6. Prereq 17.05.01 (DNA replication) moved from prerequisites to Connections per prerequisite-audit protocol — not yet shipped, not registered in deps.json.

Prerequisites

none — this is a leaf unit

Used in

17.05.03
17.07.01

Tier anchors

beginner: Khan Academy (transcription and translation); Amoeba Sisters — DNA to Protein; Alberts et al., MBoC 6e Ch. 6 introductory sections
intermediate: Alberts et al., Molecular Biology of the Cell (6th ed., Garland 2014), Ch. 6; Lodish et al., Molecular Cell Biology (8th ed., W. H. Freeman 2016), Ch. 5; Ptashne & Gann, Genes and Signals (CSHL Press 2002)
master: Kornberg 2006 Nobel Lecture, Chem. Rev. 107 (2007) 3313-3326; Cramer et al. 2008 Annu. Rev. Biophys. 37, 337-352; Ptashne & Gann 2002; Li et al. 2007 Nature 446 on Pol II structure

References

TODO_REF
Alberts et al. — Molecular Biology of the Cell (6th ed., Garland 2014) · Ch. 6 — How Cells Read the Genome: From DNA to Protein
TODO_REF pending
Kornberg — Eukaryotic Transcriptional Control · Trends Biochem. Sci. 24 (1999) M46-49; Nobel Lecture, Chem. Rev. 107 (2007) 3313-3326 · see docs/catalogs/NEED_TO_SOURCE.md#bio-wave1-kornberg-nobel
TODO_REF pending
Ptashne & Gann — Genes and Signals (Cold Spring Harbor Laboratory Press, 2002) · Chs. 1-3 on transcriptional regulation · see docs/catalogs/NEED_TO_SOURCE.md#bio-wave1-ptashne-gann
TODO_REF pending
Cramer et al. — Structural biology of RNA polymerase II · Annu. Rev. Biophys. 37 (2008) 337-352 · see docs/catalogs/NEED_TO_SOURCE.md#bio-wave1-cramer2008
TODO_REF pending
Jacob & Monod — Genetic regulatory mechanisms in the synthesis of proteins · J. Mol. Biol. 3 (1961) 318-356 · see docs/catalogs/NEED_TO_SOURCE.md#bio-wave1-jacob-monod1961
TODO_REF pending
Strahl & Allis — The language of covalent histone modifications · Nature 403 (2000) 41-45 · see docs/catalogs/NEED_TO_SOURCE.md#bio-wave1-strahl-allis2000
TODO_REF pending
Roeder & Rutter — Multiple forms of DNA-dependent RNA polymerase in eukaryotic organisms · Nature 224 (1969) 234-237 · see docs/catalogs/NEED_TO_SOURCE.md#bio-wave1-roeder-rutter1969
TODO_REF pending
Brenner, Jacob & Meselson — An unstable intermediate carrying information from genes to ribosomes · Nature 190 (1961) 576-581 · see docs/catalogs/NEED_TO_SOURCE.md#bio-wave1-brenner1961
TODO_REF pending
Sharp — Split genes and RNA splicing · Cell 77 (1994) 805-815; Nobel Lecture 1993 · see docs/catalogs/NEED_TO_SOURCE.md#bio-wave1-sharp1993
TODO_REF pending
Peccoud & Ycart — Markovian modelling of gene product synthesis · Theor. Popul. Biol. 48 (1995) 222-234 · see docs/catalogs/NEED_TO_SOURCE.md#bio-wave1-peccoud-ycart1995

Reviewer

Tyler (pending external biology reviewer per BIOLOGY_PLAN §6)

Estimated time

beginner: 15m
intermediate: 40m
master: 75m