Evolution of DNA - Gene Regulation
First Protein Transcription
First Genetic Replication
First Feedback
Puddle Evolution
First Dispersal & Evolution
First Parasite
First Organism
First Cell Metabolism
First Self-Sufficiency
Aromatic Assistants
First Assimilation
First Transfer Molecules
Eight Molecule Life
Complementary Base Pairs
Energy Sources
Conquering the Oceans
First Cells
Cellular Explosion
Gene Regulation
First DNA
Wider Reading Frames
Complementary Triplets
Cellular Scripts
The Spread of Foxy
Second Parasite-- Transposons
First Schism
Improved Gene Regulation
Cell Structures
Eukaryote Explosion
Multi-Cellular Scripts
Cambrian Explosion
Appendix 1-- Prebiotic Earth
Appendix 2-- Primordial Puddles
Appendix 3-- Primordial Catalysts
Appendix 4-- C Value Enigma
Cast of Characters

Cassius and the Fred/Roscoe system were pretty cool, but as the number of genes increased, they also had a problem.

In the early days of Caleb and before, there weren't that many genetic chains, so it wasn't very hard to regulate them. Fred could pretty much just diffuse around and copy polypeptides randomly from whichever chains happened to meet.

Likewise, Roscoe could just randomly replicate chains, and produce more or less what was needed to assemble more Calebs.

However, as Caleb turned to Cassius and the number of genes increased, there was more and more incentive to manage exactly when each gene was expressed. Any Cassius that could control the timing of protein transcription and gene replication would have gained an enormous selective advantage over its less organized neighbors.

For example, when a Cassius first got to a new puddle, its best strategy was to first create enzymes to create raw materials, and only start making Fatcat, Fred and Roscoe proteins after they had plenty of amino acids and nucleotides to build with. A Cassius that could do things in that order would survive much better than a Cassius that wasted resources on enzymes for protein transcription before there were raw materials.

Of course to do that, Cassius needed a way to distinguish a specific gene from all the other genetic chains. How would that happen?

Promoters and Repressors

We have already talked about having a complementary header or 'landing site' at the beginning of each polypeptide chain, which would help Fred avoid replication of chains that coded RNA enzymes or helper chains rather than proteins.

In a sense, the header provided a permanent 'on/off' switch that helped Cassius avoid the replication of genes that were expressed in RNA form rather than as proteins. So it wouldn't have been hard to convert it to a more temporary switch so Fred would know when to transcribe each chain into proteins.

Cassius may have used any of several approaches to gene regulation in early gene headers: it may have changed a sequence, 'blocked' the header with a short stretch of complementary RNA, or inserted some sort of blocking molecule to prevent transcription .

Of course Cassius also needed a way to turn genes on when they were needed. It may have done that with a promoter protein which removed the temporary blocking from the gene header.

For example, a promoter protein for expression of the Sofia gene might have responded to concentrations of amino acid raw materials-- when they were high it would remove any repressors, and when low, it would restore them.

Gene ID

Before Cassius could turn its genes 'on' or 'off', it needed a way to identify them.

Genetic UPC

One possible way to 'mark' each gene is to install an ID string at the beginning of each genetic chain-- a unique sequence of nucleotides that an incoming Fatcat can match up with, to make sure it is making proteins from the correct gene. You might think of it as a genetic 'UPC code' for an RNA 'scanner' to read.

The ID sequence might be a fixed length, or it could have a variable length, just as long as it had some way to locate where the ID tag ended and where the actual gene began .

Fatcat and Gene ID

How would gene ID work?

With the help of complementary base pairs, it would be easy to create a complementary match sequence that would bind to the ID sequence and put a Fatcat in the right place to begin transcription.

When a regulator protein wanted to start production of an enzyme, it might do something like the following:

1. When the regulatory protein is created, it includes the regular 'landing site' sequence, plus a short RNA sequence that is complementary to the regulated gene.

2. When the regulatory protein senses that conditions require more enzymes, it links to a Fatcat, and then attaches the RNA sequence to the RNA-connecting portion of the Fatcat.

3. The Fatcat complex diffuses until it contacts the proper RNA chain. Its sequence matches, and Fatcat is positioned in just the right place to begin transcription.

How Large was the ID?

It's possible to distinguish between 1,000 different genes with an ID sequence of only 5 RNA nucleotides (45 = 1024). An organism with 30,000 genes would require a sequence of 8 nucleotides to give each gene a unique ID code (48 = 65,536).

However, those quantities assume that there is some kind of numbering system to make sure each gene has a unique ID sequence. It's hard to imagine any sort of biochemical system that could keep track of unused ID sequences before assigning one to a new gene.

If each new gene used an ID sequence that was selected randomly, then the ID sequence would need to be much longer, to reduce the odds of conflicting with another gene. Reducing the odds of duplication down to one in a million for any new gene sequence would require 10 additional nucleotides.

So a good guess is that a minimum gene ID size would probably be somewhere between 15 and 18 base pairs long.

Where is the ID sequence?

It certainly seems most logical to put the gene ID into the 'landing site' region just before the beginning of actual protein-coding base pairs.

The bacterial promoter includes a highly conserved TTGACA sequence 35 base pairs upstream of the start of each gene, with a 'spacer' of 16 to 18 base pairs, then a highly conserved TATAAT sequence that is 10 base pairs upstream of the gene.

Since the 'spacer' size is a close match to the ideal gene ID size, and since it is marked so conspicuously on either side, it seems highly likely that this portion of the bacterial promoter serves as a unique ID sequence.

Eukaryotes include a seven-pair 'basal promoter' marker (usually TATAAAAA) that is 30 base pairs upstream from the start of each gene, along with an 'upstream promoter' GGCCAATCT or GGCCAATCT sequence that is 50 to 130 base pairs upstream. That probably provides a similar space for an ID sequence, most likely on the upstream side of the 'TATA box', or the same relative location as the bacterial promoter's ID marker . We'll talk later about some possible uses that eukaryotes may have for that extra data in the header.

What would Gene IDs look like?

With tens of thousands of genes to manage, it seems likely that modern cells would include many gene ID chains (in RNA form) as part of their day to day metabolism. What would they look like?

The main requirement for an ID sequence would be uniqueness-- so it would probably not be repetitive, and would also not code for any 'sensible' sequence of amino acids. Its overall 'information density' would be about the same as protein-coding portions of the gene.

Transient RNA carriers of gene ID might include a base pair sequence marking them as an ID (an ID ID, so to speak). On the other hand, they might also be linked with a distinctive carrier protein so they would only need the approximately 17 base pair ID sequence.

Operons and Stop Codons

Some of Cassius's proteins worked together in groups-- for example, Fred and Fatcat worked together until Fred was replaced by tRNA, and many enzymes would have used several different proteins and some helper chains to put together a supercatalyst.

Once genes came under the control of promoters and repressors, it also would had made sense for Cassius to group related proteins together into a single backbone chain. That way one gene promoter could manage more than one protein at the same time.

The technical term for a group of genes linked with a single promoter is an 'operon'.

End of Gene Markers

For Cassius to be able to combine genes for more than one protein in a RNA chain, it needed a way to mark the boundaries between one gene and the next.

Since operons probably started to evolve before all 64 triplet permutations were snapped up by proteins, the natural solution would be to reserve one or more triplets as a 'stop marker' to indicate that Fatcat should stop creating a protein from a gene.

Modern genes contain three 'stop codons': UAA, UAG and UGA (when translated to RNA) .

Operons and Nathaniel

Combining several genes on one chain made life much easier for Nathaniel and for Roscoe, and probably added some serious survival value to the first Caleb or Cassius that started using the system.

When cells divided, Nathaniel would have had an easier time assembling a complete set of genetic chains, since there were fewer chains to find.

Meanwhile, it would have taken fewer passes by Roscoe to replicate all the genes in a Caleb or Cassius, improving its chances of creating sufficient genes to fill in complete sets of new Caleb or Cassius to send out into the world.

Gene Messengers

Once genes were marked with an ID sequence and consolidated into operons, it would have been much easier for cells to regulate their action, and start to have more of a modern metabolism.

When a cell wanted to accomplish something, it would create a 'messenger' that consisted of a Fatcat and a short RNA chain that acted as an ID match. The messenger would diffuse until it ran into the complementary sequence. At that point, it would initiate a protein synthesis.

With gene ID, cells started having the potential to coordinate genes, and live their lives in a more regulated and orderly fashion.

Helper Chain Maintenance

We've already talked about the use of backbone chain for uses other than protein coding-- for guidance in the tertiary folding of enzymes, as an aid in placement of multiple enzymes in complexes, and as direct enzymes (ribozymes).

Unfortunately, these non-coding forms of RNA would not have fit into a multi-gene world of operons quite as tidily as the protein-coding genes.

The problem is that they had to be replicated by Roscoe to become useful, rather than being transcribed by Fred.

Back when every chain was on its own, they probably could have just attached close to a Roscoe, and be pretty sure they'd be replicated in sufficient quantity to fill in when needed as helpers or enzymes.

However it probably would not have worked to insert them into an operon, even if they were used with the other genes included there. As Fred ran along the chain, it couldn't do anything with the helper gene. And Roscoe would have had no way to know it needed to jump in and create an RNA chain, when the proteins were coded.

Presumably the helper chains stayed on their own near Roscoe, but it sure would have been convenient for them to be linked together with their protein-coding genes, so a promoter would work on them at the same time.

As it turns out, there was a solution to this problem, but first, let's look at one other problem that was also facing early cells. Nothing like building up some dramatic tension!