The paper that I want to talk about today is The CRISPR Toolkit for Genome Editing and Beyond by Mazhar Adli, published in Nature Communications in 2018. My rather elementary knowledge of Biology did not make this easy, and it was fun to watch countless youtube videos to try and get to grips with this amazing technology.
Genome-editing before CRISPR
Spiderman developed his awesome superhuman skills because a radioactive spider bite caused a mutation in his DNA. Genome-editing technologies have chased similar superhuman dreams for a long time now. What if we could edit our DNA itself to give us amazing capabilities, or remove those parts of the DNA that are responsible for our deformations? The first glimmer of hope in this direction was the discovery of restriction enzymes in bacteria, that protected them from invading agents called phages. Restriction enzymes scan DNA molecules, and if they see an “enemy” pattern that they’ve been trained to recognize, they cut the DNA molecules at an appropriate site, effectively rendering the “enemy” gene useless.
With this discovery, scientists were able to manipulate the DNA of cells in test tubes, rendering similar cuts to the enemy. However, manipulating the DNA of living cells that were part of a larger living organism (in vivo) remained an elusive dream. This was finally realized by the work of Capechhi and Smithies, who found that mammalian cells could incorporate a foreign copy of DNA into their own genome. This happens through a process called homologous recombination, and is explained here.
So could we just keep introducing desired DNA copies into mammalian cells, and hope that these get incorporated? No. This is because only in cells allowed the foreign DNA to combine with the existing DNA in the cell. Secondly, this foreign material could be incorporated in other parts of the DNA instead of the desired foci. Hence, we needed to get better control over the process.
Cut ’em up
Researchers soon realized that if there was a break in both strands of the DNA at the desired site, called a double-strand break or DSB, the frequency of the foreign material getting attached there would increase by orders of magnitude. This led to a lot of research into large “cutting” molecules, or meganucleases. These meganucleases could recognize strands that were 14-40 base pairs long, and then cut the genes at the desired site. This was problematic however, because scientists couldn’t find meganucleases for all the sites of interest to them. New meganucleases were not easy to engineer. Moreover, the meganuclease-induced DSBs are mostly repaired by non-homologous end joining (NHEJ), which may be thought of as a rough and slipshod method of joining ends of the DNA. Hence, this method would not be suitable for introducing the desired foreign DNA in the correct manner into the genes.
This problem was partially solved when zinc finger proteins (ZNPs) were discovered. Instead of 14-40 base pairs long sites, these proteins could recognize sites that were only 3 base pairs long. Hence, given the (possibly long) base pair configuration of a desired site, we could attach multiple such zinc finger proteins to match the sequence at the desired site. In this manner, scientists could manipulate many more sites on the genome than before. Note that the zinc finger proteins would not be performing the actual cleavage: this would be performed by an endonuclease called Fok I, to which the zinc finger proteins would be bound. ZNFs can be thought of as the search party, and Fok I held the actual knife for cleaving.
The situation was further improved when scientists discovered TALE proteins, which could now recognize just 1 base pair instead of 3 bp long sites. However, even with this discovery, a lot of difficult engineering and re-engineering of proteins was required to target all possible sites of interest. The CRISPR gene editing technology turned out to be just as robust as these technologies, if not more, and also much easier to use!
CRISPR stands for clustered regularly interspaced short palindromic repeat DNA sequences. These highly repeating DNA sequences, interspersed with non-repeating spacer genomic sequences, were first observed in Escherichia coli, although they were later observed to be present in more than 40% of all bacteria and 90% of archaea. These CRISPR sequences form the backbone of the bacterial immune response to invasion by bacteriophages. A horrifying video of such a bacteriophage invasion is present here. In response to this invasion, the bacteria would store a part of the foreign DNA of the invaders in the form of spacers. Hence, the CRISPR may be thought of as a library in which you keep records of all the invaders that have wronged you. In the future, if the same phages attacked the bacteria again, their DNA strands would get recognized and destroyed.
How does all this happen though? After detecting and storing the DNA sequence of the invading genome, the CRISPR system makes copies of this sequence and stores them in two short RNAs- the mature crRNA and the trans-activating crRNA. Both of these RNAs activate the Cas9 enzyme, which goes in search of this particular DNA sequence. When it does detect the sequence, it cleaves the genome, rendering it useless. However, the CRISPR system itself also contains a copy of this sequence! How does the Cas9 protein know that it should not cleave the CRISPR DNA sequence? This is because of propospacer-adjacent motifs or PAMs. PAMs are base pair sequences that are present in the invading genome but not in the CRISPR sequence. Hence, before cleaving, the Cas9 enzyme checks if the genome contains the relevant PAM or not, and cleaves the DNA sequence only if such a PAM exists.
Scientists soon realized that they don’t need to go through the whole shindig of first letting a foreign genome attack a cell, and only then getting the required genome sequence in order to look for DNA sites to cleave. They could just directly engineer the two crRNAs containing information about the DNA site which they wished to have cleaved, and the Cas9 enzyme would do the rest. Better still, instead of two, they could just manufacture one RNA- the guide RNA or sgRNA! This idea caught on pretty quickly, and since 2012, when the field was created, there have been over 10,000 articles written on this topic to date.
Just to be clear, CRISPR sequences, Cas9 enzymes, etc are not naturally found in human cells. They would have to be extracted from bacteria and other prokaryotes, and then put inside eukaryotic cells like those of humans. Moreover, the cleavage of DNA sites by Cas9 enzymes is only half the story. If scientists wish to add sequences to the genome, they would have to ensure that these sequences have already been accepted into the cell. The cleavage just speeds up the process of modifying the genome by adding these sequences.
Different CRISPR systems
There are two CRISPR classes- Class I, which contains types I and III of CRISPR, and Class II, which contains types II and IV. The most commonly used type is the type II, which is found in Streptococcus pyogenes (spCas9). However, researchers have also identified 10 other different types of CRISPR proteins. A table of some of them is given below:
As one can see, each protein recognizes a different PAM sequence in the genome before cleaving, and hence is suitable for attacking different types of invading genomes.
Because Cas9 or other cleaving proteins are not naturally found in human cells, they have to be packaged and delivered through Lenti or Adeno Associated Viruses (AAVs). This can be a problem if the proteins are big. For instance, the spCas9 protein is 1366 aa. Although some smaller cleaving proteins have been discovered, they have the disadvantage of having really complex PAM requirements. For instance, although the SaCas9 is only 1053 aa, it requires a PAM sequence of 5′-NNGRRT-3′. Here, 5′ and 3′ denote the ends of a DNA sequence. Because very few (invading or non-invading) genomes contain this particular sequence, SaCas9 can target very few types of invaders.
Re-engineering CRISPR-Cas9 tools
Scientists are curious about whether they can re-engineer the naturally found Cas proteins to change their sizes, PAM requirements, etc. They also want to improve the target specificity of these proteins, so that they don’t go cleave the wrong DNA sites. Unfortunately, Cas9 proteins have a natural propensity to not be too site-specific, as they were mainly used in bacteria to attack constantly mutating bacteriophages. In order to study the specificity of Cas9 proteins, scientists tried to map the DNA binding sites of catalytically inactive SpCas9. They saw that the protein was more likely to bind with open chromatin regions. Also, the cleavage rates at sites of incorrect binding were quite low. This was good news, as even though these proteins would bind with undesired sites, they wouldn’t do as much harm there.
Scientists have spent a lot of time thinking of ways to reduce off-site binding and improve target specificity. One method that is useful is changing the delivery method of the Cas9-sgRNA complex, from plasmid-based to delivery as a ribonucleotide protein (RNP) complex. This complex makes the Cas9 protein relatively inactive, and hence less likely to bind to the wrong site in a flurry of activity. Another method is to have two separate sgRNAs direct a nickase Cas9 (nCas9), attached to a Fok 1 enzyme, to cleave a certain site of the genome. A nickase Cas9 protein or nCas9 cleaves only one strand of the DNA helix, and not both. Hence, for such a complex to cleave the wrong site of the genome, both the nCas9 proteins have to make a mistake, which has a smaller probability than just one of them making a mistake. Obviously the two nCas9 proteins are slightly separated, and contain different sequences. Other ways of affecting specificity of the cleaving proteins are increasing or reducing the length of the sgRNAs, attaching self-cleaving catalytic RNAs to the sgRNAs to regulate Cas9 action, using optical light to regular Cas9 approaches, etc.
CRISPR beyond genome-editing
What if we just want to identify relevant genome sites, and not cleave them? For this purpose, we can use catalytically inactive dead Cas9 proteins, or dCas9. How are dCas9 proteins formed? A regular CRISPR-Cas9 protein has two catalytic domains- HNH and RuvC, which cleave one DNA strand each. Point mutations in either of them render them ineffective. Hence, a point mutation in only one of them gives rise to a nickase Cas9, and point mutations in both gives rise to a dCas9.
In this section, we will primarily talk about the nCas9. The nickase cas9, or nCas9, is quite useful for converting one base into another, without cleaving both strands of the DNA and hence possibly introducing harmful indels (indels or insertions/deletions are arbitrary insertions or deletions of base pairs in the DNA strand). Komor et al discovered that nCas9, fused to an APOBEC1 deaminase enzyme and a UGI protein, can change C to T without cleaving both strands of the DNA helix. Similarly, another nCas9 complex is now able to change A to G. Scientists can now subsequently introduce STOP codons in genes. A STOP codon is a trinucleotide (can be thought of as a sequence of three bases) present in the RNA, that halts the production of proteins when instructions are bring read from the mRNA. Hence, the distance between the START and STOP codons determines the number of amino acids in a protein molecule. Scientists realized that by changing C to T, scientists could change the trinucleotides CGA, CAG and CAA to TGA, TAG and TAA, which are the three STOP codons. Hence, scientists could effectively manipulate the production of proteins in the ribosomes. Another route that scientists have gone down is forming an nCas9-AID complex, where AID stands for activation-induced adenosine deaminase enzyme. In the absence of UGI, this complex supports local mutations, and hence is a powerful gain-of-function screening tool. Gain-of-function screening tools are those that identify which genes are most suitable for mutation in order for the organism to develop a desired phenotype. Hence, the nCas9-AID complex can introduce mutations at multiple genes, and then select the most suitable.
Gene expression regulation
In this section, we primarily deal with dCas9 or catalytically dead Cas9, because we don’t want to cleave any DNA sites.
Gene expression is the process by which a gene is converted into a final product, which may be a protein, non-coding RNA, etc. Hence, regulating gene expression is an important goal for researchers: essentially, we wish to induce “beneficial” genes to express at a higher rate, and the “bad” genes to not express at all. dCas9 was found to tightly bind to DNA sites, and prevent other proteins such as RNA Polymerase II to bind there and start transcription. This phenomenon was exploited to form the CRISPR interference approach or CRISPRi. Notably, attaching a Kruppel-associated Box or KRAB complex to dCas9 results in an even stronger gene repressor. It has been shown that KRAB-mediated gene repression is associated with deacetylation and methylation of histone proteins. Wait, what are those?
Histones are proteins around which the DNA double helix wraps itself, both at the actual targeted gene, and also at the promoter and enhancer sites of the gene. When acetyl groups are attached to histone molecules, the helix unwinds, and becomes ready for transcription. When these acetyl groups are removed (temporary change), or replaced by methyl groups (permanent change), the DNA helix wraps itself even more tightly to the histone proteins, and hence is not expressed. For the H3 histone protein, it has been noticed that the repression activities of the KRAB-dCas9 complex occurs through H3 deacetylation and increased H3 methylation, especially in the H3 proteins present in the promoter and distal (far away) enhancer regions of the targeted gene. This picture is quite complicated, however, and is explained in more detail later.
In contrast, dCas9 can also promote gene transcription (and hence expression) through fusion with VP64, which is composed of four identical repeating units of VP16, a 16-amino acid chain found in the Herpes simplex virus. Other dCas9 complexes that promote gene expression are SunTag, VPR and SAM. SunTag has a dCas9 fused protein scaffold that contains a repeating peptide array, that is used to recruit multiple copies of an antibody fused effector protein. These effector proteins bind with the histone modules and regulate gene expression. SAM is just a complex of gene expression-promoting proteins comprising of dCas9-VP64 and MCP-fused P65-HSF1. The latter is carried to the target site in an engineered sgRNA scaffold. VPR is a complex of VP64, P65 and Rta proteins, all of which also enhance gene expression. CRISPR regulates gene expression, but the actual expression of the gene happens through the regular mechanism of the cell itself, as opposed to other approaches in which gene expression may be facilitated by foreign elements. Hence, this process is more robust and less prone to errors.
Epigenetics refers to the mechanism of differential gene expression, even though the genome might be the same. Hence, two identical twins with the same genome are different in many ways because of differences in gene expression. The “epigenome”, on the other hand, refers to the set of molecules that attach to the genome in order to regulate gene expression. All of this is explained beautifully in this video. Epigenome may also influence post-translational modifications of features. Despite recent epigenomic mapping efforts like the Encyclopedia of DNA elements (ENCODE), the functioning of even basic epigenomic features like histone modifications and DNA methylation remain poorly understood. Scientists now hope to use dCas9 complexes to add or remove epigenetic markers at various locations on the genome, in order to study their impact on gene expression. We have already seen how dCas9 induced DNA methylation at the promoter or enhancer sites leads to gene suppression. It is known that many disorders including some types of cancer are caused by aberrant methylation (too much or too little). Although some drugs exist to counter this, they act on the whole genome globally, and hence may affect undesired sites. Some dCas9 complexes like DNMT3A can rectify this by promoting methylation only at the targeted sites. Note that even Cas9 proteins are not known for being very target specific, and are often found bound to undesired sites. However, gene expression does not change at these undesired sites. This makes DNMT3A a useful complex to promote methylation.
On the other hand, if we want to suppress excessive methylation, TET proteins are pretty useful. Researchers formed dCas9-TET1 complexes to promote demethylation at desired sites. The outcome was found to be robust, as there was a 90% reduction in methylation at CpG dinucleotides. The impact at off-target sites was yet to be studied.
Although methylation is seen as a way of suppressing gene expression, it can also promote gene expression in some cases. This phenomenon is beautifully explained in this video. Histone proteins contain 4 types of residues- H2A, H2B, H3 and H4. The H3 residue contains both the H3K4 and the H3K27 sites. Both the acetylation and methylation of H3K4 promote gene expression, while the trimethylation of H3K27 only suppresses gene expression (acetylation of H3K27 promotes gene expression though). This duo can act as a bivalent regulator of gene expression, in which one part promotes and the other represses gene expression. Researchers are curious about controlling the methylation and acetylation of histone residues via dCas9 complexes.
In order to control the methylation and acetylation of H3 resides, researchers used a dCas9 complex to recruit LSD1 at the desired sites to reduce the number of enhancers H3K4me2 and H3K27ac (remember that the acetylation of H3K27 has made it an enhancer). Hence, this complex serves to repress gene expression. On the other hand, the dCas-P300 complex results in a significant increase in the number of H3K27ac, which promotes gene expression. Other dCas9 complexes have also been used to increase H3K4me3, which promotes gene expression, or reduce H3K27ac, which represses it. The global footprint (impact on the genome globally) of such dCas9 complexes is still unknown.
CRISPR-mediated live cell chromatin imaging
Although technology to image specific parts of the genome has existed for some time, it was mainly done in vitro (in a test tube) through Fluorescent In-Situ Hybridization (FISH) methods, and not in vivo (in a live organism). The development of CRISPR has revolutionized live cell chromatin imaging.
But “how bright does the bulb have to be”? If we imagine the dCas9 complexes as bulbs that attach themselves to desired genomic loci, these are likely to be too small and faint to register on our machines. Hence ideally, such complexes should target repeating genomic sequences that are close together, so that multiple such bulbs can go and attach themselves to each of these repeating sequences, giving out a brighter light that can be registered. For a non-repeating sequence, 26-36 sgRNAs need to attach themselves to one single sequence in order to produce a clear enough signal. So many sgRNAs attaching themselves to a single site is statistically quite unlikely. To overcome this problem, researchers came up with an sgRNA scaffold containing 16 MS2 binding molecules. All of these molecules travel together, and hence attach themselves to the binding site when the sgRNA reaches the desired loci. Put together, these now generate a strong enough signal for imaging. Using these scaffolds, repeated genomic sequences can now be imaged with just 4 sgRNAs, and non-repeating sequences can now be imaged with just 1 sgRNA, as explained above.
Manipulation of chromatin topology
Chromatins are strands of DNA that are arranged linearly. What if we could bring the promoter and enhancer for a gene closer together, or push them further apart? Would this affect gene expression? Yes! That is why researchers are interested in forming chromatin loops, or change the topology of chromatin strands in other ways. Morgan et al took two dimerizable proteins (proteins that had a tendency to attract each other and form a bond), and attached them to two different dCas9 complexes. These complexes now attached themselves to the promoter and enhancer regions separately, and then the dimer bond formation between the two proteins brought the promoter and enhancer closer together. This did result in an increase in gene expression.
How do we find out which gene affects a particular phenotype, say cell proliferation? Checking each gene out of the millions available is surely a daunting task. What if we could check thousands of genes at once? This task is accomplished using hundreds of thousands of sgRNAs in a large population of cells. The way that this works is this: we ensure that each cell receives one or less sgRNA, and each gene is targeted by 6-10 different sgRNAs. This means that at at least 6-10 cells are used to study the impact of one gene on the desired phenotype, which in this case is cell proliferation. The sgRNAs which hit the correct gene will cause their cells to proliferate fast, and the other cells will die out eventually. This helps us zero in on the gene which causes cell proliferation. Of course we will have to keep track of which sgRNA goes to which cell, which will allow us to make the right deductions.
One major aim for future researchers should be to reduce the size of the existing Cas proteins, so that they may be easily transported using virus vectors. Another aim should be the careful design of CRISPR procedures, so that “gene drives” that potentially impact entire populations do not cause harm in the long run.
An important obstacle to overcome is the fact that more than half of all humans experience an immune response to the introduction of Cas9 proteins in cells (this is called immunogenicity). One possible solution to this problem is the development of Cas9 proteins to which humans have not been exposed before, so that we don’t have an immune response against them.
CRISPR has great potential to benefit society and eradicate formidable diseases. I am excited to see what comes next.
- The CRISPR toolkit for genome editing and beyond