[Lecture Notes] Fundamentals of Biotechnology - L7: Genomics

The human genome project

The human genome was sequenced by creating random segments and to sequence each piece individually and then use computers to overlap the pieces into a contig map. This was done using YACs and BACs but computing power was limited at the time. When computing power began to increase, mapping individual fragments because obscolete. Craig Venter used shotgun sequencing to sequence many random fragments and aligned all of the reads simultaneously using a computer. The draft created needed re-adjustments afterwards but this proved the technique viable.

Mapping techniques

Genetic mapping: information provded through mating. You cannot determine the distance between markers on the genome because of how chromosomes are reshuffled during reproduction.

Physcal mapping: Physical markers are placed on the chromosome and can be determined using fluorescence or in situ hybridisation (FISH). This allows one to determine the order in which to place their contigs which is helpful in situations where there are lots of repetitive sequences.

Linkage: the liklihood that two mapped features shuffle with one another during mating. The closer two features are the more likely they are to shuffle together, the further apart the less likely. In eukaryotic cells due to recombination the percentage at which markers are found close together after many experiments can be used to determine how close they are to one another on a chromosome.

Markers: these can be genes with a recognisable phenotype, however in humans there are not enough phenotypes for the amount of genes that need mapping. In some plants this can be tested by looking at the correlation between height and colour, or pedigree analysis can be done. Genes are typically polymorphic though making this hard to study. Restriction fragment length polymorphisms (RFLP) are another type of marker. These are the differences in where the restritction enzyme recognition sites are found on the genome. These can be identified using gel electrophoresis. Some people will have a restriction enzyme cut site in one location and others won’t, and the inheritance of these sites can be studied. An RFLP marker can be followed down generations. Variable number tandem repeats (VNTR) can be used for genetic mapping, some of these repeats are found only in one place on a genome and therefore can be used to identify individuals. Microsattelite polymophisms are shorter tandem repeats of 2-5 bp which can also be used for markers. Single nucleotide polymorphisms (SNPs) are individual nucleotide substitutions within genes that can be used to determine the type of gene a person has and to infer phenotypes. This is the basis of genotyping services such as 23andme.

These markers are fine but for large genomes it is not enough, there would not be enough markers for a full map. Sequence tagged sites (STSs) are 100-500 bp long and can be detecting using PCR. Expressed sequence tags (ESTs) are STSs which can be expressed as mRNA, and therefore converted to cDNA for sequencing to tell us something about individual genes, however many ESTs make up a single gene. Entire genomes can be digested using restriction enzymes and then proved using STSs or ESTs. If the markers are found close to one another then they are on the same fragment - this allows us to predict the distance between STSs or ESTs based on restriction cut sites and the number of times that these markers turn up on the same fragment from a library of fragments (fig 8.7). This tells us the distance between two markers but RFLP says how often two markers are found together.

Sometimes cloned segments may be two fragments of DNA from separate parts of the genome. Radiation hybrid mapping blasts cultered human cells with X rays or gamma rays to break up the chromosomes so that the fragments generated are of more consistent length. The intensity of the rays controls the size of the fragments. A gene that allows human cells to grow on a specific media. After the cells have been irradiated they are fused with hamster cells using PEG. Only a successful fusion of the hamster cell and the human cell will grow on the specific media, and roughly 50% of the DNA should be taken up by the hamster cell. Therefore the proportion of times that two markers both show up out of a number of experiments indicates the distance that they are from one another.

Cytogenetic marking is a low resolution technique of staining a chromosome to look at the distance between markers. Most gaps are found in the heterochromatin which is difficult to sequence because it is highly condensed, is methylated on histone H3 and is hyposcetylated. This DNA is not transcribed.


There are still gaps in the human genome sequence, even though it is complete. Chromosome walking is used where the chromosome is essentially sequenced sequentially rather than in fragments. One clone is sequenced and an overlapping section if found in the existing sequence data.

Human genome survey

It is diffuclt to say with certain how many genes are encoded in the 3.2 Gbp of the human genome because of alternate splicing, and of the genes we do know of, only 50% have had their function characterised. Introns make it extremely difficult to find out where a gene is actually coded for. For example the dystrophin gene is 2.4 million bp long but the exons make up only 3000 bp. This could lead to some genes being totally missed or mininterpretation of introns/exons.

Pseudogenes: these are duplicate copes of genes that have become defective and are not expressed. These genes can be found near or far from the expressed gene and determining if a gene is a real gene of pseudogene is difficult and must be confirmed by reading mRNA transcripts by converting them to cDNA. This calls into question the definition of genes; for example many small non-coding RNAs are coded for, but are the DNA sequences for these RNAs genes?

Non-coding DNA

25% of human DNA are protein coding genes but of this 25% only 1% is actually the coding sequence (cDNA, that will become mRNA). The rest is introns. Introns might have transcription factor binding sites and play a role in gene expression regulation.

Long interspersed elements (LINEs) are repetitive parts of the genome. The degree of repetition is not clearly defined but rRNA genes can be considered LINEs. Non-coding LINEs make up 20% of the human genome. Retrovirus like genes have been found inside these elements. Some LINEs can act like transposons and copy themselves and insert themselves into other parts of the genome.

Short interspersed elements (SINEs) are more repetitive than LINEs but shorter. These are 13% of the human genome. Alu elements are types of SINEs, so called because they contain Alu restriction enzyme binding sites, there are 300,000 to 500,000 Alu sites in our genome. SINEs are usually benign and cannot move without the help of LINE proteins.

Disease and genetics

3000 different diseases havebeen found using pedigree analysis, these are Mendelian diseases because they are inherited mutations on a single gene. Diseases caused by more than one gene are polygenic. Polygenic diseases can be identified by genome-wide association studies (GWAS). These are variation studies that compare populations of people with and without the disease checking for common variations in genes between them. Genes associated more highly with the diseased population are candidates for the ones which cause that specific phenotype.

Evolutionary genetics

Characterising an organism’s taxonomy is more objective when one uses the variations in ribosomal RNA rather than their physical characteristics. Aligning sequences based on highest similarity is sometimes difficult as many organisms have viral DNA within their genomes. Gene families are genes which are closely related, gene superfamilies (such as the transporter superfamily which codes for transport enzymes) are genes which over time have diverged so much that they a lot less related. Another example is the globin superfamily, proteins which bind oxygen in organisms. The less important a gene is for the survival of an organism the more mutations in that gene can be tolerated and passed on to offspring. So the less critical genes mutate at a higher rate than the critical genes.


Individual differences in one’s genomes will lead to different routes for drugs to be metabolised. Therefore individuals’ responses to drugs will be varied. The goal of pharmacogenetics is to reduce adverse drug reactions by taking into consideration the genetics of the patient. SNPs are widely used to correlate drug side effects with different sections of a population.


These are often called DNA chips. These exploid DNA hybridisation between a probe fixed to a solid support and target DNA. Each point in the microarray contains sequences which correspond to genes in a specific organism. Gene expression can be monitored by extracting RNA from an organism and running it through the microarray. The mRNA binds to the DNA probes and using fluorescence one can measure the expression of certain genes corresponding to the mRNA which was tested. This can be exploited by subjecting cells to different conditions to monitor how gene expression changes.

A whole genome tiling array (WGA) is an oligonucleotide microarray with probes that cover the entire genome. For humans these arrays have been created for chromosomes 21 and 22 and can identify unknown regions of DNA that are transcribed. A discovery from these chips was that 90% of transcribed RNA came from regions outside of already known exons.

Monitoring gene expression

This is the process of sequencing all the RNA in a cell. The more mRNA for a particular gene then the higher the expression. All the RNA is firstly extracted using poly-T beads and converted to cDNA and then this is sequenced. This is also called whole transcriptome shotgun sequencing. Using computers the cDNA fragments are aligned and the copy number of the cDNAs can be correlated to gene expression levels. RNA-Seq has many advantages over microarrays (fig 8.27)..

For more information on the gene expression of one gene in particular a reporter gene can be used. A gene fusion between a reporter gene (a gene with a product that is easy to isolate and measure (assay)) and the gene of interest is created. A common reporter gene is the lacZ gene which encodes beta-galatosidase. The gene fusion must have the regulatory region isolated with all the areas for transcription factors and promoters bound with only the coding section of the gene being replaced with lacZ. The amount and rate at which lactose is degraded can be correlated with gene expression. Other reportergenes are phoA which codes for alkaline phosphatase, luciferase or GFP.


This has been discussed earlier as changes to the methylation of DNA, but can also be affected by chemical modifications to histone proteins. Epigenetics refers specifically to changes in the DNA structure which are HERITABLE. Epigenetics is not the changes in gene expression in cells which are then not passed down to offspring. The methylome respresents the sites on the genome which are methylated. In eukaryotes this is seen in CpG squences.

References: my notes are made from, and follow the structure of my course textbook which is Biotechnology 2nd edition by David P. Clark, which can be found for purchase here.