Glossary
Active
site
The amino acid
residues at the catalytic site of an enzyme provide the binding
and activation energy needed to place the substrate into its
transition state and bridge the energy barrier of the reaction
undergoing catalysis
Adenine
A purine base
found in DNA and RNA
Agents
Independent,
autonomous, software modules that can search the Internet
for data or content pertinent to a particular application,
such as a gene, protein, or biological system
Agricultural
biotechnology (AgBio)
The application
of rDNA technology to agriculturally important plants and
organisms
Algorithm
An Algorithm
consists of a series of steps defining a procedure or formula
for solving a problem that can be coded into a programming
language and executed. Bioinformatics algorithms typically
are used to process, store, analyze, visualize and make predictions
from biological data.
Alignment
It is the result
of a comparison of two or more gene or protein sequences in
order to determine their degree of base or amino acid similarity.
Sequence alignments are used to determine the similarity,
homology, function or other degree of relatedness between
two or more genes or gene products.
Allele
An Allele is
a given form of a gene that occupies a specific position or
locus on a chromosome. Variant forms of genes occurring at
the same locus are said to be alleles of one another.
Alternative
splicing
One of the
alternate combinations of a folded protein that is possible
due to by recombination of multiple gene segments during mRNA
splicing that occurs in higher organisms.
Alternative
splice-form
One of the
possible alternate combinations of exons into a folded protein
those are possible by recombining multiple gene segments during
mRNA splicing in higher organisms.
Alu
family
A common set
of dispersed DNA sequences found throughout the human genome;
each is about 300 bases long and they are repeated at least
500,000 times. Alu sequences are speculated to have originated
from viral RNA sequences that integrated into human DNA thousands
of years ago.
Amino
acid
One of the
20 chemical building blocks that are joined by amide (peptide)
linkages to form a polypeptide chain of a protein
Analogy
It can be defined
as reasoning by which the function of a novel gene or protein
sequence may be deduced from comparisons with other gene or
protein sequences of known function. Identifying analogous
or homologous genes via similarity searching and alignment
is one of the chief uses of Bioinformatics.
Annotation
A combination
of comments, notations, references, and citations, either
in free format or utilizing a controlled vocabulary, that
together describe all the experimental and inferred information
about a gene or protein. Annotations can also be applied
to the description of other biological systems. Batch,
automated annotation of bulk biological sequence is one of
the key uses of Bioinformatics tools.
Anticodon
It is the triplet
of contiguous bases on tRNA that binds to the codon sequence
of nucleotides on mRNA. Example: GGG codes for Glycine.
Antigen
It can be defined
as any foreign molecule that stimulates an immune response
in a vertebrate organism. Many antigens are proteins such
as the surface proteins of foreign organisms.
Antisense
DNA or RNA
composed of the complementary sequence to the target DNA/RNA.
Also used to describe a therapeutic strategy that uses antisense
DNA or RNA sequences to target specific gene DNA sequences
or mRNA implicated in disease, in order to bind and physically
inhibit their expression by physically blocking them.
Assay
It is a method
for measuring a biological activity. This may be enzyme activity,
binding affinity, or protein turnover. Most assays utilize
a measurable parameter such as color, fluorescence or radioactivity
to correlate with the biological activity.
Assembly
Compilation
of overlapping sequences from one or more related genes that
have been clustered together based on their degree of sequence
identity or similarity. Sequence assembly may be used to piece
together "shotgun" sequencing fragments based upon overlapping
restriction enzyme digests, or may be used to identify and
index novel genes from "single-pass" cDNA sequencing efforts.
Autoradiography
Autoradiography
is used to locate radioisotope-labeled materials which have
been separated in gels or are present in blots.
The location
of the radio labeled material has to be determined by overlaying
the test material with a photographic film that is sensitive
to the radioisotope.
Bacterial
artificial chromosome (BAC)
It is a cloning
vector that can incorporate large fragments of DNA.
Bacteriophage
It is a virus
that infects bacteria. The bacteriophage DNA has served as
a basis for cloning vectors, and is also utilized to create
phage libraries containing human or other genes.
Baculovirus
A Baculovirus
is an insect virus which forms the basis of a protein expression
system.
Base
pair
A pair of nitrogenous
bases (a purine and a pyrimidine), held together by hydrogen
bonds, that form the core of DNA and RNA i.e. the A: T, G:
C and A: U interactions.
Beta
sheet
A three dimensional
arrangement taken up by polypeptide chains that consists of
alternating strands linked by hydrogen bonds. The alternating
strands together form a sheet that is frequently twisted.
It is one of the secondary structural elements characteristic
of proteins.
Bioinformatics
The field of
endeavor that relates to the collection, organization and
analysis of large amounts of biological data using networks
of computers and databases (usually with reference to the
genome project and DNA sequence information)
Bivalent
It can be defined
as a molecule having two binding sites; having 2 free electrons
available for binding.
Blunt-end
(ligation)
Blunt – end ligation
can described as the joining of DNA fragments that contain
no overhang at either end and consequently no DNA bases available
for hybridization.
Carboxyl
group
The -COOH functional
group, acidic in nature, found in all amino acids
CDNA
(complementary DNA)
A DNA strand
copied from mRNA using reverse transcriptase. A cDNA library
represents the entire expressed DNA in a cell.
CDNA
library
A set of DNA
fragments prepared from the total mRNA obtained from a selected
cell, tissue or organism.
Cell
It is the basic
unit of any living organism.
Cell
Cycle
The life cycle
of a cell which is marked by cell division which is separated
into four phases: G1, S, G2, and M. DNA replication is confined
to the S (synthesis) phase, and chromosomal separation in
the M (mitotic) phase.
Chimeric
clone
A cloning artifact
created by a foreign gene being inserted into a vector in
an incorrect orientation resulting in the expression of a
protein consisting of a fusion of two different gene products.
Chromat
Data file output
from most popular DNA sequencers. Chromat files consist of
the fluorescent traces generated by the sequencer for each
of the four chemical bases, A, C, G, and T, together with
the sequence and measures of the error in the traces at each
sequence position.
Chromatin
The chromosome
as it appears in its condensed state composed of DNA and associated
proteins (mainly histones).
Chromosome
Chromosome
is the structure in the cell nucleus that contains the cellular
DNA with a number of proteins which compact and pack the DNA.
Clinical
trials
A trial is
Research study that involves patients. Biotechnology companies
typically use clinical trials to assess the efficacy and safety
of new therapies and to answer scientific questions. Typically,
there are 3 phases during a clinical trial. Phase I is designed
to evaluate the safety of the product in humans; phase II
analyses the effects of dose escalation, and phase III definitively
evaluates the clinical efficacy of the product.
Clone
Clone is a
population of genetically identical cells or DNA molecules.
Cloning
The formation
of clones or exact genetic replicas is called as Cloning
Cluster
A Cluster is
the grouping of similar objects in a multidimensional space.
Clustering is used for constructing new features which are
abstractions of the existing features of those objects. The
quality of the clustering depends crucially on the distance
metric in the space. In bioinformatics, clustering is performed
on sequences, high-throughput expression and other experimental
data. Clusters of partial or complete gene sequences can be
used to identify the complete (contiguous) sequence and to
better identify its function. Clustering expression data enables
the researcher to discern patterns of co-regulation in groups
of genes.
Coding
regions (CDS)
The portions
of a genomic sequence bounded by start and stop codons that
identify the sequence of the protein being coded for by a
particular gene.
Codon
A codon is
a sequence of three adjacent nucleotides that designates a
specific amino acid or start/stop site for transcription.
Combinatorial
chemistry
Combinatorial
chemistry uses methods to generate all the possible combinations
of chemicals and starts with a subset of compounds. The building
blocks may be peptides, nucleic acids or small molecules.
The libraries of compounds formed by this methodology are
used to probe for new pharmaceutical.
Complementary
determining region (CDR)
The hyper variable
regions of an antibody molecule, consisting of three loops
from the heavy chain and three from the light chain, that
together form the antigen-binding site.
Complexity
(of gene sequence)
The term "low
complexity sequence" may be thought of as synonymous with
regions of locally biased amino acid composition. In these
regions, the sequence composition deviates from the random
model that underlies the calculation of the statistical significance
(P-value) of an alignment. Such alignments among low
complexity sequences are statistically but not biologically
significant, i.e., one cannot infer homology (common ancestry)
or functional similarity.
Configuration
Configuration
is defined as the complete ordering and description of all
parts of a software or database system. Configuration
management is the use of software to identify, inventory and
maintain the component modules that together comprise one
or more systems or products.
Conformation
The precise
three-dimensional arrangement of atoms and bonds in a molecule
describing its geometry and hence its molecular function
Consensus
sequence
A single sequence
delineated from an alignment of multiple constituent
sequences that represents a "best fit" for all those sequences.
A "voting" or other selection procedure is used to determine
which residue (nucleotide or amino acid) is placed at a given
position in the event that not all of the constituent sequences
have the identical residue at that position.
Constitutive
synthesis (expression)
Synthesis of
mRNA and protein at an unchanging or constant rate regardless
of a cell’s requirements
Contig
A length of
contiguous sequence assembled from partial, overlapping sequences,
generated from a "shotgun" sequencing project. Contigs
are typically created computationally, by comparing the overlapping
ends of several sequencing reads generated by restriction
enzyme digestion of a segment of genomic DNA. The creation
of contigs in the presence of sequencing errors, ambiguities
and the presence of repeats is one of the most computationally
challenging aspects of the role of Bioinformatics in genome
analysis.
Convergence
It is the end-point
of any algorithm that uses iteration or recursion to guide
a series of data processing steps. An algorithm is usually
said to have reached convergence when the difference between
the computed and observed steps falls below a pre-defined
threshold.
Cosmids
Cosmids are
the DNA vectors that allow the insertion of long fragments
of DNA (up to 50 kbases).
Crystal
structure
It is the term
used to describe the high resolution molecular structure derived
by x- ray crystallographic analysis of protein or other BioMolecular
crystals.
Cytoplasm
It is the medium
of the cell between the nucleus and the cell membrane.
Cytosine
Cytosine is a pyrimidine base
found in DNA and RNA.
Data
Cleaning
A process whereby automated or
semi-automated algorithms are used to process experimental
data, including noise, experimental errors and other artifacts,
in order to generate and store high-quality data for use in
subsequent analysis. Data cleaning is typically required in
high-throughput sequencing where compression or other experimental
artifacts limit the amount of sequence data generated from
each sequencing run or "read."
Data
Mining
The ability
to query very large databases in order to satisfy a hypothesis
("top-down" data mining); or to interrogate a database in
order to generate new hypotheses based on rigorous statistical
correlations ("bottom-up" data mining).
Data
Processing
Data processing
is defined as the systematic performance of operations upon
data such as handling, merging, sorting, and computing. The
semantic content of the original data should not be changed,
but the semantic content of the processed data may be changed.
Data
Warehouses
Vast arrays
of heterogeneous (biological) data, stored within a single
logical data repository, which are accessible to different
querying and manipulation methods.
Database
Any file system
by which data gets stored following a logical process.
Deconvolution
It is a Mathematical
procedure to separate out the overlapping effects of molecules
such as mixtures of compounds in a high-throughput screen,
or mixtures of cDNAs in a high density array.
Deletion
It can be defined
as a chromosomal alteration in which a portion of the chromosome
or the underlying DNA is lost.
Deletion
mapping
Deletion mapping
is a process in which different deletions in a region of DNA
are created and used to map the functionally critical areas
of that DNA. For example the minimal region of DNA required
for a test promoter can be ascertained by systematic deletions
in the region of interest.
Dendrogram
method: - A dendrogram is strictly defined as a binary
tree with a distinguished root. It is a
graphical procedure for representing the output of a hierarchical
clustering that has all the data items at its leaves.
Conventionally, all the leaves are shown at the same level
of the drawing. The ordering of the leaves is arbitrary,
as is their horizontal position. The heights of the internal
nodes may be arbitrary, or may be related to the metric information
used to form the clustering.
Dimer
Dimer is a
composite molecule formed by the binding of two molecules
Disulphide
bond
Covalent link
formed between the sulphur atoms of two different cysteine
residues in a protein. It is Important in maintaining the
folded structure of a protein, and also for linking different
proteins in a complex.
DNA
(deoxyribonucleic acid)
DNA is the
chemical that forms the basis of the genetic material in virtually
all organisms. DNA is composed of the four nitrogenous bases
Adenine, Cytosine, Guanine, and Thymine, which are covalently
bonded to a backbone of deoxyribose-phosphate to form a DNA
strand. Two complementary strands (where all Gs pair with
Cs and As with Ts) form a double helical structure which is
held together by hydrogen bonding between the cognate bases.
DNA
fingerprinting
A technique
for identifying human individuals based on a restriction enzyme
digest of tandemly repeated DNA sequences that are scattered
throughout the human genome, but are unique to each individual.
DNA
microarrays
A DNA microarray
is the deposition of oligonucleotides or cDNAs onto an inert
substrate such as glass or silicon. Thousands of molecules
may be organized spatially into a high-density matrix. These
DNA chips may be probed to allow expression monitoring of
many thousands of genes simultaneously. Uses include study
of polymorphisms in genes, de novo sequencing or molecular
diagnosis of disease.
DNA
polymerase
It is an enzyme
that catalyzes the synthesis of DNA from a DNA template given
the deoxyribonucleotide precursors.
DNA
probes
Short single
stranded DNA molecules of specific base sequence, labeled
either radioactively or immunologically, that are used to
detect and identify the complementary base sequence in a gene
or genome by hybridizing specifically to that gene or sequence.
DNA
sequencing
It is the technique
in which the specific sequence of bases forming a particular
DNA region is deciphered.
DNase
(Deoxyribonuclease)
It is one of
a series of enzymes that can digest DNA.
Domain
(protein)
A domain is
a region of special biological interest within a single protein
sequence. However, a domain may also be defined as a region
within the three-dimensional structure of a protein that may
encompass regions of several distinct protein sequences that
accomplishes a specific function. A domain class is a group
of domains that share a common set of well-defined properties
or characteristics.
Drug
It is an agent
that affects a biological process. Specifically, it can be
defined as a molecule whose molecular structure can be correlated
with its pharmacological activity.
Drug
discovery cycle
The cycle of
events required to develop a new drug. Typically this involves
research, preclinical testing and clinical development, and
can take from 5 to 12 years.
Electronic
Northerns
The use of
an electronic database of cDNA sequences (or probes derived
from them) in order to measure the relative levels of mRNAs
expressed in different cells or tissues. An example of the
use of an electronic Northern might be to identify the differences
in the genes expressed in prostate cancer and those in benign
prostate hyperplasia, by subtracting the database of one from
the other and seeing which cDNAs remain.
Electrophoresis
The use of
an external electric field to separate large biomolecules
on the basis of their charge by running them through
acrylamide or agarose gels.
Enhancers
DNA sequences
that can greatly increase the transcription rates of genes
even though they may be far upstream or downstream from the
promoter they stimulate.
Enzyme
Enzymes are
a class of proteins capable of catalyzing chemical reactions
(the making or breaking of chemical bonds). They do so by
orienting their substrates into a suitable geometry in a particular
location (the active site) where electrophilic or nucleophilic
amino acid residues can participate in the reaction. Enzymes
are protein catalyst that speeds up chemical reactions that
would otherwise be prohibitively slow under physiological
conditions.
Epigenomics
It is the study
of complex expression networks or linkages both spatially
(within the body) and temporally (at different times in development).
Equilibrium
constant
It is the value
that describes the equilibrium state of the reversible reaction
between two molecular species.
Eukaryote
A cell or organism
with a distinct membrane-bound nucleus as well as specialized
membrane-based organelles
Exon
An Exon is
the region of DNA within a gene that codes for a polypeptide
chain or domain. Typically a mature protein is composed of
several domains coded by different exons within a single gene.
Expressed
Sequence Tags (ESTs)
They are a
small sequence from an expressed gene that can be amplified
by PCR. ESTs act as physical markers for cloning and full
length sequencing of the cDNAs of expressed genes. Typically
they are identified by purifying mRNAs, converting to cDNAs,
and then sequencing a portion of the cDNAs.
Expression
(gene or protein)
It is a measure
of the presence, amount, and time-course of one or more gene
products in a particular cell or tissue. Expression
studies are typically performed at the RNA (mRNA) or protein
level in order to determine the number, type, and level of
genes that may be up-regulated or down-regulated during a
cellular process, in response to an external stimulus, or
in sickness or disease. Gene chips and proteomics now
allow the study of expression profiles of sets of genes or
even entire genomes.
Expression
profile
The level and
duration of expression of one or more genes, selected from
a particular cell or tissue type, generally obtained by a
variety of high-throughput methods, such as sample sequencing,
serial analysis, or microarray-based detection.
Expression
vector
It is a cloning vector that is
engineered to allow the expression of protein from a cDNA.
The expression vector provides an appropriate promoter and
restriction sites that allow insertion of cDNA.
Fingerprint
A fingerprint
is a set of motifs used to predict the occurrence of similar
motifs, in either an individual sequence or in a database.
Fingerprints are refined by iterative scanning of a composite
protein sequence database. A composite or multiple-motif
fingerprint contains a number of aligned motifs taken from
different parts of a multiple alignment. True family
members are then easy to identify by virtue of possessing
all elements of the fingerprint, while subfamily members may
be identified by possessing only part of it.
Frameshift
A Frameshift
is the deletion, substitution, or duplication of one or more
bases that causes the reading-frame of a structural gene to
shift from the normal series of triplets.
Functional
genomics
It is the use
of genomic information to delineate protein structure, function,
pathways and networks. Function may be determined by "knocking
out" or "knocking in" expressed genes in model organisms such
as worm, fruitfly, yeast or mouse.
Fusion
protein
It is a protein
resulting from the genetic joining and expression of 2 different
genes.
Gaps (affine
gaps)
A gap is defined
as any maximal, consecutive run of spaces in a single string
of a given alignment. Gaps help create alignments that better
conform to underlying biological models and more closely fit
patterns that one expects to find in meaningful alignment.
The idea is to take in account the number of continuous gaps
and not only the number of spaces when calculating an alignment.
Affine gaps contain a component for gap insertion and a component
for gap extension, where the extension penalty is usually
much lower than the insertion penalty. These mimic biological
reality as multiple gaps would imply multiple mutations, but
a single mutation can lead to a long gap quite easily.
Gap
penalties
The penalty
applied to a similarity score for the introduction of an insertion
or deletion gap, the extension of a gap, or both. Gap penalties
are usually subtracted from a cumulative score being determined
for the comparison of two or more sequences via an optimization
algorithm that attempts to maximize that score.
Gel
electrophoresis
It is a technique
by which molecules are separated by size or charge by passing
them through a gel under the influence of an external electric
field.
Gene
Index
A listing of
the number, type, label and sequence of all the genes identified
within the genome of a given organism. Gene indices are usually
created by assembling overlapping EST sequences into clusters,
and then determining if each cluster corresponds to a unique
gene. Methods by which a cluster can be identified as representing
a unique gene include identification of long open reading
frames (ORFs), comparison to genomic sequence, and detection
of SNPs or other features in the cluster that are known to
exist in the gene.
GenBank
Data bank of
genetic sequences operated by a division of the National Institutes
of Health.
Gene
A gene is a
unit of inheritance. In practice, a gene is a segment of DNA
on a chromosome that encodes a protein and all the regulatory
sequences (promoter) required to control expression of that
protein.
Gene
chips (also Gene arrays)
They are the
covalent attachment of oligonucleotides or cDNA directly onto
a small glass or silicon chip in organized arrays. Over 50,000
different DNA fragments can be presented on a single chip
providing a high throughput parallel method of probing gene
expression, genotype or gene function.
Gene
expression
Gene expression
is the conversion of information from gene to protein via
transcription and translation.
Gene
families
They are subsets
of genes containing homologous sequences which usually correlate
with a common function.
Gene
library
It is a collection
of cloned DNA fragments created by restriction endonuclease
digestion that represent part or all of an organism’s genome.
Gene
product
The gene product
is either RNA or protein, which results from expression of
a gene. The amount of gene product reflects the activity of
the gene.
Gene
therapy
Gene therapy
is the use of genetic material for therapeutic purposes. The
therapeutic gene is typically delivered using recombinant
virus or liposome based delivery systems.
Genetic
code
The mapping
of all possible codons into the 20 amino acids including the
start and stop codons.
Genetic
engineering (Recombinant DNA technology)
The procedures
used to isolate, splice and manipulate DNA outside the cell.
Genetic Engineering allows a recombinantly engineered DNA
segment to be introduced into a foreign cell or organism,
and be able to replicate and function normally.
Genetic
marker
A Gene marker
is any gene that can be readily recognized by its phenotypic
effect, and which can be used as a marker for a cell, chromosome,
or individual carrying that gene. Also, any detectable polymorphism
used to identify a specific gene.
Genome
It is the complete
genetic content of an organism.
Genomic
DNA (sequence)
DNA sequence
typically obtained from mammalian or other higher-order species,
which includes both intron and exon sequence (coding sequence),
as well as non-coding regulatory sequences such as promoter,
and enhancer sequences.
Genomics
It is the analysis
of the entire genome of a chosen organism.
Genotype
A Genotype
is a set of all of the genes possessed by an individual. In
practice, the particular alleles present in a specific genetic
locus.
Glycosylation
The addition
of carbohydrate groups (sugars) e.g. to polypeptide chains
Guanine
(G)
One of the nitrogenous
purine bases found in DNA and RNA
Hairpin
A double-helical
region in a single DNA or RNA strand formed by the hydrogen-bonding
between adjacent inverse complementary sequences to form a
hairpin shaped structure.
Haploid
A haploid is
a cell or organism containing only one set of chromosomes
without the homologous pairs.
Heterodimer
Protein composed
of 2 different chains or subunits.
Heteroduplex
It is a hybrid
structure formed by the annealing of two DNA strands (or an
RNA and DNA) that have sufficient complementarity in their
sequence to allow hydrogen bonding.
Hidden
Markov model (HMM)
It is a joint
statistical model for an ordered sequence of variables. It
is the result of stochastically perturbing the variables in
a Markov chain (the original variables are thus "hidden"),
where the Markov chain has discrete variables which select
the "state" of the HMM at each step. The perturbed values
can be continuous and are the "outputs" of the HMM. A Hidden
Markov Model is equivalently a coupled mixture model where
the joint distribution over states is a Markov chain. Hidden
Markov models are valuable in bioinformatics because they
allow a search or alignment algorithm to be trained using
unaligned or unweighted input sequences; and because they
allow position-dependent scoring parameters such as gap penalties,
thus more accurately modeling the consequences of evolutionary
events on sequence families.
High-throughput
screening (HTS)
HTS is the
method by which very large numbers of compounds are screened
against a putative drug target in either cell-free or whole-cell
assays. Typically, these screenings are carried out in 96
well plates using automated, robotic station based technologies
or in higher- density array ("chip") formats.
HLA
complex
Another name
for the MHC in humans; refers to the "Human Leukocyte Antigen"
complex located on chromosome 6.
Homeobox
A highly conserved
region in a homeotic gene composed of 180 bases (60 amino
acids) that specifies a protein domain (the homeodomain) that
serves as a master genetic regulatory element in cell differentiation
during development in species as diverse as worms, fruit flies,
and humans.
Homeodomain
A 60 amino-acid
protein domain coded for by the homeobox region of a homeotic
gene.
Homeotic
gene
A gene that
controls the activity of other genes involved in the development
of a body plan. Homeotic genes have been found in organisms
ranging from plants to humans.
Homology
Two or more
biological species, systems or molecules that share a common
evolutionary ancestor are said to be Homologus.
Two or more
gene or protein sequences that share a significant degree
of similarity, typically measured by the amount of identity
(in the case of DNA), or conservative replacements (in the
case of protein), that they register along their lengths.
Sequence "homology" searches are typically performed with
a query DNA or protein sequence to identify known genes or
gene products that share significant similarity and hence
might inform on the ancestry, heritage and possible function
of the query gene.
Housekeeping
genes
Genes that
are always expressed (i.e. they are said to be constitutively
expressed) due to their constant requirement by the cell.
Human
Anti-Murine Antibody Response (HAMA)
An immune response
generated in humans to antibodies that are raised in murine
(e.g. mouse or rat) cells.
Hybridization
The interaction
of complementary nucleic acid strands. This can occur between
two DNA strands or between DNA and RNA strands, and is the
basis of many techniques such as Southern and northern blots.
Hydrogen
bond
It is a weak
chemical interaction between an electronegative atom (e.g.
nitrogen or oxygen) and a hydrogen atom that is covalently
attached to another atom. This bond maintains the two-helices
of DNA together and is also the primary interaction between
water molecules.
Hydrophilicity
(Literary meaning
– water loving) It can be defined as the degree to which a
molecule is soluble in water. Hydrophilicity depends to a
large degree on the charge and polarizability of the molecule
and its ability to form transient hydrogen-bonds with (polar)
water molecules.
Hydrophobicity
(Literary meaning
– water-hating)It can be defined as the degree to which a
molecule is insoluble in water, and hence is soluble in lipids.
If a molecule lacking polar groups is placed in water, it
will be entropically driven to finding a hydrophobic environment
(such as the interior of a protein or a membrane).
Idiotype
Antibody variants
localized to the variable portion of an immunoglobulin that
are recognised by their antigenic determinants. The determinants
are composed from the antigen-combining site or CDRs. Every
unique antigenic determinant has a specific antibody with
its own unique idiotype.
Immunoglobulin
It is a member
of the globulin protein family consisting of two light and
two heavy chains linked by disulfide bonds. All antibodies
are immunoglobulins.
In
silico (biology)
(Lit. computer
mediated). It is the use of computers to simulate, process,
or analyse a biological experiment.
In
situ hybridization
A variation
of the DNA/RNA hybridization procedure in which the denatured
DNA is in place in the cell and is then challenged with RNA
or DNA extracted from another source.
Integration
It is the physical
insertion of DNA into the host cell genome. The process is
used by retroviruses where a specific enzyme catalyses the
process or can occur at random sites with other DNA (e.g.
transposons).
Intracellular
signaling
It is the
communication of a molecular message from the surface of the
cell to the nucleus via the participation of a series of molecules,
including receptors, enzymes, proteins, and small-molecules.
The end result of the signaling process is the up- or down-regulation
of a particular series of genes that may be involved in cell
growth, division or differentiation.
Introns
Nucleotide
sequences found in the structural genes of eukaryotes that
are non-coding and interrupt the sequences containing information
that codes for polypeptide chains. Intron sequences are spliced
out of their RNA transcripts before maturation and protein
synthesis.
Isoschizomers
They are different
restriction enzymes which recognize and cut DNA at the same
recognition site. E.g. Sma I and Xma I both recognize and
cut the sequence CCCGGG.
Isozymes
Two or more
enzymes capable of catalyzing the same reaction but varying
in their specificity due to differences in their structures
and hence their efficiencies under different environmental
conditions are called as Isozymes.
Iteration
A series of steps
in an algorithm whereby the processing of data is performed
repetitively until the result exceeds a particular threshold.
Iteration is often used in multiple sequence alignments whereby
each set of pairwise alignments are compared with every other,
starting with the most similar pairs and progressing to the
least similar, until there are no longer any sequence-pairs
remaining to be aligned.
Junk DNA
Term used to
describe the excess DNA that is present in the genome beyond
that required to encode proteins. It is a misleading term
since these regions are likely to be involved in gene regulation,
and other as yet unidentified functions.
Karyotype
It is the constitution
(typically number and size) of chromosomes in a cell or individual.
Knockout
mice (gene targeting)
Mice which have
been engineered to lack a chosen gene are called as knockout
mice. The gene is inactivated in so called embryonic stem
cells using the technique of homologous recombination. These
cells are then introduced into a early stage embryo (blastocyst)
and this is then transplanted into a recipient mouse. The
subsequent progeny lack the targeted gene in some cells. This
technique is used to determine the function of the chosen
gene.
"Lab on a
chip"
It is the term
which describes microdevices that allow rapid, microanalytical
analysis of DNA or protein in a single, fully integrated system.
Typically, these devices are miniature surfaces, made of silicon,
glass or plastic, which carry the necessary microdevices (pumps,
valves, microfluidic controllers, and detectors) that allow
sample separation and analysis. These devices are used in
drug discovery, genetic testing and separation science.
Lead
compound
A candidate
compound identified as the best "hit" (tight binder) after
screening of a combinatorial (or other) compound library,
that is then taken into further rounds of screening to determine
its suitability as a drug.
Lead
optimization
The process
of converting a putative lead compound ("hit") into a therapeutic
drug with maximal activity and minimal side affects, typically
using a combination of computer-based drug design, medicinal
chemistry and pharmacology.
Leucine
zipper
It is a Protein
motif which binds DNA in which 4-5 Leucines are found at 7
amino acid intervals. This motif is present typically in transcription
factors and other proteins that bind DNA.
Lexicon
In Bioinformatics,
a lexicon refers to a pre-defined list of terms that together
completely define the contents of a particular database.
Library
It is a large
collection of compounds, peptides, cDNAs or genes which may
be screened in order to isolate cognate molecules.
Ligand
A Ligand is
a small molecule that binds to a protein or receptor; the
cognate partner of many cellular proteins, enzymes, and receptors.
Linkage
It is the association
of genes (or genetic loci) on the same chromosome. Genes that
are linked together tend to be transmitted together.
Linkage
map
It is a genetic
map of a chromosome or genome delineated by mapping the positions
of genes to their chromosomes by their linkage to readily
identifiable genetic loci.
Locus
The specific
position occupied by a gene on a chromosome. At a given locus,
any one of the variant forms of a gene may be present. The
variants are said to be alleles of that gene.
Map
unit
It is a measure
of genetic distance between two linked genes that corresponds
to a recombination frequency of 1%.
Markov
chain
It is a multivariate
probability density whose independence diagram is a chain.
The variables are ordered, and each variable "depends" only
on its neighbors in the sense of being conditionally independent
of the others. Markov chains are an integral component
of hidden Markov models.
Meiosis
The process
within the cell nucleus that results in the reduction of the
chromosome number from diploid (two copies of each chromosome)
to haploid (a single copy) through two reductive divisions
in germ cells is defined as Meiosis.
Melting
(of DNA)
It is the denaturation
of double-stranded DNA into two single strands by the application
of heat. (Denaturation breaks the hydrogen bonds holding the
double-stranded DNA together).
Messenger
RNA (mRNA)
The complementary
RNA copy of DNA formed from a single-stranded DNA template
during transcription that migrates from the nucleus to the
cytoplasm where it is processed into a sequence carrying the
information to code for a polypeptide domain.
Methylation
It is the addition
of -CH3 (methyl) groups to a target site. Typically such addition
occurs on to the cytosine bases of DNA.
Microarray
A 2D array,
typically on a glass, filter, or silicon wafer, upon which
genes or gene fragments are deposited or synthesized in a
predetermined spatial order allowing them to be made available
as probes in a high-throughput, parallel manner.
Microfluidics
The miniaturization
of chemical reactions or pharmacological assays into microscopic
tubes or vessels in order to greatly increase their throughput,
by placing many of them side-by-side in an array is called
Microfluidics.
Mimetics
These are the
compounds that mimic the function of other molecules via their
high degree of structural (conformational) similarity, and
hence Physico-chemical properties.
Missense
mutation
It is a point
mutation in which one codon (triplet of bases) is changed
into another designating a different amino acid.
Mitosis
It is the nuclear
division that results in the replication of the genetic material
and its redistribution into each of the daughter cells during
cell division.
Modeling
In bioinformatics,
modeling usually refers to molecular modeling, a process whereby
the three-dimensional architecture of biological molecules
is interpreted (or predicted), visually represented, and manipulated
in order to determine their molecular properties
Modeling is
a series of mathematical equations or procedures which simulate
a real-life process, given a set of assumptions, boundary
parameters, and initial conditions.
Monomer
It is a single
unit of any biological molecule or macromolecule, such as
an amino acid, nucleic acid, polypeptide domain, or protein.
Monovalent
It has a single
binding site; strictly, an atom with only one free electron
available for binding in its highest energy shell.
Motif
It is a conserved
element of a protein sequence alignment that usually correlates
with a particular function. Motifs are generated from a local
multiple protein sequence alignment corresponding to a region
whose function or structure is known. It is sufficient that
it is conserved, and is hence likely to be predictive of any
subsequent occurrence of such a structural/functional region
in any other novel protein sequence.
Multigene
family
A set of genes
derived by duplication of an ancestral gene, followed by independent
mutational events resulting in a series of independent genes
either clustered together on a chromosome or dispersed throughout
the genome.
Multiplex
sequencing
Approach to
high-throughput sequencing that uses several pooled DNA samples
run through gels simultaneously and then separated and analyzed.
Mutagen
An agent that
can cause an increase in the rate of mutations in an organism
is called as a mutagen.
Mutation
An inheritable
alteration to the genome that includes genetic (point or single
base) changes, or larger scale alterations such as chromosomal
deletions or rearrangements.
Naked DNA
Pure, isolated
DNA devoid of any proteins that may bind to it.
NCEs
(New Chemical Entity)
Compounds identified
as potential drugs that are sent from research and development
into clinical trials to determine their suitability.
Nested
PCR
It is defined
as the second round amplification of an already PCR-amplified
sequence using a new pair of primers which are internal to
the original primers. This is typically done when a single
PCR reaction generates insufficient amounts of product.
Neural
net
A neural net
is an interconnected assembly of simple processing elements,
units or nodes, whose functionality is loosely based on the
animal brain. The processing ability of the network is stored
in the inter-unit connection strengths, or weights, obtained
by a process of adaptation to, or learning from, a set of
training patterns. Neural nets are used in bioinformatics
to map data and make predictions, such as taking a multiple
alignment of a protein family as a training set in order to
identify novel members of the family from their sequence data
alone.
Nonsense
mutation
It is a point
mutation in which a codon specific for an amino-acid is converted
into a nonsense codon.
Northern
blotting
It is a technique
to identify RNA molecules by hybridization that is analogous
to Southern blotting.
Nuclease
Any enzyme
that can cleave the phosphodiester bonds of nucleic acid backbones.
Nucleoside
A five-carbon
sugar covalently attached to a nitrogen base.
Nucleotide
A nucleic acid unit composed
of a five carbon sugar joined to a phosphate group and a nitrogen
base.
Object-Relational
Database
Object databases
combine the elements of object orientation and object-oriented
programming languages with database capabilities. They provide
more than persistent storage of programming language objects.
Object databases extend the functionality of object programming
languages (e.g., C++, Smalltalk, or Java) to provide full-featured
database programming capability. The result is a high level
of congruence between the data model for the application and
the data model of the database. Object-relational databases
are used in Bioinformatics to map molecular biological objects
(such as sequences, structures, maps and pathways) to their
underlying representations (typically within the rows and
columns of relational database tables.) This enables the user
to deal with the biological objects in a more intuitive manner,
as they would in the laboratory, without having to worry about
the underlying data model of their representation.
Oligonucleotide
It is a short
molecule consisting of several linked nucleotides (typically
between 10 and 60) covalently attached by phosphodiester bonds.
Open
reading frame (ORF)
Any stretch
of DNA that potentially encodes a protein. Open reading frames
start with a start codon, and end with a termination codon.
No termination codons may be present internally. The identification
of an ORF is the first indication that a segment of DNA may
be part of a functional gene.
Operator
An operator
is a segment of DNA that interacts with the products of regulatory
genes and facilitates the transcription of one or more structural
genes.
Operon
An Operon is
a unit of transcription consisting of one or more structural
genes, an operator, and a promoter.
Ortholog
Orthologs are
genes in different species that evolved from a common ancestral
gene by speciation. Normally, orthologs retain the same function
in the course of evolution. Identification of orthologs is
critical for reliable prediction of gene function in newly
sequenced genomes
Overlapping
clones
Overlapping clones
are a collection of cloned sequences made by generating randomly
overlapping DNA fragments with infrequently cutting restriction
enzymes.
Palindrome
A region of
DNA with a symmetrical arrangement of bases occurring about
a single point such that the base sequences on either side
of that point are identical (if the strands are both read
in the same direction) e.g. 5í GAATTC 3í whose
complementary sequence is 3í CTTAAG 5í.
Pattern
Molecular biological
patterns usually occur at the level of the characters making
up the gene or protein sequence. A pattern language must be
defined in order to apply different criteria to different
positions of a sequence. In order to have position-specific
comparison done by a computer, a pattern-matching algorithm
must allow alternative residues at a given position, repetitions
of a residue, exclusion of alternative residues, weighting,
and ideally, combinatorial representation.
Pathways
Bioinformatics strives to define
representations of key biological data types, algorithms and
inference procedures, including sequences, structures, biological
pathways and reactions. Representing and computing with biological
pathways requires ontologies for representing pathway knowledge;
User interfaces to these databases; Physico-chemical properties
of enzymes and their substrates in pathways; And pathway analysis
of whole genomes including identifying common patterns across
species and species differences.
Paralog
Paralogs are genes related by
duplication within a genome. Orthologs retain the same function
in the course of evolution, whereas paralogs evolve new functions,
even if these are related to the original one.
Parameters
Parameters
are user-selectable values, typically experimentally determined,
that govern the boundaries of an algorithm or program. For
instance, selection of the appropriate input parameters governs
the success of a search algorithm. Some of the most common
search parameters in bioinformatics tools include the stringency
of an alignment search tool, and the weights (penalties) provided
for mismatches and gaps.
Peptide
A short stretch
of amino acids each covalently coupled by a peptide (amide)
bond.
Peptide
bond (amide bond)
A covalent
bond formed between two amino acids when the amino group of
one is linked to the carboxy group of another (resulting in
the elimination of one water molecule).
Phage
(Bacteriophage)
It is a virus
that infects bacterial cells and serves as a useful vector
for introducing genes into bacteria for a number of purposes.
Phage
display
It is a technique
in which phage are engineered to fuse a foreign peptide or
protein with their capsid (surface) proteins and hence display
it on their cell surfaces. The immobilized phage may then
be used as a screen to see what ligands bind to the expressed
fusion protein exhibited (displayed) on the phage surface.
Pharmacogenomics
It is the use
of (DNA-based) genotyping in order to target pharmaceutical
agents to specific patient populations. Genetic differences
are known to affect responses to many types of drug therapy,
and pharmacogenomics analysis serves to customize the use
of pharmaceuticals for specific subgroups of patients. The
rationale for this approach is that observed gene expression
differences may correlate with, and explain, the differences
in side effects and efficacy to drugs in humans.
Pharmacophore
The three dimensional
spatial arrangment of atoms, substituents, functional groups,
or chemical features that together are sufficient to describe
the pharmacologically active components of a drug molecule
or molecule series.
Phenotype
Phenotype is
the observable feature of an organism that is the result of
one or more genes.
Phylum
It is the segmentation
of the animal kingdom into about 30 major groups collectively
known as phyla. The members of each phylum share the same
basic structure and organization. For instance, fish, birds,
and human beings belong to one phylum - the Chordata - because
all have spinal cords.
Physical
map
A physical
map consists of a linearly ordered set of DNA fragments encompassing
the genome or region of interest. Physical maps are of two
types, macro-restriction maps and ordered clone maps. The
former consists of an ordered set of large DNA fragments generated
by using restriction enzymes whose recognition sequences are
infrequently represented in the genome. An ordered clone map
consists of an overlapping collection of cloned DNA fragments.
The DNA may be cloned into any one of the available vector
systems--YACs, cosmids, phage, or even plasmids. Major advantages
of ordered clone maps are that they are of high resolution
and directly provide the clones for further study.
Plasmid
A Plasmid can
be defined as any replicating DNA element that can exist in
the cell independently of the chromosomes. Synthetic plasmids
are used for DNA cloning. Most commonly found in bacterial
cells.
Pleitropy
Pleitropy can
be defines as the multiple effects on an organism’s phenotype
due to a single gene or allele e.g. the cytokines which can
bind to multiple cellular receptors and effect growth and
multiple immune pathways.
Point
mutation
It is a mutation
in which a single nucleotide in a DNA sequence is substituted
by another nucleotide.
Poly
(A) tail
The stretch
of Adenine (A) residues at the 3í end of eukaryotic
mRNA that is added to the pre-mRNA as it is processed, before
its transport from the nucleus to the cytoplasm and subsequent
translation at the ribosome.
Polyadenylation
site
A site on the
3í-end of messenger RNA (mRNA) that signals the addition
of a series of Adenines during the RNA processing step and
before the mRNA migrates to the cytoplasm. These Poly
(A) "tails" increase mRNA stability and allow one to isolate
mRNA from cells by PCR-amplification using Poly (T) primers.
Polygenic
inheritance
It is Inheritance
involving alleles at many genetic loci.
Polymerase
chain reaction (PCR)
Technique used
to amplify or generate large amounts of replica DNA of a segment
of any DNA whose "flanking" sequences are known. Oligonucleotide
primers which bind these flanking sequences are used by an
enzyme (Taq polymerase) to copy the sequence in between the
primers. Cycles of heat to break apart the DNA strands, cooling
to allow the primers to bind, and heating again to allow the
enzyme to copy the intervening sequence lead to a doubling
of DNA at each cycle. The reactions are typically carried
out on a regulated heating block and consist of 30-35 cycles
of repeated amplification of the entire DNA present. Single
molecules of "target" DNA can be amplified to microgram amounts
of DNA. The target DNA can be of any origin.
Polymorphism
The existence
of a gene in a population in at least two different forms
at a frequency far higher than that attributable to recurrent
mutation alone. Variations in a population may be measured
by determining the rate of mutation in polymorphic genes.
Polypeptide
A polypeptide
is a single chain of covalently attached amino acids joined
by peptide bonds. Polypeptide chains usually fold into a compact,
stable form (a domain) that is part (or all) of the final
protein.
Positional
cloning
Method used
to define the location of a gene on a chromosome and use this
information to identify and clone the gene. The location of
the gene is determined by linkage analysis of DNA from a large
family containing afflicted and normal members to identify
linkages between the transmission of the disease gene and
observable genetic markers. This information is then used
to screen (by chromosomal jumping and walking) the location
for putative genes. The disease gene must be compared between
the afflicted and normal family members and be shown to be
different in the two groups. The full sequencing of the gene
will then provide information regarding the characteristics
and function of the gene product, and a potential explanation
for the cause of the disease.
Post-transcriptional
modification
Alterations
made to pre-mRNA before it leaves the nucleus and becomes
mature mRNA.
Post-translational
modification
Alterations
made to a protein after its synthesis at the ribosome. These
modifications, such as the addition of carbohydrate or fatty
acid chains, may be critical to the function of the protein.
Primary
sequence (protein)
It is the linear
sequence of a polypeptide or protein.
Primer
It is a short
oligonucleotide that provides a free 3í hydroxyl for
DNA or RNA synthesis by the appropriate polymerase (DNA polymerase
or RNA polymerase).
Probe
Any biochemical
that is labelled or tagged in some way so that it can be used
to identify or isolate a gene, RNA, or protein.
Profile
Sequence profiles
are usually derived from multiple alignments of sequences
with a known relationship, and consist of tables of position-specific
scores and gap-penalties. Each position in the profile contains
scores for all of the possible amino acids, as well as one
penalty score for opening and one for continuing a gap at
the specified position. Attempts have been made to further
improve the sensitivity of the profile by refining the procedures
to construct a profile starting from a given multiple alignment.
Other representations for sequence domains or motifs do not
necessarily require the presence of a correct and complete
multiple alignments, such as hidden Markov models.
Prokaryote
It is an organism
or cell that lacks a membrane-bounded nucleus. Bacteria and
blue-green algae are the only surviving prokaryotes.
Promoter
(site)
A promoter
site is defined by its recognition by eukaryotic RNA polymerase
II; its activity in a higher eukaryote; by experimentally
evidence, or homology and sufficient similarity to an experimentally
defined promoter; and by observed biological function.
Protein
families
Sets of proteins
that share a common evolutionary origin reflected by their
relatedness in function which is usually reflected by similarities
in sequence, or in primary, secondary or tertiary structure.
They are the subsets of proteins with related structure and
function.
Proteome
Proteome is
the entire protein complement of a given organism.
Proteomics
Proteomics
is the study of the proteome. Typically, the cataloging of
all the expressed proteins in a particular cell or tissue
type, obtained by identifying the proteins from cell extracts
using a combination of 2D gel electrophoresis and mass spectrometry.
Purine
It is a nitrogen-containing
compound with a double-ring structure. Purine is the parent
compound of Adenine and Guanine.
Pyrimidine
Pyrimidine is a nitrogen-containing
compound with a single six-membered ring structure. It is
the parent compound of Thymidine and Cytosine.
Query (sequence)
A DNA, RNA
of protein sequence used to search a sequence database in
order to identify close or remote family members (homologs)
of known function, or sequences with similar active sites
or regions (analogs), from whom the function of the query
may be deduced.
Rational
drug design (Structure based drug design)
The development
of drugs based on the 3-dimensional molecular structure of
a particular target.
Reading
frame
A Reading frame
is a sequence of codons beginning with an initiation codon
and ending with a termination codon, typically of at least
150 bases (50 amino acids) coding for a polypeptide or protein
chain.
Reagents
Reagents are
the sources of biological or chemical material that can be
used as the starting blocks in laboratory experiments. Reagents
can range from chemicals needed to perform a particular chemical
reaction, constituents of a laboratory protocol, or clones
to be used in a large-scale gene expression study.
Recessive
Any trait that
is expressed phenotypically only when present on both alleles
of a gene.
Recombinant
DNA (rDNA)
Recombinant
DNA is the DNA molecule resulting from the fusion of DNA from
different sources. The technology employed for splicing DNA
from different sources and for amplifying the resultant heterogenous
DNA.
Recombination
It is a new
combination of alleles resulting from the rearrangement occuring
by crossing-over or by independent assortment
Recursion
An algorithmic
procedure whereby an algorithm calls on itself to perform
a calculation until the result exceeds a threshold, in which
case the algorithm exits. Recursion is a powerful procedure
with which to process data and is computationally quite efficient.
Regulatory
gene
It is a DNA
sequence that functions to control the expression of other
genes by producing a protein that modulates the synthesis
of their products (typically by binding to the gene promoter).
Relational
Database
A database
that follows E. F. Coddís 11 rules, a series of mathematical
and logical steps for the organization and systemization of
data into a software system that allows easy retrieval, updating,
and expansion. An RDBMS stores data in a database consisting
of one or more tables of rows and columns. The rows correspond
to a record (tuple); the columns correspond to attributes
(fields) in the record. In an RDBMS, a view, defined as a
subset of the database that is the result of the evaluation
of a query, is a table. RDBMSs use Structured Query Language
(SQL) for data definition, data management, and data access
and retrieval. Relational and object-relational databases
are used extensively in bioinformatics to store sequence and
other biological data.
Relational
Database Management Systems (RDBMS)
It is a software
system that includes database architecture, query language,
and data loading and updating tools and other ancillary software
that allows the creation of a relational database application.
Repeats
(repeat sequences)
Repeat sequences
and approximate repeats occur throughout the DNA of higher
organisms (mammals). For example, the Alu sequences
of length about 300 characters appear hundreds of thousands
of times in Human DNA with about 87% homology to a consensus
Alu string. Some short substrings such as TATA-boxes,
poly-A and (TG)* also appear more often than by chance. Repeat
sequences may also occur within genes, as mutations or alterations
to those genes. Repetitive sequences, especially mobile elements,
have many applications in genetic research. DNA transposons
and retroposons are routinely used for insertional mutagenesis,
gene mapping, gene tagging, and gene transfer in several model
systems.
Repetitive
elements
Repetitive
elements provide important clues about chromosome dynamics,
evolutionary forces, and mechanisms for exchange of genetic
information between organisms The most ubiquitous class of
repetitive elements in the DNA sequence in primate genomes
is the Alu family of interspersed repeats which have
arisen in the last 65 million years of evolution Alu
repeats belong to a class of sequences defined as short interspersed
elements (SINEs). Approximately 500,000 Alu SINEs exist
within the human genome, representing about 5% of the genome
by mass.
Replication
It is the synthesis
of an informationally identical macromolecule (e.g. DNA) from
a template molecule.
Repressor
It is the protein
product of a regulatory gene that combines with a specific
operator (regulatory DNA sequence) and hence blocks the transcription
of genes in an operon.
Restriction
enzyme (restriction endonuclease)
A type of enzyme
that recognizes specific DNA sequences (usually palindromic
sequences 4, 6, 8 or 16 base pairs in length) and produces
cuts on both strands of DNA containing those sequences only.
They are called as the "molecular scissors" of rDNA technology.
Restriction
fragment length polymorphisms (RFLPs)
Variation within
the DNA sequences of organisms of a given species that can
be identified by fragmenting the sequences using restriction
enzymes, since the variation lies within the restriction site.
RFLPs can be used to measure the diversity of a gene in a
population.
Restriction
map
A physical
map or depiction of a gene (or genome) derived by ordering
overlapping restriction fragments produced by digestion of
the DNA with a number of restriction enzymes.
Reverse
Genetics
It is the use
of protein information to elucidate the genetic sequence encoding
that protein. It is used to describe the process of gene isolation
starting with a panel of afflicted patients.
Reverse
transcriptase
It is a DNA
polymerase that can synthesizer a complementary DNA (cDNA)
strand using RNA as a template - a so-called RNA-dependent
DNA polymerase.
Reverse
transcriptase-PCR (RT-PCR)
It is the procedure
in which PCR amplification is carried out on DNA that is first
generated by the conversion of mRNA to cDNA using reverse
transcriptase.
Ribonucleic
acid (RNA)
A category of
nucleic acids in which the component sugar is ribose and consisting
of the four nucleotides Thymidine, Uracil, Guanine, and Adenine.
The three types of RNA are messenger RNA (mRNA), transfer
RNA (tRNA) and ribosomal RNA (rRNA).
Secondary
structure (protein)
The organization
of the peptide backbone of a protein that occurs as a result
of hydrogen bonds e.g. alpha helix, Beta pleated sheet.
Selectivity
Selectivity
of bioinformatics similarity search algorithms is defined
as the significance threshold for reporting database sequence
matches. As an example, for BLAST searches, the parameter
E is interpreted as the upper bound on the expected frequency
of chance occurrence of a match within the context of the
entire database search. E may be thought of as the number
of matches one expects to observe by chance alone during the
database search.
Sense
strand
It is the strand
of double-stranded DNA that acts as the template strand for
RNA synthesis. Typically only one gene product is produced
per gene, reading from the sense strand only. (Some viruses
have open reading frames in both the sense and the antisense
strands).
Sensitivity
Sensitivity
of bioinformatics similarity search algorithms centers around
two areas: First, how well can the method detect biologically
meaningful relationships between two related sequences in
the presence of mutations and sequencing errors; secondly
how does the heuristic nature of the algorithm affect the
probability that a matching sequence will not be detected.
At the user's discretion, the speed of most similarity search
programs can be sacrificed in exchange for greater sensitivity
- with an emphasis on detecting lower scoring matches.
Sequence
Tagged Site (STS)
STS is a unique
sequence from a known chromosomal location that can be amplified
by PCR. STSs act as physical markers for genomic mapping and
cloning.
Sexual
PCR (Molecular Diversity)
Sexual PCR
is a form of PCR in which similar, but not identical, DNA
sequences are reassembled to obtain novel juxtapositions,
simulating the result of genetic recombination. The result
is the creation of an array of related genes which may possess
improved characteristics. By repeated rounds of recombination,
selection and PCR-based amplification vastly improved gene-products,
such as enzymes with greater activity, may be generated and
selected.
Shotgun
cloning
Shotgun cloning
is the cloning of an entire gene segment or genome by generating
a random set of fragments using restriction endonucleases
to create a gene library that can be subsequently mapped and
sequenced to reconstruct the entire genome.
Similarity
(homology) search
Given a newly
sequenced gene, there are two main approaches to the prediction
of structure and function from the amino acid sequence. Homology
methods are the most powerful and are based on the detection
of significant extended sequence similarity to a protein of
known structure, or of a sequence pattern characteristic of
a protein family. Statistical methods are less successful
but more general and are based on the derivation of structural
preference values for single residues, pairs of residues,
short oligopeptides or short sequence patterns. The transfer
of structure/function information to a potentially homologous
protein is straightforward when the sequence similarity is
high and extended in length, but the assessment of the structural
significance of sequence similarity can be difficult when
sequence similarity is weak or restricted to a short region.
Signal
sequence (leader sequence)
A short sequence
added to the amino-terminal end of a polypeptide chain that
forms an amphipathic helix allowing the nascent polypeptide
to migrate through membranes such as the endoplasmic reticulum
or the cell membrane. It is cleaved from the polypeptide after
the protein has crossed the membrane.
Single
nucleotide polymorphisms (SNPs)
Variations
of single base pairs scattered throughout the human genome
that serve as measures of the genetic diversity in humans.
About 1 million SNPs are estimated to be present in the human
genome, and SNPs are useful markers for gene mapping studies.
Single-pass
sequencing
Rapid sequencing
of large segments of the genome of an organism by isolating
as many expressed (cDNA) sequences as possible and performing
single sequencer runs on their 5í or 3í ends.
Single-pass sequencing typically results in individual, error-prone
sequencing reads of 400-700 bases, depending on the type of
sequencer used. However, if many of these are generated from
numerous clones from different tissues, they may be overlapped
and assembled to remove the errors and generate a contiguous
sequence for the entire expressed gene.
Site
Sites in sequences
can be located either in DNA (e.g. binding sites, cleavage
sites) or in proteins. In order to identify a site in DNA,
ambiguity symbols are used to allow several different symbols
at one position. Proteins, however, need a different mechanism.
Restriction enzyme cleavage sites, for instance, have the
following properties: limited length (typically, less
than 20 base pairs); definition of the cleavage site and its
appearance (3', 5' overhang or blunt); definition of the binding
site.
Southern
blotting
A procedure
for the identification of DNA by transmitting a fragment isolated
on an agarose gel to a nitrocellulose filter where it can
be hybridized with a complementary "probe" sequence.
Splice
site
The sequence
found at the 5í and 3í region of exon/intron
boundaries, usually defined by a consensus sequence:
Intron 5í
CAGGTAAGT---------TNCAGG 3í A G C T
N represents
any nucleotide; the bottom line represents alternative nucleotides
at the indicated positions.
Splice
form
By using alternative
splicing, a single message precursor from DNA can generate
an entire family of mRNAs and proteins. This can be utilized
to create specificity in cell-cell or cell-ligand interactions.
A cell may produce a given protein, but it will be a different
splice-form of the protein than that produced by an adjacent
cell. In this manner, the two cells have the potential to
interact differently with other cells or molecules. Two places
where this has been extremely important are in the production
of cell-surface specificity proteins in the immune and nervous
systems.
Splicing
Splicing can
be defined by the joining together of separate DNA or RNA
component parts. For example, RNA splicing in eukaryotes involves
the removal of introns and the stitching together of the exons
from the pre-mRNA transcript before maturation.
Solvent
accessibility
The surface
area (typically measured in square angstroms) of a biological
molecule, usually a protein, which is exposed to solvent in
its native, folded form. Determining the solvent accessibility
of a protein helps define which amino acids in its molecular
sequence are on the exterior of the molecule, and thus available
to participate in interactions with other molecules.
Structural
gene
It is the gene
which encodes a structural protein.
Structure
prediction
It is done
by algorithms that predict the secondary, tertiary and sometimes
even quaternary structure of proteins from their sequences.
Determining protein structure from sequence has been dubbed
"the second half of the Genetic Code" since it is the folded
tertiary structure of a protein that governs how it functions
as a gene product. As yet most structure prediction
methods are only partially successful, and typically work
best for certain well-defined classes of proteins.
Substitution
matrix
A model of
protein evolution at the sequence level resulting in the development
of a set of widely used substitution matrices. These are frequently
called Dayhoff, MDM (Mutation Data Matrix), BLOSUM or PAM
(Percent Accepted Mutation) matrices. They are derived from
global alignments of closely related sequences. Matrices
for greater evolutionary distances are extrapolated from those
for lesser ones.
Subtraction
library
A cDNA library
that only contains cDNAs uniquely expressed in a given cell
or tissue. E.g. T cells and B cells will express many common
RNAs, as well as a very small percentage which will be unique
for T cells and B cells respectively. To make a T cell subtraction
library, the cDNA from a T cell library is hybridized with
a vast excess of B cell RNA. The commonly expressed genes
will result in RNA-cDNA hybrids which can be removed (or subtracted)
to leave only T cell specific cDNAs.
Tentative
Consensus (TC)
TC is the identification
of a sequence from an EST cluster that represents part or
all of a complete gene. TCs are usually determined by
clustering ESTs allowing for sequencing errors, artifacts
such as chimeric clones, and naturally occuring biological
phenomena such as alternative splicing. Creation of
a cluster allows one to generate a consensus sequence and
then identify a long open reading frame which would suggest
the possibility of that consensus representing a bona fide
gene.
Tentative
Human Consensus sequences (THCs)
A consensus
sequence generated from human EST fragments. THCs may be validated
by comparison against databases of known human gene sequences,
human genomic sequences, or by identification of the ORFs
or other sequence features contained within the consensus
as belonging to a known human gene product.
Tertiary
structure
It is the folding
of a protein chain via interactions of its side chain molecules
including formation of disulphide bonds between cysteine residues.
Thymine
A pyrimidine
base found in DNA but not in RNA.
Tissue
It is the section
of an organ that consists of a largely homogenous population
of cell types. Since many organs are multifunctional, they
have developed highly specialized cell types to perform different
functions. Identifying the section of an organ that is homogenous
for a particular cell type ensures that the gene expression
profiles extracted from those cells will accurately resemble
the class of cells that make up the tissue.
Transcript
A transcript
is a single-stranded mRNA chain that is assembled from a gene
template.
Transcription
It is the assembly
of complementary single-stranded RNA on a DNA template.
Transcription
factors
A group of
regulatory proteins that is required for transcription in
eukaryotes. Transcription factors bind to the promoter region
of a gene and facilitate transcription by RNA polymerase.
Transfer
RNA (tRNA)
A small RNA
molecule that recognizes a specific amino acid, transports
it to a specific codon in the mRNA, and positions it properly
in the nascent polypeptide chain.
Transformation
A genetic alteration
to a cell as a result of the incorporation of DNA from a genetically
different cell or virus; can also refer to the introduction
of DNA into bacterial cells for genetic manipulation.
Transgene
A transgene
is a foreign gene that is introduced into a cell or whole
organism (e.g. transgenic mice) for therapeutic or experimental
purposes.
Translation
It is the process
of converting RNA to protein by the assembly of a polypeptide
chain from an mRNA molecule at the ribosome.
Transmembrane
region
It is the region
of a transmembrane protein that actually spans the membrane.
Transmembrane regions are usually hydrophobic in order to
be thermodynamically compatible with the lipid bilayer portion
of the membrane. They may consist of either alpha-helical
or beta-strand secondary structure elements, but in either
case the external residues (the ones facing the membrane)
are invariably hydrophobic while the internal residues may
be hydrophilic (as in the case of a pore or channel) or polar.
One common transmembrane structural domain is the seven-helix
bundle seen in numerous channel proteins.
Tissue
It is the section of an organ
that consists of a largely homogenous population of cell types.
Since many organs are multifunctional, they have developed
highly specialized cell types to perform different functions.
Identifying the section of an organ that is homogenous for
a particular cell type ensures that the gene expression profiles
extracted from those cells will accurately resemble the class
of cells that make up the tissue.
Unidentified reading frame
(URF)
An open reading
frame encoding a protein of undefined function.
Uracil
Uracil is the Nitrogenous pyrimidine
base found in RNA but not DNA.
Variable
numbers of tandem repeats (VNTRs)
DNA sequence
blocks of 2-60 base pairs which are repeated from two to more
than 20 times in different individuals. This polymorphism
makes VNTRs very useful DNA markers used in genomic mapping,
linkage analysis and also DNA fingerprinting.
Variation
(genetic)
Variation in
genetic sequences and the detection of DNA sequence variants
genome-wide allow studies relating the distribution of sequence
variation to a population history. This in turn allows one
to determine the density of SNPS or other markers needed for
gene mapping studies. Quantification of these variations
together with analytical tools for studying sequence variation
also relates genetic variations to phenotype.
Vector
It is an agent
that transfers material (typically DNA) from one host to another.
Typically DNA vectors are autonomous DNA elements (such as
plasmids) that can be manipulated and integrated into a host’s
DNA or recombinant viruses.
Virtual
libraries
The creation
and storage of vast collections of molecular structures in
an electronic database is done virtually. These databases
may be queried for subsets that exhibit specific physicochemical
features, or may be "virtually screened" for their ability
to bind a drug target. This process may be performed prior
to the synthesis and testing of the molecules themselves.
Visualization
Visualization is the process
of representing abstract scientific data as images that can
aid in understanding the meaning of the data.
Weight
matrix
The density
of binding sites in a gene or sequence can be used to derive
a ratio of density for each element in a pattern of interest.
The combined individual density ratios of all elements are
then collectively used to build a scoring profile known as
a weight matrix. This profile can be used to test the prediction
of the identification of the selected pattern and the ability
of the algorithm to discriminate them from non-pattern sequences.
Western
blot
It is the technique
in which specific antibodies are used to identify their antigens
from a mixture of proteins. Typically, these proteins mixtures
are first separated by electrophoresis and then transferred
onto nylon sheets by electrotransfer. Radio labeled or enzyme-linked
antibodies are incubated with the sheets and unbound antibodies
washed away allowing the position of the bound antibody to
be revealed by autoradiography or color which is formed upon
addition of a substrate.
Wild
type
Form of a gene
or allele that is considered the "standard" or most common.
X
chromosome
In mammals, the
sex chromosome that is found in two copies in the homogametic
sex (female in humans) and one copy in the heterogametic sex
(male in humans).
Yeast
2-hybrid system
A yeast-based
method used to simultaneously identify, and clone the gene
for, proteins interacting with a known protein. The basis
of this method is a "transcriptional reporter assay" in which
reporter gene expression is dependent on two domains. The
first domain is linked to the known protein. The second domain
is genetically linked to a library. If the library is screened
against the known protein the two domains will interact only
if a protein from the library binds the known protein, resulting
in transcription activation of the reporter gene, and a blue
color. The "blue yeast clone" will contain the gene encoding
the newly identified protein.
Z-DNA
A conformation
of DNA existing as a left-handed double helix (the phosphate-sugar
backbone forms a left-handed zigzag course), which may play
a role in gene regulation.
Zinc
fingers
A protein motif formed by the interaction
of repeated cysteine and histidine residues with a zinc ion.
The spacing of the repeats results in finger like arrangements
of the protein loops formed from the interaction which in
turn interacts with DNA. These motifs are typically found
in transcription factors.
|