|
Organizers |
A strategy for sampling cDNA libraries for detecting 'novel' EST seqences
by
H. Nihal De Silva
The Horticulture and Food Research Institute of New Zealand Ltd.
Coauthors: Alistair J. Hall, William Laing
Scanning EST databases for sequence homology has proved to be an extremely useful approach for discovering novel genes. Genomic data on Expressed Sequence Tags (EST) are generated from sequencing a sample of colonies of a cDNA library. Most often there is a high degree of redundancy in the raw EST data. Usually, the data is further processed to provide datasets containing singletons or contigs before blast searching against public databases. In our definition a novel (or rare) EST is considered as one that matches with no other sequence in a list of databases. Initially, the proportion of ‘novel’ ESTs in a library will drop sharply as more libraries are sequenced. Considering that some libraries are more ‘novel’ than others the question asked is: What criterion should we use to decide whether to start a new library or continue to sequence the same library?
A natural approach to solve the problem is to minimise the expected cost per novel sequence found. We let the proportion of novel ESTs in any library be denoted by the rv \Pi, which we assume is beta distributed. For a specific parameter value \Pi = \pi for a sampled library we assume the number of novel ESTs in a sample of size n to be a random variable X with a binomial distribution. Given x novel ESTs are found in a sequential sample of size n1, we use Bayes’ theorem to calculate the posterior distribution. We let the cost of creating a new library be Cnew, and the marginal cost of continued sequencing C1. Based on these costs and the expected number of novel ESTs at any given time of sampling a library, we provide a decision rule whether to continue sequencing the old library or create a new one.
Date received: August 21, 2001
Copyright © 2001 by the author(s). The author(s) of this document and the organizers of the conference have granted their consent to include this abstract in Atlas Conferences Inc. Document # cahg-20.