星期一, 十一月 14, 2011

How much data storage is needed for one ejaculate?

This all began with a joke.  Some one calculated that the information for one ejaculate of a man is more than 1PB.  He then praised the `thoughtfulness' of a man.



The calculation above exaggerated on repeated data.  Below shows a way to store the information more efficiently.

Human DNA has around 3x109 base pairs per haplotype.  There are only 4 forms of the base pairs, i.e. ATGC.  We can use two bits to code them.  Thus, one byte can hold  4 base pairs.  To store these genetic codes, we need 3 x 109(bp/haplotypes) x 2(haplotypes) / 4(bp/byte) / 10243(byte/GB) =1.40GB.

A normal sperm is produced through meiosis.  It is spliced with the two haplotypes a man carries.  With known full DNA sequence, we can use one bit to indicate the initial haplotype a sperm is from, i.e. 0/1.  Then only the cross-over points are necessary to be recorded.  Averagely, 30 cross-over points are expected to form a sperm.  Each cross-over point needs a 4-byte integer.  An ejaculate of a man usually produce 280M sperms (8G for a boar).

Thus to store them, we need 2.8 x 108 x 30 x 4 / 10243 = 31.3GB.

Altogether, the storage needed is 32.7G, which can be filled in a dual layer blu-ray disk (50G).  Using a standard compression software, e.g., gzip, this figure can be reduced to around 4G, which might not fill up a DVD. 

###

Just made a simulation, if the cross-over points are stored in binary format, i.e. 4 byte/integer, ordinary compression software can't reduce the storage size.   If they are stored with ASCII codes, then at least 10 bytes are needed for each point.  After compression, the storage size can be halved, which is still larger than a binary storage.  Thus a DVD is too small to store the information.

Modern genetics also found that the mutation rate of a human genome is around 1/meiosis/108 base pairs.  Hence, the number of mutations a sperm carries is approximately equal to its cross-over points.  To include mutation information will double the storage.

Indels and other errors in the sperms can make the situation more complicated.

Overall, 100G may be enough for one ejaculate.  This is far less than 1PB.

PS: The four sperms that are from a same primary spermatocytes have only maximumly 1 bit difference, i.e., their initial haplotype might be different.  Thus the storage size can be further reduced by around 3/4 x 31.3 GB.

One also need to consider the recombination of chromosomes.  But to record any new recombination of a chromosome with its previous, only one bit is needed.  This is because the recombination points here are fixed.   This won't take much storage space.

没有评论: