Digitized by the Internet Archive in 2015 https://archive.org/details/detectingdiffereOOIiud DETECTING DIFFERENTIAL SPECIATION RATE IN IPOMOEA BY A CONTINUOUS-TIME MARKOV MODEL BY Dongmei Liu Zoology Department Duke University 12- /og/Zcrm) A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the department of Zoology in the graduate school of Duke University 2000 L/-mt> Abstract Preliminary study on Ipomoea phylogeny shows that species selection may influence flower-color evolution in this genus. Evolutionary transitions between pigmented flower and white flower appear to be highly asymmetrical. White-flowered lineages are less speciose than their pigmented sister clade and evolutionary transitions to white flowers occurred almost exclusively at the tip of phylogeny. A hypothesis to explain this pattern is that species selection acts against white-flowered lineages and prevents them from increasing in relative abundance whiten the genus Ipomoea. A continuous-time Markov model is developed to test differential speciation rates within genus Ipomoea. Simulation result shows that the observed distribution of white lineages is not compatible with equal transition rates between pigmented flower and white flower. Only highly asymmetrical transition rates that favor transitions to white flower are compatible with the observed distribution. This result implies that there is a tendency to increase the proportion of white-flowered species in the genus. The only process that could prevent this outcome is species selection or some similar higher- level evolutionary process that acts against white-flowered lineages. The absence of any deep white-flowered clades suggests that it might be white-flowered species have greatly elevated extinction rates compared to pigmented species. in Thesis title page Abstract title page Abstract fjj Table of contents List of figures Acknowledgement I. Introduction II. System UI. Method IV. Result V. Discussion Literature cited Figures Appendix 1. cuttree.c 2. 3-branch.c 3. final-tree. c 4. sim.c 5. probRS.c 6. contour.c Table of Contents in iv v VI 1 6 7 15 18 23 27 32 32 39 44 50 57 67 iv List of Figures Figure 1. Strict consensus tree for Ipomoea based on simultaneous 6 analysis of ITS and waxy sequences. Figure 2. Relationship between R and transition rate. 27 (Simulation is based on linearized trees) Figure 3. Relationship between R and transition rate. 28 (Simulation is based on most parsimonious trees) Figure 4. Relationship between percentage of large R which is 29 bigger than the observed R (Simulation is based on linearized trees) Figure 5. Relationship between percentage of large R which is 30 bigger than the observed R (Simulation is based on most parsimonious trees) Figure 6. Contours which cover 95% of joint probability of R and S 31 (Simulation is based on first 20 most parsimonious trees) v Acknowledgement I would like to thank Dr. Maxk D. Rausher for his help during my Master research. His advice, constructive comments and encouragement helped me throughout the whole project. I would also like to thank Dr. Cliff Cunningham | and Dr. Rytas Vilgalys for their support. vi Introduction It has long been recognized that natural selection may operate at different levels of organization (Lewontin 1970, Wimsatt 1980, Brandon 1982, Sober and Wilson 1994). Evidence has accumulated indicating that besides an individual, the unit of selection may be a gene (Werren et al. 1988), a family group (McCauley 1994), and possibly a population (Wade 1977). More controversial is the notion that selection may operate at higher levels of organization, particularly among lineages or species. Recently, phylogenetic approaches have been used to detect species selection. Statistical analyses of phylogenetic reconstructions can reveal whether clades differ in extinction/speciation rates and whether particular traits are associated with these differences (Mitter et al. 1988, Farrell et al. 1991, Nee et al. 1992, Slowinski and Guyer 1993, Sanderson and Donoghue 1994, 1996). The goal of this project is to examine the role of species selection in influencing the evolution of flower color in morning glories (Ipomoea). The pantropical genus Ipomoea L. consists of more than 600 species (Austin and Huaman 1996) of vines, shrubs, and trees that exhibit a tremendous diversity of flower colors. Miller et al. (1999) published a phylogeny of 40 Ipomoea species using sequences of the internal transcribed spacer (ITS) 1 region of nuclear ribosomal DNA and sequences from three exons and two introns of the 3' end of nuclear gene waxy. For these analyses, flower colors were grouped into two categories: pigmented and white. Three interesting patterns emerge: Evolutionary transitions appear to be highly asymmetrical. In particular, parsimony character-state reconstruction of Fig. 1 reveals at least seven, possibly 9, independent transitions from pigmented to white flowers. By contrast, there are at most two, and possibly no, transitions in the reverse direction. (I. Wrightii, I. Graminea, and I. imp e rati clade may represent reversion.) This pattern suggests that loss of floral pigmentation may be a manifestation of Dollo's law that the loss of complex characters is irreversible (Dollo 1893, Bull and Charnov 1985, Sanderson 1993). There is a tendency for white-flowered lineages to be less speciose than their pigmented sister clade. In particular, three white-flowered lineages (/. Saintronanensis and the mostly white-flowered clade consisting of /. graminea, I. wrightii, and /. imperati) are substantially less speciose than their pigmented sister clades, while no white-flowered lineages are more speciose than their sister clade. The sister taxon of each remaining white-flowered species is a single pigmented species. 2 Evolutionary transitions to white flowers occurred almost exclusively at the tip of the phylogeny. All white clades are two nodes deep or less, with most represented by a single species. The strong asymmetry in transitions between pigmented and white flowers can be explained by our current knowledge of the genetic and molecular control of flower color. In Ipomoea, as in many plants, floral pigments are anthocyanins, which are produced from the precursors malonyl-CoA and p-coumarl-CoA by a biochemical pathway consisting of six core enzymes (Dooner et al. 1991, van der Meer et al. 1993, Holton and Cornish 1995). Analysis of spontaneous and induced white-flowered mutants in plants, such as Petunia and snapdragon {Antirrhinum), has revealed that in all cases, white flowers arise whenever one or more structural or regulatory genes of this pathway are deactivated (Forkmann and Dangelmayr 1980, Harker et al. 1990, Inagaki et al. 1994, 1996, Hisatomi et al. 1997, Quattrocchio et al. 1998, Mol et al. 1998). Similarly, naturally occurring white-flowered species and variants, including two in Ipomoea, (Ennos and Clegg 1983, Epperson and Clegg 1988) also appear to be due to knockout mutations (Tiffin et al. 1998, Habu et al. 1998, Quattrocchio 1994). 3 This evidence indicates that white flowers are most likely to arise genetically by loss-of-function mutations affecting specific anthocyanin pathway genes. If this pattern holds upon further investigation, it provides a ready explanation for the asymmetry in flower-color transitions: deactivating mutations are relatively common, whereas restoration of function is rare because, once deactivated, a gene will acquire additional deactivating mutations through genetic drift. These additional mutations will reduce the probability that a single mutation could restore function (but see Hanson et al. 1996). The apparent strong asymmetry in the direction of flower-color transition, if real, constitutes a process that would tend to increase the proportion of white- flowered species in the genus. If unopposed by other processes, it would eventually produce a genus in which most species have white flowers. Because the asymmetry results from the integration of all evolutionary processes acting below the species level, the only process that could prevent this outcome is species selection or some similar higher-level evolutionary process. Moreover, the observation that transitions to white flowers occur almost exclusively at the tips of our phylogeny suggests that white-flowered species, for unknown ecological reasons, have greatly elevated extinction rates, compared to pigmented species, though the concomitant existence of decreased speciation rates in white-flowered lineages can not be ruled out. The 4 overall goal of the project is to test the hypothesis that species selection acts against white-flowered lineages and prevents them from increasing in relative abundance. 5 Systems Me r re ma Operculina Iplebeia I pes-dgridis I ochracea lobscura* I graminea* I wrighdi I wperad x I dxamantwensis*- I aquatica I lobata I coccinea I ternijblia I quamoclit I lindheimen [purpurea I ml I hederacea I tricolor I alba* 1 parasitica I carnsa I polpha I costata I arborescens * I platsnsis I conzattn I saintronanenszs* I setosa I umbraticola I lacunosa * I cordatotrdoba I batatas I muellen I gracilis I argil licola lasanfolia I pes-caprae I amnicola I leptophylla I vandurata * Fig. 1 Strict consensus tree for Ipomoea based on simultaneous analysis of ITS and waxy sequences (Miller, 1999). * indicate white flower species. 6 Methods In order to test the asymmetry of Ipomoea phylogeny, we developed the following method. A set of equally most parsimonious phylogenies is generated. One phylogeny is randomly selected for analysis and linearized. Branch lengths are then estimated by maximum likelihood under the assumption for a molecular clock. Based on this linearized phylogeny, the distribution of white-flowered species is reconstructed using a continuous-time Markov model. Using this model, 1000 trees with random flower-color transitions will be generated stochastically for each phylogeny. We choose R (number of white species/total length of white lineages) and S (number of white species) to be the two statistics to test asymmetrical distribution of white-flowered and pigmented flowered species in phylogeny. This procedure is repeated for a number of randomly-chosen equally parsimonious topologies with a set of transition rates between white flowered species and pigmented flowered species. It yields a distribution of joint probabilities of R and S. The statistics are also calculated for the actual tree and compared to the frequency distribution of the statistics among the simulations. If fewer than five percent of the statistics from the stochastic simulations are as extreme as that for the actual tree, the stochastic model may be rejected as incompatible with the observed pattern. 7 (1) Phylogeny reconstruction. The phylogeny of 40 species was analyzed using the sequence of the internal transcribed spacer (ITS) region of nuclear ribosomal DNA and sequences for three exons and two introns of the 3' end of the nuclear gene waxy. Merremia tuberosa and Operculina brownii are used as outgroups. Combined ITS and waxy sequence has 1285 nucleotides. The phylogeny was reconstructed by PAUP using parsimony. Among the 1285 sites, 249 sites are parsimony- informative. Finally, we obtained 56 equally most-parsimonious phylogenies. (2) Generate linearized phylogeny and estimate branch length by maximum likelihood method. Randomly select one maximum parsimonious phylogeny for analysis. Takezaki et al. (1995) proposed a method to sequentially prune species exhibiting significant deviation from the average nucleotide substitution rate until rate variation among the remaining species is consistent with a molecular clock. Their method is called two-cluster test. The principle of the test is to examine the equality of the average substitution rate for two clusters that are created by a node in a given tree. The basic quantities they care about are the averages of observed numbers of substitutions per site from node to the tips of 8 the two clusters. Under the assumption of rate constancy, the expected difference of the two clusters' substitution rates is zero. If the test shows significant rate difference between the two clusters, they eliminate the cluster whose average branch length from the root is more different from the average of all sequences than the other cluster. This test and sequence elimination therefore proceeds from terminal nodes to the root of the tree. A linearized tree will then be constructed for a given topology under the assumption of rate constancy. Takezaki has written a program package to generate linearized tree. I used this package to linearize phylogenies in this project. Branch lengths are then estimated by maximum likelihood method under the assumption of a molecular clock. Yang (1997) wrote a package of programs, called PAML, for phylogenetic analyses of DN A sequences using the maximum likelihood method. General assumptions of the programs are that substitutions occur independently in different lineages and among sites, the process of substitution is a time-homogeneous Markov process, and the frequencies of nucleotides have remained constant over the time period covered by data. I used this program package to obtain branch lengths of trees in this project. (3) Simulate transitions between pigmented flower and white flower. 9 Pagel (1994) developed a continuous-time Markov model for character evolution. The model is designed to estimate simultaneous transition rates in pairs of binary characters arrayed on a phylogney. All species in the study could have only one of two characters, in our case, a pigmented flower or a white flower. I assign it to be either P state (pigmented flower) or W state (white flower). Along a branch beginning with state P/W, the flower color will either remain the same or change to the other state. The probability for this change or unchange over time t (t is the branch length) is P pp (t), P ww (t), P wp (0, or Pp W (t). Assume that the flower color evolves according to a Markov process such that the probability of change is independent in each branch of the phylogenetic tree, and that the probability of changing from one state to the next depends only upon the state at the beginning of the branch, not on events previous to that. The probability that flower color changes from state i to state j over time interval t+dt is given by P ij (t+JO=P ii (t)*q i j*^+P i j(t)*(l-q i j)*^ where t is an arbitrary time period, dt is a short interval of time, and q,j is transition rate parameter denoting the rate of change from state i to state j over an infinitesimally short time interval. The 1-qy term represents the probability of staying in state j. This equation states that flower color can change from state i to state j either by remaining unchanged for a period t followed by a 10 transition to state j, or by changing to state j during time period t, then remaining in state j over period dt. The set of equations describing the four possible probabilities of interest can be presented in matrix form ?(t+dt)=?{t)*(I+Qdf), | l-P pw (t+dr) P pw (t+A)| = I Ppp(t) P P w(t)| * |(l-q pw )*A q^*dt\ |P wp (t+Jr) 1-P wp (t+Jr) | | Pwp(t) Pww(t) | |q wp (t)*rff (l-qwp)*^| To solve for the P(t) in terms of the rate parameters in Q, dP(t)/ #include #include int i, j , k, l,m, ncut , cut [10 0] ,N, tp [150] ,nodeA[100] , nodeB [100] int ntp, mother , grandma; int ncutspecies , cutspecies [100] , nsite , check, newN, newtp [150] int ancestor [150] ; int nlef t , nright ; char left, right; char seq [50] [2 000] , species [50] [5 0] ; FILE * inp , * input seq, *out , * out seq , * input name ; main ( ) { input ( ) ; calculation ( ) ; output ( ) ; } input ( ) { /* input parsimonious tree */ inp=f open ( "cuttree . dat " , "r") ; f scanf (inp, "%d\n" , &N-) ; /* N, number of species k=0; 1 = 0; ntp=0 ; do{ ntp++ ; fscanf (inp, "%d\t" , &tp [ntp] ) ; if (tp [ntp] ==1000) k=k+l; if (tp [ntp] ==2000) 1=1+1; }while (tp [ntp] !=5000) ; ntp- - ; if(k!=l) printf ( "dataf ile is wrong. \n" ) ; fscanf ( inp , " \n" ) ; 32 /* input the nodes/species to cut off */ f scanf (inp, "%d\n" , &ncut) ; /* ncut , number of species/nodes to cut off */ for ( i=l ; i<=ncut ; i++) { f scanf (inp, "%d\n" , &cut [i] ) ; } f close (inp) ; /* input sequences data */ inputseq=f open (" sequence . dat " , "r") ; f scanf ( inputseq, " %d\n" , &N) ; /* N, number of species */ f scanf ( inputseq, " %d\n" , &nsite) ; /* nsite, total number of sites compared among sequences */ f or ( i = 1 ; i < =N ; i-M- ) { f scanf (inputseq, "%d\n" , &m) ; for (j=l; j<=nsite; j++) { seq[m] [j ] =getc (inputseq) ; /* seq [m] [j] , the mth sequence's jth site */ } } f close ( inputseq) ; inputname=f open ( " species_name " , " r " ) ; f scanf (inputname, "%d\n" , &N) ; for (i=l;i<=N;i++) { f scanf ( inputname ," %d " , & j ) ; f scanf ( inputname , " %s\n" , species [ j ] ) ; } f close ( inputname) ; } calculation ( ) { tree_structure ( ) ; cut_tree ( ) ; 33 } tree_structure ( ) { i = 0 ; k=N; do{ /* find the right side of the node */ do{ i=i+l ; }while (tp [i] ! =2000) ; /* 2000 indicate right edge of a cluster */ k=k+l; /* k is the node # */ tp[i]=k; /* reassign right edge to be node# */ nodeB [k] =tp [i-1] ; /* find the left. side of the node */ j=i; do{ j=j-i; }while (tp [ j ] ! =1000) ; /* 1000 indicates left edge of a cluster */ tp[j]=k; /* reassign left edge to be node# */ nodeA [k] =tp [ j +1] ; } while ( iN) && (newtp [i] ! =0) ) { check=0 ; for (j=l;j #include #include int N, tp [150] , i, j , k, l,nodeA[10 0] , nodeB [100] , nodeC [100] int ntp , check, bad_node ; int left , right , nleft ,.nright , newtp [150] , newntp; char species [50] [50] ; FILE *inputtp, *outtp; main ( ) { input ( ) ; structure ( ) ; output ( ) ; } input ( ) { /* input parsimonious tree */ input tp=f open (" topology . dat " , "r") ; f scanf ( inputtp ." %d\ n ". &N) ..• /* N. number species */ k=0 ; 1 = 0; i = 0 ; do{ i=i+l ; f scanf ( inputtp, " %d\t " , &tp [i] ) ; if (tp [i] = = 1000) k=k+l; if (tp [i] = = 2000) 1 = 1 + 1; }while (tp [i] !=5000) ; ntp=i-l ; if(k!=l) printf ( "dataf ile is wrong. \n"); fclose (inputtp) ; } structure ( ) { 39 k=N; do{ /* find the right side of the node */ do{ i=i+l; }while (tp [i] ! =2000) ; /* 2000 indicate right edge of a cluster */ k=k+l; /* k is the node # */ tp[i]=k; /* reassign right edge to be node# */ nodeB [k] =tp [i-1] ; /* find the left side of the node */ j=i; do{ j=j-i; }while (tp [ j ] ! =1000) ; /* 1000 indicates left edge of a cluster */ tp[j]=k; /* reassign left edge to be node# */ nodeA [k] =tp [ j +1] ; /*check if this tree have 3 node case */ l=j+l; if ( (tp [1] <=N) && (tp [1+1] ! =nodeB [k] ) ) { nodeC [k] =tp [1+1] ; } if (tp [1] >N) { do{ 1=1+1; }while(tp[l] !=nodeA[k]); /* find the right edge of nodeA * / if (tp [1 + 1] ! =nodeB [k] ) /* 3 node case */ { nodeC [k] =tp [1+1] ; if (nodeC [k] ! =0) { if (nodeC [k] <=N) 1=1+1 else 4() { 1=1+1; do{ 1=1+1; }while (tp [1] !=nodeC [k] ) ; } if (tp [1 + 1] ! =nodeB [k] ) printf ( "There is sth wrong for node %d\n",k); } }while ( ibad_node ; i - - ) { nodeA ( i ] =nodeA [ i - 1 ] ; if (nodeA [i] >=bad_node) nodeA [i] ++ ; nodeB [ i ] =nodeB [ i - 1 ] ; if (nodeB [i] >=bad_node) nodeB [i] ++ ; } nodeA [bad_node+l] =bad_node ; nodeB [bad_node+l] =nodeB [bad_node] ; nodeB [bad_node] =nodeC [bad_node] ; for (i=l; i<=ntp ; i++ ) { if (newtp [i] >=bad_node) newtp [i] ++ ; } 41 i = 0 ; check=0 ; do{ i=i+l; if (tp [i] ==bad_node) check=l; }while (check==0) ; left=i; f or ( j =ntp+l ; j >lef t ; j - - ) { newtp [ j ] =newtp [ j - 1 ] ; } newtp [lef t+1] =bad_node ; k=left; check=0 ; do{ k + +; if (newtp [k] ==nodeC [bad_node] ) check+- } while (check<2 ) ; for ( j =ntp + 2 ; j >k ; j - - ) { newtp [ j ] =newtp [ j - 1 ] ; } newtp [k+1] =bad_node ; newntp=ntp+2 ; } output ( ) outtp=fopen ( "n_topology . dat " , "w") ; fprintf (outtp, "%d\n" , N) ; lef t=1000 ; right=2000 ; nlef t = 0 ; nright=0 ; for (i=l ; i<=newntp ; i++) { if (newtp [i] >N) { check=0 ; for (j=l; j #include ttinclude #include int N, i, j ,k, l,m, newtp [150] , newntp, nbr, orderbr [150] , check; int nodeA [100] , nodeB [100] ; int len, nodeC [150] , bad_node ; float inpbr [150] , br [150] ; char tp [1000] , c; char species [50] [50] ; FILE *inp, *out , *inputtp; main ( ) { input ( ) ; calculation ( ) ; output ( ) ; } input ( ) { inp=fopen ( "mlb . dat " , "r" ) ; f scanf (inp, "%d\n\n" , &N) ; i = 0 ; m=0 ; do{ i + + ; tp [i] =getc (inp) ; if (tp[i]==' :') { m++ ; f scanf ( inp, " %f " , &inpbr [m] ) ; } }while (tp [i] ! = ';') ; nbr=m; /* check out input */ /* 44 printf ("%d\n\n" ,N) ; i = 0 ; m=0 ; do{ i + + ; printf ( "%c" , tp [i] ) ; if (tp[i]==' : ' ) { m++ ; printf ( " %8 . St" , inpbr [m] ) ; } }while (tp [i] !=';'); printf ("\n") ;*/ f close (inp) ; } calculat ion ( ) { change_input ( ) ; structure ( ) ; branch ( ) ; change_input ( ) { j=0; /* newtp */ i=0; /* tp */ m=0; /* species */ do{ i + + ; if (tp[i]==' (' ) { j++; newtp [ j ] =1000 ; } if (tp[i]==' ) ' ) { j + + ; newtp [ j ] =2000 ; if ((tp[i]>='A')&&(tp[i]<='Z')) 45 m++ ; k=i; do{ k++; }while(tp[k] ! = ':'); for (l = l;l<=k-i;l + + ) { len=strlen (species [m] ) ; species [m] [len] = tp [1+i-l] ; } 3++; newtp [ j ] =m; } }while(tp[i] ! = ' ; ' ) newntp=j ; } structure ( ) i = 0 ; k=N; do{ /* find the right side of the node */ do{ i=i + l ; }while (newtp [i] ! =2000) ; /* 2000 indicate right edge of a cluster */ k=k+l; /* k is the node # */ newtp [i]=k; /* reassign right edge to be node# */ nodeB [k] =newtp [i-1] ; /* find the left side of the node */ 3=1; do{ j=j-i; }while (newtp [j ]! =1000) ; /* 1000 indicates left edge of a cluster */ newtp [j]=k; /* reassign left edge to be node# */ nodeA [k] =newtp [ j +1] ; /*check if this tree have 3 node case */ 4(. l=j+l; if ( (newtp [1] <=N) && (newtp [1 + 1] ! =nodeB [k] ) ) { nodeC [k] =newtp [1+1] ; } if (newtp [1] >N) { do{ 1=1+1; }while (newtp [1] ! =nodeA [k] ) ; /* find the right edge of nodeA */ if (newtp [1 + 1] ! =nodeB [k] ) /* 3 node case */ { nodeC [k] =newtp [1+1] ; } } if (nodeC [k] ! =0) { if (nodeC [k] <=N) 1=1+1; else { 1=1+1; do{ 1=1+1; }while (newtp [1] ! =nodeC [k] ) ; } if (newtp [1+1] ! =nodeB [k] ) printf ( "There is sth wrong for node %d\n",k); } } while ( iN) { check=0 ; for (j=i-l; j> = l; j--) { if (newtp [j ] ==newtp [i] ) check=l; } if (check==l) { m++ ; orderbr [m] =i ; } } } for (i=l; i<=nbr ; i++ ) { br [newtp [orderbr [i] ] ] =inpbr [i] ; } /* for (i=l ; i<=newntp ; i + + ) { if (newtp [i] <=N) { print f("%s:%8 . 6f " , species [newtp [i] ] , br [newtp [i] ] ) ; } if (newtp [i] >N) { check=0 ; for (j=i-l; j >=1; j --) { if (newtp [j ] ==newtp [i] ) check = l; } if ( check==l ) { print f ( " %c : %8 . 6f " , ' ) ' , br [newtp [i] ] ) ; } if (check==0) { printf ( " %c" , ' ( ' ) ; } } }*/ } output ( ) { 48 out=f open ( " sim . dat " , "w" ) ; fprintf (out , " %d sequences\n" , N) ; for (i=l; i<=N; i++) { fprintf (out , " %d" , i) ; fprintf (out , " %s\n" , species [i] ) ; } check=0 ; for (i=N+l;i<=2*N-l;i + +) { if (nodeC [i] ! =0) { check=l ; bad_node=i ; } } fprintf (out, "bad_node\t%d\n" , bad_node) ; for (i=N+l ; i<=2*N-3 ; i++) { fprintf (out ," %3d and %3d" , i , nodeA [i] ) ; fprintf (out , " %lf \n" , br [nodeA [i] ] ) ; fprintf (out ," %3d and %3d" , i , nodeB [i] ) ; fprintf (out , " %lf \n" , br [nodeB [i] ] ) ; i f ( i = =bad_node ) { fprintf (out ," %3d and %3d" , i , nodeC [i] ) ; fprintf (out , " %lf \n" , br [nodeC [i] ] ) ; } } if (bad_node==0) { else i=2*N-2 ; fprintf (out , •%3d and %3d" , i , nodeA [i] ) ; fprintf (out , %lf \n" , br [nodeA [i] ] ) ; fprintf (out , •%3d and %3d" , i , nodeB [i] ) ; fprintf (out , %lf \n" , br [nodeB [i] ] ) ; fprintf (out , •%3d and l\t %lf\n" , i,br [l] i=2*N-3 ; fprintf (out , ■%3d and l\t %lf\n" , i,br [l] ) } f close (out ) ; } 49 4 . sim . c /* simulate transition between purple flower and white flower, find the final distribution of purple (-1) flower species and white (1) flower species*/ #include ^include #include #define step 40 int i , j , k, 1 , m, w, newN, nodeA [100] , nodeB [10 0] , nodeC [100] ; int ancestor [100] , color [100] ; int check , seed, nw [5 0 00 ] , newseed, runtime , nmutation, mutation [10 0] ; int bad_node, wp, pw; int order [5000] , y [10 0] , printposition, totaly, count [100] ; int biggerthanR [50] [50] ; float totalR [50] [50] ,- double branch[150] , rate , p_wrate , w_prate , t , p , sample , ttw [ 5 0 00 ] ,R[5000] ; double minimum_x, minimum_y, maximum_x, maximum_y ; double step_x, orderR [5000] ,minimum_R; double x[100] , percenty [100] ,ordery[100] ; char species [50] [50] , string [10] ; FILE *inp, *out ; main ( ) { input ( ) ; for (wp=0 ;wp< = 45 ; wp++) { w_prate= (float) (wp)/100; for (pw=l ; pw<=4 5 ; pw++ ) { p_wrate= (float) (pw)/100; totalR [wp] [pw] =0 ; printf ( " w_prate=%f \tp_wrate=%f \n" , w_prate , p_wrate) ; for (runt i me = 1 ; runt ime< = 1000 ; runt ime + + ) { calculation ( ) ; } record ( ) ; } } output ( ) ; } input ( ) 50 printf ("Seed for random number generator (samller than digits) : \n" ) ; scanf("%6d", &seed) ; srand48 (seed) ; inp=f open ( " sim . dat " , "r " ) ; f scanf ( inp , " %d sequences\n" , &newN) ; for (i=l ; i<=newN; i++) { f scanf ( inp , " %d" , &j ) ; f scanf ( inp, " %s\n" , species [j ] ) ; } f scanf (inp, "%s\t%d\n" , string, &bad_node) ; for (i=newN+l ; i<=2*newN-3 ; i++) { f scanf ( inp ," %3d and %3d" , &j , &nodeA [i] ) ; f scanf ( inp , " %lf \n" , Scbranch [nodeA [ j ] ] ) ,- f scanf ( inp ," %3d and %3d" , &j , &nodeB [i] ) ; f scanf (inp, " %lf \n" , Scbranch [nodeB [j ] ] ) ; i f ( i = =bad_node ) { f scanf (inp, "%3d and %3d" , &j , &nodeC [i] ) ; f scanf ( inp, " %lf \n" , &branch [nodeC [ j ] ] ) ; if (bad_node==0) { i=2*newN-2 ; f scanf ( inp , f scanf ( inp , f scanf ( inp , f scanf ( inp , f scanf (inp, %3d and %3d" , & j , ScnodeA [ i ] ) ; %lf \n" , Scbranch [nodeA [j ] ] ) ; %3d and %3d" , &j , &nodeB [ j ] ) ; %lf \n" , Scbranch [nodeB [ j ] ] ) ; %3d and l\t %lf\n" , &j , Scbranch [1] ) else { i=2*newN-3 ; f scanf (inp, "%3d and l\t %lf\n" , Scj , Scbranch [1] ) } f close ( inp) ; calculation ( ) 51 if (bad_node==0) { for ( i=newN+l ; i<=2*newN-2 ; i++ ) { k=nodeA [i] ; ancestor [k] =i ; k=nodeB [i] ; ancestor [k] =i ; } m=2*newN-2 ; color [m] = - 1 ; for (i=l ; i<=2*newN-3 ; i++) { color [i] =0 ; } nmutation=0 ; do{ k=nodeA [m] ; t=branch [k] ,- if (color [m] = = -1) { p=w_prate-p_wrate*exp ( - 1* ( w_prate+p_wrate ) *t ) ; p=p/ (w_prate+p_wrate) ,- sample=drand48 () ; if (sample>p) { color [k] =1 ; nmut at ion=nmutat ion+ 1 ; mutation [nmutation] =k; } else { color [k] =-1 ; } } if (color [m] ==1) { p=p_wrate+w_prate*exp ( - 1 * (w_prate+p_wrate) *t) ; p=p/ (w_prate+p_wrate) ; sample=drand48 () ; if (sample>p) { color [k] =-1 ; nmutation=nmutation+l ; mutation [nmutation] =k; } else { 52 color [k] =1 ; } } m=k; }while (m>newN) ; do{ k=ancestor [m] ; i f ( k ! =bad_node ) { if (color [nodeA [k] ] ==0 ) down ( ) ; else { if (color [nodeB [k] ] ==0) down ( ) ; } } else { if (color [nodeA [k] ] ==0) down ( ) ; else { if (color [nodeB [k] ] = = 0) down ( ) ; else { if (color [nodeC [k] ] = = 0) down ( ) ; } } } m=k ; }while (m<2*newN-4 ) ; } /* check */ check=0 ; 1 = 1; if (bad_node==0) { do{ 1=1+1; if (color [1] = = 0) check=l; jwhile (l<2*newN-2 check==0); } else { do{ 1=1+1; if (color [1] = = 0) check=l; }while (l<2*newN-3 && check==0); } if (check==l) 53 { printf (" calculation is not completed . \n" ) ; printf (" species %d is missing . \n" , 1 ) ; } statistics ( ) ; } down ( ) { do{ if (color [nodeA[k] ] ==0) l=nodeA[k] ; else { if (color [nodeB [k] ] ==0) l=nodeB [k] ; else { if ( (k==bad_node) (color [nodeC [k] ] ==0) ) l=nodeC [k] ; else printf ("sth is wrong at node %d\n" } } t=branch [1] ; if (color [k] = = -1) { p=w_prate-p_wrate*exp (-1* (w_prate+p_wrate) *t) p=p/ (w_prate+p_wrate) ; sample=drand4 8 () ; if ( sample>p) { color [1] =1 ; nmutation=nmutation+l ; mutation [nmutation] =1 ; } else { color [1] =-1 ; } } if (color [k] = = 1) { p=p_wrate+w_prate*exp ( - 1* ( w_prate+p_wrate ) *t ) p=p/ (w_prate+p_wrate) ,- sample=drand4 8 () ; if ( sample>p) { color [1] = -1 ; nmutation=nmutation+l ; mutation [nmutation] =1 ; } 54 else { color [1] =1 . k=l ; }while (k>newN) statistics ( ) { for (i=l ; i<=2 *newN-l ; i++) { count [ i ] = 0 ; } ttw [runtime] =0 ; nw [ runt ime ] = 0 ; for ( i=l ; i<=newN; i++ ) { if (color [i] = = 1) { nw [runtime] =nw [runtime] +1 ; w=i ; do{ ttw [runtime] =ttw [runtime] +branch [w] ; count [w] =1 ; w=ancestor [w] ; }while ( (color [w] ==1) && (count [w] ==0) ) ; } } if (ttw [runtime] ==0) R [runtime] =0 ; else R [runtime] =nw [runtime] /ttw [runtime] ; totalR[wp] [pw] =totalR [wp] [pw] +R [runt ime] ; } record ( ) { f or (i=l ; i<=12 ; i++) { y [i] =0; } for (i = l;i< = 1000;i + + ) { if(R[i]<30) y[l]++; if(R[i]>80) y[12]++; if ( (R [i] < = 80) && (R [i] > = 3 0) ) { 55 j = (int) ( (R[i] -30) /5) +2 ; y [j] ++; } } } output ( ) { out = f open ( " f inal . out " , " w" ) ; for (wp=l ; wp<=2 0 ;wp++) { for (pw=l ;pw<=20 ;pw++) { f print f (out , "%f \t " , totalR [wp] [pw] /100 0) ; } fprintf (out , " \n" ) ; } f close (out ) ; } 56 5 . probRS . c /* get joint probability of R and S, based on all parsimony- trees */ #include #include #include #define step 20 int i,j,k,l,m,w, newN , nodeA [100] , nodeB [100] , nodeC [100] ; int ancestor [100] , color [100] ; int check, seed, nw [6000] , newseed, runt ime , nmutat ion, mutation [100] ; int bad_node , wp , pw ; int count [15 0] , nwhite [50] [45] [45] ; int nR [110] [50] [4 0] [40] ; int len,nf ,nfile, ct [110] [50] [45] [45] ; float prob_R_given_S [110] [50] [40] [40] , prob_S [50] [40] [40]; float joint_prob_R_S [110] [50] [45] [45] ; float probRS [110] [50] [45] [45] ,del; float boarder , boarder2 , sum [40] [40] ; double branch[150] , rate , p_wrate , w_prate , t , p , sample , ttw [ 60 0 0 ] ,R[6000] ; char species [50] [50] , string [10] ; char filename [3 0] = { ' R' , '_' , ' S' , '_' , 'W , ' p' , '_' , ' p' , ' w' , '_' , ' \ 0' } ; char add [15] = { ' 0 ' , ' 1 ' , ' 2 ' , ' 3 ' , ' 4 ' , ' 5 ' , ' 6 ' , ' 7 ' , ' 8 ' , ' 9 ' , ' \0 ' } ; char newf ilename [3 0] ; char file [80] [50] ; FILE *inp,*out; main ( ) { getf ile ( ) ; set_zero ( ) ; for (nf = l;nf<=nf ile,-nf + +) { printf ( "computing tree %d\n",nf); input ( ) ; for (wp=0 ;wp<=40 ;wp++) { w_prate= (float) (wp)/100; for (pw=l ;pw<=40 ;pw++ ) { p_wrate= (float ) (pw)/100; printf ( "w_prate=%f \tp_wrate=%f \n" , w_prate , p_wrate) ; setup ( ) ; f or ( runt ime =1 ; runt ime< =1000 ; runt ime++ ) 57 calculation ( ) } prob ( ) ; output ( ) } getf ile () printf ("Seed for random number generator (samller than 6 digits) : \n" ) ; scanf ( " %6d" , &seed) ; srand48 (seed) ; inp=f open ( " tree_input_f iles " , " r " ) ; f scanf ( inp, " %d\n" , Unfile) ; f or (i=l ; i<=nf ile; i++) { f scanf ( inp , " %s\n" , f ile [ i] ) ; } f close (inp) ; set_zero ( ) { for (wp=0 ;wp<=40 ; wp++) { for (pw=l ;pw<=40 ;pw++) { for ( i=l ; i<=newN; i++) { for (j=0;j<=100;j++) { Ct [j] [i] [wp] [pw] =0; probRS[j] [i] [wp] [pw]=0, } 58 input ( ) { inp=f open (file [nf ] , "r" ) ; fscanf (inp, "%d sequences\n" , &newN) ,- for (i=l; i<=newN; { fscanf (inp, " %d" , & j ) ; fscanf ( inp , " %s\n" , species [ j ] ) ; } fscanf (inp, "%s\t%d\n" , string, &bad_node) ; for ( i=newN+l ; i<=2 *newN- 3 ; i + + ) { fscanf (inp, "%3d and %3d" , &j , SnodeA [i] ) ; fscanf ( inp , " %lf \n" , Sbranch [nodeA [ j ] ] ) ; fscanf (inp, "%3d and %3d" , &j , knodeB [i] ) ; fscanf (inp, " %lf \n" , Sbranch [nodeB [ j ] ] ) ,- if ( i==bad_node) { fscanf (inp, "%3d and %3d" , &j , SnodeC [i] ) ; fscanf ( inp, " %lf \n" , &branch [nodeC [ j ] ] ) ; } } if (bad_node==0 ) { i=2*newN-2 ; f scanf ( inp , " %3d and %3d" , &j , &nodeA [ i ] ) ; fscanf ( inp , " %1 f \n" , ^branch [nodeA [ j ] ] ) ; fscanf ( inp, " %3d and %3d" , & j , toodeB [ j ] ) ; f scanf ( inp , " %lf \n" , &branch [nodeB [ j ] ] ) ; f scanf ( inp ," %3d and l\t %lf \n" , &j , &branch [1] ) ; } else { i = 2 *newN-3 ; fscanf (inp, "%3d and l\t %lf \n" , &j , &branch [1] ) ; } f close ( inp) ; } setup ( ) { for ( i = l ,- i<=newN; i + + ) 59 { nwhite[i] [wp] [pw]=0; for ( j =0 ; j < = 100 ; j ++) { nR[j] [i] [wp] [pw] =0; } } } calculation ( ) { if (bad_node= = 0) { for ( i=newN+l ; i< = 2 *newN-2 ; i + + ) { k=nodeA [ i ] ,- ancestor [k] =i ; k=nodeB [i] ; ancestor [k] =i ; } m=2*newN-2 ; color [m] =- 1 ; f or ( i = l ; i< = 2 *newN-3 ; i + + ) { color [i] =0 ; } nmutation=0 ,- do{ k=nodeA [m] ; t=branch [k] ; if (color [m] = = -1) { p=w_prate-p_wrate*exp ( - 1 * ( w_prate+p_wrate ) * t ) ; p=p/ (w_prate+p_wrate) ,- sample=drand48 () ; if (sample>p) { color [k] =1 ; nmutation=nmutation+l ; mutation [nmutation] =k; } else { color [k] =-1; } } if (color [m] ==1) { p=p_wrate+w_prate*exp ( -1* (w_prate+p_wrate) p=p/ (w_prate+p_wrate) ; sample=drand4 8 () ; if (sample>p) { color [k] =-1 ; nmutation=nmutation+l ; mutation [nmutation] =k ; } else { color [k] =1 ; } } m=k ; }while (m>newN) ; do{ k=ancestor [m] ; i f ( k ! =bad_node ) { if (color [nodeA [k] ] ==0) down ( ) ; else { if (color [nodeB [k] ] = = 0) down ( ) ; } } else { if (color [nodeA [k] ] ==0 ) down ( ) ; else { if (color [nodeB [k] ] = = 0) down ( ) ; else { if (color [nodeC [k] ] = = 0) down ( ) ; } } } m=k ; }while (m<2 *newN-4 ) ; } /* check */ check=0 ; 1 = 1; if (bad_node==0 ) { do{ 1=1+1 ; if (color [1] = = 0) check= 1; hi }while (l<2*newN-2 && check= = 0); } else { do{ 1=1+1; if (color [1] ==0) check=l; }while (l<2*newN-3 && check==0) ; } if (check==l) { printf ( "calculation is not completed . \n" ) ; printf (" species %d is missing . \n" , 1 ) ; } statistics ( ) ; } down ( ) { do{ if (color [nodeA [k] ] ==0) l=nodeA [k] ; else { if (color [nodeB [k] ] = = 0) l=nodeB [k] ; else { if ( (k==bad_node) (color [nodeC [k] ] = = 0) ) l=nodeC [k] ; else printf ("sth is wrong at node %d\n" } } t=branch [1] ; if (color [k] = = -1) { p=w_prate-p_wrate*exp (-1* (w_prate+p_wrate) *t) p=p/ (w_prate+p_wrate) ; sample=drand4 8 () ; if (sample>p) { color [1] =1; nmutation=nmutation+l ; mutation [nmutation] =1 ; } else { color [1] =-1; } } 62 if (color [k] = = 1) { p=p_wrate+w_prate*exp ( -1* (w_prate+p_wrate) *t) ; p=p/ (w_prate+p_wrate) ; sample=drand48 () ; if (sample>p) { color [1] = -1; nmutation=nmutation+l ; mutation [nmutation] =1 ; } else { color [1] =1; } } k=l; } while (k>newN) ; } statistics ( ) { f or (i = l ; i< = 2 *newN- 1 ; i + + ) { count [i] =0 ; } ttw [runtime] =0 ; nw [runt ime] =0 ; for (i=l; i<=newN; i++) { if (color [i] = = 1) { nw [ runt ime ] =nw [ runt ime ] + 1 ; w=i ; do{ ttw [runtime] =ttw [runtime] +branch [w] ; count [w] =1 ; w=ancestor [w] ; }while ( (color [w] = = 1) && (count [w] = = 0) ) ; } } if (ttw [runtime] ! =0) R [runtime] =nw [runtime] /ttw [runtime] ; else R [runtime] =0 ; nwhite [nw [runtime] ] [wp] [pw]++; if ( (int) R [runtime] <100) nR [( int ) R [runtime] ] [nw [runtime] ] [wp] [pw]++; 63 prob ( ) for (i=l ; i<=newN; i++) { prob_S[i] [wp] [pw] = (float) nwhite [i] [wp] [pw] /runtime ; for ( j =0 ; j < = 100 ; j ++) { if (nwhite [i] [wp] [pw] !=0) { prob_R_given_S [ j ] [i] [wp] [pw] = (float ) nR [j ] [i] [wp] [pw] /nwhite [i] [ wp] [pw] ; j oint_prob_R_S [ j ] [i] [wp] [pw] =prob_R_given_S [ j ] [i] [wp] [pw] *prob_ S [i] [wp] [pw] ; ct [j] [i] [wp] [pw] ++; del = j oint_prob_R_S [ j ] [i] [wp] [pw] - probRS [j ] [i] [wp] [pw] ; probRSfj] [i] [wp] [pw] =probRS [j] [i] [wp] [pw] +del/ct [ j ] [i] [wp] [pw] ; /*test*/ /* boarder = 0 ,- for (i=l ; i<=newN; i++) { for ( j=0; j<=100; j++) { boarder=boarder + j oint_prob_R_S [ j ] [i] [wp] [pw] print f ( "boarder=%f \n" , boarder) ; * / output ( ) { for (wp=0 ;wp<=4 0 ;wp++) { for (pw=l ;pw<=40 ;pw++) { sum [wp] [pw] =0 ; for (i = l ; i<=newN; i + + ) 64 for (j=0; j<=100; j++) { sum[wp] [pw]=sum[wp] [pw] +probRS [ j ] [i] [wp] [pw] ; } } /*printf ( "total prob of wp(%d) pw(%d) = %f \n" , wp, pw, sum [wp] [pw]),-*/ } } for (wp=0 ;wp<=4 0 ;wp++) { for (pw=l ;pw<=40 ;pw++) { f or ( i=l ; i<=newN; i++) { for (j=0; j<=100; j++) { probRS[j] [i] [wp] [pw] =probRS [ j ] [i] [wp] [pw] /sum [wp] [pw] ; } } } } for (i = 0 ; i< = 8 ; i + +) { wp=i * 5 ; for (j=l; j<=8; j++) { pw=j*5; strcpy (newfilename , filename) • len=strlen (newf ilename) ; k=wp/10 ; newf ilename [len] =add [k] ; l=wp%10 ; newf ilename [len+1] =add [1] ; newfilename [len+2] ='_' ; k=pw/10 ; newf ilename [len+3 ] =add [k] ; l=pw%10 ; newf ilename [len+4] =add [1] ; newfilename [len+5] =' \0 ' ; printf ( " %s\n" , newfilename) ; write ( ) ; } write ( ) { 65 out =f open (newfilename , "w") ; for (k=l ; k<=newN; k++ ) { for (1=0;1<=100;1++) { fprintf (out, " %f \t " , probRS [1] [k] [wp] [pw] ) ■ } fprintf (out , " \n" ) ; } f close (out ) ; } 66 6 . contour . c /* find the area which covers 95% of all simulate (R,S) */ ^include #include #include int i , j , k, bottom [5 0] , up [50] , cut , R, S, smallest S , biggest S , bad_S , check; int wp, pw, len, 1 , w_p, p_w; float probRS[210] [50] , newprobRS [210] [50]; float min, total_prob ; char filename [3 0] = { 'R' , '_' , 'S' , '_' , 'W , 'p' , '_' , 'p' , 'w' , '_' , ' \ 0' } ; char add [15] ={'0','1','2','3','4','5','6','7','8','9','\0'} ; char inputf ile [30] ; char writefile[30]={'c' , ' o' , 'n' , ' \0' } ; char outputf ile [30] ; FILE *inp,*out; main ( ) { for (w_p=0 ; w_p<=9 ;w_p++) { wp=w_p*5 ; for (p_w=l ;p_w<=9 ;p_w++) { pw=p_w*5 ; strcpy ( inputf ile , filename) ; len=strlen (inputf ile) ; k=wp/10 ; inputf ile [len] =add [k] ; l=wp%10 ; inputf ile [len+1] =add [1] ; inputf ile [len+2 ] = ' _' ; k=pw/10 ; inputf ile [len+3 ] =add [k] ; l=pw%10 ; inputf ile [len+4] =add [1] ; inputf ile [len+5] = ' \0 ' ; printf ( " %s\n" , inputf ile) ; input ( ) ; s e t up ( ) ; total_prob=l ; do{ boarder ( ) ; cut_of f ( ) ; find_bad_S () ; 67 /*printf ( "total_prob=%f \n" , total_prob) ; */ }while (total_prob>0 . 95 ) ; strcpy (outputf ile , writef ile) ; len=strlen (outputf ile) ; k=wp/10 ; outputf ile [len] =add [k] ; l=wp%10 ; outputf ile [len+1] =add [1] ; k=pw/10; outputf ile [len+2] =add [k] ; l=pw%10 ; outputf ile [len+3] =add [1] ; outputf ile [len+4] =' \0 ' ; printf (" %s\n" , outputf ile) ; output ( ) ; } } } input ( ) { inp=f open ( inputf ile , "r") ; for (i = l ; i<=42 ; i + + ) { for (j=0; j<=200; j++) { f scanf (inp, " %f \t " , &probRS [ j ] [i] ) ; } f scanf ( inp , " \n" ) ; } f close ( inp) ; } setup ( ) { for (i = l ; i< = 42 ; i + + ) { for (j=0; j<=200; j++) { newprobRS[j] [i] =probRS [ j ] [i] ; } } } 68 boarder ( ) { for (i=l; i< = 42 ; i + + ) { j=0; do{ } while ( (newprobRS [j][i]= = 0)&&(j< = 200)) ; if ( j< = 200) { bottom [i] =j ; j=201; do{ j - - ; }while (newprobRS [j ] [i]==0); up [i] =j ; } else { bottom [i] =0 ; up [i] =0 ; } /*printf ( "bottom [%d] =%d\tup [%d] =%d\n" , i , bottom [i] , i , up [i] ) ; */ } } cut_of f ( ) { min=2 ; for (i = l ; i< = 42 ; i + +) { if (bottom [i] +up [i] !=0) { if (min>newprobRS [bottom [i] ] [i] ) { min=newprobRS [bottom [i] ] [i] ; cut =bot torn [i] *100+i ; } if (min>newprobRS [up [i] ] [i] ) { min=newprobRS [up [i] ] [i] ; cut=up [i] *10 0+i ; } } } R=CUt/100 ; 69 S=CUt%100 ; newprobRS [R] [S]=0; total_prob=total_prob-min; /*printf ( "cut off prob[%d] [%d] %f \ttotal_prob=%f \n" , R, S, probRS [R] [S] , total_prob) ; */ } f ind_bad_S () { do{ j=0; do{ } while (bottom [ j ] +up [ j ] ==0 ) ; smallestS=j ; j=43; do{ j--; } while (bottom [ j ] +up [ j ] ==0 ) ; biggestS=j ; check=0 ; j =smallestS ; do{ if (bottom [j ] +up [ j ] ==0) { check=l ; bad_S=j ; } } while ( (check= = 0) && ( j =bad_S ; i- - ) { for (k=bottom [i] ;k<=up[i] ;k++) { newprobRS [k] [i]=0; total_prob=total_prob-newprobRS [k] [i] ; } bottom [i] =0 ; up [i] =0; } /*printf ( "smallestS=%d\tbiggestS=%d\tbad_S=%d\n" , smallestS , bigg estS,bad_S) ;*/ }while (check==l) ; output ( ) { out =f open (output file , "w" ) ; fprintf (out, "%d\t\t%d\n" ,9,63) ; for (i = l ; i< = 42 ; i + + ) { if (bottom [i] ! =0) fprintf (out , " %d\t%d\n" , i , bottom [ i] ) ; } for (i=42 ;i>=l;i--) { if (bottom [i] ! =0) fprintf (out , " %d\t%d\n" , i , up [i] ) ; } f close (out) ; 71 453 R8\n-)y