Flyguyinthesky wrote: ?Thu Feb 03, 2022 10:23 am
https://patents.google.com/patent/WO2012122555A2
3. The method of claim 1 wherein the biological sequence data comprises DNA sequence data.
"Biological data networks and methods therefor"
Abstract
A system and method of transmitting and receiving packetized biological sequence data is disclosed. The method includes receiving, at a network interface of a node of a network, a data packet including a first header containing network routing information, a second header containing header information pertaining to the biological sequence data, and a payload containing a compressed version of the biological sequence data. The method further includes extracting at least the compressed version of the biological sequence data from the data packet. In addition, the method includes storing the compressed version of the biological sequence data within a memory of the node.
Classifications
H04L45/00 Routing or path finding of packets in data switching networks
View 4 more classifications
nventorLawrence GaneshalingamPatrick Allen
Worldwide applications
2012 US WO WO US US WO WO WO US WO US WO US US
Application PCT/US2012/028652 events
2011-03-09
Priority to US201161451086P
2011-03-09
Priority to US61/451,086
2011-09-27
Priority to US61/539,942
2011-09-27
Priority to US201161539931P
2011-09-27
Priority to US201161539942P
2011-09-27
Priority to US61/539,931
2012-03-09
Application filed by Lawrence Ganeshalingam, Patrick Allen
2012-09-13
Publication of WO2012122555A2
2012-12-27
Publication of WO2012122555A3
InfoPatent citations (71) Cited by (64) Legal events Similar documents Priority and Related Applications
Claims
Hide Dependent
We Claim:
1. A method for packetized transmission of biological sequence data, the method comprising: generating a data packet including a first header containing network routing information, a second header containing header information pertaining to the biological sequence data, and a payload containing a representation of the biological sequence data relative to a reference sequence;
providing the data packet to a network interface; and
transmitting the data packet to a node of a network.
2. The method of claim 1 wherein the biological sequence data comprises polymeric data.
3. The method of claim 1 wherein the biological sequence data comprises DNA sequence data.
4. The method of claim 3 wherein the header information comprises [information relating to mutations within the DNA sequence data].
5. The method of claim 3 wherein the payload further includes embedded data relating to the DNA sequence data.
6. The method of claim 5 wherein the embedded data comprises correlative information relating to mutations within the DNA sequence data.
7. The method of claim 6 wherein the correlative information includes pharmacological information.
8. The method of claim 6 wherein the correlative information includes clinical result information.
9. The method of claim 5 wherein the embedded data is represented within the payload in a compressed form.
10. A method of receiving packetized biological sequence data, the method comprising: receiving, at a network interface of a node of a network, a data packet including a first header containing network routing information, a second header containing header information pertaining to the biological sequence data, and a payload containing a compressed version of the biological sequence data;
extracting at least the compressed version of the biological sequence data from the data packet; and
storing the compressed version of the biological sequence data within a memory of the node.
11. The method of claim 10 wherein the biological sequence data comprises polymeric data.
12. The method of claim 10 wherein the biological sequence data comprises DNA sequence data.
13. The method of claim 12 wherein the header information comprises [information relating to mutations within the DNA sequence data].
14. The method of claim 12 wherein the payload further includes embedded data relating to the DNA sequence data.
15. The method of claim 14 wherein the embedded data comprises correlative information relating to mutations within the DNA sequence data.
16. The method of claim 15 wherein the correlative information includes pharmacological information.
17. The method of claim 15 wherein the correlative information includes clinical result information.
18. The method of claim 14 wherein the embedded data is represented within the payload in a compressed form.
19. A network node, comprising:
a network interface; a packet generator communicatively coupled to the network interface, the packet generator being configured to generate a data packet including a first header containing network routing information, a second header containing header information pertaining to the biological sequence data, and a payload containing a representation of the biological sequence data relative to a reference sequence; and
a transmit controller configured to control transmission of the data packet from the network interface to a node of a network.
20. The network node of claim 19 wherein the biological sequence data comprises polymeric data.
21. The network node of claim 19 wherein the biological sequence data comprises DNA sequence data.
22. The network node of claim 21 wherein the header information comprises [information relating to mutations within the DNA sequence data].
23. The network node of claim 21 wherein the payload further includes embedded data relating to the DNA sequence data.
24. The network node of claim 23 wherein the embedded data comprises correlative information relating to mutations within the DNA sequence data.
25. The network node of claim 24 wherein the correlative information includes
pharmacological information.
26. The network node of claim 24 wherein the correlative information includes clinical result information.
27. The network node of claim 23 wherein the embedded data is represented within the payload in a compressed form.
28. A network node, comprising: a network interface configured to receive a data packet including a first header containing network routing information, a second header containing header information pertaining to the biological sequence data, and a payload containing a compressed version of the biological sequence data;
an input packet processor communicatively coupled to the network interface, the input packet processor being configured to extract at least the compressed version of the biological sequence data from the data packet; and
a memory in which is stored the compressed version of the biological sequence data.
29. The network node of claim 28 wherein the biological sequence data comprises polymeric data.
30. The network node of claim 28 wherein the biological sequence data comprises DNA sequence data.
31. The network node of claim 30 wherein the header information comprises information relating to mutations within the DNA sequence data.
32. The network node of claim 30 wherein the payload further includes embedded data relating to the DNA sequence data.
33. The network node of claim 32 wherein the embedded data comprises correlative information relating to mutations within the DNA sequence data.
34. The network node of claim 33 wherein the correlative information includes
pharmacological information.
35. The network node of claim 33 wherein the correlative information includes clinical result information.
36. The network node of claim 32 wherein the embedded data is represented within the payload in a compressed form.
Description
BIOLOGICAL DATA NETWORKS AND METHODS THEREFOR
FIELD
[1001] This application is generally directed to processing and networking polymeric sequence information, including biopolymeric sequence information such as DNA sequence information.
BACKGROUND
[1002] Deoxyribonucleic acid ("DNA") sequencing is the process of determining the ordering of nucleotide bases (adenine (A), guanine (G), cytosine (C) and thymine (T)) in molecular DNA. Knowledge of DNA sequences is invaluable in basic biological research as well as in numerous applied fields such as, but not limited to, medicine, health, agriculture, livestock, population genetics, social networking, biotechnology, forensic science, security, and other areas of biology and life sciences.
[1003] Sequencing has been done since the 1956s, when academic researchers began using laborious methods based on two-dimensional chromatography. Due to the initial difficulties in sequencing in the early 1956s, the cost and speed could be measured in scientist years per nucleotide base as researchers set out to sequence the first restriction endonuclease site containing just a handful of bases. Thirty years later, the entire 3.2 billion bases of the human genome have been sequenced, with a first complete draft of the human genome done at a cost of about three billion dollars. Since then sequencing costs have rapidly decreased.
[1004] Today, the cost of sequencing the human genome is on the order of $5000 and is expected to hit the $ 1 000 mark later this year with the results available in hours, much like a routine blood test. As the cost of sequencing the human genome continues to plummet, the number of individuals having their DNA sequenced for medical, as well as other purposes, will likely increase significantly. Currently, the nucleotide base sequence data collected from DNA sequencing operations are stored in multiple different formats in a number of different databases.
[1005] Such databases also contain annotations and other attribute information related to the DNA sequence data including, for example, information concerning single nucleotide polymorphisms (SNPs), gene expression, copy number variations methylation sequence. Moreover, transcriptomic and proteomic data are also present in multiple formats in multiple databases. This renders it impractical to exchange and process the sources of genome sequence data and related information collected in various locations, thereby hampering the potential for scientific discoveries and advancements.
SUMMARY
[1006] In one aspect the disclosure is directed to method for packetized transmission of biological sequence data. The method includes generating a data packet including a first header containing network routing information, a second header containing header information pertaining to the biological sequence data, and a payload containing a representation of the biological sequence data relative to a reference sequence. The method also includes providing the data packet to a network interface and transmitting the data packet to a node of a network.
[1007] In another aspect the disclosure is directed to a method of receiving packetized biological sequence data. The method includes receiving, at a network interface of a node of a network, a data packet including a first header containing network routing information, a second header containing header information pertaining to the biological sequence data, and a payload containing a compressed version of the biological sequence data. The method further includes extracting at least the compressed version of the biological sequence data from the data packet. In addition, the method includes storing the compressed version of the biological sequence data within a memory of the node.
[1008] In a further aspect the disclosure pertains to a network node including network interface and a packet generator communicatively coupled to the network interface. The packet generator is configured to generate a data packet including a first header containing network routing information, a second header containing header information pertaining to the biological sequence data, and a payload containing a representation of the biological sequence data relative to a reference sequence. The network node further includes a transmit controller configured to control transmission of the data packet from the network interface to a node of a network.
[ 1009] In yet another aspect the disclosure relates to a network node including a network interface configured to receive a data packet. In this aspect the data packet includes a first header containing network routing information, a second header containing header information pertaining to the biological sequence data, and a payload containing a compressed version of the biological sequence data. The network nodes further includes an input packet processor communicatively coupled to the network interface, the input packet processor being configured to extract at least the compressed version of the biological sequence data from the data packet. The network node also includes a memory in which is stored the compressed version of the biological sequence data.
BRIEF DESCRIPTION OF THE DRAWINGS
[1010] Various objects and advantages and a more complete understanding of the disclosure are apparent and more readily appreciated by reference to the following Detailed Description and to the appended claims when taken in conjunction with the accompanying Drawings wherein:
[1011] FIG. 1 is a representation is provided of a biological data unit comprised of a payload containing DNA sequence data and a header containing information having biological relevance to the DNA sequence data within the payload.
[1012] FIG. 2 illustratively represents a biological data model which includes a plurality of interrelated layers.
[1013] FIG. 3 depicts a biological data unit having a header and a payload containing an instruction-based representation of segmented DNA sequence data.
[1014] FIG. 4 is a logical flow diagram of a process for segmentation of biological sequence data and combining the segments with metadata attributes to form biological data units encapsulated with headers.
[1015] FIG. 5 depicts a biological data network comprised of representations of biological data linked and interrelated by an overlay network containing a plurality of network nodes. '
[1016] FIG. 6 illustrates an exemplary protocol stack implemented at a network node together with corresponding layers of the OSI network model.
[1017] FIG. 7 shows a high-level view of various data types that may be processed by a group of network nodes in response to a query/request received from a client terminal.
[1018] FIG. 8 provides a block diagrammatic representation of the architecture of an exemplary network node.
[1019] FIG. 9A illustratively represents a process effected by a network node to implement a sequence variants processing procedure.
[1020] FIG. 9B is a flowchart of an exemplary variants processing procedure. [1021] FIG. 10 illustratively represents the processing occurring at a network node configured to perform a specialized processing function.
[ 1022] FIG. 1 1 provides a representation of an exemplary processing platform capable of being configured to implement a network node.
[1023] FIG. 12 illustrates one manner in which data may be processed, managed and stored at an individual network node in an exemplary clinical environment.
[1024] FIGS. 13- 18 illustratively represent the manner in which information within the layered data structure is utilized at an individual network processing node.
[1025] FIG. 19 illustrates the cooperative performance of an exemplary result-based network processing using multiple network nodes.
[1026] FIG. 20 illustrates an exemplary process flow corresponding to the result- based network processing illustrated by FIG. 19.
[1027] FIG. 21 depicts a biological data network comprised of a plurality of network nodes.
[1028] FIG. 22 is a flow chart representative of a set of exemplary processing operations performed by a biological data network in response to a user query or request.
[1029] FIG. 23 illustratively represents a separation of localized and network-based processing functions within a portion of a biological data network.
[1030] FIG. 24 provides an illustration of various functional interactions between network-based and localized applications.
[ 1031] FIG. 25 depicts a biological data network which includes a collaborative simulation network.
[1032] FIG. 26 is a flowchart representative of a manner in which information relating to various different layers of biologically-relevant data organized consistently with a biological data model may be processed at different network nodes.
[1033] FIG. 27 is a flowchart representative of an exemplary manner in which network nodes of a biological data network may cooperate to process a client request.
[ 1034] FIG. 28 is a flowchart representative of an exemplary sequence of operations involved in the identification and processing of sequence variants at a network node.
[1035] Fig. 29 is a flowchart representative of an exemplary sequence of operations carried out by network nodes of a biological data network in connection with processing of a disease-related query. [ 1036] FIG. 30 is a flowchart representative of an exemplary sequence of operations involved in providing pharmacological response data in response to a user query concerning a specified disease.
[1037] FIG. 3 1 illustratively represents communication of DNA sequence data or other biological sequence information between a pair of devices supporting a biological data network.
[ 1038] FIG. 32 illustratively represents one manner in which multiple devices may support various operations within a biological data network.
[1039] FIG. 33 illustrates a biological data network configured to utilize techniques such as, for example, multiprotocol label switching ("MPLS") to facilitate the distribution of DNA sequence data and related information between client devices.
[ 1040] FIG. 34 illustrates a process for assigning biologically-relevant and network- related headers to segments of DNA sequence data stored within network-attached storage or received from a sequencing machine.
[1041] FIG. 35 illustratively represents a system and approach for using networking protocols otherwise employed for streaming media to facilitate the dissemination of DNA sequence data.
[1042] FIG. 36 is a block diagram of a high-speed sequence data analysis system.
DETAILED DESCRIPTION INTRODUCTION
[1043] This disclosure relates generally to an innovative new biological data network and related methods capable of efficiently handling the massive quantities of DNA sequence data and related information expected to be produced as sequencing costs continue to decrease. The disclosed network and approaches permit such sequence data and related medical or other information to be efficiently stored in data containers provided at either a central location or distributed throughout a network, and facilitate the efficient network-based searching, transfer, processing, management and analysis of the stored information in a manner designed to meet the demands of specific applications.
[1044] The disclosed approaches permit such sequence data and any related medical, biological, referential or other information, be it computed, human-entered/directed or a combination thereof, to be efficiently transmitted and/or shared or otherwise conveyed from a centralized location or either partly or wholly distributed throughout the biological data network. These approaches also facilitate data formats and encodings used in the efficient processing, management and analysis of various "omics" (i.e., proto/onco/pharma) information. The innovative new biological data network or, equivalently, network, is configured to operate with respect to biological data units stored at various network locations.
[1045] Each biological data unit will generally be comprised of one or more headers associated with or relating to a payload containing a representation of segmented DNA sequence data or other non-sequential data of interest. The term header in this context refers to one or more pieces of information that have relevance to the payload, without regard to how or where such information is physically stored or represented within the network. As is discussed below, it will be appreciated that certain operations performed by the nodes or elements of the biological data network may be effected with respect to the entirety of the biological data units undergoing processing; that is, with respect to representations of both the segmented sequence data and headers of such biological data units.
[1046] However, the elements of the biological data network may perform other operations by, for example, comparing or correlating only the headers of the biological data units being processed. In this way network bandwidth may be conserved by obviating the need for network transport of segmented biological sequence data, or some representation thereof, in connection with various processing operations involving biological units nominally stored at different network locations.
[1047] The biological data network may be comprised of a plurality of network nodes configured with processing and analytical capabilities, which are individually or collectively capable of responding to machine or user queries or requests for information. As is discussed below, the functionality of the new biological data network may be integrated into the current architectural framework of the Open Systems Interconnection (OSI) seven-layer model and the Transmission Control Protocol and Internet Protocol (TCP/IP) model for network and computing communications. This will allow service providers to configure existing network infrastructure to accommodate biological sequence data to deliver optimized quality of service for medical and health professionals practicing genomics-based personalized medicine. Alternatively or in addition, the new biological data network may be realized as an Internet-based overlay network capable of providing biological, medical and health-related intelligence to applications supported by the network.
[1048] The new biological data network facilitates overcoming the daunting challenges associated with analysis of various pertinent omics data types together with, and in the context of, all relevant, available prior knowledge. In this regard the new biological data network may facilitate development of an integrated ecosystem in which distributed databases are accessible on a network and in which the data stored therein is configured to be linked by . This new biological data network may enable, for example, forming, securing, linking, searching, filtering, sorting, aggregating and connecting an individual's genome data with a layered data model of existing knowledge in order to facilitate extraction of new and meaningful information.
OVERVIEW OF BIOLOGICAL DATA UNITS AND HEADERS
[1049] As disclosed herein, the innovative new biological data network is configured to operate with respect to biological data units stored at various network locations. Biological data units can be considered as a set of information that is known or can be predicted to be associated with certain segments of genome sequences. Biological data units will generally be comprised of one or more headers associated with or relating to a payioad containing a representation of segmented DNA sequence data or other non-sequential data of interest.
[1050] The biological data units may be generated by dividing source DNA sequences into segments and associating one or more headers (also referred to herein as "BI headers" or annotations or attributes) with one or more segments of genome sequence data. The various component parts XML metadata files that are of the header information contained in biological data units can be stored in distributed storage containers that are accessible on a network. Furthermore, the different segments of a whole genome sequence data contained in the payioad of biological data units may be stored in multiple BAM files at various different locations on a network.
[1051] Each BI header can be considered a specific piece of information or set of information that may be associated with or have biological relevance to one or more specific segments of DNA sequence data within the payioad of the biological data unit. It should be appreciated that any information that is relevant to the segmented sequence data payioad of a biological data unit can be placed in the one or more headers of the data unit or, as is discussed below, within headers of other biological data units. It should also be clearly understood that the information contained in any biological data unit can be highly distributed and network linked in such a manner that allows filtration and dynamic recombination of any permutation of associated attributes and sequence segments.
11052] The headers may be arranged in any order, whether dependent upon or independent of the payioad data. However, in one embodiment the headers are each respectively associated with at least one layer of a biological data model of existing knowledge that is representative of the biological sequence data which, for example, may be stored as BAM files within the payloads of the distributed biological data units with which such headers or XML metadata attributes are associated.
[1053] Although the present disclosure provides specific examples of the use of BI headers in the context of a layered data model, it should be understood that BI headers may be realized in essentially any form capable of embedding information within, or associating such information with, all or part of any biological or other polymeric sequence or plurality thereof. For example, one or more BI headers could be associated with any permutation of segments of DNA sequence or other such polymeric sequence or within any combination thereof, in any analog or digital format.
[1054] The BI headers could also be placed within a representation of associated polymeric sequence data, or could be otherwise associated with any electronic file or other electronic structure representative of molecular information. In other words, the one or more metadata attributes that are stored in multiple storage containers on a network may compose headers that are specifically associated with at least one segment of sequence contained in a file transfer session.
[1055] In the case in which data is embedded within DNA or other biological sequence information, the BI headers or tags including the data may be placed in front of, behind or in any arbitrary position within any particular segmented sequence data or multiple segmented data sequences. In other words, in one particular embodiment of the invention, information that is associated directly or indirectly may be stored within the base calls of reads that are contained in BAM files or any other sequence file format or internal memory structures, for example. This approach would involve a method for integrating, at least one specific attribute of information that is associated with a genome sequence between and or among the base calls contained within reads of sequence data files.
[1056] In addition, the data may be embedded in a contiguous or disbursed manner among and within the base calls of the segmented sequence data. When this highly structured and layered approach is applied to the storage configuration of this sequence data and associated information it will advantageously facilitate the computationally efficient, effective and rapid analysis of, for example, the massive quantities of genome sequence data being generated by next-generation, high-throughput DNA sequencing machines.
[1057] In particular, distributed biological data units containing segmented DNA sequence data and associated attributes may be stored, sorted, filtered and operated on for various scope and depth of analysis based upon the said associated information which is contained within the headers. This obviates the need to manipulate, transfer and otherwise breach the security of the segmented DNA sequence data in order to process and analyze such data.
[1058] One embodiment of the layered data model of the existing body of relevant knowledge includes not only of or pertaining to biologically-relevant data but also other metadata which are associated with the nucleic acid sequence files. Such Metalntelligence™ metadata may include, for example, facts, information, knowledge and prediction derived from biological, clinical, pharmacological, environmental, medical or other health-related data, including but not limited to other biological sequence data such as methylation sequence data as well as information on differential expression, alternative splicing, copy number variation and other related information.
[1059] The DNA sequence information included within the biological data units described herein may be obtained from a variety of sources. For example, DNA sequence information may be obtained "directly" from DNA sequencing apparatus, as well as from sequence data files that are stored in private and publicly accessible genome data repositories. Additionally, it may be computationally derived and/or manually gathered or inferred. In the case of the database of Genotypes and Phenotypes at the National Center for Biotechnology Information at the National Library of Medicine, the DNA sequence entries may be stored as BAM, SRF, fastq as well as in the FASTA format, which includes annotated information concerning the sequence data files. In one embodiment certain of the information contained within the one or more headers of each biological data unit would be obtained from publicly accessible databases containing genome data sequences.
[1060] Turning now to FIG. 1 , a representation is provided of a biological data unit comprised of a payload containing DNA sequence data and a header containing information having biological relevance to the DNA sequence data within the payload. Furthermore, it should be appreciated that information contained in a particular header may also point or associate with sequence data that is stored in at least one data container as the payload portion of biological data units.
[1061] In addition, it should be understood that the header information and sequence payload that is contained within biological data units relate directly to attributes in XML metadata files and BAM sequence files, respectively. Any key value can associate with one or more sequence files or segments of sequence within such files. In one particular aspect of the disclosed approach, the key value may be information of or pertaining to a drug or its effect and the sequence may be a segment of sequence contained in a GeneTorrent Object file transfer session.
[1062] The header information may associate with or relate to for example a microRNA sequence or the regulatory region of a gene or interaction with another gene product from at least one molecular pathway. Since the example that is presented as FIG. 1 shows that the payload contains DNA sequence data, the biological data unit of FIG. 1 may also be referred to herein as a DNA protocol data unit (DPDU). The DPDU can be considered as distributed biological data units that are encapsulated with information for transfer, control and other data that is relevant to the protocol.
[1063] In one embodiment, the exemplary biological data unit that is depicted in FIG. 1 would be associated with the DPDUs that are encapsulated and involved in a computer- implemented method for processing data units. For example, in the case where the sequence payload is RNA sequence data which may be derived from RNA-seq or deduced from the DNA sequence data could be included within RNA protocol data units (RPDU) comprised of a plurality of RNA specific headers and a payload comprised of the RNA sequence data. The header information contained in distributed components of RPDUs may include but not be limited to information on differential expression, splicing, processing and other
posttranscriptional modifications of RNA.