In bioinformatics studies (eg rare diseases, population genetics, etc.), it is possible to use variation files for large volumes of genetic data. A system where high-volume genome variation information can be transferred, search for, filter, prioritize variations on this data, and complex queries based on genotype and inheritance characteristics will enable bioinformatics researchers to work efficiently on large amounts of data. For this reason, Safir Bio platform with a web-based user interface was developed using scalable, distributed and in-memory computing technologies. This system provides an easy and efficient platform to analyze and query high-volume genomic variation data, as well as providing an infrastructure for machine learning and advanced analytical studies.
The genome data of an individual, consisting of approximately 3 billion pairs of bases, reaches a size of approximately 200 gigabytes, depending on the quality of the sequence after wet laboratory procedures. Since the genomes of the two individuals are 99.9% similar, the method preferred in genome research is; It is the alignment of the individual's DNA sequence with respect to the reference genome of its own species and then identifying its differences from this reference genome. Base sequence differences that occur as a result of these operations are recorded in the variation files (Variant Call File) in VCF format. VCF files can reach an average of 125 megabytes in length for an individual. VCF is a file in which variations, genes, individuals and labeling of the study are stored in a general format.
Safir Bio is a data management platform created for the management of genomic variation files within the scope of big data and analysis on these files. Safir Bio enables the transfer of the VCF file containing the high volume genome variation information, the search for, filtering, prioritizing the variations on this data, and the ability to perform complex queries based on genotype and inheritance characteristics.
Researchers have difficulties in analyzing the rapidly increasing genome data with Next Generation sequencing due to the lack of standard applications and formats. VCF files containing variation data with 1000 genome projects are considered to be a widely used file format. The basic components of a possible search engine to be developed on DNA data will be the examination of files and data in this format. Although there are studies on individual or limited number of VCF files in the literature, studies on the management of large amounts of VCF files together are more limited. With Safir Bio, it is possible to easily perform filtering and querying processes on high amount and volume of genomic variation data.
The infrastructure required for population studies on large-scale data sets containing genome data of a large number of individuals is not a system that every researcher can easily establish. In recent years, especially big data technologies are preferred to work on genome data. It is important to ensure that bioinformatics researchers can easily study population studies with large-scale genome data collected in a short time with the cheaper sequencing technologies. With the Safir Bio platform, which we have developed with flexible and simple interfaces and working with distributed in-memory computing systems, variation analysis (filtering, query, etc.) of a single individual can be performed, as well as population studies with variation information of many individuals. Thus, there is no need for data conversion and transfer operations caused by the use of many different software tools, and infrastructures that can perform parallel operations on the data in the distributed file system can be used.
Different software have been developed and databases have been used for the analysis of variation files in the literature. However, in terms of scaling, big data software tools are well suited for genomic data. It is important for advanced gene research to enable bioinformatics researchers to run processes that are difficult or impossible to do on desktop computers by using computing infrastructures that run in memory instead of disk with current distributed systems. The system we have developed is able to work on variation files of different sizes. The infrastructure of our variation analysis platform, which is suitable for scaling and includes functions that can perform in-memory distributed computing, is quite suitable for adding new features and machine learning studies.