What is Big Data,What are its Features?

In the event that classical methods, tools and infrastructures cannot be used to analyze, process and transfer data obtained from various sources, the concept of Big Data can be mentioned. In order for data to be big, it must have one or more components called 5V (Volume, Velocity, Variety, Value, Veracity).

Volume: Volume is directly related to the data size. By volume it is not just the size of the data in bytes. If the available data cannot be analyzed at the desired time, volume can be mentioned as a big data component.

Velocity: Social media, phone, sensors, etc. data can be produced very quickly by. When the data produced needs to be processed and analyzed in real time, speed emerges as a big data component.

Variety: In data analysis processes that require the use of many different types of data together, diversity is included as a big data component.

Veracity: The accuracy of the data is an important component that affects the analysis result. When analyzing data containing noise, it is necessary to make sure that the data is correct.

Value : Converting data into value is the main purpose of the analysis process. It is the most important feature that defines big data.


Types of Analytics in Data Science

Institutions that want to make decisions by using big amounts of data flow try to make sense by creating analytical solutions on this data. Analytical solutions that can be used can be evaluated under four different headings.

Descriptive Analytics: Descriptive analytics answers the question of “What happened?”. It gives an idea of ​​the past using raw data but does not give information about the reasons. This type of analysis is performed with tools such as business intelligence applications and dashboards.

Diagnostic Analytics: Diagnostic Analytics answers the question of  "Why did it happen?". It gives an idea of ​​the root cause of the results using historical data.

Predictive Analytics: Predictive Analytics answers the question of  "What will happen?" to predict the probability of a future outcome.

Prescriptive Analytics: Prescriptive Analytics answers the question of  "How do I make it happen?". It gives information about how to achieve the desired result by predicting the effects before the actions are taken.


Big Data and Advanced Analytics Use Case Examples

360 ° Customer View: By using big data solutions, a 360-degree customer view can be obtained by combining data such as past online and offline interactions, social media data and purchasing history.


Providing a Personalized User Experience: Unlike the traditional e-commerce experience, it becomes possible to follow the products that customers are interested in and to offer personalized suggestions to the user including these products.

Suggestion Engines: Suggestion engines are algorithms implemented to provide appropriate offers for each customer. An example is presenting similar products that may be of interest to the customer while shopping on an e-commerce site.

Price Optimization: In business-to-consumer and business-to-business trade, the ability to control how competitors are pricing their products and determining the best pricing by looking at historical data becomes possible through big data analysis.

Predicting Trends: With the big data strategy, you can predict the trends in the market and the products that will sell next. Data from social media broadcasts and user internet browsing habits can be combined; With the sentiment analysis, it can be determined whether the comments about a product are positive or not.

Fraud Prevention: More sophisticated systems for fraud prevention can be developed with big data analytics and machine learning. With big data analytics, changing trends such as fraud aggregating in certain geographical regions (eg airports) can be quickly identified.

Reducing Data Warehouse Burden: Many businesses are changing or supplementing their data warehouses with open source big data solutions such as Hadoop. Hadoop-based solutions can provide faster performance while reducing license fees and other costs.

Log Data Analysis: The need for storing, processing and presenting log data in the most efficient and cost-effective way, which emerges from the exponential growth of commercial activities and transactions, is met with big data solutions.

Preventive Maintenance: Businesses in manufacturing, energy, construction, agriculture, transportation and similar sectors; it can take advantage of big data and industrial Internet of Things technologies to improve equipment maintenance. With big data solutions, it is possible to prevent possible accidents and costly line closures by analyzing data in real time and predicting when a problem will occur.

Internet of Things: Internet of Things and big data technologies can be used in all sectors to collect data and gain insights to take action. Monitoring product movements, weather conditions and security camera images are examples.


SAFİR Big Data Infrastructure

With Cloud Computing and Big Data Research Laboratory (B3LAB) "Big Data Analysis Solutions", valuable information can be extracted by processing and analyzing big amounts of data in different forms. SAFİR Big Data; offers big data storage, data transfer and analytics solutions that are easy to install and use. With the installation on the servers in the B3LAB Prototype Data Center, physically; Big data infrastructure can be used virtually with the services on SAFİR Infrastructure. Both installations allow the processing of batch and streaming data with an infrastructure that is scalable, highly accessible, and has distributed and redundant hardware.

SAFİR Big Data; provides solutions within the scope of big data architecture, data transfer and processing, big data analytics, big data ecosystem training, proof of concept (PoC) applications.

Hadoop cluster installation, configuration, management and optimization within the scope of big data architecture solutions; operating system configuration and optimization used; big data file systems configuration and optimization; Big data network architecture design and installation studies are carried out.

Within the scope of data flow and processing solutions; streaming data management and processing, batch data transfer, management and processing, NoSQL databases installation and configuration and optimization.

Anomaly detection, estimation, classification and cluster analysis are performed within the scope of big data analytical solutions.

Training is provided on big data technologies, big data analytics, machine learning. Trainings are supported by application studies on clusters prepared in virtual big data environment.

By using SAFİR Big Data infrastructure; Data center monitoring and estimation of server loads, automatic category assignment to call center records, creation of a genomic variation analysis platform, examination failure root cause analysis for students, information about the growth rate and structure of the data, and big data infrastructure and tools needs analysis studies were carried out.


SAFİR Big Data Projects

Project of Data Governance Tool Development for Turkish Customs: Technical Assistance Project for Improving the Detection Capacity of Turkish Customs Enforcement; In the whole of Turkey Customs Zone customs supervision and control functions of the Commerce Department's administrative, technical and increasing the operational capacity and the Customs Administration Coordination Center (CECC) is a data management project that uses big data and machine learning techniques to improve and strengthen its structure. The overall aim of the project is to use integrated border management in line with the EU Acquis and European Standards.

X-Ray Image Analysis Project for Preventing Smuggling in Customs: The project that aims to be analyzed was collected in a single center of the Customs Zone of Turkey's anti-trafficking directed as located by processing the image obtained by X-ray devices automatically identifying anomalies and smuggling and X-ray data.

Turkish Statistical Institute-Big Data Advanced Analytics Project: It is aimed to design a system that enables the storage, processing and analysis of daily price information and job postings tagged with category and sub-category information provided from different stores through websites and other sources, as batch and streaming data in the big data ecosystem.

Republic of Turkey Ministry of National Education - TEOG Data Analysis: Analyzes were performed on the data provided by the Ministry of National Education regarding the students who could not be placed in the TEOG exam. Spark and Spark MLlib solutions have been applied to find relationships between data.

SAFİR Bio - B3LAB Variation Analysis Platform: In bioinformatics studies (eg rare diseases, population genetics, etc.), it is possible to use variation files for high volumes of genetic data. A system where the transfer of high-volume genome variation information, search for variations on this data, filtering, prioritizing, and making complex queries based on genotype and inheritance characteristics will enable bioinformatics researchers to work efficiently on big amounts of data. Therefore, it is aimed to develop a platform using scalable, distributed and in-memory computing technologies. By using gene data on the platform, the needs of variation analysis, drug-active ingredient analysis and genetic engineering data science can be met.

Data Center Monitoring and Server Load Prediction: The data produced by the servers in two different data centers was transferred to the Safir Big Data infrastructure and analysis studies were carried out on it. Lambda architecture has been used in order to transfer the server data to the big data environment as streaming data and to analyze the transferred data as batch and streaming data. A machine learning model was created using the server data stored as a batch. The estimation of the values ​​in the next step of the server data coming in the form of streaming data was made using the created model.

Big Data Infrastructure and Tools Needs Analysis: A report was prepared on the need for big data infrastructure and tools for data stored on relational database management systems with traditional methods and whose metrics such as growth rate and query time are known. Synthetic data was produced and the increase in data was simulated on a monthly basis, query performance was measured with a relational database management system, NoSQL database, big data infrastructure tools (Spark, Hive).