CHAPTER TWO

LITERATURE REVIEW

INTRODUCTION

In today’s business environment, the importance of data cannot be overemphasized as economic and social activities have long relied on data to survive. But today the increased volume, velocity, variety, social and economic value of data gestures a paradigm shift towards a data-driven socioeconomic world. In parallel with the continuous and significant growth of data; better data access, availability of powerful ICT systems, and ubiquitous connectivity of both systems and people has come. This has led to intensified activities around Big Data and Big Data Value. Powerful tools have been developed to collect, store, analyze, process, and visualize huge amounts of data. Open data initiatives have been launched to provide broad access to data from the public sector, business and science [1]. In the Journal of Science 2008, “Big Data” is defined as “the representation of the progress of human cognitive processes, which usually includes data sets with sizes beyond the ability of current technology, method and theory to capture, manage, and process the data within a tolerable elapsed time”. So also , the definition of big data as given by the Gartner defined it as high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization” [2]. Big Data can be described as a massive volume of structured and unstructured data which are so large and very difficult to process this data using traditional methods and recent software technologies. [3]. Furthermore, Big data is the elusive, all-encompassing name given to enormous datasets stored on enterprise servers for example, data stored in Google (which organizes 100 trillion Web pages), Facebook (with 1 million gigabytes of disk storage, its data keep increasing on a daily basis), and YouTube (which contains 20 petabytes of new video content per year). Big data is also used in science, for scientific applications such as weather forecasting, earthquake prediction, seismic processing, molecular modeling, and genetic sequencing. Many of these applications require servers to run with tens of petabytes of storage, such as the Sequoia (Lawrence Livermore) and Blue Waters (NCSA) supercomputers [4]. The three main terms that generally signify Big Data are: i. Volume: This has to do with the amount of data generated on a daily basis which is so large and keeps increasing with time. ii. Variety: Today data is created in different type, form and formats such as emails, video, audio, transactions etc. iii. Velocity: This has to do with the speed it takes to produce data and how fast this data produced needs to be processed on time to meet individual demand. The other two properties that need to be critically consider when talking about Big Data are Variability and Complexity as depicted by [3]. i. Variability: this goes along with velocity, and it has to do with how inconsistent the flow of data can be with respect to time and far it can go. ii. Complexity: the complexity of the data must be considered especially when we have multiple source of data. The data must be rearranged in such a format that will be suitable for processing. Recent trends in Technologies today has not only supported the collection of large amounts of data but is has also utilizes the management of such data effectively. And such technologies also supports the daily transactions made all over the world ranging from Bank transactions, Walmart customer transactions, and social media transactions such as the once generated from Facebook, twitter, Instagram, YouTube etc. Some common characteristics of big data as highlighted by [5] includes the following: a) Big data incorporates both structured and unstructured data. b) Big data discourses the following speed and scalability, mobility and security, flexibility and stability. c) In big data the time it takes to retrieve information is critical to be able to emphasize how important various data sources are which may include mobile devices, radio frequency identification, tablets, the web and a growing list of automated sensory technologies etc. There are various definitions used to define cloud computing however, most researchers agree with [6] definition who defines Cloud Computing as “a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This definition gives Cloud Computing, a more general description. However, Cloud Computing is seen as a technology that depends on sharing of computing resources than having local servers or personal devices to handle the applications of users. In Cloud Computing, the word “Cloud” can be used interchangeably with “Internet”, so therefore Cloud Computing means a type of computing in which services are delivered through the Internet or network of servers. The main aim of Cloud Computing is to make use of computing power to execute a large pool of instruction per second. Cloud Computing uses networks of a large group of servers with specialized connections to distribute data processing among the servers. Instead of installing a software suite for each computer, with the use of cloud computing technologies, it is possible to install a single software in a computer which happens to be the host computer and users will log into a web based service that host all the programs to be accessed by the user. There is a significant workload shift, in a cloud computing system which helps local computers to lessen the burden of having to host a lot of programs and applications.

Cloud computing simply performs a desired computation (mostly on big data) on a remote server that a subscriber has configured and controls, rather than on the subscriber’s local desktop PC or tablet. The leading commercial cloud computing provider’s includes: Amazon EC2, Microsoft Azure, and Google Compute Engine (still in beta). The service charge, charged by Cloud computing providers for using their platforms go as little as $0.10 per CPU-hour for renting MIPS, memory, and disk space and other services [7]. Cloud servers can house up to a few hundred thousand processor cores, plus many terabytes of disk storage hence it has a high computational power. Cloud computing also offers virtualization technology that gives users the ability to select any of the following: operating systems, applications, and network interconnects additional software flexibility for their modest rental fee. Therefore the solution to big data lies in cloud computing technologies.

WHAT IS BIG DATA

There are many definitions of Big Data, but the widely accepted definition is that by Gartner (2013) who define Big Data as “…high-volume, high velocity and/or high variety information assets that demand cost-effective innovative forms of information processing for enhanced insight, decision making and process optimization”. The continuous growth of the technology and the internet in today’s digital universe and integration of computing into virtually every facet of human life are no doubt making the concept of Big Data a ubiquitous paradigm for deploying novel technologies/applications hitherto not practicable by conventional methods. The unending trend in the quest to conquer the challenges posed by management and control of the Big Data revolution is currently leading the entire ICT community to a plethora of systems. New activities spring up, more and more data sets are being created, faster than ever before. And these new sets, in some cases, hold the key to unlock new streams of activities that will make both the government and the public have better understanding of business that will make it more efficient and effective.

Advertisements

Big Data is a term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. The trend to larger data sets is due to the traditional information derivable from analysis of a single related data as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to “spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions [9]. Big Data is a term applied to data sets whose size is beyond the capability of commonly used software tools to capture, manage and process. The sheer size of data, combined with complexity of analysis and commercial imperative to create value from it has led to a new class of technologies and tools to tackle it. The term Big Data tends to be used in multiple ways, often referring to both the type of data being managed as well as the technology being used to manage it. In the most part, these technologies originated from companies such as Google, Amazon, face book, and linked-in, where they were developed for each company’s own use in order to analyze the massive amounts of social media data they were dealing with [11]. As of 2012, limits on the size of data sets that were feasible to process in a reasonable amount of time were on the order of exabytes of data. Scientists regularly encounter limitations due to large data sets in many areas, including meteorology, genomics, connectomics, complex physics simulations, and biological and environmental research. Data sets grow in size in part because they are increasingly being gathered by ubiquitous information sending mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio- frequency identification readers, and wireless sensor networks. The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day, 2.5 exabytes (2.5 x 1018) of data were created . The challenge for large enterprises is determining who should own Big Data initiatives that straddle the entire organization [7]. Big Data is relatively a new concept and a lot of definitions have been given to it by researchers, organizations and individuals. As far back as 2001, industry analyst Doung Laney (currently with Gartener), articulated the mainstream of definition of Big Data as the three Vs; Volume, Velocity and Variety. At SAS, SAS considered two additional dimensions when thinking about Big Data: the Variability and Complexity [8]. Oracle defined Big Data in terms of four Vs – Volume, Velocity, Variety and Value [11]. Having gone through the literature of Big Data, in this paper, we will like to bring the definition of Big Data to a new state based on its genesis, bogusness and values.

We define Big Data in terms of five Vs and a C. These form a reasonable test as to determine if a Big Data approach is the right one to adopt for a new era of analysis. The five Vs are;

Volume: the size of data. With technology, it is often very limiting to talk about data volume in any absolute sense. As technology marches forward, numbers get quickly outdated, so it is better to think about volume in a relative sense instead. If the volume of data you are looking at is an order of magnitude or larger than anything previously encountered in your industry, then you are probably dealing with Big Data. For some companies, this might be 10s of terabytes, for others, it might be 10s of petabytes [11].
Velocity: data is streaming at unprecedented speeds and must be dealt with in a timely manner [8]. The rate at which data is being received and has to be acted upon is becoming much more real-time. While it is unlikely that any real analysis will have to be completed in the same time period, delays in execution will inevitably limit the effectiveness of campaigns, limit interventions or lead to sub-optimal processes [11].
Variety: Data today comes in all types of formats, structured, numeric data in traditional databases, information created from line-of business applications, unstructured text documents, email, video, audio, stock ticker data and financial transactions. Managing, merging and governing different varieties of data is something many organizations still grapple with [8].
Variability: In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something trending on social media? Daily, seasonal and event-triggered peak data loads can be challenging to manage, even more so with unstructured data involved.
Value: We need to consider what commercial value any new sources and forms of data can add to the business or scientific research. Are the existing problems that have defiled solutions due to unavailability of data now being solved?
Complexity: Today’s data comes from multiple sources and it is still an undertaking to link, match, cleanse and transform data across systems. However, it is necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control.

CHARACTERISTICS OF BIG DATA

Phillip Russom (2011, 4th Edition) provided a similar definition to Gartner’s idea with emphasis on volume, variety and velocity, but with further insight on each of the 3V’s. Typically, big data is categorized based on three characteristics:

Volume: How much data
Velocity: How fast data is processed
Variety: The various types of data

This is a convenient and simple categorization especially on a relatively small amount of very disparate, complex data or even a huge volume of very simple data that can be structured or unstructured. Structured data is straightforward to analyze. Unstructured data is different than structured data in that its structure is unpredictable. Data from e-mails, blogs, digital images, videos, social media and satellite imagery are e-Government and Big data analysis 14 unstructured in nature. This type of data accounts for the majority sources of data in both private and public domain. Another Vs that are of important is the fourth V, veracity and the fifth V value. How accurate is that data in predicting business value? Do the results of a big data analysis actually make sense? Data must be able to be verified based on both accuracy and context. For example, an innovative business may want to be able to analyze massive amounts of data in real time to quickly assess the value of that customer and the potential to provide additional offers to that customer. It is necessary to identify the right amount and types of data that can be analyzed in real time to impact business outcomes.

BIG DATA IN NIGERIA

In Nigeria, with about 170 million population, the challenges of data collection and mining is surely a challenge particularly with various organs of government that are involved in the management and usage of data for different purposes. Some of the agencies that are mandated to collect and manage data in Nigeria include:

Download Chapters 1 to 5 PDF