Big Data Infrastructure

Big data is a term used to describe the latest advancements in technology that aid the analysis of huge amounts of data. Big data is different from traditional data analytics because of the nature of data being generated in the 21st century. Big data is characterized by variety, velocity and volume. Data can be in various forms such as structured, unstructured or semi-structured data. Such data is churned out at a fast rate resulting in enormous volumes of data generated. A server analyzing big data is expected to handle the large quantities of data generated. Since such large amounts of data is beyond the storage and processing capacity of a single server, distributed computing systems were first conceived. These systems have several machines combined together to form a cluster. The data is split into smaller pieces and distributed among the machines in the cluster. The processing power and storage capacity of each machine is used to process the chunk of data present in that machine.

The problem with most big data servers is the high cost of purchasing and using these servers due to rapid changes in technology. This dramatically affect the cost of big data services making such services inaccessible to small institutions and developing countries. To address this problem, this project intends to explore options to build big data servers using old discarded laptops which can be acquired at extremely low prices.

There are two methods that can be used to build big data servers. The first method uses a virtualized environment to manage all the machines. The second method uses a non-virtualized environment. Virtualized clusters are commonly used in data centers because of the ease of managing the computing nodes despite of some performance overhead, which is claimed to be about 25% according to a leading virtualization software company. It has also been stated that the overhead of using a hypervisor in a virtualized server environment is 5 to 7%.

We report on the performance testing results for the two methods to evaluate scalability and feasibility of using virtualized environment for building big data servers using recycled computers. Two clusters are built using the discarded laptops. The first cluster operates on a virtualized environment based on a hypervisor and the second cluster operates on a non-virtualized environment. CentOS 6.5 is used as the operating system for both clusters. The performance of both clusters is benchmarked. The results show that the virtualized environment has an overhead of 66% for read operations and 88% for write operations. This suggests that for recycled computers, bare-metal non-virtualized environment is recommended for building big-data servers.