[IEEE 2013 5th International Conference on Information & Communication Technologies (ICICT) -...

5
1 VDrive A cost-effective storage solution, using volunteer computing with emphasis on security/privacy, availability, and latency Dr. Jawwad Shamsi Associate Professor [email protected] du.pk Aarij Siddiqui Bachelors Student [email protected] Rabbia Hassan Bachelors Student [email protected] Rida Sattar Bachelors Student [email protected] Safi Siddiqui Bachelors Student [email protected] Computer Science Department, National University of Computer and Emerging Sciences Karachi, Pakistan AbstractVDrive works on the baseline model of Storage-as-a- Service (SaaS), with an infrastructure built on volunteer computing. The project gives a model for storing user data as decentralized packets on nodes on the network. The model proposes a solution with redundant data to provide an efficient model, with maximum availability. Keywords – Volunteer Computing; Storage as a service; Distributed Systems; Platform Independency; Centralized Storage; I. INTRODUCTION In the current world, data is everything. It is the crucial informational details in a company, details about finances, employees, customers, trade secrets, and more that keeps companies valuable. A heightened need of storage is necessary not only to ‘store’ the data, but also to prevent it from leaking, or getting out in the market. To be valuable to a company data storage has to be safe, secure, and accessible efficiently. There are two major trends in storage: Local, and cloud. Local Storage usually has no fault tolerance, no backup, no mobility, and no ubiquity by default. It‘s safety, security, and all matters related to it are dependent on how it is exploited. Cloud storage on the other hand provides high availability, high latency, and a (comparatively) low cost solution to the storage problem, but it takes the ultimate control over the data from the hands of the user, to the hands of the service provider. Vdrive is motivated to develop a system which is based on volunteer architecture having trust relationship, nearest node, and platform independency, along with the fundamental storage features such as security, availability, etc. as illustrated in figure 1. Consider a few scenarios: a) A company has been initialized. With some initial investment they have bought some Computers. The company is now in need of a new centralized storage System. As a start- up, instead of pouring money in Cloud, or data servers, they find a cheaper, yet still reliable option; a way to utilize their current infrastructure, without any addition, obtaining the same results. b) A University that needs to provide the students and faculty a storage space. Instead of going for a bundled cloud solution, or getting data servers, they find a cloud storage system that can be built over the existing infrastructure of PCs (Computer Lab, Faculty, and Management) to provide a local, efficient, and reliable storage. Figure 1

Transcript of [IEEE 2013 5th International Conference on Information & Communication Technologies (ICICT) -...

Page 1: [IEEE 2013 5th International Conference on Information & Communication Technologies (ICICT) - Karachi, Pakistan (2013.12.14-2013.12.15)] 2013 5th International Conference on Information

1

VDrive A cost-effective storage solution, using volunteer computing

with emphasis on security/privacy, availability, and latency

Dr. Jawwad Shamsi Associate Professor

[email protected]

Aarij Siddiqui Bachelors Student

[email protected]

Rabbia Hassan Bachelors Student

[email protected]

Rida Sattar Bachelors Student

[email protected]

Safi Siddiqui Bachelors Student

[email protected]

Computer Science Department, National University of Computer and Emerging Sciences

Karachi, Pakistan

Abstract— VDrive works on the baseline model of Storage-as-a-Service (SaaS), with an infrastructure built on volunteer computing. The project gives a model for storing user data as decentralized packets on nodes on the network. The model proposes a solution with redundant data to provide an efficient model, with maximum availability.

Keywords – Volunteer Computing; Storage as a service; Distributed Systems; Platform Independency; Centralized Storage;

I. INTRODUCTION In the current world, data is everything. It is the crucial

informational details in a company, details about finances, employees, customers, trade secrets, and more that keeps companies valuable. A heightened need of storage is necessary not only to ‘store’ the data, but also to prevent it from leaking, or getting out in the market. To be valuable to a company data storage has to be safe, secure, and accessible efficiently.

There are two major trends in storage: Local, and cloud.

Local Storage usually has no fault tolerance, no backup, no mobility, and no ubiquity by default. It‘s safety, security, and all matters related to it are dependent on how it is exploited. Cloud storage on the other hand provides high availability, high latency, and a (comparatively) low cost solution to the storage problem, but it takes the ultimate control over the data from the hands of the user, to the hands of the service provider.

Vdrive is motivated to develop a system which is based on

volunteer architecture having trust relationship, nearest node, and platform independency, along with the fundamental storage features such as security, availability, etc. as illustrated in figure 1.

Consider a few scenarios: a) A company has been initialized. With some initial

investment they have bought some Computers. The company is now in need of a new centralized storage System. As a start-up, instead of pouring money in Cloud, or data servers, they find a cheaper, yet still reliable option; a way to utilize their current infrastructure, without any addition, obtaining the same results.

b) A University that needs to provide the students and

faculty a storage space. Instead of going for a bundled cloud solution, or getting data servers, they find a cloud storage system that can be built over the existing infrastructure of PCs (Computer Lab, Faculty, and Management) to provide a local, efficient, and reliable storage.

Figure 1

Amir
Text Box
978-1-4799-2622-0/13/$31.00 ©2013 IEEE
Page 2: [IEEE 2013 5th International Conference on Information & Communication Technologies (ICICT) - Karachi, Pakistan (2013.12.14-2013.12.15)] 2013 5th International Conference on Information

2

II. RELATED WORK

There are three kinds of researches done in the field of volunteer computing relating to storage; Evaluation based, recommending modifications in the existing frameworks, and developing a new solution.

A. Evaluation based:

In this paper [1] the task distribution in the volunteer based projects was addressed. BOINC was considered as the underlying system for the experiments. The evaluation was done on the usage of computation power, disk read/write, database transactions, and time consumption in the cases of single and multiple servers. It was concluded that BOINC task server running on an inexpensive hardware can potentially dispatch tens of millions of task per day. The database server (and in particular it’s CPU) is typically the bottleneck of the system.

In this paper [2] it was studied and analyzed that how much capable is the volunteer architecture. A measurement of over 330,000 hosts participating in a volunteer computing project was considered. These measurements include processing power, memory, disk space, network throughput, host availability, user specified limits on the resource usage, and host churn. It showed that volunteer computing can support applications that are significantly more data intensive or have high memory and storage requirements than those in current projects.

B. Recommending modifications in the existing frameworks:

This paper [3] discussed the how identity federation based on agent technology can be exploited to deal with the challenge of guaranteeing the trust and required security to the users

. In this paper [4] it was proposed to develop a flexible

distributed storage integrity auditing mechanism, utilizing the homographic token and distributed erasure coded data. The proposed design allows user to audit the cloud storage with very lightweight communication and computation cost. C. New Solutions:

Cloud@Home [5] is the idea to utilize the resources available either in a single system or in the whole enterprise. It gives an option to the user, volunteering their resources, an option to buy, sell or donate their resources. Its main focus was to design an interoperable system that can cater all types of users. For the implementation, a resource subsystem was created by using virtual machines as per requests, similar to what Amazon Elastic Compute Cloud [6] is doing but on volunteer architecture.

Trust Store [7] is an idea based research which addressed the security concerns of the cloud storage services. It focused on confidentiality, integrity, and availability. It discussed the security threat third party storage services poses on a user. This research proposed a middleware in between the user and storage service provider which will fragment, encrypt and implement integrity management on the data.

So far the closest research to our idea is Storage@Home

[8]. It is a distributed storage infrastructure developed to solve the problem of backing up and sharing large amount of data (scientific results) using a distributed model of volunteer managed hosts. Data is maintained by a mixture of replication and monitoring. File is encrypted and replicated copies are kept.

These researches are directly related to our research idea

but with some major additions. In Vdrive we are developing a volunteer based solution to the centralized storage problem. In the process we are addressing the issues of security, privacy, latency, cost, availability, reliability, platform independency, and mobility. These issues are resolved using fragmentation, password based encryption, replication, nearest node algorithms, and smart phone access, as shown in table 1.

Table 1

Attributes Cloud@Home Trust Store

Storage@Home

Vdrive

Security Privacy Latency Cost Availability Reliability Platform Ind.

Mobility

III. ISSUES AND CHALLENGES

There were many minor challenges that we faced during our

research. Some major issues are as follows:

A. Ensuring Availability: Proposing a solution using volunteer computing which is

usually implemented with dedicated resources (which we are not using) was a challenge in itself. We did a thorough research and concluded that a self-learning system would be a long term solution to this problem.

Page 3: [IEEE 2013 5th International Conference on Information & Communication Technologies (ICICT) - Karachi, Pakistan (2013.12.14-2013.12.15)] 2013 5th International Conference on Information

3

B. Communication Issues: It is quite easy to send data from client to server, but the

vice versa is not as easy, considering that majority of our client machines are behind NAT and without a static IP. To solve this problem we utilized the ‘alive’ messages that were sent to inform the availability of the machine. The data that we need to send from server to client is sent in reply to the ‘alive’ message.

IV. VDRIVE

A. Security/Privacy: The security feature of Vdrive is two folded. First at the

time of uploading the file to the server, it is encrypted using the password based encryption. Second, it is broken into chunks and distributed over the nodes. The encryption makes sure that file is not being misused if accessed while being uploaded. The fragmentation (chunking) ensures that a file that resides in a node is not being misused. In this case encryption would not have been enough because a file may reside on a system for years, and it might give enough time to the person for decrypting it. Encryption is done using owner’s password, thus ensuring privacy.

B. Availability/Reliability: In Vdrive we have tried to maximize the availability as

much as possible. As it is a volunteer based model, it cannot ensure 100% availability. We have applied algorithms to replicate chunks, such that at least one node having that chunk remains in the network all the time. For replication we have done node profiling that gives the system a clear view of when a node might be active. The node profiling will be collecting the information and calculating new results on the go.

C. Trust Relationship: Being the system with volunteer computing as the

underlying system, trust is an important concern. The trust mechanism that we have implemented is a self-learning mechanism i.e. the more the node remains connected with the server, trust ratings will increase and more space will be granted to the user. It is to be mentioned here that up-time is not the only factor that will affect the space allowed to a user.

D. Nearest Node: In this model nearest node algorithm have implemented to

decrease the latency rate for accessing a file or a chunk. The node with least hops will be selected for the download/retrieval of the file.

E. Platform Independency: It was the requirement of the system to be able to cater

multiple platforms, as it is an application to be used by general users; we cannot limit them to certain platforms. Vdrive has been developed using Java Runtime Environment, hence ensuring platform independency.

F. Two part architecture: 1. Client – When a client uploads a file, through our

software agent that has been installed on every workstation in the vicinity, it is separated from its file extension, encrypted and sent over the network to the server. The file appears in the list of uploaded files in the client application and can be downloaded by clicking the download button. The client application/agent also sends the heartbeat to the server after regular intervals so that server can keep the record of available nodes in the network.

2. Server – When server receives the file from the client

it then breaks the file into chunks, and here n-1 chunks are of equal sizes. These chunks are then entered in to the database and distributed amongst multiple nodes using our replication algorithm (explained later). There is a universally increasing id number for chunks, hence identifying all the chunks uniquely. The database keeps record of all the nodes, their mac addresses, the volunteered amount of storage, login credentials, file name, file size, file owner, chunk id, chunk locations, chunk belong to which file, etc.

Figure 2

The communication between client and server is further

explained using pictorial representation in Figure 2.

V. IMPLEMENTATION

A. Parallelism: Threads have been used while programming the system. It

makes our application run efficiently providing a faster interaction with the user. ‘Pthread’ library has been used for thread building.

Page 4: [IEEE 2013 5th International Conference on Information & Communication Technologies (ICICT) - Karachi, Pakistan (2013.12.14-2013.12.15)] 2013 5th International Conference on Information

4

B. GUI: The system is built with a very basic GUI, which was made

using ‘JSwing’ library. It allows the user to perform the simple operations, provided by our system.

C. Data type:

It was one of the initially challenges to find a data type which helps us in catering all types of files. There after some research we decided to use ‘bytes’ as our basic data type for the file transfer and handling.

D. Space Allocation:

Our client side application initially allocates a certain amount of space on the machine, to ensure that the agreed amount of storage has been fully availed.

Figure 3

E. Database: Our database is connected to our server, and ensures all the

required transactions are carried out successfully. It comprises of four tables namely user, node, file and chunk. All the tables are interconnected with proper keys to ensure executions. Further details can be seen in Figure 3.

VI. PERFORMANCE

There has been a moderate level of performance evaluation

being done, comparing the system to other service providers.

Following are the graphs that were built. These illustrated experiments were conducted with a network size of ten nodes. Out of which 6 were stable, whereas other four had fluctuation in their connection.

While uploading a file of size 512 KB as illustrated in figure 4, each chunk was distributed on 3 nodes (using our replication algorithm). For some chunks, 1 or 2 nodes went offline after the upload but the chunk was still retrieved through the third online node.

Figure 4, Test performed on a file of size 512 KB In the process of uploading the file of size 1 MB, and 2 MB

as illustrated in figures 5 and 6 respectively, the chunk distribution was same as before but the distance (measured in hops) between the nodes was varying. Some were connected to the same router while others were a little far. It was observed that the system downloaded each chunk from the node which was nearer in the network, although most of the times these nodes were not highest in the priority, but before starting the download a simple hop count was done. In this way a significant amount of time was saved.

Figure 5, Test performed on a file of size 1 MB

Page 5: [IEEE 2013 5th International Conference on Information & Communication Technologies (ICICT) - Karachi, Pakistan (2013.12.14-2013.12.15)] 2013 5th International Conference on Information

5

Figure 6, Test performed on a file of size 2 MB

VII. CONCLUSION

VDrive provides an efficient and well-organized

Distributed Volunteer File storage system for an Institution, or an Organization. The Project is based on the visible trends in an Organization where it is found that most Personal Computers never achieve their full storage capacity, and the investment is generally never reaps a full benefit. The project plans to utilize this issue by constructing a Client Server architecture where files can be uploaded to, and downloaded from a server, but instead of server saving this information in itself, or other dedicated machines, the server will allow the files to be saved in the volunteer nodes within that environment.

This inexpensive solution will work on the ‘give-and-take’ policy, where every node that will be allowed access to the network, and the information it hosts will have to contribute to

the Volunteer storage pool of Data. This solution will cut major costs, and implement a secure model of a storage system on any existing architecture where a string of Personal Computers are available. In addition to a closed environment such as office space, the project can also be used in an open environment as the storage mechanism can utilize high security.

ACKNOWLEDGEMENT We would like to acknowledge the support provided by

National ICT Research and Development Authority. It is due to their support that we were able to perform evaluations on the cutting edge technology, which made certain the successful completion of this project.

REFERENCES [1] David P. Anderson, Eric Korpela, and Rom Walton.“High-Performance

Task Distribution for Volunteer Computing.”IEEE (2005). [2] David P. Anderson, and GilliesFedak. “The computation and storage

potential of volunteer computing.” 16th – 19th May, 2006. [3] Khemakhem M., and Belghith A. “Identity federation based on agent

technology for secure large scale data storage and processing our volunteer grids.” 10th – 13th May, 2009.

[4] Cong Wang, Qian Wang, KuiRen, Ning Cao, and Wenjing Lou. “Toward secure and dependabple storage services in cloud computing.” April – June, 2012.

[5] Vincenzo D. Cunsolo, Salvatore Distefano, Antonio Puliafito and Marco Scarpa. “Volunteer computing and Desktop Cloud: theCloud@Home Paradigm.”IEEE (2009).

[6] Amazon Elastic Compute Cloud (EC2). “http://www.amazon.com/ec2/” [7] Surya Nepal, Causten Friedrich, Leakha Henry, Shipping Chen. “A

secure storage service in the hybrid cloud.” 5th – 8th December, 2011. [8] Adam L. Beberg, and Vijay S. Pande. “Storage@Home:Petascale

Distributed Storage.” 26th – 30th March, 2007