Final Mini Report (Recovered)

103
TABLE OF CONTENTS LIST OF FIGURES.............................................. 4 ABSTRACT……….………………………………………………………………………….6 Chapter 1: Project Introduction 1.1 Introduction to Big Data…………………………………………………………………………………………………………….7 1.2 Problem Statement.......................................9 1.3 Scope...................................................9 1.4 Project Feature.........................................9 1.5 Organistion.............................................9 Chapter 2: Literature Survey................................12 Chapter 3: System Analysis..................................14 3.1 Introduction.......................................14 3.2. System Study…………………………………………………………………..............15 3.2.1. Existing System...............................15 3.2.2. Proposed System…………………………………………………………………..15 3.2.3. Hadoop………...…………………………………………………………………..16 3.3 Feasibility study………………………………………………………………………...17 3.3.1 Introduction to Java API…………………………………………………………...18 3.4 Objectives..........................................19 3.5 Technoogy Used.........................................20 Chapter 4: System Requirements..............................22 4.1 Introduction...........................................22 4.2 System requirements....................................25 1

description

fasfg

Transcript of Final Mini Report (Recovered)

TABLE OF CONTENTSLIST OF FIGURES4ABSTRACT..6Chapter 1: Project Introduction 1.1 Introduction to Big Data.71.2 Problem Statement91.3 Scope91.4 Project Feature91.5 Organistion9Chapter 2: Literature Survey12Chapter 3: System Analysis143.1 Introduction14 3.2. System Study..............15 3.2.1. Existing System153.2.2. Proposed System..15 3.2.3. Hadoop.....163.3 Feasibility study...173.3.1 Introduction to Java API...18 3.4 Objectives193.5 Technoogy Used20Chapter 4: System Requirements224.1 Introduction224.2 System requirements25 4.2.1 Software requirements254.2.2 Hardware requirements264.3 Conclusion26Chapter 5: System design275.1 Introduction275.2 Hdfs Architecture275.3 Modules305.4 UML diagrams315.4.1 Sequence diagram 315.4.2 Class diagram325.5 Conclusion32Chapter 6: Implementation336.1 Introduction 336.2 Map Reduce33 6.3 Hdfs Shell Commands38 6.4 Sample Code456.5 Conclusion46Chapter 7: Screenshots47Chapter 8: Testing and Validation698.1 Testing:798.2 Types of testing:70 8.2.1 White box Testing.... ...70 8.2.2 Black box Testing.........70 8.2.3 Alpha Testing........70 8.2.4 Beta Testing..................70 8.3. Path Testing...71 9: Conclusion and future scope72 9.1. Conclusion ....72 9.1.1. Selecting a Project in Hadoop....72 9.1.2. Rethinking and adopting to Existing Hadoop72 9.1.3. Path Availability.73 9.1.4. Services insight and Operations.73 9.1.5. Adapt lean ad Agile Integration Principles73 9.2. Future Scope and Hadoop.74 10: References.....................75

LIST OF FIGURESFigure No. Name of Figure Page No. 1.1 Structure of big data 71.2 Data Analysis 81.3 3 Vs of big data 93.1 Evolution of big data 163.2 Objectives 19 5.1 Architecture of HDFS 285.2 Architecture of client server modules 305.3 Sequence Diagram 315.4 Class Diagram 327.1 CAT command execution 477.2 copy to local command execution 48 7.3 cp command execution 497.4 du command execution 507.5 dus command execution 517.6 Expunge command execution 527.7 Get command execution 537.8 Get Merge command execution 547.9 ls command execution 557.10 lsr command execution 567.11 mkdir command execution 577.12 Move From Local command execution 587.13 mv command execution 597.14 put command execution 607.15 rm command execution 617.16 rmr command execution 627.17 stat command execution 637.18 tail command execution 647.19 test command execution 657.20 text command execution 667.21 touchz command execution 67 ABSTRACTHadoop is a flexible and available architecture for large scale computation and data processing on a network of commodity hardware. As Hadoop is an open source framework for processing, storing and analyzing massive amounts of distributed unstructured data. Originally created by Doug Cutting at Yahoo!, Hadoop was inspired by MapReduce, a user-defined function developed by Google in early 2000s for indexing the Web. It was designed to handle petabytes and Exabytes of data distributed over multiple nodes in parallel. Hadoop clusters run on inexpensive commodity hardware so projects can scale-out without breaking the bank. Hadoop is now a project of the Apache Software Foundation, where hundreds of contributors continuously improve the core technology. Fundamental concept: Rather than banging away at one, huge block of data with a single machine, Hadoop breaks up Big Data into multiple parts so each part can be processed and analyzed at the same time. Why Hadoop used for searching, log processing, recommendation systems, analytics, video and image analysis, data retention? It is used by the top level apache foundation project, large active user base, mailing lists, users groups, very active development, and strong development teams. Hadoop is a popular open-source implementation of MapReduce for the analysis of large datasets. To manage storage resources across the cluster, Hadoop uses a distributed user-level filesystem. This filesystem HDFS is written in Java and designed for portability across heterogeneous hardware and software platforms. This paper analyzes the performance of HDFS and uncovers several performance issues. First, architectural bottlenecks exist in the Hadoop implementation that result in inefficient HDFS usage due to delays in scheduling new MapReduce tasks. Second, portability limitations prevent the Java implementation from exploiting features of the native platform. Third, HDFS implicitly makes portability assumptions about how the native platform manages storage resources, even though native filesystems and I/O schedulers vary widely in design and behavior. This paper investigates the root causes of these performance bottlenecks in order to evaluate tradeoffs between portability and performance in the Hadoop distributed filesystem.

CHAPTER 1: INTRODUCTION1.1 INTRODUCTION TO BIGDATADEFINITIONS:- Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it with in a tolerable elapsed time for its user population. Teradata Magazine article, 2011

Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.

- The McKinsey Global Institute, 2012

Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools.

- Wikipedia, 2014.

Datasets that exceeds the boundaries and sizes of normal processing capabilities forcing you to take a non-traditional approach.

Fig1.1:-structure of Big Data

Fig1.2:- data analysis For the past two decades most business analytics have been created using structured data extracted from operational systems and consolidated into a data warehouse. Big data dramatically increases both the number of data sources and the variety and volume of data that is useful for analysis. A high percentage of this data is often described as multi-structured to distinguish it from the structured operational data used to populate a data warehouse. In most organizations, multi-structured data is growing at a considerably faster rate than structured data. Two important data management trends for processing big data are relational DBMS products optimized for analytical workloads (often called analytic RDBMSs, or ADBMSs) and non-relational systems for processing multi-structured data. In a 2001 research report and related lectures, META Group (now Gartner) analyst Doug Laney defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). In the modern world, not only Gartner but many of the industries continue to use this "3Vs" model for describing big data. Later in 2012, Gartner updated big datas definition as follows: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. Additionally, a new V "Veracity" is added by some organizations to describe it.Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.Big data can also be defined as "Big data is a large volume unstructured data which cannot be handled by standard database management systems like DBMS, RDBMS or ORDBMS".The 3Vs Of Big Data:

Fig1.3:- 3Vs of big data

Volume:Enterprises are awash with ever-growing data of all types, easily amassing terabyteseven petabytesof information. Turn 12 terabytes of Tweets created each day into improved product sentiment analysis. Convert 350 billion annual meter readings to better predict power consumption.Velocity:Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value. Scrutinize 5 million trade events created each day to identify potential fraud. Analyze 500 million daily call detail records in real-time to predict customer churn faster.Variety:Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together. Monitor 100s of live video feeds from surveillance cameras to target points of interest.1.2 PROBLEM STATEMENTRelative database systemsand desktop statistics and visualization packages often have difficulty handling big data. The work instead requires "massively parallel software running on tens, hundreds, or even thousands of servers. What is considered big data varies depending on the capabilities of the users and their tools, and expanding capabilities make Big Data a moving target. It is a very difficult task for the organization to maintain such large amounts of data which keeps on increasing day by day.1.3 SCOPEBig data concept is used mainly to manage the bulk data that may be useful to give us a best Idea of what are the factors that are responsible like the preferences of a customer to launch A new product into the market and make it a successful. It has a wide number of applications Like Net pricing out of home advertising retail habits and politics.1.4 PROJECT FEATURESThis project gives an illustration of all the commands that are used in the Hadoop platform it shows you how the commands work and Hadoop file system interaction implementation Using java application program interface and it shows the implementation of default file sys -tem in Hadoop.1.5 ORGANIZATIONCHAPTER 1: In this chapter the basic introduction regarding the project is given and the uses And the reason why it is introduced is mentioned.CHAPTER 2: In this chapter Literature survey is explained. The literature survey gives a detailed study of projects with similar activity or technology. From the literature survey we come to know about how the other similar projects were implemented and problems faced in these projects. CHAPTER 3: In this chapter system analysis is explained. We also give the description of the language that is used in this project. We explain about the advantages and disadvantages of using the language.CHAPTER 4: In this chapter system requirements and analysis is explained and also the classification of requirements has been done and the functional requirements & non-functional requirements.CHAPTER-5: In this chapter the system design is explained with the help of different diagrammatic representation like ER, UML etc.CHAPTER-6: In this chapter implementation of the system is explained by explaining the basic concept. The sample code is also explained which tells about the functionality of the microcontroller.CHAPTER-7: In this chapter screenshots are explained. The screenshots gives a clear description about how the project is implemented. CHAPTER-8: This chapter includes the different types of testing, the validation part of coding with the brief description and also different test cases are defined. CHAPTER-9: This chapter includes the conclusion of the project.CHAPTER-10: It includes the references of all concepts that have been in this project

CHAPTER 2: LITERATURE SURVEYApplications frequently require more resources than are available on an inexpensive machine. Many organizations find themselves with business processes that no longer fit on a single cost effective computer. A simple but expensive solution has been to buy specialty machines that have a lot of memory and many CPUs. This solution scales as far as what is supported by the fastest machines available, and usually the only limiting factor is the budget. An alternative solution is to build a high-availability cluster. Such a cluster typically attempts to look like a single machine, and typically requires very specialized installation and administration services. Many high-availability clusters are proprietary and expensive.Hadoop supports the Map Reduce model, which was introduced by Google as a method of solving a class of pet scale problems with large clusters of inexpensive machines. The model is based on two distinct steps for an application: Map: An initial ingestion and transformation step, in which individual input records can be processed in parallel. Reduce: An aggregation or summarization step, in which all associated records must be processed together by a single entity.The core concept of Map Reduce in Hadoop is that input may be split into logical chunks, and each chunk may be initially processed independently, by a map task. The results of these individual processing chunks can be physically partitioned into distinct sets, which are then sorted. Each sorted chunk is passed to a reduce task. Figure 1-1 illustrates how the Map Reduce model works.The Hadoop Distributed File System HDFS is a file system that is designed for use for Map Reduce jobs that read input in large chunks of input, process it, and write potentially large chunks of output. HDFS does not handle random access particularly well. For reliability, file data is simply mirrored to multiple storage nodes. This is referred to as replication in the Hadoop community. As long as at least one replica of a data chunk is available, the consumer of that data will not know of storage server failures. HDFS services are provided by two processes: Name Node handles management of the file system metadata, and provides management and control services. Data Node provides block storage and retrieval services. There will be one Name Node process in an HDFS file system, and this is a single point of failure. Hadoop Core provides recovery and automatic backup of the Name Node, but no hot failover services.

CHAPTER 3: SYSTEM ANALYSIS In the previous chapter we looked at the literature survey regarding to the big data. In system analysis we give a description of the language that is being used and about the advantages and disadvantages of it.3.1 INTRODUCTION Apache Hadoopis an open source software frameworkwritten inJavafordistributed sharingand distributed processing of very large data sets oncomputer clustersbuilt fromcommodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are commonplace and thus should be automatically handled insoftwareby the framework.The core of Apache Hadoop consists of a storage part (HDFS) and a processing part (Map reduce). Hadoop splits files into large blocks and distributes them amongst the nodes in the cluster. To process the data, Hadoop Map Reduce transferspackaged codefor nodes to process in parallel, based on the data each node needs to process. This approach takes advantage of data locality nodes manipulating the data that they have on-hand to allow the data to be processed faster andmoreefficiently than it would be in a more conventionalsupercomputer architecturethat relies on aparallel systemwhere computation and data are connected via high-speed networking. ThebaseApache Hadoop framework is composed of the following modules: Hadoop Common contains libraries and utilities needed by other Hadoop modules; Hadoop DistributedFileSystem (HDFS) a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster; Hadoop YARN a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications3.2 SYSTEM STUDYIn system study we make an analysis of present system and the proposed system and get to know the reason behind the need of a new technology in the place of the existing one.3.2.1 EXISTING SYSTEM It incentivizes more collection of data and longer retention of it. If any and all data sets might turn out to prove useful for discovering some obscure but valuable correlation, you might as well collect it and hold on to it. In long run, the more useful big data proves to be, the stronger this incentivizing effect will be; but in the short run it almost doesnt matter; the current buzz over the idea is enough to do the trick. Many (perhaps most) people are not aware of how much information is being collected (for example, that stores are tracking their purchases over time), let alone how it is being used (scrutinized for insights into their lives). Big data can further tilt the playing field toward big institutions and away from individuals. In economic terms, it accentuates the information asymmetries of big companies over other economic actors and allows for people to be manipulated.3.2.2 PROPOSED SYSTEMBig data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. The challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and privacy violations. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, prevent diseases, combat crime and so on."3.2.3 HADOOPHadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become in operative. Hadoop was inspired by Google's Map Reduce, a software framework in which an application is broken down into numerous small parts. Any of these parts (also called fragments or blocks) can be run on any node in the cluster. Doug Cutting, Hadoop creator, named the framework after his child's stuffed toy elephant. The current Apache Hadoop ecosystem consists of the Hadoop kernel, Map Reduce, the Hadoop distributed file system (HDFS) and a number of related projects such as Apache Hive, Base and Zookeeper.

Fig3.1:- Evolution of big dataThe Hadoop framework is used by major players including Google, Yahoo and IBM, largely for applications involving search engines and advertising. The preferred operating systems are Windows and Linux but Hadoop can also work with BSD and OS X.\3.3 FEASIBILITY STUDYThe Hadoop Distributed File System (HDFS)a subproject of the Apache Hadoop projectis a distributed, highly fault-tolerant file system designed to run on low-cost commodity hardware. HDFS provides high-throughput access to application data and is suitable for applications with large data sets.Overview of HDFSHDFS has many similarities with other distributed file systems, but is different in several respects. One noticeable difference is HDFS's write-once-read-many model that relaxes concurrency control requirements, simplifies data coherency, and enables high-throughput access.Another unique attribute of HDFS is the viewpoint that it is usually better to locate processing logic near the data rather than moving the data to the application space. HDFS rigorously restricts data writing to one writer at a time. Bytes are always appended to the end of a stream, and byte streams are guaranteed to be stored in the order written. HDFS has many goals. Here are some of the most notable: Fault tolerance by detecting faults and applying quick, automatic recovery Data access via Map Reduce streaming Simple and robust coherency model Processing logic close to the data, rather than the data close to the processing logic Portability across heterogeneous commodity hardware and operating systems Scalability to reliably store and process large amounts of data Economy by distributing data and processing across clusters of commodity personal computers3.3.1 INTRODUCTION TO JAVA APIThe Java Client API is an open source API for creating applications that use Mark Logic Server for document and search operations. Developers can easily take advantage of the advanced capabilities for persistence and search of unstructured documents that Mark Logic Server provides. The capabilities provided by the JAVA API include: Insert, update, or remove documents and document metadata. For details, seeDocument Operations. Query text and lexicon values. For details, seeSearching. Configure persistent and dynamic query options. For details, seeQuery Options. Apply transformations to new content and search results. For details, seeContent Transformations. Extend the Java API to expose custom capabilities you install on Mark Logic Server. For details, seeExtending the Java API.When working with the Java API, you first create a manager for the type of document or operation you want to perform on the database (for instance, aJSON Document Managerto write and read JSON documents or aQuery Managerto search the database). To write or read the content for a database operation, you use standard Java APIs such asInput Stream, DOM, Sax, JAXB, and Transformer as well as Open Source APIs such as JDOM and Jackson.The Java API provides a handle (a kind of adapter) as a uniform interface for content representation. As a result, you can use APIs as different asInput Streamand DOM to provide content for oneread()orwrite()method. In addition, you can extend the Java API so you can use the existingread()orwrite()methods with new APIs that provide useful representations for your content. This chapter covers a number of basic architecture aspects of the Java API, including fundamental structures such as database clients , managements andhandlesused in almost every program you will write with it. Before starting to code, you need to understand these structures and the concepts behind them.The Java API co-exists with the previously developed Java XCC, as they are intended for different use cases. A Java developer can use the Java API to quickly become productive in their existing Java environment, using the Java interfaces for search, facets, and document management. It is also possible to use its extension mechanism to invoke XQuery, so as both to leverage development teams XQuery expertise and to enable Mark Logic server functionality not implemented by the Java API. XCC provides a lower-level interface for running remote or ad hoc XQuery. While it provides significant flexibility, it also has a somewhat steeper learning curve for developers who are unfamiliar with XQuery. You may want to think of XCC as being similar to ODBC or JDBC; a low level API for sending query language directly to the server, while the Java Client API is a higher level API for working with database constructs in Java. In terms of performance, the Java API is very similar to Java XCC for compatible queries. The Java API is a very thin wrapper over a REST API with negligible overhead. Because it is REST-based, minimize network distance for best performance.3.3 OBJECTIVESWhy objectives first? Well, theres a natural tendency to drop into the planning phase before youve thought out what youre trying to do in what I like to call big animal pictures. I call this descending into the weeds before youve got a general idea of what you need to do. Projects, big data or otherwise, generally begin with determining your objectives and then breaking down the resources and tasks needed to complete the project.

Fig 3.2:-objectivesNow we moved on to the heart of the question: how can we determine and isolate the propagation mode of company news from the reporting of financial news in Reuters to tweets about that information? Naturally, we also wanted to explore the different aspects of a tweet that might make it more or lessinfluence There are a number of toolsavailable to measure some aspect of social authority but for this project we focused on the following: The volume, velocity, and acceleration of tweets generated after a news article reports financial information. The social authority (or influence) of the twitter as indicated by his/hers Kl out score and number of followers.Finally, once we had all this data how would we determine (algorithmically) the impact both (and singularly) sources had on a company stock price? Based on what we just covered, here are our four objectives:

Table 3.3: Our Four Objectives 3.5 TECHNOLOGY USEDBig data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. A 2011McKinseyreport suggests suitable technologies includeA testing,Crowd sourcing,data fusionandintegration,genetic algorithms,machine learning,natural language processing, signal processing, time series, analysisandvisualization Multidimensional big data can also be represented astensors, which can be more efficiently handled by tensor-based computation, such as multi linear subspace learning.

Additional technologies being applied to big data include massively parallel-processing (MPP) databases,search-based applications,data mining distributed file systems, distributed databases, cloud based infrastructure (applications, storage and computing resources) and the Internet. Some but not allMPPrelational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, back up, and optimize the use of the large data tables in theRDBMS. DARPAsTopological Data Analysisprogram seeks the fundamental structure of massive data sets and in 2008 the technology went public with the launch of a company called Avado. The practitioners of big data analytics processes are generally hostile to slower shared storage,preferring direct-attached storage (DAS) in its various forms from solid state drive (SSD) to high capacitySATAdisk buried inside parallel processing nodes. The perception of shared storage architecturesStorage area network(SAN) andNetwork-attached storage(NAS) is that they are relatively slow, complex, and expensive. These qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost.Real or near-real time information delivery is one of the defining characteristics of data analytics. Latency is therefore avoided whenever and wherever possible. Data in memory is gooddata on spinning disk at the other end ofFCSANconnection is not the cost of aSANat the scale needed for analytics applications is very much higher than other storage techniques.There are advantages as well as disadvantages to shared storage in big data analytics, but big data analytics practitioners as of 2011did not favour it.]

CHAPTER 4: SYSTEM REQUIREMENTS4.1 Introduction Before you begin the Big Data Extensions deployment tasks, make sure that your system meets all of the prerequisites. Big Data Extensions requires that you install and configure vSphere, and that your environment meets minimum resource requirements. You must also make sure that you have licenses for the VMware components of your deployment.vSphere RequirementsBefore you can install Big Data Extensions, you must have set up the following VMware products.Install vSphere 5.0 (or later) Enterprise or Enterprise Plus.NoteThe Big Data Extensions graphical user interface is only supported when using vSphere Web Client 5.1 and later. If you install Big Data Extensions on vSphere 5.0, you must perform all administrative tasks using the command-line interface.

When installing Big Data Extensions on vSphere 5.1 later you must use VMware vCenter Single Sign-On to provide user authentication. When logging in to vSphere 5.1 or later you pass entication to the vCenter Single Sign-On server, which you can configure with multiple identity sources such as Active Directory and OpenLDAP. On successful authentication, your username and password is exchanged for a security token which is used to access vSphere components such as Big Data Extensions.

Enable the vSphere Network Time Protocol on the ESXi hosts. The Network Time Protocol (NTP) daemon ensures that time-dependent processes occur in sync across hosts.

Cluster SettingsConfigure your cluster with the following settings.Enable vSphere HA and vSphere DRS

Enable Host Monitoring.

Enable Admission Control and set desired policy. The default policy is to tolerate one host failure.

Set the virtual machine restart priority to High

Set virtual machine monitoring to virtual machine and Application Monitoring.

Set the Monitoring sensitivity to High.

Enable vMotion and Fault Tolerance Logging.

All hosts in the cluster have Hardware VT enabled in the BIOS.

The Management Network VM kernel Port has vMotion and Fault Tolerance Logging enabled.

Network SettingsBig Data Extensionsdeploys clusters on a single network. Virtual machines are deployed with one NIC, which is attached to a specific Port Group. The environment determines how this Port Group is configured and which network backs the Port Group.

Either a switch or vSphere Distributed Switch can be used to provide the Port Group backing a Serengeti cluster. vDS acts as a single virtual switch across all attached hosts while a vSwitch is per-host and requires the Port Group to be configured manually.When configuring your network for use with Big Data Extensions, the following ports must be open as listening ports.Ports 8080 and 8443 are used by the Big Data Extensions plug-in user interface and the Serengeti Command-Line Interface Client.

Port 22 is used by SSH clients.

To prevent having to open network firewall port to access Hadoop services, log into the Hadoop client node, and from that node you can access your cluster.

To connect to the Internet (for example, to create an internal Yum repository from which to install Hadoop distributions), you may use a proxy.

Direct Attached Direct Attached Storage should be attached and configured on the physical controller to present each disk separately to the operating system. This configuration is commonly described as Just A Bunch Of Disks (JBOD). You must create VMFS Datastores on Direct Attached Storage using the following disk drive recommendations.8-12 disk drives per host.The more disk drives per host, the better the performance.

1-1.5 disk drives per processor core.

7,200 RPM disk Serial ATA disk drives.

Residue Requirements for the vSphere

Resource pool with at least 27.5GB RAM.

40GB or more (recommended) disk space for the management server and Hadoop template virtual disks.

rescue Requirements for the Hadoop ClusterData store free space is not less than the total size needed by the Hadoop cluster, plus swap disks for each Hadoop node that is equal to the memory size requested.

Network configured across all relevant ESX or ESXi hosts, and has connectivity with the network in use by the management server.

HA is enabled for the master node if HA protection is needed. You must use shared storage in order to use HA or FT to protect the Hadoop master node.

Hardware RequirementsHost hardware is listed in theVMware Compatibility Guide. To run at optimal performance, install your vSphere and Big Data Extensions environment on the following hardware.Dual Quad-core CPUs or greater that have Hyper-Threading enabled. If you can estimate your computing workload, consider more than powerful CPU.

Use High Availability (HA) and dual power supplies for the node's host machine.

4-8 GBs of memory per processor core, with 6% overhead for virtualization.

Tested Host and Virtual Machine SupportThe following is the maximum host and virtual machine support that has been confirmed to successfully run with Big Data Extensions.45 physical hosts running a total of 182 virtual machines.

128 virtual ESXi hosts deployed on 45 physical hosts, running 256 virtual machines.

4.2. SYSTEM REQUIREMENTS:4.2.1 Software Requirements When enterprise executives try to wrap their minds around the challenges of Big Data, two things quickly become evident: Big Data will require Big Infrastructure in one form or another, but it will also require new levels of management and analysis to turn that data into valuable knowledge. Too often, however, the latter part of that equation gets all the attention, resulting in situations in which all the tools are put in place to coordinate and interpret massive reams of data only to get bogged down in endless traffic bottlenecks and resource allocation issues. However, Big Infrastructure usually requires Big Expenditures, so it makes sense to formulate a plan now to the kinds of volumes that are expected to become the common workloads of the very near future.To some, that means the enterprise will have toadopt more of the technologies and architectures that currently populate the high-performance computing (HPC) worldof scientific and educational facilities. As ZDNet's Larry Dignan pointed out this month,companies like Univa are adapting platforms like Oracle Grid Engine to enterprise environments. Company CEO Gary Tyreman notes that it's one thing to build a pilot Hadoop environment, but quite another to scale it to enterprise levels. Clustering technologies and even high-end appliances will go a long way toward getting the enterprise ready to truly tackle the challenges of Big Data.

Integrated hardware and software platforms are also making a big push for the enterprise market. Teradata just introduced theUnified Data Environment and Unified Data Architecture solutionsthat seek to dismantle the data silos that keep critical disparate data sets apart. Uniting key systems like the Aster and Apache Hadoop releases with new tools like Viewpoint, Connector and Vital Infrastructure, and wrapped in the new Warehouse Appliance 2700 and Aster Big Analytics appliances, the platforms aim for nothing less than complete, seamless integration and analysis of accumulated enterprise knowledge.

4.2.2 Hardware Requirements:As I mentioned, though, none of this will come on the cheap. Gartner predicts thatBig Data will account for $28 billion in IT spending this year alone, rising to $34 billion next year and consuming about 10 percent of total capital outlays. Perhaps most ominously, nearly half of Big Data budgets will go toward social network analysis and content analytics, while only a small fraction will find its way to increasing data functionality. It seems, then, that the vast majority of enterprises are seeking to repurpose existing infrastructure to the needs of Big Data. It will be interesting to note whether future studies will illuminate the success or failure of that strategy.

Indeed, as application performance management (APM) firm OpTier notes in a recent analysis of Big Data trends, the primary challenge isn't simply to drill into large data volumes for relevant information, but todo it quickly enough so that its value can be maximized. And on this front, the IT industry as a whole is sorely lacking. Fortunately, speeding up the process is not only a function of bigger and better hardware. Improved data preparation and contextual storage practices can go a long way toward making data easier to find, retrieve and analyze, much the same way that wide area networks can be improved through optimization rather than by adding bandwidth.

In short, then, enterprises will need to shore up infrastructure to handle increased volumes of traffic, but as long as that foundation is in place, many of the tools needed to make sense of it all are already available. However, the downside is that this will not be an optional undertaking. As in sports, success in business is usually a matter of inches, and organizations of all stripes are more than willing toINVESTin substantial infrastructure improvements to gain an edge, even a small one.4.3 CONCLUSION:These are the hardware and software requirements that are needed for Hadoop to run. In the next chapter the system design will be discussed.

CHAPTER 5: SYSTEM DESIGN5.1 INTRODUCTION In the previous chapter we have seen the requirements that are necessary .In this phase we look at the system design for better understanding of the technology. It is about the physical organization of the system. It is demonstrated with the help of UML diagrams or block diagrams etc, it is explained in a pictorial representation.5.2. HDFS architectureHDFS is comprised of interconnected clusters of nodes where files and directories reside. An HDFS cluster consists of a single node, known as a Name Node that manages the file system namespace and regulates client access to files. In addition, data nodes (Data Nodes) store data as blocks within files.Name nodes and data nodesWithin HDFS, a given name node manages file system namespace operations like opening, closing, and renaming files and directories. A name node also maps data blocks to data nodes, which handle read and write requests from HDFS clients. Data nodes also create, delete, and replicate data blocks according to instructions from the governing name node. As the Figure illustrates, each cluster contains one name node. This design facilitates a simplified model for managing each namespace and arbitrating data distribution.Relationships between name nodes and data nodesName nodes and data nodes are software components designed to run in a decoupled manner on commodity machines across heterogeneous operating systems. HDFS is built using the Java programming language; therefore, any machine that supports the Java programming language can run HDFS. A typical installation cluster has a dedicated machine that runs a name node and possibly one data node. Each of the other machines in the cluster runs one data node. The below figure illustrates the high-level architecture of HDFS. Fig5.1:- architecture of HDFSCommunications protocolsAll HDFS communication protocols build on the TCP/IP protocol. HDFS clients connect to a Transmission Control Protocol (TCP) port opened on the name node, and then communicate with the name node using a proprietary Remote Procedure Call (RPC)-based protocol. Data nodes talk to the name node using a proprietary block-based protocol.Data nodes continuously loop, asking the name node for instructions. A name node can't connect directly to a data node; it simply returns values from functions invoked by a data node. Each data node maintains an open server socket so that client code or other data nodes can read or write data. The host or port for this server socket is known by the name node, which provides the information to interested clients or other data nodes. See the Communications protocols sidebar for more about communication between data nodes, name nodes, and clients. The name node maintains and administers changes to the file system namespace.File system namespaceHDFS supports a traditional hierarchical file organization in which a user or an application can create directories and store files inside them. The file system namespace hierarchy is similar to most other existing file systems; you can create, rename, relocate, and remove files. HDFS also supports third-party file systems such as CloudStore and Amazon Simple Storage Service (S3)HDFS provides interfaces for applications to move them closer to where the data is located, as described in the following section.Application interfaces into HDFSYou can access HDFS in many different ways. HDFS provides a native Java application programming interface (API) and a native C-language wrapper for the Java API. In addition, we can use a web browser to browse HDFS files. The applications described in Table 1 are also available to interface with HDFS.Table 1. Applications that can interface with HDFSApplicationDescription

FileSystem (FS) shellA command-line interface similar to common Linux and UNIX shells (bash, csh, etc.) that allows interaction with HDFS data.

DFSAdminA command set that you can use to administer an HDFS cluster.

FsckA subcommand of the Hadoop command/application. You can use the fsck command to check for inconsistencies with files, such as missing blocks, but you cannot use the fsck command to correct these inconsistencies.

Name nodes and data nodesThese have built-in web servers that let administrators check the current status of a cluster.

5.3 MODULESModules are nothing but the components that operate the big data. They are1. Client2. Server Client is the end user who asks for a request to the server for getting his task accomplished. Server is the thing that forwards the response in return to the client for the request that he had made. The figure below illustrates about the architecture of the working of client and server modules.

Fig5.2:-Architecture of client server modules5.4 UML DIAGRAMS UML diagrams help us to know about the functioning of the system easily in a pictorial representation.5.4. 1. UML sequence diagram illustrating the operation of commands in Hadoop.

fig:-5.3 sequence diagram for the operation of commands in Hadoop5.4.2. UML class diagram illustrating the Hadoop operation

fig5.4:- class diagram of Hadoop operation5.5 CONCLUSIONThus the UML diagrams help us in understanding the operation of the system in Hadoop.In next chapter we discuss about the implementation part of the project.

CHAPTER 6: IMPLEMENTATION6.1 Introduction to HDFScommands: In the last post, we saw how we could install Apache Hadoop on 32-bit Ubuntu and how to run a sample program. Now that we have achieved this, lets explore further. In this post, we will see how to run basic HDFS commands.6.2. Map Reduce: HDFS GFSHadoop is a distributed OS. Like any OS, Hadoop will need its file system. For a distributed file system, the important thing to understand is that there will be an abstraction in terms of its representation to the programmer to its actual mount point in the physical system. For our setup of a single node cluster using only test data, this will be OK. But if you are going to install a Hadoop distribution on a large cluster using an installer that has default mount point, please take caution early on. It is possible that that your data size may increase later and your system may run out of memory.Moving on -Start your Ubuntu setup and start the Hadoop system. You may need to switch user to Hadoop and then navigate to $HADOOP_HOME/bin to run the start-all.sh script. Verify that Hadoop has started by running JPS command and checking Name Node, Data Node etc.1. Starting pointGo to the prompt and type Hadoop. You will see the following response.hadoop@sumod-hadoop:~$ hadoopUsage: hadoop [--config confdir] COMMANDwhere COMMAND is one of:namenode -format format the DFS filesystemsecondarynamenode run the DFS secondary namenodenamenode run the DFS namenodedatanode run a DFS datanodedfsadmin run a DFS admin clientmradmin run a Map-Reduce admin clientfsck run a DFS filesystem checking utilityfs run a generic filesystem user clientWe will focus on the fs command for now. We have used the jar command and fs command earlier to some extent.2. Using the fs commandAt the prompt, type hadoop fs to get a list of options for fs command. You can see that the options for hadoop fs command look a lot like familiar Unix commands such as ls, mv, cp etc. Remember that you should always use hadoop fs - to run the command. We will now see some commands to be used with hadoop fs - ls -This command will list the contents of an HDFS directory. Note that / here means root of HDFS and not of your local file system.hadoop@sumod-hadoop:~$ hadoop fs -ls /Found 2 itemsdrwxr-xr-x hadoop supergroup 0 2012-09-09 14:39 /appdrwxr-xr-x hadoop supergroup 0 2012-09-08 16:49 /userhadoop@sumod-hadoop:~$ hadoop fs -ls /userFound 1 itemsdrwxr-xr-x hadoop supergroup 0 2012-09-08 16:49 /user/hadoophadoop@sumod-hadoop:~$We have only hadoop as the user on our hadoop system. It is also the de-facto superuser of the hadoop system in our setup. You can give option -lsr to see the directory listing recursively. Of course, any HDFS directory can be given as a starting point. -mkdir This command can be used to create a directory in HDFS.

hadoop@sumod-hadoop:~$ hadoop fs -mkdir /user/hadoop/testhadoop@sumod-hadoop:~$ hadoop fs -ls /user/hadoopFound 3 itemsdrwxr-xr-x hadoop supergroup 0 2012-09-23 15:46 /user/hadoop/testdrwxr-xr-x hadoop supergroup 0 2012-09-08 16:36 /user/hadoop/wcinputdrwxr-xr-x hadoop supergroup 0 2012-09-08 16:49 /user/hadoop/wcoutputmkdir is particularly useful when you have multiple users on your hadoop system. It really helps to have separate user directories on HDFS the same way on a UNIX system. So remember to create HDFS directories for your UNIX users who need to access Hadoop as well. -count This command will list the count of directories, files and list file size and file namehadoop@sumod-hadoop:~$ hadoop fs -count /user/hadoop6 9 9349507 hdfs://localhost:54310/user/hadoop -touchz This command is used to create a file of 0 length. This is similar to the Unix touch command.hadoop@sumod-hadoop:~$ hadoop fs -touchz /user/hadoop/test/temp.txthadoop@sumod-hadoop:~$ hadoop fs -ls /user/hadoop/testFound 1 items-rw-rr 1 hadoop supergroup 0 2012-09-23 16:00 /user/hadoop/test/temp.txthadoop@sumod-hadoop:~$ -cp and -mv These commands operate like regular Unix commands to copy and rename a file.hadoop@sumod-hadoop:~$ hadoop fs -cp /user/hadoop/test/temp.txt /user/hadoop/test/temp1.txthadoop@sumod-hadoop:~$ hadoop fs -mv /user/hadoop/test/temp.txt /user/hadoop/test/temp2.txthadoop@sumod-hadoop:~$ hadoop fs -ls /user/hadoop/testFound 2 items-rw-rr 1 hadoop supergroup 0 2012-09-23 16:04 /user/hadoop/test/temp1.txt-rw-rr 1 hadoop supergroup 0 2012-09-23 16:03 /user/hadoop/test/temp2.txthadoop@sumod-hadoop:~$ put and -copyFromLocal These commands are used to put files from local file system to the destination file system. The difference is that put allows reading from stdin while copyFromLocal allows only local file reference as a source.hadoop@sumod-hadoop:~$ hadoop fs -put hdfs.txt hdfs://localhost:54310/user/hadoop/hdfs.txthadoop@sumod-hadoop:~$ hadoop fs -ls /user/hadoopFound 7 items-rw-rr 1 hadoop supergroup 0 2012-09-24 10:02 /user/hadoop/hdfs.txtTo use stdin see the following example. On Ubuntu, you need to press CTRL+D to stop entering the file.hadoop@sumod-hadoop:~$ hadoop fs -put /user/hadoop/sample.txtthis is a sample text file.hadoop@sumod-hadoop:~$ hadoop fs -cat /user/hadoop/sample.txtthis is a sample text file.hadoop@sumod-hadoop:~$I have now removed the files to show an example of using copyFromLocal. Lets try copying multiple files this time.hadoop@sumod-hadoop:~$ hadoop fs -copyFromLocal hdfs.txt temp.txt /user/hadoop/hadoop@sumod-hadoop:~$ hadoop fs -ls /user/hadoopFound 5 items-rw-rr 1 hadoop supergroup 0 2012-09-24 10:14 /user/hadoop/hdfs.txt-rw-rr 1 hadoop supergroup 1089803 2012-09-24 10:14 /user/hadoop/temp.txt

-get and -copyToLocal These commands are used to copy files from HDFS to the local file system.

hadoop@sumod-hadoop:~$ hadoop fs -get /user/hadoop/hdfs.txt .hadoop@sumod-hadoop:~$ lsDesktop Documents Downloads hdfs.txt Music Pictures Public Templates #temp.txt# Videoshadoop@sumod-hadoop:~$ hadoop fs -copyToLocal /user/hadoop/temp.txtUsage: java FsShell [-copyToLocal [-ignoreCrc] [-crc] ]hadoop@sumod-hadoop:~$ hadoop fs -copyToLocal /user/hadoop/temp.txt.hadoop@sumod-hadoop:~$ lsDesktop Downloads Music Public #temp.txt# VideosDocuments hdfs.txt Pictures Templates temp.txthadoop@sumod-hadoop:~$6.3 Hadoop Shell Commands:

1.DFShell 2. cat 3.chgrp 4.chmod 5.chown 6. copyFromLocal7.copyToLocal8.cp 9.du10. dus 11.expunge 12.get 13.getmerge 14. ls 15. lsr16.mkdir 17.movefromLocal 18.mv 19.put 20. rm 21.rmr 22.setrep 23.stat 24.tail 25.test 26.text27. touchz

1. DFShellThe HDFS shell is invoked by bin/hadoop dfs . All the HDFS shell commands take path URIs as arguments. The URI format is scheme://autority/path. For HDFS the scheme is hdfs, and for the local file system the scheme is file. The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as hdfs://namenode:namenodeport/parent/child or simply as /parent/child (given that your configuration is set to point to namenode:namenodeport). Most of the commands in HDFS shell behave like corresponding UNIX commands. Differences are described with each of the commands. Error information is sent to stderr and the output is sent to stdout.2. catUsage: hadoop dfs -cat URI [URI ]Copies source paths to stdout.Example: hadoop dfs -cat hdfs://host1:port1/file1hdfs://host2:port2/file2 hadoop dfs -cat file:///file3 /user/hadoop/file4Exit Code:Returns 0 on success and -1 on error.

3. chgrpUsage: hadoop dfs -chgrp [-R] GROUP URI [URI ]Change group association of files. With -R, make the change recursively through the directory structure. The user must be the owner of files, or else a super-user. Additional information is in the Permissions User Guide.

4. chmoUsage: hadoop dfs -chmod [-R] URI[URI ]Change the permissions of files. With -R, make the change recursively through the directory structure. The user must be the owner of the file, or else a super-user. Additional information is in the Permissions User Guide.

5. chownUsage: hadoop dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]Change the owner of files. With -R, make the change recursively through the directory structure. The user must be a super-user. Additional information is in the Permissions User Guide.

6. copyFromLocalUsage: hadoop dfs -copyFromLocal URISimilar to put command, except that the source is restricted to a local file reference.

7. copyToLocalUsage: hadoop dfs -copyToLocal [-ignorecrc] [-crc] URISimilar to get command, except that the destination is restricted to a local file reference.

8. cpUsage: hadoop dfs -cp URI [URI ] Copy files from source to destination. This command allows multiple sources as well inwhich case the destination must be a directory.Example: hadoop dfs -cp /user/hadoop/file1 /user/hadoop/file2 hadoop dfs -cp /user/hadoop/file1 /user/hadoop/file2/user/hadoop/dirExit Code:Returns 0 on success and -1 on error.

9. duUsage: hadoop dfs -du URI [URI ]Displays aggregate length of files contained in the directory or the length of a file in case itsjust a file.Example:hadoop dfs -du /user/hadoop/dir1 /user/hadoop/file1hdfs://host:port/user/hadoop/dir1Exit Code:Returns 0 on success and -1 on error.

10. dusUsage: hadoop dfs -dus Displays a summary of file lengths.

11. expungeUsage: hadoop dfs expungeEmpty the Trash. Refer to HDFS Design for more information on Trash feature.

12. getUsage: hadoop dfs -get [-ignorecrc] [-crc] Copy files to the local file system. Files that fail the CRC check may be copied with the -ignorecrc option. Files and CRCs may be copied using the -crc option.Example: hadoop dfs -get /user/hadoop/file localfile hadoop dfs -get hdfs://host:port/user/hadoop/file localfileExit Code:Returns 0 on success and -1 on error.

13. getmergeUsage: hadoop dfs -getmerge [addnl]Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file.

14. lsUsage: hadoop dfs -ls For a file returns stat on the file with the following format:filename filesize modification_datemodification_time permissions userid groupid

For a directory it returns list of its direct children as in unix. A directory is listed as:dirname modification_time modification_time permissionsuserid groupidExample:hadoop dfs -ls /user/hadoop/file1 /user/hadoop/file2hdfs://host:port/user/hadoop/dir1 /nonexistentfileExit Code:Returns 0 on success and -1 on error.

15. lsrUsage: hadoop dfs -lsr Recursive version of ls. Similar to Unix ls -R.

16. mkdirUsage: hadoop dfs -mkdir Takes path uri's as argument and creates directories. The behavior is much like unix mkdir p creating parent directories along the path.Example: hadoop dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2 hadoop dfs -mkdir hdfs://host1:port1/user/hadoop/dirhdfs://host2:port2/user/hadoop/dirExit Code:Returns 0 on success and -1 on error.

17. movefromLocalUsage: dfs -moveFromLocal Displays a "not implemented" message.

18. mvUsage: hadoop dfs -mv URI [URI ] Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across filesystems is not permitted.Example: hadoop dfs -mv /user/hadoop/file1 /user/hadoop/file2 hadoop dfs -mv hdfs://host:port/file1 hdfs://host:port/file2hdfs://host:port/file3 hdfs://host:port/dir1Exit Code:Returns 0 on success and -1 on error.

19. putUsage: hadoop dfs -put ... Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads input from stdin and writes to destination filesystem. hadoop dfs -put localfile /user/hadoop/hadoopfile hadoop dfs -put localfile1 localfile2 /user/hadoop/hadoopdir hadoop dfs -put localfile hdfs://host:port/hadoop/hadoopfile hadoop dfs -put - hdfs://host:port/hadoop/hadoopfileReads the input from stdin.Exit Code:Returns 0 on success and -1 on error.

20. rmUsage: hadoop dfs -rm URI [URI ]Delete files specified as args. Only deletes non empty directory and files. Refer to rmr for recursive deletes.Example: hadoop dfs -rm hdfs://host:port/file /user/hadoop/emptydirExit Code:Returns 0 on success and -1 on error.

21. rmrUsage: hadoop dfs -rmr URI [URI ]Recursive version of delete.Example: hadoop dfs -rmr /user/hadoop/dir hadoop dfs -rmr hdfs://host:port/user/hadoop/dirExit Code:Returns 0 on success and -1 on error.

22. setrepUsage: hadoop dfs -setrep [-R] Changes the replication factor of a file. -R option is for recursively increasing the replication factor of files within a directory.Example: hadoop dfs -setrep -w 3 -R /user/hadoop/dir1Exit Code:Returns 0 on success and -1 on error.23. statUsage: hadoop dfs -stat URI [URI ]Returns the stat information on the path.Example: hadoop dfs -stat pathExit Code:Returns 0 on success and -1 on error.

24. tailUsage: hadoop dfs -tail [-f] URIDisplays last kilobyte of the file to stdout. -f option can be used as in Unix.Example: hadoop dfs -tail pathnameExit Code : Returns 0 on success and -1 on error.

25. testUsage: hadoop dfs -test -[ezd] URIOptions:-e check to see if the file exists. Return 0 if true.-z check to see if the file is zero length. Return 0 if true-d check return 1 if the path is directory else return 0.Example: hadoop dfs -test -e filename

26. textUsage: hadoop dfs -text Takes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream.

27. touchzUsage: hadoop dfs -touchz URI [URI ]Create a file of zero length.Example: hadoop -touchz pathnameExit Code:Returns 0 on success and -1 on error.

6.4 Java Code Program:package cse.GNITC;import java.io.IOException;import java.net.URI;import java.net.URISyntaxException;

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.*;

public class ImplementCreateHDFSDir { public static void main(String[] args) throws IOException ,URISyntaxException { String FolderName = args[0] ;System.out.println("Folder name is : " + args[0]); //1. Get the Configuration instance Configuration conf = new Configuration(); //2. Add Configuration files to the objectconf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));conf.addResource(new Path("/usr/local/hadoop/conf/hdfs-site.xml"));conf.addResource(new Path("/usr/local/hadoop/conf/mapred-site.xml")); //3. Get the instance of the HDFS FileSystem FS = FileSystem.get(new URI("hdfs://localhost:54310"), conf); //4. Get the the folder name from the parameter String FN = FolderName.substring(FolderName.lastIndexOf('/') + 1, FolderName.length()); System.out.println("FolderName name is : " + FN); Path folderpath = new Path(FolderName); //5. Check if the folder already exists

if (FS.exists(folderpath)) { System.out.println("Folder by name " + FolderName + " already exists"); return; }

FS.mkdirs(folderpath); FS.close(); }}

OUTPUT:

6.5 CONCLUSION In this way,the code is implemented and thus the result will be observed in the next chapter in the form of screen shots.

CHAPTER 7: SCREENSHOTS1. CAT COMMAND

Fig 7.1 CAT CommandUsage: hadoop dfs -cat URI [URI ]Copies source paths to stdout.Example: hadoop dfs -cat hdfs://host1:port1/file1 hdfs://host2:port2/file2 hadoop dfs -cat file:///file3 /user/hadoop/file4

2. COPYTOLOCAL COMMAND

Fig 7.2 copytoLocal CommandUsage: hadoop dfs -copyToLocal [-ignorecrc] [-crc] URI Similar to get command, except that the destination is restricted to a local file reference.

3. CP COMMAND

Fig 7.3 cp CommandUsage: hadoop dfs -cp URI [URI ] Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory.Example: hadoop dfs -cp /user/hadoop/file1 /user/hadoop/file2 hadoop dfs -cp /user/hadoop/file1 /user/hadoop/file2/user/hadoop/dirExit Code:Returns 0 on success and -1 on error.4. DU COMMAND

Fig 7.4 du commandUsage: hadoop dfs -du URI [URI ]Displays aggregate length of files contained in the directory or the length of a file in case its just a file.Example:hadoop dfs -du /user/hadoop/dir1 /user/hadoop/file1hdfs://host:port/user/hadoop/dir1Exit Code:Returns 0 on success and -1 on error.

5. DUS COMMAND

Fig 7.5 dus CommandUsage: hadoop dfs -dus Displays a summary of file lengths.

6. EXPUNGE COMMAND

Fig 7.6 expunge commandUsage: hadoop dfs -expungeEmpty the Trash. Refer to HDFS Design for more information on Trash feature.

7. GET COMMAND

Fig 7.7 get CommandUsage: hadoop dfs -get [-ignorecrc] [-crc] Copy files to the local file system. Files that fail the CRC check may be copied with the -ignorecrc option. Files and CRCs may be copied using the -crc option.Example: hadoop dfs -get /user/hadoop/file localfile hadoop dfs -get hdfs://host:port/user/hadoop/file localfile

8. GETMERGE COMMAND

Fig 7.8 GETMERGE CommandUsage: hadoop dfs -getmerge [addnl]Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file.

9. LS COMMAND

Fig 7.9 ls CommandUsage: hadoop dfs -ls For a file returns stat on the file with the following format:filename filesize modification_date modification_time permissions userid groupidFor a directory it returns list of its direct children as in unix. A directory is listed as:dirname modification_time modification_time permissions userid groupidExample:hadoop dfs -ls /user/hadoop/file1 /user/hadoop/file2hdfs://host:port/user/hadoop/dir1 /nonexistentfileExit Code:Returns 0 on success and -1 on error.

10. LSR COMMAND

Fig 7.10 lsr CommandUsage: hadoop dfs -lsr Recursive version of ls. Similar to Unix ls -R.

11. MKDIR COMMAND

Fig 7.11 mkdir commandUsage: hadoop dfs -mkdir Takes path uri's as argument and creates directories. The behavior is much like unix mkdir p creating parent directories along the path.Example: hadoop dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2 Hadoopdfs -mkdir hdfs://host1:port1/user/hadoop/dir hdfs://host2:port2/user/hadoop/dir

12. MOVEFROMLOCAL COMMAND

Fig 7.12 movefromLocal CommandUsage: dfs -moveFromLocal Displays a "not implemented" message.

13. MV COMMAND

Fig 7.13 mv Command

Usage: hadoop dfs -mv URI [URI ] Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across filesystems is not permitted.Example: hadoop dfs -mv /user/hadoop/file1 /user/hadoop/file2 hadoop dfs -mv hdfs://host:port/file1 hdfs://host:port/file2 hdfs://host:port/file3 hdfs://host:port/dir1Exit Code:Returns 0 on success and -1 on error.

14. PUT COMMAND

Fig 7.14 put commandUsage: hadoop dfs -put ... Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads input from stdin and writes to destination filesystem. hadoop dfs -put localfile /user/hadoop/hadoopfile hadoop dfs -put localfile1 localfile2 /user/hadoop/hadoopdir hadoop dfs -put localfile hdfs://host:port/hadoop/hadoopfile hadoop dfs -put - hdfs://host:port/hadoop/hadoopfileReads the input from stdin.Exit Code:

15. RM COMMAND

Fig 7.15 rm CommandUsage: hadoop dfs -rm URI [URI ]Delete files specified as args. Only deletes non empty directory and files. Refer to rmr for recursive deletes.Example: hadoop dfs -rm hdfs://host:port/file /user/hadoop/emptydirExit Code:Returns 0 on success and -1 on error.

16. RMR COMMAND

Fig 7.16 rmr commandUsage: hadoop dfs -rmr URI [URI ]Recursive version of delete.Example: hadoop dfs -rmr /user/hadoop/dir hadoop dfs -rmr hdfs://host:port/user/hadoop/dirExit Code:Returns 0 on success and -1 on error.

17. STAT COMMAND

Fig 7.17 stat CommandUsage: hadoop dfs -stat URI [URI ]Returns the stat information on the path.Example: hadoop dfs -stat pathExit Code:Returns 0 on success and -1 on error.

18. TAIL COMMAND

Fig 7.18 tail commandUsage: hadoop dfs -tail [-f] URIDisplays last kilobyte of the file to stdout. -f option can be used as in Unix.

19. TEST COMMAND

Fig 7.19 Test CommandUsage: hadoop dfs -test -[ezd] URIOptions:-e check to see if the file exists. Return 0 if true.-z check to see if the file is zero length. Return 0 if true-d check return 1 if the path is directory else return 0.Example: hadoop dfs -test -e filename

20. TEXT COMMAND

Fig 7.20 text commandUsage: hadoop dfs -text Takes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream.

21. TOUCHZ COMMAND

Fig 7.21 touchz CommandUsage: hadoop dfs -touchz URI [URI ]Create a file of zero length.Example: hadoop -touchz pathname

7.2 Screen shot of Java Code Output:

Fig 7.22 Java Output

CHAPTER 8: TESTING AND VALIDATION

Software Testing is a critical element of software quality assurance and represents the ultimate review of specification, design and coding, Testing presents an interesting anomaly for the software engineer.Testing Objectives include:Testing is a process of executing a program with the intent of finding an error. A good test case is one that has a probability of finding an as yet undiscovered error. A successful test is one that uncovers an undiscovered errorTesting principles:All tests should be traceable to end user requirements. Tests should be planned long before testing begins. Testing should begin on a small scale and progress towards testing in large Exhaustive testing is not possible To be most effective testing should be conducted by an independent third party.Testing strategies:A Strategy for software testing integrates software test cases into a series of well-planned steps that result in the successful construction of software. Software testing is a broader topic for what is referred to as Verification and Validation. Verification refers to the set of activities that ensure that the software correctly implements a specific function. Validation refers he set of activities that ensure that the software that has been built is traceable to customers requirements.8.1 Testing: Testing is a process of executing a program with a intent of finding an error. Testing presents an interesting anomaly for the software engineering. The goal of the software testing is to convince system developer and customers that the software is good enough for operational use. Testing is a process intended to build confidence in the software.8.2 Types of testing:The various types of testing are1. White Box Testing2. Black Box Testing3. Alpha Testing4. Beta Testing8.2.1 White box testing:It is also called as glass-box testing. It is a test case design method that uses the control structure of the procedural design to derive test cases.Using white box testing methods, the software engineer can derive test cases that 1. Guarantee that all independent parts within a module have been exercised at least once, 2. Exercise all logical decisions on their true and false sides.8.2.2 Black box testing:Its also called as behavioural testing. It focuses on the functional requirements of the software.It is complementary approach that is likely to uncover a different class of errors than white box errors.A black box testing enables a software engineering to derive a set of input conditions that will fully exercise all functional requirements for a program.8.2.3 Alpha testing:Alpha testing is the software prototype stage when the software is first able to run. It will not have all the intended functionality, but it will have core functions and will be able to accept inputs and generate outputs. An alpha test usually takes place in the developer's offices on a separate system.8.2.4 Beta testing:The beta test is a live application of the software in an environment that cannot be controlled by the developer. The beta test is conducted at one or more customer sites by the end user of the software.8.3 Path testing:Established technique of flow graph with Cyclomatic complexity was used to derive test cases for all the functions. The main steps in deriving test cases were:Use the design of the code and draw correspondent flow grapgh.Determine the cyclomatic complexity of resultant flow graph, using formulaV(G)=E-N+2 orV(G)=P+1 orV(G)=Number of RegionsWhere V(G) is cyclomatic complexity,E is the number of edges,N is the number of flow graph nodes,P is the number of predicate nodes,

9: CONCLUSION AND FUTURE SCOPE

9.1 CONCLUSION

9.1.1 Select the Right Projects for Hadoop ImplementationChoose projects that fit Hadoops strengths and minimize its disadvantages. Enterprises use Hadoop in data-science applications for log analysis, data mining, machine learning and image processing involving unstructured or raw data. Hadoops lack of fixed-schema works particularly well for answering ad-hoc queries and exploratory what if scenarios. Hadoop Distributed File System (HDFS) and MapReduce address growth in enterprise datavolumesfrom terabytes to petabytes and more; and the increasingvarietyof complex multi-dimensional data from disparate sources.For applications that require fastervelocityfor real-time or right-time data processing, while Apache HBase adds a distributed column-oriented database on top of HDFS and there is work in the Hadoop community to support stream processing, Hadoop does have important speed limitations. Likewise, compared to an enterprise data warehouse, current-generation Apache Hadoop does not offer a comparable level of feature sophistication to mandate deterministic query response times, balance mixed workloads, define role and group based user access, or place limits on individual queries.

9.1.2 Rethink and Adapt Existing Architectures to HadoopFor most organizations, Hadoop is one extension or component of a broader data architecture. Hadoop can serve as a data bag for data aggregation and pre-processing before loading into a data warehouse. At the same time, organizations can offload data from an enterprise data warehouse into Hadoop to create virtual sandboxes for use by data analysts.As part of your multi-year data architecture roadmap, be ready to accommodate changes from Hadoop and other technologies that impact Hadoop deployment. Devise an architecture and tools to efficiently implement the data processing pipeline and provision the data to production. Start small and grow incrementally with a data platform and architecture that enable you to build once and deploy wherever it makes sense using Hadoop or other systems, on premise or in the cloud.9.1.3 Plan Availability of Skills and Resources Before You Get StartedOne of the constraints of deploying Hadoop is the lack of enough trained personnel resources. There are many projects and sub-projects in the Apache ecosystem, making it difficult to stay abreast of all of the changes. Consider a platform approach to hide the complexity of the underlying technologies from analysts and other line of business users.9.1.4 Prepare to Deliver Trusted Data for Areas That Impact Business Insight and OperationsCompared to the decades of feature development by relational and transactional systems, current-generation Hadoop offers fewer capabilities to track metadata, enforce data governance, verify data authenticity, or comply with regulations to secure customer non-public information. The Hadoop community will continue to introduce improvements and additions for example, HCatalog is designed for metadata management but it takes time for those features to be developed, tested, and validated for integration with third-party software. Hadoop is not a replacement for master data management (MDM): lumping data from disparate sources into a Hadoop data bag does not by itself solve broader business or compliance problems with inconsistent, incomplete or poor quality data that may vary by business unit or by geography.You can anticipate that data will require cleansing and matching for reporting and analysis. Consider your end-to-end data processing pipeline, and determine your needs for security, cleansing, matching, integration, delivery and archiving. Adhere to a data governance program to deliver authoritative and trustworthy data to the business, and adopt metadata-driven audits to add transparency and increase efficiency in development.9.1.5 Adopt Lean and Agile Integration PrinciplesTo transfer data between Hadoop and other elements of your data architecture, the HDFS API provides the core interface for loading or extracting data. Other useful tools include Chukwa, Scribe or Flume for the collection of log data, and Sqoop for data loading from or to relational databases. Hive enables hoc query and analysis for data in HDFS using a SQL interface. InformaticaPowerCenter version 9.1includes connectivity for HDFS, to load data into Hadoop or extract data from Hadoop.

9.2 FUTURE SCOPE OF HADOOPAdvantages

1. Distribute data and Computation2. Tasks are Independent3. Linear Scaling in the Ideal case. It is used to design for cheap, commodity hardware.4. HDFS stores large amount of Information5. HDFS is simple and robust coherency model6. HDFS should integrate well with Hadoop MapReduce, allowing data to read and computed upon locally when possible.

REFERENCES & BIBILOGRAPHY

1. http://www.jeffshafer.com/publications/papers/shafer_ispass10.pdf2. http://theglobaljournals.com/gra/file.php?val=February_2013_1360851170_47080_37.pdf3. http://www.j2eebrain.com/java-J2ee-hadoop-advantages-and-disadvantages.html4. http://www.bigdatacompanies.com/5-big-disadvantages-of-hadoop-for-big-data/5. http://www.havoozacademy.org6. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6223552

2