Preprocessing the Data in Apache Sparkvvr3254/CMPS598/Notes/...Apache Spark Steps in preprocessing...
Transcript of Preprocessing the Data in Apache Sparkvvr3254/CMPS598/Notes/...Apache Spark Steps in preprocessing...
![Page 1: Preprocessing the Data in Apache Sparkvvr3254/CMPS598/Notes/...Apache Spark Steps in preprocessing •Deploy data to the cluster •Creating building RDD •Verifying the data thru.](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb362e764ca07c28028017/html5/thumbnails/1.jpg)
Preprocessing the Datain
Apache Spark
![Page 2: Preprocessing the Data in Apache Sparkvvr3254/CMPS598/Notes/...Apache Spark Steps in preprocessing •Deploy data to the cluster •Creating building RDD •Verifying the data thru.](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb362e764ca07c28028017/html5/thumbnails/2.jpg)
Steps in preprocessing
• Deploy data to the cluster
• Creating building RDD
• Verifying the data thru. sampling
• Cleaning data: For example
– Converting the datatype
– Filing missing values
• Other steps: integration, reduction, transformation, discretization
![Page 3: Preprocessing the Data in Apache Sparkvvr3254/CMPS598/Notes/...Apache Spark Steps in preprocessing •Deploy data to the cluster •Creating building RDD •Verifying the data thru.](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb362e764ca07c28028017/html5/thumbnails/3.jpg)
The Data Set
![Page 4: Preprocessing the Data in Apache Sparkvvr3254/CMPS598/Notes/...Apache Spark Steps in preprocessing •Deploy data to the cluster •Creating building RDD •Verifying the data thru.](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb362e764ca07c28028017/html5/thumbnails/4.jpg)
Deploy the data to the cluster
• Distributed computing requires the file distributed across the cluster
• Transfer the local data files to hdfs
– $ hdfs dfs –mkdir linkage
– $ hdfs dfs –put block_*.csv linkage
![Page 5: Preprocessing the Data in Apache Sparkvvr3254/CMPS598/Notes/...Apache Spark Steps in preprocessing •Deploy data to the cluster •Creating building RDD •Verifying the data thru.](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb362e764ca07c28028017/html5/thumbnails/5.jpg)
Creating
• Create a RDD (Resilient Distributed Dataset) from text file– val rawblocks = sc.textFile(“hdfs:///user/yxie2/linkage2”)
• Create RDD from external databases– val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
– val test_enc_orc = hiveContext.sql("select * from test_enc_orc")
• Spark is a lazy execution
![Page 6: Preprocessing the Data in Apache Sparkvvr3254/CMPS598/Notes/...Apache Spark Steps in preprocessing •Deploy data to the cluster •Creating building RDD •Verifying the data thru.](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb362e764ca07c28028017/html5/thumbnails/6.jpg)
Sampling data
• Sample:
– $val head = rawblocks.first()
– $val top10 = rawblocks.take(10)
• View data:
– Printing on to client console
• $head.foreach(println)
![Page 7: Preprocessing the Data in Apache Sparkvvr3254/CMPS598/Notes/...Apache Spark Steps in preprocessing •Deploy data to the cluster •Creating building RDD •Verifying the data thru.](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb362e764ca07c28028017/html5/thumbnails/7.jpg)
The Data Set
![Page 8: Preprocessing the Data in Apache Sparkvvr3254/CMPS598/Notes/...Apache Spark Steps in preprocessing •Deploy data to the cluster •Creating building RDD •Verifying the data thru.](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb362e764ca07c28028017/html5/thumbnails/8.jpg)
• Pre-Process the data – Part II: Structuring Data with Tuples and Case Classes
– The records in the head array are all strings of comma-separated fields
– To make it a bit easier to analyze this data, we will need to parse these strings into a structured format that converts the different fields into the correct data type
![Page 9: Preprocessing the Data in Apache Sparkvvr3254/CMPS598/Notes/...Apache Spark Steps in preprocessing •Deploy data to the cluster •Creating building RDD •Verifying the data thru.](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb362e764ca07c28028017/html5/thumbnails/9.jpg)
def parse(line: String) ={
val pieces = line.split(‘,’)
val id1 = pieces(0).toInt
val id2 = pieces(1).toInt
val scores = pieces.slice(2,11).map(_.toDouble)
val matched = pieces(11).toBoolean
(id1, id2, scores, matched)
}
val lines= sc.textFile("hdfs://...")
val flines = lines.map(line => parse(line) )
![Page 10: Preprocessing the Data in Apache Sparkvvr3254/CMPS598/Notes/...Apache Spark Steps in preprocessing •Deploy data to the cluster •Creating building RDD •Verifying the data thru.](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb362e764ca07c28028017/html5/thumbnails/10.jpg)
• Needs to redefine the toDouble method
def toDouble(s: String) = {
if (“?”.equals(s)) Double.NaN else s.toDouble
}
![Page 11: Preprocessing the Data in Apache Sparkvvr3254/CMPS598/Notes/...Apache Spark Steps in preprocessing •Deploy data to the cluster •Creating building RDD •Verifying the data thru.](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb362e764ca07c28028017/html5/thumbnails/11.jpg)
def parse(line: String) ={val pieces = line.split(‘,’)val id1 = pieces(0).toIntval id2 = pieces(1).toIntval scores = pieces.slice(2,11).map(toDouble)val matched = pieces(11).toBoolean(id1, id2, scores, matched)
}
val flines = lines.map(line => parse(line) )val filtered_data = flines.filter(tup => tup(4) > 0)
![Page 12: Preprocessing the Data in Apache Sparkvvr3254/CMPS598/Notes/...Apache Spark Steps in preprocessing •Deploy data to the cluster •Creating building RDD •Verifying the data thru.](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb362e764ca07c28028017/html5/thumbnails/12.jpg)
Transformation
![Page 13: Preprocessing the Data in Apache Sparkvvr3254/CMPS598/Notes/...Apache Spark Steps in preprocessing •Deploy data to the cluster •Creating building RDD •Verifying the data thru.](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb362e764ca07c28028017/html5/thumbnails/13.jpg)
Action