Problem Scenario 81 : You have been given MySQL DB with following details. You have been given following product.csv file
product.csv
productID,productCode,name,quantity,price
1001,PEN,Pen Red,5000,1.23
1002,PEN,Pen Blue,8000,1.25
1003,PEN,Pen Black,2000,1.25
1004,PEC,Pencil 2B,10000,0.48
1005,PEC,Pencil 2H,8000,0.49
1006,PEC,Pencil HB,0,9999.99
Now accomplish following activities.
1. Create a Hive ORC table using SparkSql
2. Load this data in Hive table.
3. Create a Hive parquet table using SparkSQL and load data in it.
Problem Scenario 85 : In Continuation of previous question, please accomplish following activities.
1. Select all the columns from product table with output header as below. productID AS ID
code AS Code name AS Description price AS 'Unit Price'
2. Select code and name both separated by ' -' and header name should be Product Description'.
3. Select all distinct prices.
4. Select distinct price and name combination.
5. Select all price data sorted by both code and productID combination.
6. count number of products.
7. Count number of products for each code.
Problem Scenario 65 : You have been given below code snippet.
val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = sc.parallelize(1 to a.count.tolnt, 2)
val c = a.zip(b)
operation1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2>, (ant,5))
Problem Scenario 90 : You have been given below two files
course.txt
id,course
1,Hadoop
2,Spark
3,HBase
fee.txt
id,fee
2,3900
3,4200
4,2900
Accomplish the following activities.
1. Select all the courses and their fees , whether fee is listed or not.
2. Select all the available fees and respective course. If course does not exists still list the fee
3. Select all the courses and their fees , whether fee is listed or not. However, ignore records having fee as null.
Problem Scenario 44 : You have been given 4 files , with the content as given below:
spark11/file1.txt
Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework
spark11/file2.txt
The core of Apache Hadoop consists of a storage part known as Hadoop Distributed File System (HDFS) and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed.
spark11/file3.txt
his approach takes advantage of data locality nodes manipulating the data they have access to to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking
spark11/file4.txt
Apache Storm is focused on stream processing or what some call complex event processing. Storm implements a fault tolerant method for performing a computation or pipelining multiple computations on an event as it flows into a system. One might use Storm to transform unstructured data as it flows into a system into a desired format
(spark11Afile1.txt)
(spark11/file2.txt)
(spark11/file3.txt)
(sparkl 1/file4.txt)
Write a Spark program, which will give you the highest occurring words in each file. With their file name and highest occurring words.
Problem Scenario 60 : You have been given below code snippet.
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"}, 3}
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","woif","bear","bee"), 3)
val d = c.keyBy(_.length)
operation1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(lnt, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)),
(6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))
Problem Scenario 43 : You have been given following code snippet.
val grouped = sc.parallelize(Seq(((1,"twoM), List((3,4), (5,6)))))
val flattened = grouped.flatMap {A =>
groupValues.map { value => B }
}
You need to generate following output.
Hence replace A and B
Array((1,two,3,4),(1,two,5,6))
Problem Scenario 20 : You have been given MySQL DB with following details.
user=retail_dba
password=cloudera
database=retail_db
table=retail_db.categories
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish following activities.
1. Write a Sqoop Job which will import "retaildb.categories" table to hdfs, in a directory name "categories_targetJob".
Problem Scenario 64 : You have been given below code snippet.
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(Ust("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
operation1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(lnt, (Option[String], String))] = Array((6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit}}, (6,(Some(salmon),turkey)), (6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)), (3,(Some(dog),dog)), (3,(Some(dog),cat)), (3,(Some(dog),gnu)), (3,(Some(dog),bee)), (3,(Some(rat), (3,(Some(rat),cat)), (3,(Some(rat),gnu)), (3,(Some(rat),bee)), (4,(None,wo!f)), (4,(None,bear)))
Problem Scenario GG : You have been given below code snippet.
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("ant", "falcon", "squid"), 2)
val d = c.keyBy(.length)
operation 1
Write a correct code snippet for operationl which will produce desired output, shown below. Array[(lnt, String)] = Array((4,lion))
Problem Scenario 35 : You have been given a file named spark7/EmployeeName.csv (id,name).
EmployeeName.csv
E01,Lokesh
E02,Bhupesh
E03,Amit
E04,Ratan
E05,Dinesh
E06,Pavan
E07,Tejas
E08,Sheela
E09,Kumar
E10,Venkat
1. Load this file from hdfs and sort it by name and save it back as (id,name) in results directory. However, make sure while saving it should be able to write In a single file.