Pig : Data types and Operators
Data types: simple data types: --------------------- int --> 32 bit integer. long ---> 64 bit " float --> 32 bit float [ not available in latest version ] double -->...
View ArticlePig : How to perform grouping by Multiple Columns
how to perform grouping by multiple columns. ------------------------------------------- task: mutiple grouping with mulitiple aggregations . sql: select dno, sex , sum(sal) , count(*),...
View ArticlePig : Entire Column Aggregations
Entire column aggregations. select sum(sal) from emp; grunt> describe emp emp: {id: int,name: chararray,sal: int,sex: chararray,dno: int} grunt> esal = foreach emp generate sal; grunt> rsum =...
View ArticlePig : Word Count Using Pig Data Flow
Word Count Using Pig DataFlow: [cloudera@quickstart ~]$ cat comment hadoop is great spark is great hadoop and spark combination is great [cloudera@quickstart ~]$ hadoop fs -copyFromLocal comment...
View ArticleSpark : Entire Column Aggregations
Entire Column Aggregations: sql: select sum(sal) from emp; scala> val emp = sc.textFile("/user/cloudera/spLab/emp") emp: org.apache.spark.rdd.RDD[String] = /user/cloudera/spLab/emp...
View ArticleSpark : Handling CSV files .. Removing Headers
scala> val l = List(10,20,30,40,50,56,67) scala> val r2 = r.collect.reverse.take(3) r2: Array[Int] = Array(67, 56, 50) scala> val r2 = sc.parallelize(r.collect.reverse.take(3)) r2:...
View ArticleSpark : Conditional Transformations
Conditions Transformations: val trans = emp.map{ x => val w = x.split(","); val sal = w(2).toInt val grade = if(sal>=70000) "A" else if(sal>=50000)...
View ArticleSpark : Union and Distinct
Unions in spark.val l1 = List(10,20,30,40,50)val l2 = List(100,200,300,400,500)val r1 = sc.parallelize(l1)val r2 = sc.parallelize(l2)val r = r1.union(r2)scala> r.collect.foreach(println)[Stage...
View ArticleSpark : CoGroup And Handling Empty Compact Buffers
Co Grouping using Spark:--------------------------scala>...
View ArticlePig : load Operator
Load Operator:-------------- to load data from file to relation. [cloudera@quickstart ~]$ cat > samp110020030040050090010012023123900800[cloudera@quickstart ~]$ hadoop fs -copyFromLocal samp1...
View ArticlePig : Subsetting using Filter, Limit, Sample
Techniques of subsetting relations: i) filter: used for condiational filtering. ii) limit : takes first n number of tuples. iii) sample: to take random sample sets. " with replace " model.filter:...
View ArticlePig : Foreach Operator
Foreach Operator:-------------------grunt> emp = load 'piglab/emp' using PigStorage(',')>> as (id:int, name:chararray, sal:int,>> sex:chararray, dno:int);i) to copy data from one...
View ArticleSpark : Joins
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal emp spLab/e[cloudera@quickstart ~]$ hadoop fs -copyFromLocal dept spLab/d[cloudera@quickstart ~]$ hadoop fs -cat...
View ArticleSpark : Joins 2
Denormalizing datasets using Joins[cloudera@quickstart ~]$ cat > childrenc101,p101,Ravi,34 c102,p101,Rani,24c103,p102,Mani,20c104,p103,Giri,22c105,p102,Vani,22[cloudera@quickstart ~]$ cat >...
View ArticlePig : Order [ Sorting ] , exec, run , pig
order :- to sort data (tuples) in ascending or descending order. emp = load 'piglab/emp' using PigStorage(',') as (id:int, name:chararray, sal:int, sex:chararray, dno:int); e1 = order emp...
View ArticlePig : Joins
[cloudera@quickstart ~]$ hadoop fs -cat spLab/e 101,aaaa,40000,m,11 102,bbbbbb,50000,f,12 103,cccc,50000,m,12 104,dd,90000,f,13 105,ee,10000,m,12 106,dkd,40000,m,12 107,sdkfj,80000,f,13...
View ArticlePig : Cross Operator to Cartisian
Cross: ----- used cartisian product. each element of left set, joins with each element of right set. ds1 --> (a) (b) (c) ds2 --> (1) (2) x = cross ds1, ds3...
View ArticlePig : UDFs
Pig UDFS ---------- UDF ---> user defined functions. adv: i) custom functionalities. ii) reusability. Pig UDFs can be developed by java python ruby c++ javascript...
View ArticleSpark : Spark streaming and Kafka Integration
steps: 1) start zookeper server 2) Start Kafka brokers [ one or more ] 3) create topic . 4) start console producer [ to write messages into topic ] 5) start console consumer [ to test , whether...
View ArticlePython Examples 1
name = input("Enter name ") age = input("Enter age") print(name, " is ", age, " years old ") ----------------------------------- # if a = 10 b = 25 if a>b: print(a , " is big") else: print(b , "...
View ArticlePig : Udfs using Python
we can keep multiple functions under one program(.py) transoform.py ------------------------- from pig_util import outputSchema @outputSchema(name:Chararray) def firstUpper(x): fc = x[0].upper()...
View ArticleHive Partitioned tables [case study]
[cloudera@quickstart ~]$ cat saleshistory 01/01/2011,2000 01/01/2011,3000 01/02/2011,5000 01/02/2011,4000 01/02/2011,1000 01/03/2011,2000 01/25/2011,3000 01/25/2011,5000 01/29/2011,4000...
View ArticlePig Video Lessons
Pig class Links: PigLab1 Video: https://drive.google.com/file/d/0B6ZYkhJgGD6XTzVHbzBYUFY0a1k/view?usp=sharing PigLab Notes1:...
View ArticleHive(10AmTo1:00Pm) Lab1 notes : Hive Inner and External Tables
hive> create table samp1(line string); -- here we did not select any database. default database in hive is "default". the hdfs location of default database is /user/hive/warehouse -- when...
View Article
More Pages to Explore .....