moocer 发表于 2020-6-12 12:08

【Sprak】【笔记】Sprak的RDD的转换之双Value类型交互笔记

1 union(otherDataset) 案例
1. 作用:对源RDD和参数RDD求并集后返回一个新的RDD
2. 需求:创建两个RDD,求并集
(1)创建第一个RDD
scala> val rdd1 = sc.parallelize(1 to 5)rdd1: org.apache.spark.rdd.RDD = ParallelCollectionRDD at parallelize at :24

(2)创建第二个RDD
scala> val rdd2 = sc.parallelize(5 to 10)rdd2: org.apache.spark.rdd.RDD = ParallelCollectionRDD at parallelize at <console>:24
(3)计算两个RDD的并集
scala> val rdd3 = rdd1.union(rdd2)rdd3: org.apache.spark.rdd.RDD = UnionRDD at union at <console>:28
(4)打印并集结果
scala> rdd3.collect()res18: Array = Array(1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10)

2 subtract (otherDataset) 案例
1. 作用:计算差的一种函数,去除两个RDD中相同的元素,不同的RDD将保留下来
2. 需求:创建两个RDD,求第一个RDD与第二个RDD的差集
(1)创建第一个RDD
scala> val rdd = sc.parallelize(3 to 8)rdd: org.apache.spark.rdd.RDD = ParallelCollectionRDD at parallelize at <console>:24
(2)创建第二个RDD
scala> val rdd1 = sc.parallelize(1 to 5)rdd1: org.apache.spark.rdd.RDD = ParallelCollectionRDD at parallelize at <console>:24
(3)计算第一个RDD与第二个RDD的差集并打印
scala> rdd.subtract(rdd1).collect()res27: Array = Array(8, 6, 7)

2.3.2.3 intersection(otherDataset) 案例
1. 作用:对源RDD和参数RDD求交集后返回一个新的RDD
2. 需求:创建两个RDD,求两个RDD的交集
(1)创建第一个RDD
scala> val rdd1 = sc.parallelize(1 to 7)rdd1: org.apache.spark.rdd.RDD = ParallelCollectionRDD at parallelize at <console>:24
(2)创建第二个RDD
scala> val rdd2 = sc.parallelize(5 to 10)rdd2: org.apache.spark.rdd.RDD = ParallelCollectionRDD at parallelize at <console>:24
(3)计算两个RDD的交集
scala> val rdd3 = rdd1.intersection(rdd2)rdd3: org.apache.spark.rdd.RDD = MapPartitionsRDD at intersection at <console>:28
(4)打印计算结果
scala> rdd3.collect()res19: Array = Array(5, 6, 7)

2.3.2.4 cartesian(otherDataset) 案例
1. 作用:笛卡尔积(尽量避免使用)
2. 需求:创建两个RDD,计算两个RDD的笛卡尔积
(1)创建第一个RDD
scala> val rdd1 = sc.parallelize(1 to 3)rdd1: org.apache.spark.rdd.RDD = ParallelCollectionRDD at parallelize at <console>:24
(2)创建第二个RDD
scala> val rdd2 = sc.parallelize(2 to 5)rdd2: org.apache.spark.rdd.RDD = ParallelCollectionRDD at parallelize at <console>:24
(3)计算两个RDD的笛卡尔积并打印
scala> rdd1.cartesian(rdd2).collect()res17: Array[(Int, Int)] = Array((1,2), (1,3), (1,4), (1,5), (2,2), (2,3), (2,4), (2,5), (3,2), (3,3), (3,4), (3,5))

2.3.2.5 zip(otherDataset)案例
1. 作用:将两个RDD组合成Key/Value形式的RDD,这里默认两个RDD的partition数量以及元素数量都相同,否则会抛出异常。
2. 需求:创建两个RDD,并将两个RDD组合到一起形成一个(k,v)RDD
(1)创建第一个RDD
scala> val rdd1 = sc.parallelize(Array(1,2,3),3)rdd1: org.apache.spark.rdd.RDD = ParallelCollectionRDD at parallelize at <console>:24
(2)创建第二个RDD(与1分区数相同)
scala> val rdd2 = sc.parallelize(Array("a","b","c"),3)rdd2: org.apache.spark.rdd.RDD = ParallelCollectionRDD at parallelize at <console>:24
(3)第一个RDD组合第二个RDD并打印
scala> rdd1.zip(rdd2).collectres1: Array[(Int, String)] = Array((1,a), (2,b), (3,c))

(4)第二个RDD组合第一个RDD并打印
scala> rdd2.zip(rdd1).collectres2: Array[(String, Int)] = Array((a,1), (b,2), (c,3))
(5)创建第三个RDD(与1,2分区数不同)
scala> val rdd3 = sc.parallelize(Array("a","b","c"),2)rdd3: org.apache.spark.rdd.RDD = ParallelCollectionRDD at parallelize at <console>:24
(6)第一个RDD组合第三个RDD并打印
scala> rdd1.zip(rdd3).collectjava.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions: List(3, 2)at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)at scala.Option.getOrElse(Option.scala:121)at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965)at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)at org.apache.spark.rdd.RDD.collect(RDD.scala:935)... 48 elided
页: [1]
查看完整版本: 【Sprak】【笔记】Sprak的RDD的转换之双Value类型交互笔记