spark1.3.1使用基础教程

spark可以通过交互式命令行及编程两种方式来进行调用：前者支持scala与python后者支持scala、python与java

本文参考https://spark.apache.org/docs/latest/quick-start.html，可作快速入门

再详细资料及用法请见https://spark.apache.org/docs/latest/programming-guide.html

一、基础介绍1、spark的所有操作均是基于RDD(Resilient Distributed Dataset)进行的，，其中R（弹性）的意思为可以方便的在内存和存储间进行交换。2、RDD的操作可以分为2类：transformation 和 action，其中前者从一个RDD生成另一个RDD(如filter)，后者对RDD生成一个结果（如count)。二、命令行方式1、快速入门$ ./bin/spark-shell（1）先将一个文件读入一个RDD中，然后统计这个文件的行数及显示第一行。scala> var textFile = sc.textFile("/mnt/jediael/spark-1.3.1-bin-hadoop2.6/README.md")textFile: org.apache.spark.rdd.RDD[String] = /mnt/jediael/spark-1.3.1-bin-hadoop2.6/README.md MapPartitionsRDD[1] at textFile at <console>:21scala> textFile.count()res0: Long = 98scala> textFile.first();res1: String = # Apache Spark（2）统计包含spark的行数scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:23scala> linesWithSpark.count()res0: Long = 19（3）以上的filter与count可以组合使用scala> textFile.filter(line => line.contains("Spark")).count()res1: Long = 192、深入一点（1）使用map统计每一行的单词数量，reduce找出最大的那一行所包括的单词数量scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)res2: Int = 14（2）在scala中直接调用java包scala> import java.lang.Mathimport java.lang.Mathscala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))res2: Int = 14（3）wordcount的实现scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:24scala> wordCounts.collect()res4: Array[(String, Int)] = Array((package,1), (For,2), (processing.,1), (Programs,1), (Because,1), (The,1), (cluster.,1), (its,1), ([run,1), (APIs,1), (computation,1), (Try,1), (have,1), (through,1), (several,1), (This,2), ("yarn-cluster",1), (graph,1), (Hive,2), (storage,1), (["Specifying,1), (To,2), (page](),1), (Once,1), (application,1), (prefer,1), (SparkPi,2), (engine,1), (version,1), (file,1), (documentation,,1), (processing,,2), (the,21), (are,1), (systems.,1), (params,1), (not,1), (different,1), (refer,2), (Interactive,2), (given.,1), (if,4), (build,3), (when,1), (be,2), (Tests,1), (Apache,1), (all,1), (./bin/run-example,2), (programs,,1), (including,3), (Spark.,1), (package.,1), (1000).count(),1), (HDFS,1), (Versions,1), (Data.,1), (>…3、缓存：将RDD写入缓存会大大提高处理效率scala> linesWithSpark.cache()res5: linesWithSpark.type = MapPartitionsRDD[2] at filter at <console>:23scala> linesWithSpark.count()res8: Long = 19三、编码scala代码，还不熟悉，以后再运行import org.apache.spark.SparkContextimport org.apache.spark.SparkContext._import org.apache.spark.SparkConfobject SimpleApp { def main(args: Array[String]) { val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) }}

学习会使你永远立于不败之地。

相关文章：

你感兴趣的文章：

标签云：