Spark MLlib提供了一种叫colStats()的统计方法,调用该方法会返回一个类型为MultivariateStatisticalSummary的实例。通过这个实例看,我们可以获得每一列的最大值,最小值,均值、方差、总数等。
1 2 3 4 5 6 7 1 5 9 3 5 6 3 1 3 1 1 5 6
val data_path = "file:///Users/walle/Documents/D3/sparkmlib/sample_stat.txt" val data = sc.textFile(data_path).map(_.split("\t")).map(f => f.map(f => f.toDouble)) val data1 = data.map(f => Vectors.dense(f)) val stat1 = Statistics.colStats(data1) stat1.max stat1.min stat1.mean stat1.variance stat1.normL1 stat1.normL24632