失效链接处理 |
spark core deep dive optimizer PDF 下载
本站整理下载:
提取码:0mlg
相关截图:
主要内容:
Spark Partitions – Types
• Input
– Controls - Size
• spark.default.parallelism (don’t use)
• spark.sql.files.maxPartitionBytes (mutable)
– assuming source has sufficient partitions
• Shuffle
– Control = Count
• spark.sql.shuffle.partitions
• Output
– Control = Size
• Coalesce(n) to shrink
• Repartition(n) to increase and/or balance (shuffle)
• df.write.option(“maxRecordsPerFile”, N)
#UnifiedAnalytics #SparkAISummit 15
Partitions – Shuffle – Default
Default = 200 Shuffle Partitions
#UnifiedAnalytics #SparkAISummit 16
Partitions – Right Sizing – Shuffle – Master Equation
• Largest Shuffle Stage
– Target Size <= 200 MB/partition
• Partition Count = Stage Input Data / Target Size
– Solve for Partition Count
EXAMPLE
Shuffle Stage Input = 210GB
x = 210000MB / 200MB = 1050
spark.conf.set(“spark.sql.shuffle.partitions”, 1050)
BUT -> If cluster has 2000 cores
spark.conf.set(“spark.sql.shuffle.partitions”, 2000)
|