我已经创建了一个火花dataframe阅读csv hdfs的位置。

emp_df = spark.read.format("com.databricks.spark.csv") \  .option("mode", "DROPMALFORMED") \  .option("header", "true") \  .option("inferschema", "true") \  .option("delimiter", ",").load(PATH_TO_FILE)

并保存这个dataframe蜂巢paritioned兽人使用partitionBy方法表

emp_df.repartition(5, 'emp_id').write.format('orc').partitionBy("emp_id").saveAsTable("UDB.temptable")

当我阅读此表如下方法如果我看看逻辑和物理计划,似乎它已经完全过滤的数据使用分区键列:

emp_df_1 = spark.sql("select * from UDB.temptable where emp_id ='6'")emp_df_1.explain(True)***************************************************************************== Parsed Logical Plan =='Project [*]
+- 'Filter ('emp_id = 6)
   +- 'UnresolvedRelation `UDB`.`temptable`== Analyzed Logical Plan ==emp_name: string, emp_city: string, emp_salary: int, emp_id: intProject [emp_name#7399, emp_city#7400, emp_salary#7401, emp_id#7402]+- Filter (emp_id#7402 = cast(6 as int))
   +- SubqueryAlias temptable      +- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc== Optimized Logical Plan ==Filter (isnotnull(emp_id#7402) && (emp_id#7402 = 6))+- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc== Physical Plan ==*(1) FileScan orc udb.temptable[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://pathlocation/database/udb...., PartitionCount: 1, PartitionFilters: [isnotnull(emp_id#7402), (emp_id#7402 = 6)], PushedFilters: [], ReadSchema: struct***************************************************************************

而如果我读这个dataframe通过绝对hdfs路径位置,似乎不能够过滤数据使用分区键列:

emp_df_2 = spark.read.format("orc").load("hdfs://pathlocation/database/udb.db/temptable/emp_id=6")emp_df_2.explain(True)******************************************************************************== Parsed Logical Plan ==Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc== Analyzed Logical Plan ==emp_name: string, emp_city: string, emp_salary: intRelation[emp_name#7411,emp_city#7412,emp_salary#7413] orc== Optimized Logical Plan ==Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc== Physical Plan ==*(1) FileScan orc [emp_name#7411,emp_city#7412,emp_salary#7413] Batched: true, Format: ORC, Location: InMemoryFileIndex[hdfs://pathlocation/data/database/udb.db/tem..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct********************************************************************************

你能帮我了解逻辑和物理计划的情况下?

逻辑和物理计划如何工作时读蜂巢分区表在兽人pyspark dataframe吗