Hi all,

For hive datasource, driver is created per split, so more splits means more parallelism, this is working without any issue if we have good file size , however , there is a case which the file size is far less than hive split size(hive.max-initial-split-size, hive.max-split-size), thus number of splits is determined by number of small files. To process a large number of small splits in parallel, it definitely introduces overhead of CPU context switch, causing performance issues when doing the data shuffling and resulting in low CPU/memory usage. Based on our testing we are seeing huge performance impact in high concurrency environment since task slot is occupied by those small split.

Find design doc in attachments.

Kindly let me know if there are any comments/suggestions.