Hi all,
For
hive datasource, driver is created per split, so more splits means more
parallelism, this is working without any issue if we have good file size ,
however , there is a case which the file size is far less than hive split
size(hive.max-initial-split-size, hive.max-split-size), thus number of splits
is determined by number of small files. To process a large number of small splits
in parallel, it definitely introduces overhead of CPU context switch, causing performance
issues when doing the data shuffling and resulting in low CPU/memory usage. Based
on our testing we are seeing huge performance impact in high concurrency
environment since task slot is occupied by those small split.
FindĀ design doc in attachments.
Kindly let me know if there are any comments/suggestions.