Merging small file size splits Design Doc - Dev

9 Oct 2020


      Hi all,
For hive datasource, driver is created per split, so more splits means more
parallelism, this is working without any issue if we have good file size ,
however , there is a case which the file size is far less than hive split
size(hive.max-initial-split-size, hive.max-split-size), thus number of
splits is determined by number of small files. To process a large number of
small splits in parallel, it definitely introduces overhead of CPU context
switch, causing performance issues when doing the data shuffling and
resulting in low CPU/memory usage. Based on our testing we are seeing huge
performance impact in high concurrency environment since task slot is
occupied by those small split
Find  design doc in attachments.
Kindly let me know if there are any comments/suggestions.
Thanks,
Sandeep.k