Dev October 2020

dev@openlookeng.io

1 participants
2 discussions

Merging small file size splits Design Doc
by K Sandeep 28 Oct '20

28 Oct '20

Hi all, For hive datasource, driver is created per split, so more splits means more parallelism, this is working without any issue if we have good file size , however , there is a case which the file size is far less than hive split size(hive.max-initial-split-size, hive.max-split-size), thus number of splits is determined by number of small files. To process a large number of small splits in parallel, it definitely introduces overhead of CPU context switch, causing performance issues when doing the data shuffling and resulting in low CPU/memory usage. Based on our testing we are seeing huge performance impact in high concurrency environment since task slot is occupied by those small split. Find design doc in attachments. Kindly let me know if there are any comments/suggestions.

1 0

Merging small file size splits Design Doc
by K Sandeep 09 Oct '20

09 Oct '20

Hi all, For hive datasource, driver is created per split, so more splits means more parallelism, this is working without any issue if we have good file size , however , there is a case which the file size is far less than hive split size(hive.max-initial-split-size, hive.max-split-size), thus number of splits is determined by number of small files. To process a large number of small splits in parallel, it definitely introduces overhead of CPU context switch, causing performance issues when doing the data shuffling and resulting in low CPU/memory usage. Based on our testing we are seeing huge performance impact in high concurrency environment since task slot is occupied by those small split Find design doc in attachments. Kindly let me know if there are any comments/suggestions. Thanks, Sandeep.k

1 0

2025

2024

2023

2022

2021

2020

Dev October 2020