A different approach to LLM efficiency
A different approach to LLM efficiency
A team of Apple released a research paper (arXiv:2312.11514v1) titled “LLM in a Flash: Efficient Large Language Model Inference with Limited Memory,” focusing on optimizing large language model (LLM) inference on devices with limited DRAM capacity.
LLMs exhibit exceptional performance in various natural language tasks but have high computational and memory requirements. Running them efficiently, especially on devices with limited DRAM, is challenging.
The paper proposes a method where model parameters are stored in flash memory, and only necessary parameters are dynamically loaded into DRAM during inference. This approach helps in running models larger than the available DRAM.
Techniques like “windowing” are used to reduce data transfer. This involves reusing previously activated neurons and transferring only essential data from flash to DRAM.
As well by reading data in larger, more contiguous chunks from flash memory, the transfer throughput is improved. A technique called Row-Column Bundling involves storing related data together in flash memory to increase the size of data chunks read, thus boosting throughput.
The paper discusses detailed implementation strategies, including managing loaded data in DRAM efficiently, predicting sparsity in models to avoid loading unnecessary data, and optimizing memory preallocation.
In the end the methods proposed enable running models up to twice the size of the available DRAM, with a 4-5x increase in inference speed on CPU and a 20-25x increase on GPU compared to traditional loading methods.
At illogic we prefer a different approach using the extraction of data generated by LLM trained in industrial domain and creating a SSM (Small Specialistic Model) that is a surrogate model (arXiv:2107.14574) using machine learning models. With different set of data (as well huge) we achieved 15x speed in getting the results. So we are now working to implement it with LLM.
This approach will help us to implement Knowledge base on our product Visualgear.ai