Chem. J. Chinese Universities ›› 2022, Vol. 43 ›› Issue (10): 20220397.doi: 10.7503/cjcu20220397

• Physical Chemistry • Previous Articles     Next Articles

A Dataset Representativeness Metric and A Slicing Sampling Strategy for the Kennard-Stone Algorithm

WU Qingying, ZHU Zhenyu, WU Jianming(), XU Xin   

  1. Department of Chemistry,Fudan University,Shanghai 200438,China
  • Received:2022-06-05 Online:2022-10-10 Published:2022-07-11
  • Contact: WU Jianming E-mail:jianmingwu@fudan.edu.cn
  • Supported by:
    the National Natural Science Foundation of China(21373053)

Abstract:

In machine learning with big data, it is essential to prepare a representative dataset for training a model. The Kennard-Stone(KS) algorithm and its derivatives are a large class of excellent dataset splitting methods. But it rely heavily on empirical selection or modeling results to determine the sampling ratio and sampling number. In addition, its computational complexity is OK3? according to the original literature, making it difficult to apply to massive data. In this paper, we design a metric based on dataset completeness to quantify the representativeness degree of an extracted subset to the whole dataset. An amendment using dynamic programming algorithm is put to reduce the algorithm complexity to O'K2. And a slicing sampling strategy is proposed to divide the whole dataset into several subset and implement KS sampling respectively, which can further improve the algorithm efficiency to O''K. The partial least squares regression test results show that the method can improve the sampling efficiency while still ensuring the representativeness of the finally extracted dataset.

Key words: Kennard-Stone algorithm, Dataset completeness, Dataset representativeness, Linear scaling

CLC Number: 

TrendMD: