Attribute reduction is a key issue in the research of rough sets. Aiming at the shortcoming of attribute reduction algorithm based on discernibility matrix, an attribute reduction method based on sample extraction and priority is presented. Firstly, equivalence classes are divided using quick sort for computing compressed decision table. Secondly, important samples are extracted from compressed decision table using iterative selforganizing data analysis technique algorithm(ISODATA). Finally, attribute reduction of sample decision table is conducted based on the concept of priority. Experimental results show that the attribute reduction method based on sample extraction and priority can significantly reduce the overall execution time and improve the reduction efficiency.
Keywords
 rough sets
 sample extraction
 priority
 attribute reduction
 algorithm
Rough sets theory is a mathematical tool to deal with vague and uncertain knowledge presented by Pawlak in 1982 [1]. It can mine the hidden knowledge from the mass data and has been successfully applied in fields such as data mining, pattern recognition, decision support system [2,3], etc. In recent years, it is also used in big data [4], information sharing [5], and other fields. Attribute reduction is one of the core contents of the Rough sets theory. It has been proved to be a NPHard problem to find all reductions or a minimal reduction in Rough Sets [6]. So, the common method is to adopt the heuristic algorithm. Normally there are two main classes, the algorithms that are based on the discernibility matrices and the algorithms that are based on positive region.
Because the attribute reduction algorithms based on discernibility matrix is succinct and easy to understand, it has attracted a great deal of attention from scholars [7]. But in this kind of algorithms the discernibility matrix need be constructed and there are a lot of repeat elements in the discernibility matrix. In disposing big datasets, this kind of algorithms not only cost a great deal of memory space, but also waste much computer time in reduction. Because the discernibility matrix need not be constructed, the attribute reduction algorithms based on positive region attract more and more attention from scholars [8]. But the time complexity of this kind of algorithms are still far from satisfactory in processing large data sets [9].
If unnecessary elements are removed from decision table and only essential elements are extracted for attribute reduction, high efficiency of reduction algorithms can be reached. Hart first puts forward the concept of the sample extraction. There are some typic algorithms, such as the CNN [10], the RNN [11], etc. Numerous sample extraction algorithms can be broadly divided into two kinds: classification competence enhancement approaches and classification competence preservation approaches [12]. At present, many scholars have done lots of correlative investigations to the sample extraction.
In the paper [9], condition attributes are clustered firstly. Then the cluster are randomly sampled. Finally, attribute reduction is conducted on the sample data set. The literature [13] proposes an attribute reduction algorithm that is based on sample extraction. There are three steps in this algorithm. First, important instances are selected from the set of objects. Second, discernibility matrix is constructed by the use of the selected instances. Finally, all reductions of decision table are computed. In [14], an attribute reduction algorithm based on granular computing and discernibility ability is presented. This algorithm uses stratified sampling in statistics to divide original large dataset into several sample subsets and conduct attribute reduction on each subset. In the end, weight fusion was made to get final reduction result of original large dataset according to reduction results of all the subset. In [15], it is stated that useless samples will be filtered to reduce memory space and running time during reduction calculation. The algorithm called condensation rule based on fuzzy rough sets(FRSC) is introduced in [16]. In rough sets theory, the samples near the decision boundary are often these samples that have smaller lower approximation. Based on this characteristic, FRSC uses lower approximation as a basis for sample extraction.
Most sample extraction algorithms use random sampling or sampling methods based on a certain criteria. But the reasons for that and how to achieve optimum sample effect have not been fully interpreted in theory and practice. According to the above scholars’ researches, an attribute reduction method based on sample extraction and priority is presented in this paper. Because the samples near the classification boundary have great uncertainty, they can provide more information on classification and play an important role in classification. So, in this paper, the samples near the classification boundary are extracted and used for attribute reduction. Experiments illustrate the validity and efficiency of this method.
This paper makes some improvements on the previous attribute reduction algorithms which are based on sample extraction to compress decision table, improve sampling method and enhance the reduction efficiency. These improvements primarily reflected in the following aspects.
In some references, it is believed that sample extraction can removes redundant objects from decision table. But in fact the selected samples are far more likely to be still redundant if only the sample extraction method is used. Therefore data preprocessing must be done by using other methods before sample extraction to remove redundant objects from decision table completely. This paper proposes that
Let
for(
sort
for(
if
else
return
The running time of Algorithm 1 consists of two parts: quick sort and equivalence class division, among which the main cost is quick sort. Because the total cost of 
A redundant decision table.
1  0  1  0  
2  0  1  1  
1  0  1  0  
0  0  0  0  
1  2  0  1  
1  2  0  1  
2  2  0  1 
On the basis of sorting, nonredundant data are chosen to form the compressed decision table. The compressed decision table is defined as follows.
In decision table
According to Definition 1, generation algorithm of the compressed decision table is given below.
computing
for(
return
For Algorithm 2, time complexity is O(
Compressed decision table.
2  0  1  1  
1  0  1  0  
0  0  0  0  
1  2  0  1  
2  2  0  1 
After removing redundant objects in the previous step, to improve the efficiency of attribute reduction, we try to further reduce the number of objects by sampling. In some papers, instances are selected by
cluster the objects in
for(
for(
if(
return
In Algorithm 3,
Through sample extraction, objects that have an important influence on the reduction are selected, which greatly reduces the number of objects and improves the efficiency of reduction.
After sample extraction, attribute reduction can be conducted. Some literature reduce the attributes using the discernibility matrix. However, the discernibility matrix method shows some disadvantages that lead to its lower efficiency and thus largely limit the wider applications, such as it occupies a large amount of RAM and has high time complexity. The attribute reduction algorithms based on positive region have better time and space complexity because they do not need to establish the discernibility matrix. According to the concept of positive region, an attribute reduction algorithm based on priority is put forward in this paper. Then the effect of this algorithm is illustrated by the time complexity analysis and experiments.
This paper introduces the concept of priority to measure the discernibility ability. The definition of priority is presented here.
For decision table
It can be seen from the above definition that the more equivalence classes are divided and the more balanced, the stronger the ability of the attribute set to distinguish the object set.
For decision table
Clearly, the smaller the priority is, the larger the discernibility ability of attribute set is. Theorem 1 is given on this basis.
According to the above definitions, Theorem 1 holds clearly.
According to the above theoretical foundation, the attribute reduction algorithm which is based on priority is given below.
calculate
while(
{calculate priority of each attribute in
choose the attribute
return
For Algorithm 4, time complexity is O(
To determine the reasonable value of
UCI data sets.
 
 


Statlog  6435  36 
Winequality  4898  11 
Contraceptive  1473  9 
First, experiment to explore the relationship between different sampling distances and reduction effectiveness with Statlog data set is conducted and reasonable value of
The relationship between different sampling distances and reduction effectiveness for Statlog.
0.6 
5473  9  329.61 
0.8 
3965  8  195.13 
2648  8  111.42  
1.05 
2344  7  79.82 
1.1 
2077  7  65.15 
1.2 
1628  6  54.59 
From Table 4 it can be seen that the smaller the sampling distance is, the fewer the number of samples is and the greater the difference between the number of the reduced attributes and traditional algorithm is. To bring reduction effectiveness in line with traditional algorithm as far as possible, a proper sampling distance must be chosen. Clearly, if sampling distance is
Next, traditional attribute reduction algorithm based on positive region and Algorithm 4 in this paper are separately used to calculate the time required for getting reduction and compare the efficiency of the algorithms. The corresponding experimental results on Winequality data set are listed. There are 11 condition attributes in database and computing time of this paper is the sum of redundancy removal time, sampling time and reduction time.
As seen from Table 5, the original Winequality data set has nearly 20% redundant data, which is a very common phenomenon in real data. At the same time, the sampling rate is 45%. So, the number of objects of the original data set has been greatly decreased after redundancy removal and sampling. Moreover, reduction of samples basically coincides with the original data set, but computing time just accounts for 29% of the original data set. Overall, it is clear that reduction efficiency is greatly improved.
Comparison on the effect of reduction for Winequality data set
 


Original data set  Redundancy removal  Sampling  Reduction result  Reduction time (s)  Reduction result  Computing time (s) 
4898  3961  1788  5  113.62  4  33.25 
The experimental results on Contraceptive data set are listed here. There are nine condition attributes in database and computing time of this paper is the sum of redundancy removal time, sampling time and reduction time.
The result shown in Table 6 is similar to that in Table 5. Redundant data provide 3% of the original Contraceptive data set and the sampling rate reaches 47%. Thus, the number of objects drops significantly. Consequently, computing time of Algorithm 4 in this paper just accounts for 33% of the original data set. This also illustrates the efficiency of the proposed approach. Meanwhile, comparing Table 5 with Table 6, it is found that the higher the data volume is, the more obvious the advantage of Algorithm 4 is.
Comparison on the effect of reduction for Contraceptive data set.
 


Original data set  Redundancy removal  Sampling  Reduction result  Reduction time (s)  Reduction result  Computing time (s) 
1473  1425  665  9  12.62  8  4.13 
For large data sets, redundant data expend a lot of computer run time and memory in attribute reduction. To overcome this issue, the paper puts forward an attribute reduction method based on sample extraction and priority. To start with, this method removes redundant data using quick sort. Then, the samples near the classification boundary are extracted and used for reduction calculation. According to the concept of positive region, an attribute reduction algorithm that is based on priority is proposed in this paper and used for reduction. Experiments show that this method suits attribute reduction calculation on large data sets well, specifically. With the rise of smart medical care, a large number of medical data resources have been integrated. Medical big data knowledge mining has become a research focus in the current academic field. The method proposed in this article can be used to effectively reduce the dimensionality of massive medical data and improve the efficiency of knowledge mining. This is the next application and research direction.
UCI data sets.
 
 


Statlog  6435  36 
Winequality  4898  11 
Contraceptive  1473  9 
A redundant decision table.
1  0  1  0  
2  0  1  1  
1  0  1  0  
0  0  0  0  
1  2  0  1  
1  2  0  1  
2  2  0  1 
The relationship between different sampling distances and reduction effectiveness for Statlog.
0.6 
5473  9  329.61 
0.8 
3965  8  195.13 
2648  8  111.42  
1.05 
2344  7  79.82 
1.1 
2077  7  65.15 
1.2 
1628  6  54.59 
Compressed decision table.
2  0  1  1  
1  0  1  0  
0  0  0  0  
1  2  0  1  
2  2  0  1 
Comparison on the effect of reduction for Contraceptive data set.
 


Original data set  Redundancy removal  Sampling  Reduction result  Reduction time (s)  Reduction result  Computing time (s) 
1473  1425  665  9  12.62  8  4.13 
Comparison on the effect of reduction for Winequality data set
 


Original data set  Redundancy removal  Sampling  Reduction result  Reduction time (s)  Reduction result  Computing time (s) 
4898  3961  1788  5  113.62  4  33.25 
Regarding new wave distributions of the nonlinear integropartial Ito differential and fifthorder integrable equations Nonlinear Mathematical Modelling of Bone Damage and Remodelling Behaviour in Human Femur Value Creation of Real Estate Company Spinoff Property Service Company Listing Entrepreneur's Passion and Entrepreneurial Opportunity Identification: A Moderated Mediation Effect Model Applications of the extended rational sinecosine and sinhcosh techniques to some nonlinear complex models arising in mathematical physics Study on the Classification of Forestry Infrastructure from the Perspective of Supply Based on the Classical Quartering Method A Modified Iterative Method for Solving Nonlinear Functional Equation A comprehensive evaluation of county economies in the BeijingTianjinHebei Region based on entropy TOPSIS analysis New Principles of NonLinear Integral Inequalities on Time Scales Has the belt and road initiative boosted the resident consumption in cities along the domestic route? – evidence from credit card consumption Analysis of the agglomeration of Chinese manufacturing industries and its effect on economic growth in different regions after entering the new normal Study on the social impact Assessment of Primary Land Development: Empirical Analysis of Public Opinion Survey on New Town Development in Pinggu District of Beijing Possible Relations between Brightest Central Galaxies and Their Host Galaxies Clusters and Groups Attitude control for the rigid spacecraft with the improved extended state observer An empirical investigation of physical literacybased adolescent health promotion MHD 3dimensional nanofluid flow induced by a powerlaw stretching sheet with thermal radiation, heat and mass fluxes The research of power allocation algorithm with lower computational complexity for nonorthogonal multiple access The art design of industrialised manufacturing furniture products based on the simulation of mathematical curves Research on the normalisation method of logging curves: taking XJ Oilfield as an example A Method of Directly Defining the inverse Mapping for a HIV infection of CD4+ Tcells model On the interaction of species capable of explosive growth Research on Evaluation of Intercultural Competence of Civil Aviation College Students Based on Language Operator Combustion stability control of gasoline compression ignition (GCI) under lowload conditions: A review Research on the Psychological Distribution Delay of Artificial Neural Network Based on the Analysis of Differential Equation by Inequality Expansion and Contraction Method The Comprehensive Diagnostic Method Combining Rough Sets and Evidence Theory Study on Establishment and Improvement Strategy of Aviation Equipment Design of softwaredefined network experimental teaching scheme based on virtualised Environment Research on Financial Risk Early Warning of Listed Companies Based on Stochastic Effect Mode System dynamics model of output of ball mill The Model of Sugar Metabolism and Exercise Energy Expenditure Based on Fractional Linear Regression Equation Constructing Artistic Surface Modeling Design Based on Nonlinear Overlimit Interpolation Equation Statistical analysis of typical elevator accidents in China from 2002 to 2019 Optimal allocation of microgrid using a differential multiagent multiobjective evolution algorithm About one method of calculation in the arbitrary curvilinear basis of the Laplace operator and curl from the vector function Numerical Simulation Analysis Mathematics of Fluid Mechanics for Semiconductor Circuit Breaker Dynamics of infectious diseases: A review of the main biological aspects and their mathematical translation Cartesian space robot manipulator clamping movement in ROS simulation and experiment Effects of internal/external EGR and combustion phase on gasoline compression ignition at lowload condition Research of urban waterfront space planning and design based on childrenfriendly idea Characteristics of Mathematical Statistics Model of Student Emotion in College Physical Education Human Body Movement Coupling Model in Physical Education Class in the Educational Mathematical Equation of Reasonable Exercise Course Sensitivity Analysis of the Waterproof Performance of Elastic Rubber Gasket in Shield Tunnel Impact of Web Page House Listing Cues on Internet Rental Research on management and control strategy of Ebikes based on attribute reduction method A study of aerial courtyard of super highrise building based on optimisation of space structure Exact solutions of (2 + 1)AblowitzKaupNewellSegur equation