Using Domain Knowledge to Systematically Guide Feature Selection William Groves

资源分类

2019-11-11 |

74 |

37 |

Abstract The effectiveness of machine learning models can often be improved by feature selection as a preprocessing step. Often this is a data driven process only and can result in models that may not correspond to true relationships present in the data set due to overfitting. In this work, we propose leveraging known relationships between variables to constrain and guide feature selection. Using commonalities across domains, we provide a framework for the user to express model constraints while still making the feature selection process data driven and sensitive to actual relationships in the data. Motivation: When building prediction models to solve real world problems, feature selection is often not considered directly. Instead, many machine learning algorithms do feature selection implicitly as part of the learning process. If performance is not satisfactory, explicit feature selection can be peformed as a pre-processing step. Alternatively, feature selection can be performed ad-hoc by a human, but this is discouraged because of complexity and because humans may (and often do) make sub-optimal selections. We propose a middle road where feature selection is data driven but the search for a better feature set is guided by domain knowledge from the user. Background: Data driven feature selection has received significant attention and several techniques are used in practice.[Hall, 2000] presents CFS (correlation-based feature selection) to perform a filter-based feature selection using a meritheuristic (Pearson’s correlation). The algorithm uses bestfirst search to incrementally add features. The output of feature selection is used as input to a machine learning algorithm. Wrapper-based feature selection is an approach similar to CFS that uses the desired machine learning algorithm is in-situ as the merit heuristic [Kohavi and John, 1997]. Another approach toward prediction in multivariate domains is to use time-series methods that incorporate timedelayed relationships (i.e. lagged) implicitly. Vector autoregressive moving average (ARMA) and multivariate regression are effective techniques [Martens and Næs, 1992]. ARMA methods can be sensitive to collinear variables especially when there are many variables and the data set size is small, so feature selection is a useful pre-processing step.

上一篇：High-Level Program Execution in Multi-Agent Settings

下一篇：Strategic Interactions Among Agents with Bounded Rationality

用户评价

全部评价

还没有评论，说两句吧！

热门资源

The Variational S...

Unlike traditional images which do not offer in...
Learning to Predi...

Much of model-based reinforcement learning invo...
Stratified Strate...

In this paper we introduce Stratified Strategy ...
A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Rating-Boosted La...

The performance of a recommendation system reli...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com