Abstract
The effectiveness of machine learning models can often be improved by feature selection as a preprocessing step. Often this is a data driven process only and can result in models that may not correspond to true relationships present in the data set due to overfitting. In this work, we propose leveraging known relationships between variables to constrain and guide feature selection. Using commonalities across domains, we provide a framework for the user to express model constraints while still making the feature selection process data driven and sensitive to actual relationships in the data. Motivation: When building prediction models to solve real world problems, feature selection is often not considered directly. Instead, many machine learning algorithms do feature selection implicitly as part of the learning process. If performance is not satisfactory, explicit feature selection can be peformed as a pre-processing step. Alternatively, feature selection can be performed ad-hoc by a human, but this is discouraged because of complexity and because humans may (and often do) make sub-optimal selections. We propose a middle road where feature selection is data driven but the search for a better feature set is guided by domain knowledge from the user. Background: Data driven feature selection has received significant attention and several techniques are used in practice.[Hall, 2000] presents CFS (correlation-based feature selection) to perform a filter-based feature selection using a meritheuristic (Pearson’s correlation). The algorithm uses bestfirst search to incrementally add features. The output of feature selection is used as input to a machine learning algorithm. Wrapper-based feature selection is an approach similar to CFS that uses the desired machine learning algorithm is in-situ as the merit heuristic [Kohavi and John, 1997]. Another approach toward prediction in multivariate domains is to use time-series methods that incorporate timedelayed relationships (i.e. lagged) implicitly. Vector autoregressive moving average (ARMA) and multivariate regression are effective techniques [Martens and Næs, 1992]. ARMA methods can be sensitive to collinear variables especially when there are many variables and the data set size is small, so feature selection is a useful pre-processing step.