Newswise — In data analysis, it’s the outlier information that is usually the most interesting, yet sometimes that information goes unrecognized by the most common evaluation methods because they make inaccurate assumptions.
But now Michael Houle, a senior university lecturer at New Jersey Institute of Technology’s Ying Wu College of Computing, along with collaborators in Australia, Denmark and Serbia have become outliers themselves for developing the math to prove that breaking those assumptions can work better than conventional methods.
“Outlier detection, one of the most fundamental tasks in data mining, aims to identify observations that deviate from the general distribution of the data. Such observations often deserve special attention as they may reveal phenomena of extreme importance, such as network intrusions, sensor failures or disease,” they wrote in an award-winning paper about their new proof, Dimensionality-Aware Outlier Detection, given at the recent SIAM International Conference on Data Mining (SDM24) in Houston.
“Dimensionality is the number of features that you use to describe your data. If you have a 100 pixel by 100 pixel image, which is three colors per pixel, that’s 30,000 features,” Houle explained.
Dimensionality-Aware Outlier Detection was tested on 800 datasets. It uses a mathematical concept called local intrinsic dimensionality.
“It is awareness of local variations in dimensionality that makes our method unique,” Houle noted. "The intrinsic dimensionality can be interpreted as the number of main influences that best describe the distribution of the data in space. It does not depend directly on the number of data features or the dimension of the space itself.”
"Intrinsic dimensionality is a popular concept in machine learning, particularly when data is regarded as lying within a sheet, or 'manifold'. The manifold can have a much smaller dimension than the data space, and knowing this number -- the intrinsic dimensionality -- can be an advantage in data modeling and data processing. Fitting a manifold can be very computationally expensive, though."
"Instead, we assess the effective number of dimensions directly, using a notion of intrinsic dimensionality that is entirely local to the data point being tested. Local intrinsic dimensionality infers the dimensional properties through the distribution of distances from the test point to its nearest neighbors within the dataset."
"Using the LID theory, we were able to derive an outlierness criterion that not only took dimensionality properly into account, but did so in a way that the conventional methods have ignored until now."
Traditionally, data miners would seek anomalies using techniques such as local outlier factor (LOF), simplified LOF and something called k-nearest neighbors. Those all ignore dimensionality.
"My main role was the theoretical contribution and coordinating with my colleagues on the experimental design and evaluation. The theory had been in place for quite some time before we found a way to show the effects in practice. And the effect turned out to be elusive,” Houle stated.
Houle said he became interested in the subject because he was first interested in something even more arcane.
“Anomaly detection, from my point of view, is only the tip of the iceberg. It was not my first motivation for getting into this line of research - it was just a nice outcome of the theory. Anomaly detection is a classical problem in data mining that has not been fully settled in the two decades since LOF came out. In this context, we were able to nail down through experimental justification what many researchers have been assuming for a while now: intrinsic dimensionality can vary from one part of a dataset to another. And by taking variation in LID into account, we were able to comprehensively outperform anomaly detection methods that have been the state of the art for the past 25 years."
“Outlier detection turned out to be one of the potential application areas. So the way that I'm looking at it is, I'm working on certain fundamentals … in data mining, deep learning, indexing, databases and a number of areas. My colleagues and I are still exploring where it can possibly fit, and what we're targeting these days more than anything else is to try to reinterpret machine learning and deep learning in the light of what this model, what this characterization of complexity, could reveal.”