Applying Anomaly Detection to Healthcare Data
By Ben Lawrence, MBI, RHIA
Anomaly detection, also known as outlier detection, is a technique used to identify unusual patterns that are significantly different from normal data. Anomaly detection can be used in a variety of fields to detect things like credit card, insurance, or healthcare fraud; campaign donation irregularities; cancerous tumors in an MRI photograph; or even new or unusual galaxies in astronomical data.
At DataDx, we use anomaly detection to find abnormal values within clinical and financial data sets. Doctors in everyday clinical practice struggle to analyze clinical data due to the rapid, exponential growth of healthcare data, but artificial intelligence could assist them by shortening processing times and improving the quality of care in a practice (1).
The DataDx Anomaly Detection System uses advanced machine learning algorithms to automatically detect inconsistencies in real time and present the results in meaningful, interactive dashboards. The data sets we work with often amount to millions and millions of rows of data where anomalies may slip by undetected and would be very difficult to find through manual processes.
Jon Mattes, Director of Analytics at DataDx, highlights some great examples of how data inconsistencies can go unnoticed and the importance of data cleanliness here: Can You Trust Your Data?
There are several anomaly detection techniques, such as local outlier factor, one-class support vector machine, robust covariance, k-nearest neighbor, and isolation forests. These algorithms find anomalies using two methods called outlier detection and novelty detection. Outlier detection, also known as unsupervised anomaly detection, uses a training data set that attempts to find values that are far from the other seemingly normal values. Novelty detection, also known as semi-supervised anomaly detection, is used to detect new values that might be anomalies (2).
We use the isolation forest algorithm for our anomaly detection system by implementing it in the Python programming language coupled with the open source scikit-learn machine learning library to produce our data quality reports in Power BI. The isolation forest was chosen for its speed, accuracy, low computational cost, and capacity to scale up to handle extremely large data sets. The algorithm explicitly focuses on isolating anomalies rather than profiling normal instances like other detection methods (3) and has outperformed other techniques in our internal testing. To illustrate some of these terms, here is an example of the scikit-learn implementation of the isolation forest (4):
Imagine the white dots as a DataDx data set that contains a practice’s historical information for things like CPT codes, Relative Value Units (RVUs), patient ages, zip codes, dates of service, transactions, or journal entries. That historical information is run though the initial outlier detection to identify what are considered to be normal values. The green and red dots can be thought of as real-time data (novelty detection) that we receive on a daily basis, where a green dot might indicate a normal charge and a red dot might indicate an abnormally high charge for a procedure code. The following example shows how we display this information in our DataDx quality reports:
This initial overview page gives a snapshot of all detected anomalies within a given detection date. The isolation forest returns an anomaly score, with a higher score indicating a higher likelihood of an anomaly. The table on the bottom LEFT displays the values with the highest scores first, and the severity updates dynamically based on what is selected from the filters on the left (Table, Feature, Rule Violation, and Anomaly Fixed). The box and whisker plot (top left) and the stacked histogram (top right) show where your anomalous values lie. For instance, the leftmost value in the box and whisker plot shows an amount of negative 7 million. If you look at the table, that value is shown with a value of “Yes” for “Anomaly Fixed.” In this case, that amount was actually an entry error that was fixed in the client’s source system, so it is not an anomaly. The “Attestation Form” button on the bottom left links to a page where clients can report values that are mistakenly picked up as anomalies or report any additional feedback. There is also an option to filter out these fixed anomalies by clicking on the “Anomaly Fixed” filter.
Figure 3 demonstrates how we can merge the detected anomalies with the original source data. This example shows fictitious journal entries for credit amounts. The blue points indicate values that are considered to be normal, while the red points indicate anomalous values or rule violations. Rule violations, which are values that are above or below a certain threshold, highlight the importance of using anomaly detection, as many anomalous points that lie within the data may be missed if out-of-threshold values are the only focus. As this figure also demonstrates, our system allows users to compare normal versus anomalous data alongside the dates of service the entries were created and the dates the anomalies were detected.
Applying this anomaly detection method to healthcare data allows us to pick up on anomalies in real time and convey the results to doctors and administrators in a user-friendly format. If you’re interested in learning about more of the technical details behind this process, you can view a recording of our presentation at the CSG Pro Portland Power BI User Group, Visualizing Anomaly Detection and Interactive Documentation in Power BI: https://www.youtube.com/watch?v=Z2DBzjwQ1rM&t=1s. For more information or to schedule a demo of the DataDx Anomaly Detection System, please contact us at firstname.lastname@example.org.
- Krittanawong C. The rise of artificial intelligence and the uncertain future for physicians. Eur J Intern Med. 2018 Feb 1;48:e13–4.
- 2.7. Novelty and Outlier Detection — scikit-learn 0.23.1 documentation [Internet]. [cited 2020 Jul 7]. Available from: https://scikit-learn.org/stable/modules/outlier_detection.html
- Liu FT, Ting KM, Zhou Z-H. Isolation Forest. In: 2008 Eighth IEEE International Conference on Data Mining. 2008. p. 413–22.
- IsolationForest example — scikit-learn 0.23.1 documentation [Internet]. [cited 2020 Jul 7]. Available from: https://scikit-learn.org/stable/auto_examples/ensemble/plot_isolation_forest.html