Outlier Detection

Outlier detection
Outlier detection

This repository contains all the useful resources (personal opinion) I have found during my outlier detection study and research.

I’ve spent a lot of words to discuss outlier detection philosophy and methods in my blogs. You can find these 4 blog posts below.

  1. The philosophy of outliers ( part 1)
  2. Outlier detection univariate methods ( part 2)
  3. Outliers detection in regression ( part 3)
  4. Outlier detection multivariate methods ( part 4)

Besides my blog articles, I put my technical notes in the ‘Notes’ folder for your information. In the meantime, all the relevant online sources and useful links are saved in the EXCEL file. It includes multiple useful handouts and some valuable papers. You can find them in the ‘Handout’ folder.

Among all the materials, William G. Jacoby’s handout is worthy of special mention.

A lot useful ‘Outlier detection’ R packages are available.

  • outliers is useful for the univariate outlier detection. it contains multiple statistical tests (i.e, ‘grubbs’, ‘dixon’).
  • EnvStats has ‘rosner test’ (rosnerTest).
  • car is super useful and it has a lot of wonderful functions for the outlier detection. The outlierTest() function from the {car} package gives the most extreme observation based on the given model and allows to test whether it is an outlier. In addition, car package provides a series of graphing functions to plot outliers through residualPlots, avPlots, qqPlot, influenceIndexPlot. Among all these useful plot functions, the function-influencePlot deserves special mention. This function creates a bubble plot of Studentized residuals versus hat values, with the areas of the circles representing the observations proportional to the value Cook’s distance. Vertical reference lines are drawn at twice and three times the average hat value, horizontal reference lines at -2, 0, and 2 on the Studentized-residual scale. [x-axis: Hat-value (with cutoffs), Y-axis: studentized residual, size of bubble (Cook’s D).]
  • mvoutlier includes a variety of functions for the multivariate outlier detection.
  • DMwR has a useful function-lofactor() which obtains local outlier factors using the LOF algorithm.
  • robustbase provides a higher level of multivariate outlider criteria calculation. For example, the function covMcd calculates Robust Location and Scatter Estimation via MCD.
  • performance offers one of the most multivariate outlier detection function-check_outliers. With different option keywords, this function is able to cover most of multivariate outlier detection criteria. You can find more details through the link below. check_outliers

In addition, I’ve listed all 9 useful R example scripts. These R scripts are great practice resources for you to understand the outlier detection process and some available methods. You can download them and practice on your local computer.

Ou Zhang
Ou Zhang
Research Scientist & Data Scientist

I’m a research scientist, psychometrician, and data scientist, who loves psychometrics, applied statistics, general data science, and programming.

Related