Machine Learning for Input Data Cleaning

Sam Clark

This is a project to apply machine learning algorithms to impute missing data and identify outliers that are likely to be incorrect.

Missing Value Replacement:

This portion of the tool allows users to replace missing values using a number of machine learning algorithms. It also has features that help the user through the very iterative process of creating/selecting the right features, selecting the right machine learning algorithm, and, if necessary, selecting good parameters for the machine learning algorithm.

This includes:

  • Scripts to help with feature creation.
  • Various forms of logging preliminary results in order to help the user discover necessary changes in configurations.
  • Useful estimations of accuracy, running time, etc to help the user to evaluate configurations.

Thorough documentation is being created also, which will be key to making this process reasonable for people who don't have experience with machine learning. This will include numerous sample configurations and explanations of the pros and cons of various machine learning algorithms.

Outlier Detection:

This portion of the tool helps users discover the types of outliers that exist in there data and provides them with ways to either systematically remove them or (hopefully) fix them. The outlier discovery process is done through picking a number of interesting attributes then finding the Local Outlier Factor value for each instance in the data using these attributes. The n instances with the highest LOF values can then be inspected.

A number of things can be done with these n instances:

  • A threshold value can be picked for LOF and all instances with greater values can be removed
    • Users don't seem to like this approach since it lacks transparency and may generate false negatives
    • A data set can be created where instances with LOF values above a certain value are considered outliers. The decision tree algorithm C4.5 can then be run on the data set. After inspecting the tree the user can find rules for defining outliers.
      • This only work well with a small number of attributes, otherwise the tree is incomprehensible.
    • Outliers can be identified individually from the list.

Outliers can be split in a few ways:

  • Entry errors vs. Systematic errors
    • Entry errors are pretty self explanatory - 32 bedroom 3000 Sqft home
    • Most common systematic errors I have run into are those in which the building table disagrees with the parcel table. These can be due to the tables being out of sync.
      • Building footprint is greater than parcel size.
      • Building use description is very different from land use description.
        • This was common in the presentation I gave in the Spring. It applied to 90% of the outliers found.

  • Errors that rely on spatial data:
    • Parcels that have a much smaller price per square foot than neighbors
      • Generally tax exempt

I still need to try some of my strategies to fix outliers.

Some Preliminary Results from Puget Sound

Outliers discovered in raw 2005 Puget Sound data:

Entry errors:

  • 1 16 million square foot warehouse
  • 14 99 story buildings
  • 29 Buildings with Stories > 8 and Bldg SQFT < 5000

Systematic errors:

  • 2000 buildings with footprint > parcel size
  • Over 1% of data where land use description is very different from building use description

Spatial errors:

  • Under valued parcels *Haven't been able to ground truth these at all, so I can't speculate.
Topic revision: r2 - 18 Nov 2009 - 19:15:30 - PaulWaddell
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback