Consistency Checker

From ExternalWiki

Jump to: navigation, search

The Consistency Checker in UrbanSim3

In the previous version of UrbanSim, we had a tool called the consistency checker, that would apply a set of stored tests to the data and fail if any of them failed. This was a generally useful idea in that it allowed users to examine the state of their database before running a model system on it. But it was monolithic and rigid. Often it imposed restrictions that users did not necessarily want to impose. It also had a high threshold for success: a user had to have the entire database 'clean' before it would pass the consistency checker.

We have the same needs that motivated the consistency checker, but have not yet implemented anything like this in UrbanSim4, other than unit-tests for variables to ensure that the variable code does what it should do on trivially small test data. We need something that tests the data to be used in the model system.

A Proposed Approach for UrbanSim4

An alternative to the monolithic consistency checker is to develop a declarative specification of requirements at the variable and model level. This would identify not only dependencies, but also valid ranges for variables and an appropriately helpful error message to print when such ranges are not met in a given data set.

The use cases would be mainly:

  1. . Testing a table or set of tables in a database (in MySQL, SQL Server or some other) against standards defined for a target data model
  2. . Testing a single variable, or a batch list of variables, against user data (not test data).
  3. . Testing a single model, or an entire model system, against user data.

We would need some kind of specification standard to apply both to variables, and to model specifications, since model specifications can use not only variables, but also expressions and primary attributes, which also need to have valid ranged defined.

Helpful error messages and warnings might be written to a consistency_report file as well as to the screen, and look something like:

  Real Estate Price Model
     Submodel 1
        Error: building_sqft value %X% is not in valid range (%Y% to %Z%) on DataSet Buildings; this occurred %n% times out of %N%

Some Extensions

It would be helpful to extend the expression language to incorporate these kinds of tests also: we would have to pass optional arguments that define the valid ranges for each component used in the expression.

It would also be helpful to be able to include primary attributes via the expression mechanism: bldgsqft = building.building_sqft which would extend the consistency checking to primary attributes.

In the context of the database environment, it would be useful to couple a testing/diagnostic tool with one or more data 'repair' tools, such as data imputation methods.

Personal tools