next up previous index
Next: Working with Models Up: Tutorial for the urbansim Previous: Tutorial for the urbansim   Index


Working with Data Sets

A dataset in Opus is considered as a $ n \times m$ table where $ n$ is the number of entries and $ m$ is the number of characteristics, also called attributes. One of the characteristics must have unique values that are numeric larger than 0.

Suppose you have a set of household agents with two characteristics, income and number of persons per household, which are uniquely identified by household IDs. The file data/tutorial/households.tab in the urbansim package contains an example dataset for 10 households:

household_id   income         persons
 1               1000           2
 2               2000           3
 3               5000           3
 4               3000           2
 5                500           1
 6              10000           4
 7               8000           4
 8               1000           1
 9               3000           2
10              15000           5

In urbansim datasets are independent from the physical storage of the data. A data storage is represented by a python object. We create a storage object for the ASCII file households.tab:

>>> import os
>>> import urbansim
>>> us_path = urbansim.__path__[0]
>>> from opus_core.storage_factory import StorageFactory
>>> storage = StorageFactory().get_storage('tab_storage',
        storage_location = os.path.join(us_path, 'data/tutorial'))

The storage here specifies that the data are stored as an ASCII file in a table format. Alternatively, if the table is a MySQL table called ``households'' stored in a database ``mydatabase,'' you could use

>>> import os
>>> from opus_core.store.opus_database import OpusDatabase
>>> from opus_core.storage_factory import StorageFactory
>>> connection = OpusDatabase(hostname = os.environ["MYSQLHOSTNAME"],
                              username = os.environ["MYSQLUSERNAME"],
                              password = os.environ["MYSQLPASSWORD"],
                              database_name = "mydatabase")
>>> storage = StorageFactory().get_storage('mysql_storage',
        storage_location = connection)

(To do: add an example MySQL file, like household.tab, to play around with.)

If the mysql database is a scenario database (as described in 9.1), replace opus_database and OpusDatabase in the above code with scenario_database and ScenarioDatabase.

Opus can support many types of storage formats, including formats you define. In addition to ``tab'' and ``mysql'', formats that come with Opus are ``flt'' where the data are stored in a binary format, ``xml'' for xml type of data, ``csv'' for ASCII comma delimited type and ``dict'' where the data are passed directly to the data set class (see Section 7.3).

Now we can create a household dataset with the urbansim class HouseholdDataset, using the created storage object:

>>> from urbansim.datasets.household_dataset import HouseholdDataset
>>> households = HouseholdDataset(in_storage = storage,
                  in_table_name = 'households', id_name='household_id')

The HouseholdDataset class is a child class of Dataset from the module
opus_core.datasets.dataset, and thus it can use any of Dataset's methods (see Sections 8.2 and 7.2).

Dataset supports lazy loading. Thus, there are no entries loaded for households at this moment:

>>> households.get_attribute_names()
[]
But the dataset `knows' about attributes living on the given storage:
>>> households.get_primary_attribute_names()
['household_id', 'income', 'persons']
The data are loaded as they are needed. For example, loading the unique identifier of the dataset gives:
>>> households.get_id_attribute()
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
>>> households.size()
10

Other attributes can be loaded via the get_attribute() method which returns a numpy array:

>>> households.get_attribute("income")
array([  1000.,   2000.,   5000.,   3000.,    500.,  10000.,   8000.,
         1000.,   3000.,  15000.])
>>> households.get_attribute_names()
['household_id', 'income']

Each attribute of a Dataset is stored as a numpy array.

In the above example, each of the attributes is loaded separately. Alternatively, we can load multiple attributes at once, which can be useful when loading data from a slow storage, such as a SQL database:

>>> households.load_dataset()
>>> households.get_attribute_names()
['household_id', 'persons', 'income']
An optional argument attributes can be passed to the load_dataset() method that specifies names of attributes to be loaded, e.g. attributes=['income', 'persons'].

We can also plot a histogram of the income attribute (this method requires the matplotlib library):

>>> households.plot_histogram("income", bins = 10)
Image incomehist
or (if the rpy library is installed)
>>> households.r_histogram("income")
Image incomerhist

We can investigate a correlation between attributes by plotting a scatter plot (rpy library required):

>>> households.r_scatter("persons", "income")
Image incomerscatter
Correlation coefficient:  0.919147133827

The correlation coefficient between two attributes and the correlation matrix of several attributes, respectively, can be obtained by:

>>> households.correlation_coefficient("persons", "income")
0.91914713382720947
>>> households.correlation_matrix(["persons", "income"])
array([[ 1.        ,  0.91914713],
       [ 0.91914713,  1.        ]], type=float32)

A summary of data in a dataset can by given by:

>>> households.summary()
Attribute name        mean           sd           sum        min     max
-------------------------------------------------------------------------
        income      4850.0      4749.56         48500        500   15000
       persons         2.7         1.34            27          1       5


Size: 10  records
identifiers:
        household_id  in range  1 - 10

To add an attribute to the set of households, for example each household location, we do

>>> households.add_primary_attribute(data=[4,6,9,2,4,8,2,1,3,2], name="location")
>>> households.get_attribute_names()
['household_id', 'persons', 'location', 'income']
If the attribute "location" already exists in the dataset, the values are overwritten.

To change specific values in a dataset, one can use

>>> households.modify_attribute(name="location", data=[0,0], index=[0,1])
>>> households.get_attribute("location")
array([0, 0, 9, 2, 4, 8, 2, 1, 3, 2])
Here the argument index determines the index of the data that are modified.

To determine the location of household with household_id $ = 5$, do

>>> households.get_data_element_by_id(5).location
4

In order to store data in one of the supported formats, you can use the storage object created at the beginning of this section, or create a new one using different directory/database:

>>> households.write_dataset(out_storage=storage,
                             out_table_name="households_output")

urbansim contains many pre-defined dataset classes, such as GridcellDataset, JobDataset, ZoneDataset, FazDataset, NeighborhoodDataset, RaceDataset, RateDataset. Some datasets are described in Section 8.2, Table 8.1.

Each dataset should have a unique dataset name that is used as an identification in variable computation (see Section 7.4.1). Each dataset defined in urbansim initiates an object of class Dataset by passing default values, such as a dataset name, name of the unique identifier or a table name and storage of the input data.

>>> households.get_dataset_name()
'household'


next up previous index
Next: Working with Models Up: Tutorial for the urbansim Previous: Tutorial for the urbansim   Index
info (at) urbansim.org