next up previous index
Next: Working with Models Up: Interactive Exploration of Opus Previous: Interactive Exploration of Opus   Index


Working with Data Sets

A dataset in Opus is considered as a $ n \times m$ table where $ n$ is the number of entries and $ m$ is the number of characteristics, also called attributes. One of the characteristics must have unique values that are numeric larger than 0.

Suppose you have a set of household agents with two characteristics, income and number of persons per household, which are uniquely identified by household IDs. The file data/tutorial/households.tab in the urbansim package contains an example dataset for 10 households:

household_id   income         persons
 1               1000           2
 2               2000           3
 3               5000           3
 4               3000           2
 5                500           1
 6              10000           4
 7               8000           4
 8               1000           1
 9               3000           2
10              15000           5

In Opus datasets are independent from the physical storage of the data. A data storage is represented by a python object. We create a storage object for the ASCII file households.tab:

>>> import os
>>> import urbansim
>>> us_path = urbansim.__path__[0]
>>> from opus_core.storage_factory import StorageFactory
>>> storage = StorageFactory().get_storage('tab_storage',
        storage_location = os.path.join(us_path, 'data/tutorial'))

The storage here specifies that the data are stored as an ASCII file in a table format. Opus can support many types of storage formats, including formats you define. See Section 22.2 for more details.

Now we create a household dataset with the opus_core class Dataset (see Section 22.1), using the created storage object:

>>> from opus_core.datasets.dataset import Dataset
>>> households = Dataset(in_storage = storage,
                         in_table_name = 'households', 
                         id_name='household_id',
                         dataset_name='household')

Dataset supports lazy loading. Thus, there are no entries loaded for households at this moment:

>>> households.get_attribute_names()
[]
But the dataset `knows' about attributes living on the given storage:
>>> households.get_primary_attribute_names()
['household_id', 'income', 'persons']
The data are loaded as they are needed. For example, loading the unique identifier of the dataset gives:
>>> households.get_id_attribute()
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
>>> households.size()
10

Other attributes can be loaded via the get_attribute() method which returns a numpy array:

>>> households.get_attribute("income")
array([  1000.,   2000.,   5000.,   3000.,    500.,  10000.,   8000.,
         1000.,   3000.,  15000.])
>>> households.get_attribute_names()
['household_id', 'income']

Each attribute of a Dataset is stored as a numpy array.

In the above example, each of the attributes is loaded separately. Alternatively, we can load multiple attributes at once, which can be useful when loading data from a slow storage, such as a SQL database:

>>> households.load_dataset()
>>> households.get_attribute_names()
['household_id', 'persons', 'income']
An optional argument attributes can be passed to the load_dataset() method that specifies names of attributes to be loaded, e.g. attributes=['income', 'persons'].

We can also plot a histogram of the income attribute (this method requires the matplotlib library):

>>> households.plot_histogram("income", bins = 10)
Image incomehist
or (if the rpy library is installed)
>>> households.r_histogram("income")
Image incomerhist

We can investigate a correlation between attributes by plotting a scatter plot (rpy library required):

>>> households.r_scatter("persons", "income")
Image incomerscatter
Correlation coefficient:  0.919147133827

The correlation coefficient between two attributes and the correlation matrix of several attributes, respectively, can be obtained by:

>>> households.correlation_coefficient("persons", "income")
0.91914713382720947
>>> households.correlation_matrix(["persons", "income"])
array([[ 1.        ,  0.91914713],
       [ 0.91914713,  1.        ]], type=float32)

A summary of data in a dataset can by given by:

>>> households.summary()
Attribute name        mean           sd           sum        min     max
-------------------------------------------------------------------------
       persons         2.7         1.34            27          1       5
        income      4850.0      4749.56         48500        500   15000
       


Size: 10  records
identifiers:
        household_id  in range  1 - 10

To add an attribute to the set of households, for example each household's location, we do

>>> households.add_primary_attribute(data=[4,6,9,2,4,8,2,1,3,2], name="location")
>>> households.get_attribute_names()
['household_id', 'persons', 'location', 'income']
If the attribute "location" already exists in the dataset, the values are overwritten.

To change specific values in a dataset, one can use

>>> households.modify_attribute(name="location", data=[0,0], index=[0,1])
>>> households.get_attribute("location")
array([0, 0, 9, 2, 4, 8, 2, 1, 3, 2])
Here the argument index determines the index of the data that are modified.

To determine the location of household with household_id $ = 5$ , do

>>> households.get_data_element_by_id(5).location
4

In order to store data in one of the supported formats, you can use the storage object created at the beginning of this section, or create a new one using a different type of storage:

>>> households.write_dataset(out_storage=storage,
                             out_table_name="households_output")

Each dataset should have a unique dataset name that is used as an identification in variable computation (see Section 22.3.1).

>>> households.get_dataset_name()
'household'

The urbansim package contains many pre-defined dataset classes, such as HouseholdDataset, GridcellDataset, JobDataset, ZoneDataset, FazDataset, NeighborhoodDataset, RaceDataset, RateDataset. Some datasets are described in Section 23.2, Table 23.1. They are all children of Dataset with pre-defined values for some arguments, such as id_name, in_table_name or dataset_name.


next up previous index
Next: Working with Models Up: Interactive Exploration of Opus Previous: Interactive Exploration of Opus   Index
info (at) urbansim.org