A dataset in Opus is considered as a
table where
is the
number of entries and
is the number of characteristics, also called
attributes. One of the characteristics must have unique values
that are numeric larger than 0.
Suppose you have a set of household agents with two characteristics, income and number of persons per household, which are uniquely identified by household IDs. The file data/tutorial/households.tab in the urbansim package contains an example dataset for 10 households:
household_id income persons 1 1000 2 2 2000 3 3 5000 3 4 3000 2 5 500 1 6 10000 4 7 8000 4 8 1000 1 9 3000 2 10 15000 5
In Opus datasets are independent from the physical storage of the data. A data storage is represented by a python object. We create a storage object for the ASCII file households.tab:
>>> import os
>>> import urbansim
>>> us_path = urbansim.__path__[0]
>>> from opus_core.storage_factory import StorageFactory
>>> storage = StorageFactory().get_storage('tab_storage',
storage_location = os.path.join(us_path, 'data/tutorial'))
The storage here specifies that the data
are stored as an ASCII file in a table format.
Opus can support many types of storage formats, including formats you define. See Section 22.2
for more details.
Now we create a household dataset with the opus_core class Dataset (see Section 22.1), using the created storage object:
>>> from opus_core.datasets.dataset import Dataset
>>> households = Dataset(in_storage = storage,
in_table_name = 'households',
id_name='household_id',
dataset_name='household')
Dataset supports lazy loading. Thus, there are no entries
loaded for households at this moment:
>>> households.get_attribute_names() []But the dataset `knows' about attributes living on the given storage:
>>> households.get_primary_attribute_names() ['household_id', 'income', 'persons']The data are loaded as they are needed. For example, loading the unique identifier of the dataset gives:
>>> households.get_id_attribute() array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) >>> households.size() 10
Other attributes can be loaded via the get_attribute() method which returns a numpy array:
>>> households.get_attribute("income")
array([ 1000., 2000., 5000., 3000., 500., 10000., 8000.,
1000., 3000., 15000.])
>>> households.get_attribute_names()
['household_id', 'income']
Each attribute of a Dataset is stored as a numpy array.
In the above example, each of the attributes is loaded separately. Alternatively, we can load multiple attributes at once, which can be useful when loading data from a slow storage, such as a SQL database:
>>> households.load_dataset() >>> households.get_attribute_names() ['household_id', 'persons', 'income']An optional argument
attributes can be passed to the load_dataset()
method that specifies names of attributes to be loaded, e.g. attributes=['income', 'persons'].
We can also plot a histogram of the income attribute (this method requires the matplotlib library):
>>> households.plot_histogram("income", bins = 10)
>>> households.r_histogram("income")
We can investigate a correlation between attributes by plotting a scatter plot (rpy library required):
>>> households.r_scatter("persons", "income")
Correlation coefficient: 0.919147133827
The correlation coefficient between two attributes and the correlation matrix of several attributes, respectively, can be obtained by:
>>> households.correlation_coefficient("persons", "income")
0.91914713382720947
>>> households.correlation_matrix(["persons", "income"])
array([[ 1. , 0.91914713],
[ 0.91914713, 1. ]], type=float32)
A summary of data in a dataset can by given by:
>>> households.summary()
Attribute name mean sd sum min max
-------------------------------------------------------------------------
persons 2.7 1.34 27 1 5
income 4850.0 4749.56 48500 500 15000
Size: 10 records
identifiers:
household_id in range 1 - 10
To add an attribute to the set of households, for example each household's location, we do
>>> households.add_primary_attribute(data=[4,6,9,2,4,8,2,1,3,2], name="location") >>> households.get_attribute_names() ['household_id', 'persons', 'location', 'income']If the attribute "location" already exists in the dataset, the values are overwritten.
To change specific values in a dataset, one can use
>>> households.modify_attribute(name="location", data=[0,0], index=[0,1])
>>> households.get_attribute("location")
array([0, 0, 9, 2, 4, 8, 2, 1, 3, 2])
Here the argument index determines the index of the data that are
modified.
To determine the location of household with household_id
,
do
>>> households.get_data_element_by_id(5).location 4
In order to store data in one of the supported formats, you can use the storage object created at the beginning of this section, or create a new one using a different type of storage:
>>> households.write_dataset(out_storage=storage,
out_table_name="households_output")
Each dataset should have a unique dataset name that is used as an identification in variable computation (see Section 22.3.1).
>>> households.get_dataset_name() 'household'
The urbansim package contains many pre-defined dataset classes, such as HouseholdDataset, GridcellDataset,
JobDataset, ZoneDataset, FazDataset, NeighborhoodDataset, RaceDataset,
RateDataset. Some datasets are described in Section 23.2,
Table 23.1. They are all children of Dataset with pre-defined values
for some arguments, such as id_name, in_table_name or dataset_name.