6.1. UCTB.dataset package¶

6.1.1. UCTB.dataset.data_loader module¶

class UCTB.dataset.data_loader.NodeTrafficLoader(dataset, city=None, data_range='all', train_data_length='all', test_ratio=0.1, closeness_len=6, period_len=7, trend_len=4, target_length=1, normalize=True, workday_parser=<function is_work_day_america>, with_tpe=False, data_dir=None, MergeIndex=1, MergeWay='sum', remove=True, **kwargs)¶

Bases: object

The data loader that extracts and processes data from a DataSet object.

Parameters:

dataset (str) – A string containing path of the dataset pickle file or a string of name of the dataset.
city (str or None) – None if dataset is file path, or a string of name of the city. Default: None
data_range – The range of data extracted from self.dataset to be further used. If set to 'all', all data in self.dataset will be used. If set to a float between 0.0 and 1.0, the relative former proportion of data in self.dataset will be used. If set to a list of two integers [start, end], the data from start day to (end - 1) day of data in self.dataset will be used. Default: 'all'
train_data_length – The length of train data. If set to 'all', all data in the split train set will be used. If set to int, the latest train_data_length days of data will be used as train set. Default: 'all'
test_ratio (float) – The ratio of test set as data will be split into train set and test set. Default: 0.1
closeness_len (int) – The length of closeness data history. The former consecutive closeness_len time slots of data will be used as closeness history. Default: 6
period_len (int) – The length of period data history. The data of exact same time slots in former consecutive period_len days will be used as period history. Default: 7
trend_len (int) – The length of trend data history. The data of exact same time slots in former consecutive trend_len weeks (every seven days) will be used as trend history. Default: 4
target_length (int) – The numbers of steps that need prediction by one piece of history data. Have to be 1 now. Default: 1
normalize (bool|str|object) – Select which normalizer to normalize input data. Default: True
workday_parser – Used to build external features to be used in neural methods. Default: is_work_day_america
with_tpe (bool) – If True, data loader will build time position embeddings. Default: False
data_dir (str or None) – The dataset directory. If set to None, a directory will be created. If dataset is file path, data_dir should be None too. Default: None
MergeIndex (int) – The granularity of dataset will be MergeIndex * original granularity.
MergeWay (str) – How to change the data granularity. Now it can be sum average or max.
remove (bool) – If True, dataloader will remove stations whose average traffic is less than 1. Othewise, dataloader will use all stations.

dataset¶: DataSet – The DataSet object storing basic data.

daily_slots¶: int – The number of time slots in one single day.

station_number¶: int – The number of nodes.

external_dim¶: int – The number of dimensions of external features.

train_closeness¶: np.ndarray – The closeness history of train set data. When with_tpe is False, its shape is [train_time_slot_num, station_number, closeness_len, 1]. On the dimension of closeness_len, data are arranged from earlier time slots to later time slots. If closeness_len is set to 0, train_closeness will be an empty ndarray. train_period, train_trend, test_closeness, test_period, test_trend have similar shape and construction.

train_y¶: np.ndarray – The train set data. Its shape is [train_time_slot_num, station_number, 1]. test_y has similar shape and construction.

make_concat(node='all', is_train=True)¶

A function to concatenate all closeness, period and trend history data to use as inputs of models.

Parameters:	node (int or `'all'`) – To specify the index of certain node. If set to `'all'`, return the concatenation result of all nodes. If set to an integer, it will be the index of the selected node. Default: `'all'` is_train (bool) – If set to `True`, `train_closeness`, `train_period`, and `train_trend` will be concatenated. If set to `False`, `test_closeness`, `test_period`, and `test_trend` will be concatenated. Default: True
Returns:	Function returns an ndarray with shape as [time_slot_num, `station_number`, `closeness_len` + `period_len` + `trend_len`, 1], and time_slot_num is the temporal length of train set data if `is_train` is `True` or the temporal length of test set data if `is_train` is `False`. On the second dimension, data are arranged as `earlier closeness -> later closeness -> earlier period -> later period -> earlier trend -> later trend`.
Return type:	np.ndarray

6.1.2. UCTB.dataset.dataset module¶

class UCTB.dataset.dataset.DataSet(dataset, MergeIndex, MergeWay, city=None, data_dir=None)¶

Bases: object

An object storing basic data from a formatted pickle file. See also Build your own datasets. :param dataset: A string containing path of the dataset pickle file or a string of name of the dataset. :type dataset: str :param city: None if dataset is file path, or a string of name of the city. Default: None :type city: str or None :param data_dir: The dataset directory. If set to None, a directory will be created.

If dataset is file path, data_dir should be None too. Default: None

data¶: dict – The data directly from the pickle file. data may have a data['contribute_data'] dict to store supplementary data.

time_range¶: list – From data['TimeRange'] in the format of [YYYY-MM-DD, YYYY-MM-DD] indicating the time range of the data.

time_fitness¶: int – From data['TimeFitness'] indicating how many minutes is a single time slot.

node_traffic¶: np.ndarray – Data recording the main stream data of the nodes in during the time range. From data['Node']['TrafficNode'] with shape as [time_slot_num, node_num].

node_monthly_interaction¶: np.ndarray – Data recording the monthly interaction of pairs of nodes. Its shape is [month_num, node_num, node_num].It’s from data['Node']['TrafficMonthlyInteraction'] and is used to build interaction graph. Its an optional attribute and can be set as an empty list if interaction graph is not needed.

node_station_info¶: dict – A dict storing the coordinates of nodes. It shall be formatted as {id (may be arbitrary): [id (when sorted, should be consistant with index of node_traffic), latitude, longitude, other notes]}. It’s from data['Node']['StationInfo'] and is used to build distance graph. Its an optional attribute and can be set as an empty list if distance graph is not needed.

MergeIndex¶: int – A int number that used to adjust the granularity of the dataset, the granularity of the new dataset is time_fitness*MergeIndex. default: 1

MergeWay¶: str – can be sum and average. default: ``sum

merge_data(data, dataType)¶