6.1. UCTB.dataset package

6.1.1. UCTB.dataset.data_loader module

class UCTB.dataset.data_loader.NodeTrafficLoader(dataset, city=None, data_range='all', train_data_length='all', test_ratio=0.1, closeness_len=6, period_len=7, trend_len=4, target_length=1, normalize=True, workday_parser=<function is_work_day_america>, with_tpe=False, data_dir=None, MergeIndex=1, MergeWay='sum', remove=True, **kwargs)

Bases: object

The data loader that extracts and processes data from a DataSet object.

Parameters:
  • dataset (str) – A string containing path of the dataset pickle file or a string of name of the dataset.
  • city (str or None) – None if dataset is file path, or a string of name of the city. Default: None
  • data_range – The range of data extracted from self.dataset to be further used. If set to 'all', all data in self.dataset will be used. If set to a float between 0.0 and 1.0, the relative former proportion of data in self.dataset will be used. If set to a list of two integers [start, end], the data from start day to (end - 1) day of data in self.dataset will be used. Default: 'all'
  • train_data_length – The length of train data. If set to 'all', all data in the split train set will be used. If set to int, the latest train_data_length days of data will be used as train set. Default: 'all'
  • test_ratio (float) – The ratio of test set as data will be split into train set and test set. Default: 0.1
  • closeness_len (int) – The length of closeness data history. The former consecutive closeness_len time slots of data will be used as closeness history. Default: 6
  • period_len (int) – The length of period data history. The data of exact same time slots in former consecutive period_len days will be used as period history. Default: 7
  • trend_len (int) – The length of trend data history. The data of exact same time slots in former consecutive trend_len weeks (every seven days) will be used as trend history. Default: 4
  • target_length (int) – The numbers of steps that need prediction by one piece of history data. Have to be 1 now. Default: 1
  • normalize (bool|str|object) – Select which normalizer to normalize input data. Default: True
  • workday_parser – Used to build external features to be used in neural methods. Default: is_work_day_america
  • with_tpe (bool) – If True, data loader will build time position embeddings. Default: False
  • data_dir (str or None) – The dataset directory. If set to None, a directory will be created. If dataset is file path, data_dir should be None too. Default: None
  • MergeIndex (int) – The granularity of dataset will be MergeIndex * original granularity.
  • MergeWay (str) – How to change the data granularity. Now it can be sum average or max.
  • remove (bool) – If True, dataloader will remove stations whose average traffic is less than 1. Othewise, dataloader will use all stations.
dataset

DataSet – The DataSet object storing basic data.

daily_slots

int – The number of time slots in one single day.

station_number

int – The number of nodes.

external_dim

int – The number of dimensions of external features.

train_closeness

np.ndarray – The closeness history of train set data. When with_tpe is False, its shape is [train_time_slot_num, station_number, closeness_len, 1]. On the dimension of closeness_len, data are arranged from earlier time slots to later time slots. If closeness_len is set to 0, train_closeness will be an empty ndarray. train_period, train_trend, test_closeness, test_period, test_trend have similar shape and construction.

train_y

np.ndarray – The train set data. Its shape is [train_time_slot_num, station_number, 1]. test_y has similar shape and construction.

make_concat(node='all', is_train=True)

A function to concatenate all closeness, period and trend history data to use as inputs of models.

Parameters:
  • node (int or 'all') – To specify the index of certain node. If set to 'all', return the concatenation result of all nodes. If set to an integer, it will be the index of the selected node. Default: 'all'
  • is_train (bool) – If set to True, train_closeness, train_period, and train_trend will be concatenated. If set to False, test_closeness, test_period, and test_trend will be concatenated. Default: True
Returns:

Function returns an ndarray with shape as [time_slot_num, station_number, closeness_len + period_len + trend_len, 1], and time_slot_num is the temporal length of train set data if is_train is True or the temporal length of test set data if is_train is False. On the second dimension, data are arranged as earlier closeness -> later closeness -> earlier period -> later period -> earlier trend -> later trend.

Return type:

np.ndarray

6.1.2. UCTB.dataset.dataset module

class UCTB.dataset.dataset.DataSet(dataset, MergeIndex, MergeWay, city=None, data_dir=None)

Bases: object

An object storing basic data from a formatted pickle file. See also Build your own datasets. :param dataset: A string containing path of the dataset pickle file or a string of name of the dataset. :type dataset: str :param city: None if dataset is file path, or a string of name of the city. Default: None :type city: str or None :param data_dir: The dataset directory. If set to None, a directory will be created.

If dataset is file path, data_dir should be None too. Default: None
data

dict – The data directly from the pickle file. data may have a data['contribute_data'] dict to store supplementary data.

time_range

list – From data['TimeRange'] in the format of [YYYY-MM-DD, YYYY-MM-DD] indicating the time range of the data.

time_fitness

int – From data['TimeFitness'] indicating how many minutes is a single time slot.

node_traffic

np.ndarray – Data recording the main stream data of the nodes in during the time range. From data['Node']['TrafficNode'] with shape as [time_slot_num, node_num].

node_monthly_interaction

np.ndarray – Data recording the monthly interaction of pairs of nodes. Its shape is [month_num, node_num, node_num].It’s from data['Node']['TrafficMonthlyInteraction'] and is used to build interaction graph. Its an optional attribute and can be set as an empty list if interaction graph is not needed.

node_station_info

dict – A dict storing the coordinates of nodes. It shall be formatted as {id (may be arbitrary): [id (when sorted, should be consistant with index of node_traffic), latitude, longitude, other notes]}. It’s from data['Node']['StationInfo'] and is used to build distance graph. Its an optional attribute and can be set as an empty list if distance graph is not needed.

MergeIndex

int – A int number that used to adjust the granularity of the dataset, the granularity of the new dataset is time_fitness*MergeIndex. default: 1

MergeWay

str – can be sum and average. default: ``sum

merge_data(data, dataType)