6.1. UCTB.dataset package

6.1.1. UCTB.dataset.data_loader module

class UCTB.dataset.data_loader.NodeTrafficLoader(dataset, city=None, data_range='all', train_data_length='all', test_ratio=0.1, closeness_len=6, period_len=7, trend_len=4, target_length=1, normalize=True, workday_parser=<function is_work_day_america>, with_tpe=False, data_dir=None, MergeIndex=1, MergeWay='sum', remove=True, **kwargs)

Bases: object

The data loader that extracts and processes data from a DataSet object.

Parameters:
  • dataset (str) – A string containing path of the dataset pickle file or a string of name of the dataset.

  • city (str or None) – None if dataset is file path, or a string of name of the city. Default: None

  • data_range – The range of data extracted from self.dataset to be further used. If set to 'all', all data in self.dataset will be used. If set to a float between 0.0 and 1.0, the relative former proportion of data in self.dataset will be used. If set to a list of two integers [start, end], the data from start day to (end - 1) day of data in self.dataset will be used. Default: 'all'

  • train_data_length – The length of train data. If set to 'all', all data in the split train set will be used. If set to int, the latest train_data_length days of data will be used as train set. Default: 'all'

  • test_ratio (float) – The ratio of test set as data will be split into train set and test set. Default: 0.1

  • closeness_len (int) – The length of closeness data history. The former consecutive closeness_len time slots of data will be used as closeness history. Default: 6

  • period_len (int) – The length of period data history. The data of exact same time slots in former consecutive period_len days will be used as period history. Default: 7

  • trend_len (int) – The length of trend data history. The data of exact same time slots in former consecutive trend_len weeks (every seven days) will be used as trend history. Default: 4

  • target_length (int) – The numbers of steps that need prediction by one piece of history data. Have to be 1 now. Default: 1

  • normalize (bool|str|object) – Select which normalizer to normalize input data. Default: True

  • workday_parser – Used to build external features to be used in neural methods. Default: is_work_day_america

  • with_tpe (bool) – If True, data loader will build time position embeddings. Default: False

  • data_dir (str or None) – The dataset directory. If set to None, a directory will be created. If dataset is file path, data_dir should be None too. Default: None

  • MergeIndex (int) – The granularity of dataset will be MergeIndex * original granularity.

  • MergeWay (str) – How to change the data granularity. Now it can be sum average or max.

  • remove (bool) – If True, dataloader will remove stations whose average traffic is less than 1. Othewise, dataloader will use all stations.

dataset

The DataSet object storing basic data.

Type:

DataSet

daily_slots

The number of time slots in one single day.

Type:

int

station_number

The number of nodes.

Type:

int

external_dim

The number of dimensions of external features.

Type:

int

train_closeness

The closeness history of train set data. When with_tpe is False, its shape is [train_time_slot_num, station_number, closeness_len, 1]. On the dimension of closeness_len, data are arranged from earlier time slots to later time slots. If closeness_len is set to 0, train_closeness will be an empty ndarray. train_period, train_trend, test_closeness, test_period, test_trend have similar shape and construction.

Type:

np.ndarray

train_y

The train set data. Its shape is [train_time_slot_num, station_number, 1]. test_y has similar shape and construction.

Type:

np.ndarray

make_concat(node='all', is_train=True)

A function to concatenate all closeness, period and trend history data to use as inputs of models.

Parameters:
  • node (int or 'all') – To specify the index of certain node. If set to 'all', return the concatenation result of all nodes. If set to an integer, it will be the index of the selected node. Default: 'all'

  • is_train (bool) – If set to True, train_closeness, train_period, and train_trend will be concatenated. If set to False, test_closeness, test_period, and test_trend will be concatenated. Default: True

Returns:

Function returns an ndarray with shape as [time_slot_num, station_number, closeness_len + period_len + trend_len, 1], and time_slot_num is the temporal length of train set data if is_train is True or the temporal length of test set data if is_train is False. On the second dimension, data are arranged as earlier closeness -> later closeness -> earlier period -> later period -> earlier trend -> later trend.

Return type:

np.ndarray

6.1.2. UCTB.dataset.dataset module

class UCTB.dataset.dataset.DataSet(dataset, MergeIndex, MergeWay, city=None, data_dir=None)

Bases: object

An object storing basic data from a formatted pickle file. See also Build your own datasets. :param dataset: A string containing path of the dataset pickle file or a string of name of the dataset. :type dataset: str :param city: None if dataset is file path, or a string of name of the city. Default: None :type city: str or None :param data_dir: The dataset directory. If set to None, a directory will be created.

If dataset is file path, data_dir should be None too. Default: None

data

The data directly from the pickle file. data may have a data['contribute_data'] dict to store supplementary data.

Type:

dict

time_range

From data['TimeRange'] in the format of [YYYY-MM-DD, YYYY-MM-DD] indicating the time range of the data.

Type:

list

time_fitness

From data['TimeFitness'] indicating how many minutes is a single time slot.

Type:

int

node_traffic

Data recording the main stream data of the nodes in during the time range. From data['Node']['TrafficNode'] with shape as [time_slot_num, node_num].

Type:

np.ndarray

node_monthly_interaction

Data recording the monthly interaction of pairs of nodes. Its shape is [month_num, node_num, node_num].It’s from data['Node']['TrafficMonthlyInteraction'] and is used to build interaction graph. Its an optional attribute and can be set as an empty list if interaction graph is not needed.

Type:

np.ndarray

node_station_info

A dict storing the coordinates of nodes. It shall be formatted as {id (may be arbitrary): [id (when sorted, should be consistant with index of node_traffic), latitude, longitude, other notes]}. It’s from data['Node']['StationInfo'] and is used to build distance graph. Its an optional attribute and can be set as an empty list if distance graph is not needed.

Type:

dict

MergeIndex

A int number that used to adjust the granularity of the dataset, the granularity of the new dataset is time_fitness*MergeIndex. default: 1

Type:

int

MergeWay

can be sum and average. default: ``sum

Type:

str

merge_data(data, dataType)