6.1. UCTB.dataset package¶
6.1.1. UCTB.dataset.data_loader module¶
-
class
UCTB.dataset.data_loader.NodeTrafficLoader(dataset, city=None, data_range='all', train_data_length='all', test_ratio=0.1, closeness_len=6, period_len=7, trend_len=4, target_length=1, normalize=True, workday_parser=<function is_work_day_america>, with_tpe=False, data_dir=None, MergeIndex=1, MergeWay='sum', remove=True, **kwargs)¶ Bases:
objectThe data loader that extracts and processes data from a
DataSetobject.Parameters: - dataset (str) – A string containing path of the dataset pickle file or a string of name of the dataset.
- city (
strorNone) –Noneif dataset is file path, or a string of name of the city. Default:None - data_range – The range of data extracted from
self.datasetto be further used. If set to'all', all data inself.datasetwill be used. If set to a float between 0.0 and 1.0, the relative former proportion of data inself.datasetwill be used. If set to a list of two integers[start, end], the data from start day to (end - 1) day of data inself.datasetwill be used. Default:'all' - train_data_length – The length of train data. If set to
'all', all data in the split train set will be used. If set to int, the latesttrain_data_lengthdays of data will be used as train set. Default:'all' - test_ratio (float) – The ratio of test set as data will be split into train set and test set. Default: 0.1
- closeness_len (int) – The length of closeness data history. The former consecutive
closeness_lentime slots of data will be used as closeness history. Default: 6 - period_len (int) – The length of period data history. The data of exact same time slots in former consecutive
period_lendays will be used as period history. Default: 7 - trend_len (int) – The length of trend data history. The data of exact same time slots in former consecutive
trend_lenweeks (every seven days) will be used as trend history. Default: 4 - target_length (int) – The numbers of steps that need prediction by one piece of history data. Have to be 1 now. Default: 1
- normalize (bool|str|object) – Select which normalizer to normalize input data. Default:
True - workday_parser – Used to build external features to be used in neural methods. Default:
is_work_day_america - with_tpe (bool) – If
True, data loader will build time position embeddings. Default:False - data_dir (
strorNone) – The dataset directory. If set toNone, a directory will be created. Ifdatasetis file path,data_dirshould beNonetoo. Default:None - MergeIndex (int) – The granularity of dataset will be
MergeIndex* original granularity. - MergeWay (str) – How to change the data granularity. Now it can be
sumaverageormax. - remove (bool) – If
True, dataloader will remove stations whose average traffic is less than 1. Othewise, dataloader will use all stations.
-
dataset¶ DataSet – The DataSet object storing basic data.
-
daily_slots¶ int – The number of time slots in one single day.
-
station_number¶ int – The number of nodes.
-
external_dim¶ int – The number of dimensions of external features.
-
train_closeness¶ np.ndarray – The closeness history of train set data. When
with_tpeisFalse, its shape is [train_time_slot_num,station_number,closeness_len, 1]. On the dimension ofcloseness_len, data are arranged from earlier time slots to later time slots. Ifcloseness_lenis set to 0, train_closeness will be an empty ndarray.train_period,train_trend,test_closeness,test_period,test_trendhave similar shape and construction.
-
train_y¶ np.ndarray – The train set data. Its shape is [train_time_slot_num,
station_number, 1].test_yhas similar shape and construction.
-
make_concat(node='all', is_train=True)¶ A function to concatenate all closeness, period and trend history data to use as inputs of models.
Parameters: - node (int or
'all') – To specify the index of certain node. If set to'all', return the concatenation result of all nodes. If set to an integer, it will be the index of the selected node. Default:'all' - is_train (bool) – If set to
True,train_closeness,train_period, andtrain_trendwill be concatenated. If set toFalse,test_closeness,test_period, andtest_trendwill be concatenated. Default: True
Returns: Function returns an ndarray with shape as [time_slot_num,
station_number,closeness_len+period_len+trend_len, 1], and time_slot_num is the temporal length of train set data ifis_trainisTrueor the temporal length of test set data ifis_trainisFalse. On the second dimension, data are arranged asearlier closeness -> later closeness -> earlier period -> later period -> earlier trend -> later trend.Return type: np.ndarray
- node (int or
6.1.2. UCTB.dataset.dataset module¶
-
class
UCTB.dataset.dataset.DataSet(dataset, MergeIndex, MergeWay, city=None, data_dir=None)¶ Bases:
objectAn object storing basic data from a formatted pickle file. See also Build your own datasets. :param dataset: A string containing path of the dataset pickle file or a string of name of the dataset. :type dataset: str :param city:
Noneif dataset is file path, or a string of name of the city. Default:None:type city: str orNone:param data_dir: The dataset directory. If set toNone, a directory will be created.Ifdatasetis file path,data_dirshould beNonetoo. Default:None-
data¶ dict – The data directly from the pickle file.
datamay have adata['contribute_data']dict to store supplementary data.
-
time_range¶ list – From
data['TimeRange']in the format of [YYYY-MM-DD, YYYY-MM-DD] indicating the time range of the data.
-
time_fitness¶ int – From
data['TimeFitness']indicating how many minutes is a single time slot.
-
node_traffic¶ np.ndarray – Data recording the main stream data of the nodes in during the time range. From
data['Node']['TrafficNode']with shape as [time_slot_num, node_num].
-
node_monthly_interaction¶ np.ndarray – Data recording the monthly interaction of pairs of nodes. Its shape is [month_num, node_num, node_num].It’s from
data['Node']['TrafficMonthlyInteraction']and is used to build interaction graph. Its an optional attribute and can be set as an empty list if interaction graph is not needed.
-
node_station_info¶ dict – A dict storing the coordinates of nodes. It shall be formatted as {id (may be arbitrary): [id (when sorted, should be consistant with index of
node_traffic), latitude, longitude, other notes]}. It’s fromdata['Node']['StationInfo']and is used to build distance graph. Its an optional attribute and can be set as an empty list if distance graph is not needed.
-
MergeIndex¶ int – A int number that used to adjust the granularity of the dataset, the granularity of the new dataset is time_fitness*MergeIndex. default: 1
-
merge_data(data, dataType)¶
-