6.1. UCTB.dataset package¶
6.1.1. UCTB.dataset.data_loader module¶
-
class
UCTB.dataset.data_loader.
NodeTrafficLoader
(dataset, city=None, data_range='all', train_data_length='all', test_ratio=0.1, closeness_len=6, period_len=7, trend_len=4, target_length=1, normalize=True, workday_parser=<function is_work_day_america>, with_tpe=False, data_dir=None, MergeIndex=1, MergeWay='sum', remove=True, **kwargs)¶ Bases:
object
The data loader that extracts and processes data from a
DataSet
object.Parameters: - dataset (str) – A string containing path of the dataset pickle file or a string of name of the dataset.
- city (
str
orNone
) –None
if dataset is file path, or a string of name of the city. Default:None
- data_range – The range of data extracted from
self.dataset
to be further used. If set to'all'
, all data inself.dataset
will be used. If set to a float between 0.0 and 1.0, the relative former proportion of data inself.dataset
will be used. If set to a list of two integers[start, end]
, the data from start day to (end - 1) day of data inself.dataset
will be used. Default:'all'
- train_data_length – The length of train data. If set to
'all'
, all data in the split train set will be used. If set to int, the latesttrain_data_length
days of data will be used as train set. Default:'all'
- test_ratio (float) – The ratio of test set as data will be split into train set and test set. Default: 0.1
- closeness_len (int) – The length of closeness data history. The former consecutive
closeness_len
time slots of data will be used as closeness history. Default: 6 - period_len (int) – The length of period data history. The data of exact same time slots in former consecutive
period_len
days will be used as period history. Default: 7 - trend_len (int) – The length of trend data history. The data of exact same time slots in former consecutive
trend_len
weeks (every seven days) will be used as trend history. Default: 4 - target_length (int) – The numbers of steps that need prediction by one piece of history data. Have to be 1 now. Default: 1
- normalize (bool|str|object) – Select which normalizer to normalize input data. Default:
True
- workday_parser – Used to build external features to be used in neural methods. Default:
is_work_day_america
- with_tpe (bool) – If
True
, data loader will build time position embeddings. Default:False
- data_dir (
str
orNone
) – The dataset directory. If set toNone
, a directory will be created. Ifdataset
is file path,data_dir
should beNone
too. Default:None
- MergeIndex (int) – The granularity of dataset will be
MergeIndex
* original granularity. - MergeWay (str) – How to change the data granularity. Now it can be
sum
average
ormax
. - remove (bool) – If
True
, dataloader will remove stations whose average traffic is less than 1. Othewise, dataloader will use all stations.
-
dataset
¶ DataSet – The DataSet object storing basic data.
-
daily_slots
¶ int – The number of time slots in one single day.
-
station_number
¶ int – The number of nodes.
-
external_dim
¶ int – The number of dimensions of external features.
-
train_closeness
¶ np.ndarray – The closeness history of train set data. When
with_tpe
isFalse
, its shape is [train_time_slot_num,station_number
,closeness_len
, 1]. On the dimension ofcloseness_len
, data are arranged from earlier time slots to later time slots. Ifcloseness_len
is set to 0, train_closeness will be an empty ndarray.train_period
,train_trend
,test_closeness
,test_period
,test_trend
have similar shape and construction.
-
train_y
¶ np.ndarray – The train set data. Its shape is [train_time_slot_num,
station_number
, 1].test_y
has similar shape and construction.
-
make_concat
(node='all', is_train=True)¶ A function to concatenate all closeness, period and trend history data to use as inputs of models.
Parameters: - node (int or
'all'
) – To specify the index of certain node. If set to'all'
, return the concatenation result of all nodes. If set to an integer, it will be the index of the selected node. Default:'all'
- is_train (bool) – If set to
True
,train_closeness
,train_period
, andtrain_trend
will be concatenated. If set toFalse
,test_closeness
,test_period
, andtest_trend
will be concatenated. Default: True
Returns: Function returns an ndarray with shape as [time_slot_num,
station_number
,closeness_len
+period_len
+trend_len
, 1], and time_slot_num is the temporal length of train set data ifis_train
isTrue
or the temporal length of test set data ifis_train
isFalse
. On the second dimension, data are arranged asearlier closeness -> later closeness -> earlier period -> later period -> earlier trend -> later trend
.Return type: np.ndarray
- node (int or
6.1.2. UCTB.dataset.dataset module¶
-
class
UCTB.dataset.dataset.
DataSet
(dataset, MergeIndex, MergeWay, city=None, data_dir=None)¶ Bases:
object
An object storing basic data from a formatted pickle file. See also Build your own datasets. :param dataset: A string containing path of the dataset pickle file or a string of name of the dataset. :type dataset: str :param city:
None
if dataset is file path, or a string of name of the city. Default:None
:type city: str orNone
:param data_dir: The dataset directory. If set toNone
, a directory will be created.Ifdataset
is file path,data_dir
should beNone
too. Default:None
-
data
¶ dict – The data directly from the pickle file.
data
may have adata['contribute_data']
dict to store supplementary data.
-
time_range
¶ list – From
data['TimeRange']
in the format of [YYYY-MM-DD, YYYY-MM-DD] indicating the time range of the data.
-
time_fitness
¶ int – From
data['TimeFitness']
indicating how many minutes is a single time slot.
-
node_traffic
¶ np.ndarray – Data recording the main stream data of the nodes in during the time range. From
data['Node']['TrafficNode']
with shape as [time_slot_num, node_num].
-
node_monthly_interaction
¶ np.ndarray – Data recording the monthly interaction of pairs of nodes. Its shape is [month_num, node_num, node_num].It’s from
data['Node']['TrafficMonthlyInteraction']
and is used to build interaction graph. Its an optional attribute and can be set as an empty list if interaction graph is not needed.
-
node_station_info
¶ dict – A dict storing the coordinates of nodes. It shall be formatted as {id (may be arbitrary): [id (when sorted, should be consistant with index of
node_traffic
), latitude, longitude, other notes]}. It’s fromdata['Node']['StationInfo']
and is used to build distance graph. Its an optional attribute and can be set as an empty list if distance graph is not needed.
-
MergeIndex
¶ int – A int number that used to adjust the granularity of the dataset, the granularity of the new dataset is time_fitness*MergeIndex. default: 1
-
merge_data
(data, dataType)¶
-