We introduce cellularX, the first spatial-temporal dataset focusing on user-level network experiences within telco networks. As opposed to most existing datasets that only offer cell-level Key Performance Indicators (KPIs), cellularX fills the gap by offering user-grained multi-dimensional KPI data. In particular, cellularX provides a synthetic dataset for simulation-to-reality (sim2real) research to address the challenge of scarcity of real-world data for specific scenarios. Additionally, a real-world dataset collected from almost one thousand users is open-sourced. Both datasets are capable of assisting user-level network experience modeling and monitoring, e.g., anomaly detection, anomaly prediction, and root cause analysis.
The potential applications of cellularX include
1) Sim2real study: cellularX provides a controlled, flexible simulation platform and a set of real-world and simulated data that can be used to generate low-cost training data, while helping to understand the reality gap in sim2real learning and facilitate a fair comparison of sim2real algorithm. 2) AIOps for telco network: CellularX provides real-world KPI data, viewed as a snapshot of user access and network experience. Its unique multi-dimensional indicators enhance the suitability for supporting various AIOps tasks, including root cause analysis and anomaly prediction.
We propose CellularX, a large dataset focusing on user-level network eXperience in cellular networks. It provides two sub-datasets, cellularXsim and cellularXreal, enabling a comprehensive view of the user-level network experience.
By integrating these two aspects, cellularX provides a richer understanding of the user-level network experience in wireless cellular networks. To the best of our knowledge, cellularX is the first dataset that focuses on user-level network experience.
This dataset is licensed under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License . This means that you are free to use, copy, modify, and distribute the dataset for non-commercial purposes as long as you provide attribution to the original author. The license restricts the use of the dataset for any commercial purposes without obtaining prior permission from the author. Please note that this license ensures the dataset’s availability for academic and non-profit research, but prohibits its usage for commercial gain.
Author statement: we bear all responsibility in case of violation of rights, etc., and confirmation of the data license.
BiBTeX:
TODO
[anonymous]
[anonymous]
Category | Data |
---|---|
Dataset Name | CellularXreal, CellularXsim |
Size of Dataset | x MB, 2.0 MB |
Number of Instances | x, ~18,000 |
Above: Summary of CellularX dataset
CellularXreal dataset consists of two parts: multidimensional KPI data and cellular information, where each record corresponds to a timestamp and a sampling cell. The data contains a significant number of missing values since not all indicators are reported for each access. The published KPI data encompasses 982 users and consists of eight dimensions of KPIs. These KPIs include uplink and downlink user experience rate, block error rate, the number of resource blocks, as well as RSRP of the service cell and uplink SINR.
CellularXsim provides real-world KPI data generated by user equipment while moving along four manually designed paths, as well as corresponding simulated data generated with six simulator configurations along the same routes. CellularXsim also includes a set of real-world data generated by randomly roaming within the study area. Simulator configuations and cell infomation are provided. Four KPI metrics are recorded, that are RSSI, SINR, RSRP, and RSRQ.
Intentional Collected Sensitive Data
No sensitive data was intentionally collected.
Unintentionally Collected Sensitive Data
All user devices involved in the data set have been anonymized, and the data set only collected data from cells in a part of the city block, so any user’s behavior and identity cannot be inferred using any additional methods.
**Limited Maintenance: ** The data will not be updated, but any technical issues will be addressed.
Current Version: 1.0
Last Updated: 06/2023
Release Date: N/A
CellularX is collected in the real world at one time for academic purposes, and maintenance will be limited.
Feedback: For feedback, reach out to shaoyu@tongji.edu.cn.
Below are examples of kind data in the cellularXreal dataset.
Below are examples of kind data in the cellularXsim dataset.
KPI Name | Description | Unit | Type |
---|---|---|---|
RSRP | RSRP of service cell | dBm | Integer |
ULThrp | Uplink experience rate | Mbps | Float |
DLThrp | Downlink experience rate | Mbps | Float |
DLPrbNum | The number of downlink resource blocks | / | Integer |
ULPrbNum | The number of uplink resource blocks | / | Integer |
DLBLER | Downlink block error rate | % | Float |
ULBLER | Uplink block error rate | % | Float |
ULSINR | Uplink SINR | dB | Float |
KPI Name | Description | Unit | Type |
---|---|---|---|
CellID | ID of cellular | / | String |
Average RSRP | Average of Service Cell RSRP | dBm | Float |
lon | Longitude of the cell | / | Float |
lat | Latitude of the cell | / | Float |
azimuth | Azimuth of outdoor cell, or indoor cell | / | String |
KPI Name | Description | Unit | Type |
---|---|---|---|
# Receiver Point (#) | Index of sample point in the path | / | Integer |
X(m) | Coordinates of sampling points | m | Float |
Y(m) | Coordinates of sampling points | m | Float |
Z(m) | Coordinates of sampling points | m | Float |
Distance (m) | The distance from the sampling point to the start of the path | m | Float |
Strongest Power (dBm) | Maximum signal power at sampling point | dBm | Float |
Total Power With Phase (dBm) | The total power at the sampling point | dBm | Float |
Best SINR (dB) | Best SINR at sampling point | dB | Float |
RSSI (dBm) | Reference Singal Strength Indicator | dBm | Float |
RSRP (dBm) | Reference Singal Receiving Power | dBm | Float |
RSRQ (dB) | Reference Signal Receiving Quality | dB | Float |
Strongest power transmitter (Tx #) | Base station index with maximum power | / | Integer |
KPI Name | Description | Unit | Type |
---|---|---|---|
LATITUDE | Longitude of the sample point | / | Float |
LONGITUDE | Latitude of the sample point | / | Float |
TYPE | Type of cell | / | String |
TAC | Tracking Area Code | / | Integer |
PCI | Physical Cell Identifier | / | Integer |
ECI | E-UTRAN Cell Identifier | / | Integer |
EARFCN | E-UTRA Absolute Radio Frequency Channel Number | / | Integer |
RSSI | Received Signal Strength Indicator | dBm | Float |
RSRP | Reference Singal Receiving Power | dBm | Float |
RSRQ | Reference Signal Receiving Quality | dB | Float |
SINR | Signal to Interference plus Noise Ratio | / | Float |
KPI Name | Description | Unit | Type |
---|---|---|---|
cell(ECI) | E-UTRAN Cell Identifier | / | Integer |
lat | Longitude of the cell | / | Float |
lon | Latitude of the cell | / | Float |
radius | Positioning error radius | m | Integer |
BSIndexInSimulator | Index of base station in Wireless Insite simulator | / | Integer |
BSNameInSimulator | Name of base station in Wireless Insite simulator | / | Integer |
AIOps
, Prediction
, Anomaly Detection
, Root Cause Analysis
Artificially Generated:
Static: Data was collected once from single or multiple sources.
The raw data of cellularXreal contains a large number of data points filled with default values, indicating that they were not collected. To avoid ambiguity, we replaced these default values with null values.
When studying sim2real with the cellularXsim dataset, we have the following recommendations for the construction of training, validation, and testing sets. Recall that we manually design four paths in the study area when we constructed the data set. CellularXsim provides real-world data generated by user equipment while moving along these paths, as well as corresponding simulated data generated along the same routes. We recommend that these data can be used as training and validation sets. CellularXsim also includes a set of real-world data generated by randomly roaming within the study area. We encourage the use of this randomly sampled data as a testing set to validate the effectiveness of machine learning models, in order to avoid unfair comparisons caused by specific optimizations tailored to the given paths.