For this challenge, three CDR (Call Detail Record) datasets will be made available by Türk Telekom, together with two files on cell tower locations.
The datasets will include one year of mobile CDR data, collected between January 2017 – December 2017.
All datasets will be stored in plain text format.
BASE TRANSCEIVER STATION LOCATIONS
The geographical coordinates (longitude, lattitude) of the mobile network antennae are given (BTSs - Base Transceiver Stations). It should be noted that several BTSs may be co-located. Each line of this file contains the BTS ID, and a district ID, for the district where the antenna is located. Each district may contains several antennae.
BTS_ID, district_ID, longitude, latitude
For coarse mobility data, we do not provide individual base stations, but only district information. There are 971 districts (or prefectures) in Turkey. The base stations included in the dataset are collected in approximately 481 districts across the country. The rough geometric center of each district will be provided separately.
district_ID, district_name, city_name, longitude, latitude
1, Beşiktaş, İstanbul, -17.5251,14.74683
2, Sarıyer, İstanbul, -17.5164,14.74673
DATASET 1: ANTENNA TRAFFIC
One year site-to-site traffic on an hourly basis. This dataset contains the traffic between each site for a year. The file Veri_Seti1_201701 (i.e. Dataset_1_2017_01, indicating dataset type, collection year and month) contains monthly voice traffic between sites and is structured as follows:
timestamp: day / hour formatted as YYYY-MM-DD HH (rounded up to hours, HH from 1 to 24)
outgoing_site_id: id of site the call originated from
incoming_site_id: id of site receiving the call
total number_of_calls: the total number of calls between these two sites during this hour
number of calls originated from refugees: the number of calls originated from numbers with refugee status.
total call duration: the total duration of all calls between these two sites during this hour.
total call duration originated from refugees: the total duration of calls between these two sites during this hour originating from refugee IDs.
Similarly, the file Veri_Seti1_SMS_201701 contains monthly text traffic between sites and are structured as follows:
timestamp: day and hour considered in format YYYY-MM-DD HH (rounded up to hours, HH from 1 to 24)
outgoing_site_id: id of site the SMS originated from
incoming_site_id: id of site receiving the SMS
number_of_SMS: the total number of SMS messages between these two sites during this hour
number of SMS originated from refugees: the number of SMS originated from numbers with refugee IDs.
timestamp, outgoing_site_id, incoming_site_id,... ...number_of_calls, refugee_calls, total_call_duration, refugee_call_duration
DATASET 2: FINE GRAINED MOBILITY
This dataset will provide the cell tower identifiers used by a group of randomly chosen active users to make phone calls and send texts. The data will be timestamped and a particular group of users will be observed for a period of 2 weeks. At the end of the two-week period, a fresh sample of active users will be drawn at random. Each sample contains 3% of the refugee base plus equal amount of non-refugee users. To protect privacy, new random identifiers are chosen in every time period. Time stamps are rounded to the minute.
The phone numbers for these users are removed, and each one is assigned a unique random number instead. These numbers will start with 1 for refugees, 2 for non-refugees, 3 for unknown. However, this indicator should be considered to be somewhat noisy. Among the users who are marked as refugees, there may be customers who are not refugees, and vice versa. Consequently, it will not be possible to say with 100% certainty whether an invitation CDR belongs to a refugee or not. There is no identifying information about the other party of the call; only the area code (1: refugee, 2: not refugee, 3: unknown) is given.
It should be noted that there are multiple mobile operators for each region. Therefore, the number of phone calls and conversations do not represent actual total numbers, although they are indicators of the total amount of conversations of the region. Numbers of -99 or 9999 are given for missing antenna information, for instance if the other party uses a different operator.
Monthly voice traffic between the areas are stored in the form of Veri_Seti2_201701W_In / Out for VOICE and in the format of Veri_Seti2_201701W_SMS_In / Out for SMS. These are structured as follows:
caller id: rrandomly assigned value, prefixed with digit indicating refugee status (1: refugee, 2: non-refugee, 3: unknown)
timestamp: day / hour considered in format YYYY-MM-DD HH:MM (rounded up to minute)
callee prefix: refugee, 2: non-refugee, 3: unknown
site_id: id of site recording the call
call type: 1 for outgoing, 2 for incoming
If incoming SMSs come from the 9333 service or from different SMS services and applications, the dialed area code is given as 3: unknown.
caller id, timestamp, callee prefix, site id, call type
1138, 2013-04-01 12:32, 1, 52, 1
309095, 2013-04-01 12:33, 3, -1, 2
DATASET 3: COARSE GRAINED MOBILITY
In this dataset, the trajectories of 50,000 randomly selected refugees and 50,000 randomly selected non-refugees are provided for the entire observation period, but with reduced spatial resolution.
The spatial resolution is reduced by replacing antenna identifiers with broader area identifiers, called districts, or prefectures. The map of Turkey is divided into 971 districts officially, our dataset contains data from 481 districts.
The files of the dataset are split into 12 monthly accumulated files. Veri_Seti3_201701_In/Out will contain records of the form:
caller id: randomly assigned value, prefixed with digit indicating refugee status (1: refugee, 2: non-refugee)
timestamp: day / hour considered in format YYYY-MM-DD HH:MM (round up to minute)
prefecture_id: id of prefecture recording the call
caller id, timestamp, prefecture id
1138, 2013-04-01 12:32, 167,
209095, 2013-04-01 12:33, 23
176202, 2013-04-01 12:33, 75
PROTECTION OF PRIVACY
Dataset 1 contains the number and duration of calls per cell tower. There is little scope of privacy breach being caused by Dataset 1 alone, since it contains no personally identifiable information about the users. It could be used to study traffic patterns during the entire period but reveals no information pertaining to the users.This dataset enables analysis of activity levels of different areas, as well as makes it possible to establish communication links between areas.
Dataset 2 contains detailed call records. To protect the privacy of users, phone numbers are replaced with random numbers, and only 2-weeks of data is recorded for any given user. It should not be forgotten that this dataset only makes available call records made by one operator for each region. The exact physical location per call is not shared. The dataset only records the id of the cell tower that handled the call. Since calls are not always handled by the nearest cell tower (depending on how busy a tower is, and about the physical lay of the land), this adds another layer of protection.
Dataset 3 contains records for an entire year, but the physical location is very coarsely indicated. Also here, all personal information is excluded. There is only a refugee status indicator. However, this indicator is not perfect, and contains some noise. This makes it impossible to say with certainty whether a record belongs to a refugee or not.