Getting Started with CML Readers¶
In [1]:
import json
import pandas as pd
import cmlreaders as cml
Finding Files on Rhino¶
The PathFinder helper class can be used to locate files on RHINO. It’s sole responsibility is to locate and return the file path of the file. In many cases, a file could be located in more than one location. In these situations, PathFinder will search over the list of possible locations and return the path where the file is first found. Implicitly, this assumes that the order of the file locations is prioritized such that the preferred location comes before a fall-back location.
In [2]:
# If not working on RHINO, specify the mount point.
# Alternatively, set the CML_ROOT environment variable and never
# have to explicitly pass the rootdir keyword argument.
rhino_root = "/mnt/rhino/"
# Instantiate the finder object
finder = cml.PathFinder(subject="R1389J", experiment="catFR5", session=1,
localization=0, montage=0, rootdir=rhino_root)
What can you request?¶
The PathFinder has a few built-in properties to help you understand what data types are currently supported. Different file types require that the finder be instantiated with different fields. For example, if you are planning to request localization files, there is no need to specify an experiment, session, or montage. However, it is not a problem to specify too many fields, as any extraneous ones will simply be ignored if the data type does not require that it be given. The following properties are defined:
- requestable_files: All supported data types
- localization_files: Files related to localization
- montage_files: Files associated with a specific montage
- session_files: Files that are specific to a session. This files could be processed events, Ramulator files, etc.
For high-level information about each of these data types, see the Data Guide section of the documentation.
In [3]:
finder.requestable_files
Out[3]:
['r1_index',
'ltp_index',
'pyfr_index',
'pyfr_root',
'localization',
'voxel_coordinates',
'prior_stim_results',
'electrode_coordinates',
'jacksheet',
'area',
'electrode_categories',
'good_leads',
'leads',
'classifier_excluded_leads',
'matlab_bipolar_talstruct',
'matlab_monopolar_talstruct',
'pairs',
'contacts',
'session_summary',
'classifier_summary',
'math_summary',
'target_selection_table',
'baseline_classifier',
'all_events',
'task_events',
'math_events',
'ps4_events',
'sources',
'processed_eeg',
'experiment_log',
'session_log',
'ramulator_session_folder',
'event_log',
'experiment_config',
'raw_eeg',
'odin_config',
'used_classifier',
'excluded_pairs',
'all_pairs']
In [4]:
finder.localization_files
Out[4]:
('localization',)
In [5]:
finder.montage_files
Out[5]:
('pairs',
'contacts',
'voxel_coordinates',
'prior_stim_results',
'electrode_coordinates',
'jacksheet',
'good_leads',
'leads',
'area',
'classifier_excluded_leads',
'electrode_categories',
'target_selection_file',
'baseline_classifier')
In [6]:
finder.session_files
Out[6]:
('session_summary',
'classifier_summary',
'math_summary',
'used_classifier',
'excluded_pairs',
'all_pairs',
'experiment_log',
'session_log',
'event_log',
'experiment_config',
'raw_eeg',
'odin_config',
'all_events',
'task_events',
'math_events',
'ps4_events')
Finding File Paths¶
In [7]:
# Find some example files
example_data_types = ['pairs', 'task_events', 'voxel_coordinates']
for data_type in example_data_types:
print(finder.find(data_type=data_type))
/mnt/rhino/protocols/r1/subjects/R1389J/localizations/0/montages/0/neuroradiology/current_processed/pairs.json
/mnt/rhino/protocols/r1/subjects/R1389J/experiments/catFR5/sessions/1/behavioral/current_processed/task_events.json
/mnt/rhino/data10/RAM/subjects/R1389J/tal/VOX_coords_mother.txt
Identifying Available Sessions¶
CMLReaders contains a utility function for loading the json-formatted index files located in the protocols/ directory on RHINO as a dataframe. Once loaded, the standard pandas selection idioms can be used to answer questions such as:
- What subjects completed FR1?
- What experiments did subject R1111M complete?
- How many sessions have been colleted of PAL1?
For many analyses, this will be the first step in determining the sample of subjects to be used.
In [8]:
from cmlreaders import get_data_index
In [9]:
r1_data = get_data_index(kind='r1', rootdir=rhino_root)
r1_data.head()
Out[9]:
Recognition | all_events | contacts | experiment | import_type | localization | math_events | montage | original_experiment | original_session | pairs | ps4_events | session | subject | subject_alias | system_version | task_events | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | protocols/r1/subjects/R1001P/experiments/FR1/s... | protocols/r1/subjects/R1001P/localizations/0/m... | FR1 | build | 0 | protocols/r1/subjects/R1001P/experiments/FR1/s... | 0 | NaN | 0 | protocols/r1/subjects/R1001P/localizations/0/m... | NaN | 0 | R1001P | R1001P | NaN | protocols/r1/subjects/R1001P/experiments/FR1/s... |
1 | NaN | protocols/r1/subjects/R1001P/experiments/FR1/s... | protocols/r1/subjects/R1001P/localizations/0/m... | FR1 | build | 0 | protocols/r1/subjects/R1001P/experiments/FR1/s... | 0 | NaN | 1 | protocols/r1/subjects/R1001P/localizations/0/m... | NaN | 1 | R1001P | R1001P | NaN | protocols/r1/subjects/R1001P/experiments/FR1/s... |
2 | NaN | protocols/r1/subjects/R1001P/experiments/FR2/s... | protocols/r1/subjects/R1001P/localizations/0/m... | FR2 | build | 0 | protocols/r1/subjects/R1001P/experiments/FR2/s... | 0 | NaN | 0 | protocols/r1/subjects/R1001P/localizations/0/m... | NaN | 0 | R1001P | R1001P | NaN | protocols/r1/subjects/R1001P/experiments/FR2/s... |
3 | NaN | protocols/r1/subjects/R1001P/experiments/FR2/s... | protocols/r1/subjects/R1001P/localizations/0/m... | FR2 | build | 0 | protocols/r1/subjects/R1001P/experiments/FR2/s... | 0 | NaN | 1 | protocols/r1/subjects/R1001P/localizations/0/m... | NaN | 1 | R1001P | R1001P | NaN | protocols/r1/subjects/R1001P/experiments/FR2/s... |
4 | NaN | protocols/r1/subjects/R1001P/experiments/PAL1/... | protocols/r1/subjects/R1001P/localizations/0/m... | PAL1 | build | 0 | protocols/r1/subjects/R1001P/experiments/PAL1/... | 0 | NaN | 0 | protocols/r1/subjects/R1001P/localizations/0/m... | NaN | 0 | R1001P | R1001P | NaN | protocols/r1/subjects/R1001P/experiments/PAL1/... |
In [10]:
# What subjects completed FR1?
fr1_subjects = r1_data[r1_data['experiment'] == 'FR1']['subject'].unique()
fr1_subjects
Out[10]:
array(['R1001P', 'R1002P', 'R1003P', 'R1006P', 'R1010J', 'R1015J',
'R1018P', 'R1020J', 'R1022J', 'R1023J', 'R1026D', 'R1027J',
'R1030J', 'R1031M', 'R1032D', 'R1033D', 'R1034D', 'R1035M',
'R1036M', 'R1039M', 'R1042M', 'R1044J', 'R1045E', 'R1048E',
'R1049J', 'R1050M', 'R1051J', 'R1052E', 'R1053M', 'R1054J',
'R1056M', 'R1057E', 'R1059J', 'R1060M', 'R1061T', 'R1062J',
'R1063C', 'R1065J', 'R1066P', 'R1067P', 'R1068J', 'R1069M',
'R1070T', 'R1074M', 'R1075J', 'R1076D', 'R1077T', 'R1080E',
'R1081J', 'R1083J', 'R1084T', 'R1086M', 'R1089P', 'R1092J',
'R1093J', 'R1094T', 'R1096E', 'R1098D', 'R1100D', 'R1101T',
'R1102P', 'R1104D', 'R1105E', 'R1106M', 'R1108J', 'R1111M',
'R1112M', 'R1113T', 'R1114C', 'R1115T', 'R1118N', 'R1120E',
'R1121M', 'R1122E', 'R1123C', 'R1124J', 'R1125T', 'R1127P',
'R1128E', 'R1129D', 'R1130M', 'R1131M', 'R1134T', 'R1135E',
'R1136N', 'R1137E', 'R1138T', 'R1142N', 'R1145J', 'R1146E',
'R1147P', 'R1148P', 'R1149N', 'R1150J', 'R1151E', 'R1153T',
'R1154D', 'R1155D', 'R1156D', 'R1158T', 'R1159P', 'R1161E',
'R1162N', 'R1163T', 'R1164E', 'R1166D', 'R1167M', 'R1168T',
'R1169P', 'R1170J', 'R1171M', 'R1172E', 'R1173J', 'R1174T',
'R1175N', 'R1176M', 'R1177M', 'R1178P', 'R1184M', 'R1185N',
'R1186P', 'R1187P', 'R1189M', 'R1191J', 'R1193T', 'R1195E',
'R1196N', 'R1198M', 'R1200T', 'R1201P', 'R1202M', 'R1203T',
'R1204T', 'R1207J', 'R1212P', 'R1214M', 'R1215M', 'R1216E',
'R1217T', 'R1221P', 'R1222M', 'R1223E', 'R1226D', 'R1228M',
'R1229M', 'R1230J', 'R1231M', 'R1232N', 'R1234D', 'R1236J',
'R1240T', 'R1241J', 'R1243T', 'R1247P', 'R1250N', 'R1251M',
'R1260D', 'R1264P', 'R1268T', 'R1274T', 'R1275D', 'R1277J',
'R1281E', 'R1283T', 'R1286J', 'R1288P', 'R1290M', 'R1291M',
'R1292E', 'R1293P', 'R1297T', 'R1298E', 'R1299T', 'R1302M',
'R1304N', 'R1306E', 'R1307N', 'R1308T', 'R1309M', 'R1310J',
'R1311T', 'R1313J', 'R1315T', 'R1316T', 'R1317D', 'R1318N',
'R1320D', 'R1321M', 'R1323T', 'R1324M', 'R1325C', 'R1328E',
'R1329T', 'R1330D', 'R1331T', 'R1332M', 'R1334T', 'R1336T',
'R1337E', 'R1338T', 'R1339D', 'R1341T', 'R1342M', 'R1345D',
'R1346T', 'R1347D', 'R1349T', 'R1350D', 'R1351M', 'R1354E',
'R1355T', 'R1358T', 'R1361C', 'R1363T', 'R1364C', 'R1367D',
'R1368T', 'R1373T', 'R1374T', 'R1375C', 'R1376D', 'R1377M',
'R1378T', 'R1379E', 'R1380D', 'R1381T', 'R1383J', 'R1384J',
'R1385E', 'R1386T', 'R1387E', 'R1390M', 'R1391T', 'R1393T',
'R1394E', 'R1395M', 'R1396T', 'R1397D', 'R1398J', 'R1401J',
'R1402E', 'R1404E', 'R1405E', 'R1406M', 'R1409D', 'R1412M',
'R1414E', 'R1415T', 'R1416T', 'R1420T', 'R1421M', 'R1422T',
'R1423E', 'R1425D', 'R1427T', 'R1431J', 'R1438M'], dtype=object)
In [11]:
# What experiments did R1111M complete?
r1111m_experiments = r1_data[r1_data['subject'] == 'R1111M']['experiment'].unique()
r1111m_experiments
Out[11]:
array(['FR1', 'FR2', 'PAL1', 'PAL2', 'PS2', 'catFR1'], dtype=object)
In [12]:
# How many sessions of PAL1 have been collected?
pal_sessions = r1_data[r1_data['experiment'] == 'PAL1']
len(pal_sessions)
Out[12]:
151
Loading Data¶
In most cases, the end goal is to load the data into memory rather than just locating a file or determing what data has been collected. In this case, CML Readers provides a handy class to unify the API for loading data. By default, the location will be determined automatically based on the file type using the PathFinder class demonstrated earlier. However, a custom path can be given by using the file_path keyword. This can be useful if you have some data stored locally that is in the same format as one of the data types supported by CMLReaders that you would like to load and use. See the “Loading from a Custom Location” section below for an example.
Each data type has a default representation that is returned when you call the .load() method. Most users will want to use this default representation. However, if you would like to get the data in a different format, you have two options:
- Get the reader for the data type and load the data using a different supported method using one of the as_x methods
- Load the data as the default type and convert it manually
In [13]:
reader = cml.CMLReader(subject="R1389J", experiment="catFR5", session=1,
localization=0, montage=0, rootdir=rhino_root)
Using the Default Representation¶
In [14]:
# Pandas dataframe
events_df = reader.load('task_events')
events_df.head()
Out[14]:
eegoffset | category | category_num | eegfile | exp_version | experiment | intrusion | is_stim | item_name | item_num | ... | recog_rt | recognized | rectime | rejected | serialpos | session | stim_list | stim_params | subject | type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -1 | X | -999 | catFR5 | -999 | False | X | -999 | ... | -999 | -999 | -999 | -999 | -999 | 1 | False | [] | R1389J | STIM_ARTIFACT_DETECTION_START | ||
1 | 5831 | X | -999 | R1389J_catFR5_1_28Feb18_1552.h5 | catFR5 | -999 | False | -1 | ... | -999 | -999 | -999 | -999 | -999 | 1 | False | [{'amplitude': 500.0, 'anode_label': 'STG6', '... | R1389J | STIM_ON | ||
2 | 7790 | X | -999 | R1389J_catFR5_1_28Feb18_1552.h5 | catFR5 | -999 | False | -1 | ... | -999 | -999 | -999 | -999 | -999 | 1 | False | [{'amplitude': 500.0, 'anode_label': 'STG6', '... | R1389J | STIM_ON | ||
3 | 9786 | X | -999 | R1389J_catFR5_1_28Feb18_1552.h5 | catFR5 | -999 | False | -1 | ... | -999 | -999 | -999 | -999 | -999 | 1 | False | [{'amplitude': 500.0, 'anode_label': 'STG6', '... | R1389J | STIM_ON | ||
4 | 11782 | X | -999 | R1389J_catFR5_1_28Feb18_1552.h5 | catFR5 | -999 | False | -1 | ... | -999 | -999 | -999 | -999 | -999 | 1 | False | [{'amplitude': 500.0, 'anode_label': 'STG6', '... | R1389J | STIM_ON |
5 rows × 28 columns
In [15]:
# Python dictionary
electrode_categories_dict = reader.load('electrode_categories')
electrode_categories_dict
Out[15]:
{'interictal': ['ONEMC 9',
'ONEMC10',
'ONEMC8',
'SMC3',
'SMC4',
'STG 7 ',
'STG8',
'TWOSTG 4'],
'brain_lesion': ['FOURSC', 'ONESC', 'ONNEMC', 'THREESC', 'TWOMC', 'TWOSC'],
'bad_channel': ['NONE'],
'soz': []}
Using the Underlying Reader¶
In [16]:
# Ask CMLReader to give back the reader instead of the data
event_reader = reader.get_reader('task_events')
type(event_reader)
Out[16]:
cmlreaders.readers.readers.EventReader
In [17]:
# Load the task events as a dictionary instead of the default representation
event_dict = event_reader.as_dict()
event_dict[:1]
Out[17]:
[{'eegoffset': -1,
'category': 'X',
'category_num': -999,
'eegfile': '',
'exp_version': '',
'experiment': 'catFR5',
'intrusion': -999,
'is_stim': False,
'item_name': 'X',
'item_num': -999,
'list': -999,
'montage': 0,
'msoffset': -1,
'mstime': -1,
'phase': '',
'protocol': 'r1',
'recalled': False,
'recog_resp': -999,
'recog_rt': -999,
'recognized': -999,
'rectime': -999,
'rejected': -999,
'serialpos': -999,
'session': 1,
'stim_list': False,
'stim_params': [],
'subject': 'R1389J',
'type': 'STIM_ARTIFACT_DETECTION_START'}]
In [18]:
# Load the task event as a recarray (not recommended)
event_recarray = event_reader.as_recarray()
event_recarray[0]
Out[18]:
(0, -1, 'X', -999, '', '', 'catFR5', -999, False, 'X', -999, -999, 0, -1, -1, '', 'r1', False, -999, -999, -999, -999, -999, -999, 1, False, [], 'R1389J', 'STIM_ARTIFACT_DETECTION_START')
Examples¶
Loading Multiple Sessions¶
CMLReaders is currently designed for loading events one session at a time. This may change in the future, but in the interim, it is straightforward to load multiple sessions-worth of events.
In [19]:
# Find all sessions of FR6 that subject R1409D completed
sessions_completed = r1_data[(r1_data['subject'] == 'R1409D') &
(r1_data['experiment'] == 'FR6')]['session'].unique()
sessions_completed
Out[19]:
array([0, 2, 3, 9])
In [20]:
# Verbose method
all_events = []
for session in sessions_completed:
sess_events = cml.CMLReader(subject="R1409D", experiment="FR6", session=session,
localization=0, montage=0, rootdir=rhino_root).load('task_events')
all_events.append(sess_events)
all_sessions_df = pd.concat(all_events)
len(all_sessions_df)
Out[20]:
4527
In [21]:
# Same operation, but with list comprehension
all_session_df = pd.concat([cml.CMLReader(subject='R1409D', experiment='FR6', session=session, rootdir=rhino_root).load('events')
for session in sessions_completed])
len(all_sessions_df)
Out[21]:
4527
We can also use the special load_events
classmethod to load events
from multiple subjects and/or experiments.
In [22]:
aggregate_events = cml.CMLReader.load_events(subjects=["R1111M", "R1409D"],
experiments=["FR1"],
rootdir=rhino_root)
aggregate_events.subject.unique()
Out[22]:
array(['R1111M', 'R1409D'], dtype=object)
Loading EEG¶
Loading EEG data is a bit more complicated than loading other data
types. For this reason, rather than using the general load
method,
we instead use load_eeg
which takes special keyword arguments. As
always, start by instantiating a CMLReader
:
In [23]:
reader = cml.CMLReader(subject='R1111M', experiment='FR1', session=0, rootdir=rhino_root)
Reading a full session¶
If we give no parameters to reader.load_eeg
, then by default all
data for an entire session will be loaded as the data were recorded:
In [24]:
full_session_eeg = reader.load_eeg()
full_session_eeg.shape
/Users/depalati/src/cmlreaders/cmlreaders/path_finder.py:225: MultiplePathsFoundWarning: Multiple files found. Returning the first file found
'file found', MultiplePathsFoundWarning)
Out[24]:
(1, 100, 1623160)
Reading from events¶
To select EEG epochs based on events, first load the events and use standard pandas idioms to select events of interest:
In [25]:
events = reader.load('events')
word_events = events[events['type'] == 'WORD']
In addition to passing the events, we also need to specify relative start and stop times in milliseconds. Below, we will load data for each word onset starting at the time the word appeared and ending 100 ms later:
In [26]:
word_event_eeg = reader.load_eeg(events=word_events, rel_start=0, rel_stop=100)
word_event_eeg.shape
/Users/depalati/src/cmlreaders/cmlreaders/path_finder.py:225: MultiplePathsFoundWarning: Multiple files found. Returning the first file found
'file found', MultiplePathsFoundWarning)
Out[26]:
(288, 100, 50)
Reading from multiple sessions¶
Building upon aggregate event loading, we can also load EEG data from multiple sessions. The caveat here is that we are restricted to a single subject:
In [27]:
all_fr1_events = reader.load_events(subjects=["R1111M"], experiments=["FR1"], rootdir=rhino_root)
fr1_words = all_fr1_events[all_fr1_events["type"] == "WORD"]
fr1_eeg = reader.load_eeg(events=fr1_words, rel_start=-100, rel_stop=100)
fr1_eeg.shape
/Users/depalati/src/cmlreaders/cmlreaders/path_finder.py:225: MultiplePathsFoundWarning: Multiple files found. Returning the first file found
'file found', MultiplePathsFoundWarning)
Out[27]:
(1020, 100, 100)
Converting EEG Representation¶
By default, CMLReaders uses a simple container for time series data. However, we provide two utility methods for converting this representation to a format that is understood by PTSA and MNE, since these are common libraries for interacting with EEG data.
In [28]:
ptsa_eeg = word_event_eeg.to_ptsa()
type(ptsa_eeg)
Out[28]:
ptsa.data.timeseries.TimeSeries
In [29]:
mne_eeg = word_event_eeg.to_mne()
type(mne_eeg)
Out[29]:
mne.epochs.EpochsArray