Getting Started with CML Readers

In [1]:
import json
import pandas as pd
import cmlreaders as cml

Finding Files on Rhino

The PathFinder helper class can be used to locate files on RHINO. It’s sole responsibility is to locate and return the file path of the file. In many cases, a file could be located in more than one location. In these situations, PathFinder will search over the list of possible locations and return the path where the file is first found. Implicitly, this assumes that the order of the file locations is prioritized such that the preferred location comes before a fall-back location.

In [2]:
# If not working on RHINO, specify the mount point.
# Alternatively, set the CML_ROOT environment variable and never
# have to explicitly pass the rootdir keyword argument.
rhino_root = "/mnt/rhino/"

# Instantiate the finder object
finder = cml.PathFinder(subject="R1389J", experiment="catFR5", session=1,
                        localization=0, montage=0, rootdir=rhino_root)

What can you request?

The PathFinder has a few built-in properties to help you understand what data types are currently supported. Different file types require that the finder be instantiated with different fields. For example, if you are planning to request localization files, there is no need to specify an experiment, session, or montage. However, it is not a problem to specify too many fields, as any extraneous ones will simply be ignored if the data type does not require that it be given. The following properties are defined:

  • requestable_files: All supported data types
  • localization_files: Files related to localization
  • montage_files: Files associated with a specific montage
  • session_files: Files that are specific to a session. This files could be processed events, Ramulator files, etc.

For high-level information about each of these data types, see the Data Guide section of the documentation.

In [3]:
finder.requestable_files
Out[3]:
['r1_index',
 'ltp_index',
 'pyfr_index',
 'pyfr_root',
 'localization',
 'voxel_coordinates',
 'prior_stim_results',
 'electrode_coordinates',
 'jacksheet',
 'area',
 'electrode_categories',
 'good_leads',
 'leads',
 'classifier_excluded_leads',
 'matlab_bipolar_talstruct',
 'matlab_monopolar_talstruct',
 'pairs',
 'contacts',
 'session_summary',
 'classifier_summary',
 'math_summary',
 'target_selection_table',
 'baseline_classifier',
 'all_events',
 'task_events',
 'math_events',
 'ps4_events',
 'sources',
 'processed_eeg',
 'experiment_log',
 'session_log',
 'ramulator_session_folder',
 'event_log',
 'experiment_config',
 'raw_eeg',
 'odin_config',
 'used_classifier',
 'excluded_pairs',
 'all_pairs']
In [4]:
finder.localization_files
Out[4]:
('localization',)
In [5]:
finder.montage_files
Out[5]:
('pairs',
 'contacts',
 'voxel_coordinates',
 'prior_stim_results',
 'electrode_coordinates',
 'jacksheet',
 'good_leads',
 'leads',
 'area',
 'classifier_excluded_leads',
 'electrode_categories',
 'target_selection_file',
 'baseline_classifier')
In [6]:
finder.session_files
Out[6]:
('session_summary',
 'classifier_summary',
 'math_summary',
 'used_classifier',
 'excluded_pairs',
 'all_pairs',
 'experiment_log',
 'session_log',
 'event_log',
 'experiment_config',
 'raw_eeg',
 'odin_config',
 'all_events',
 'task_events',
 'math_events',
 'ps4_events')

Finding File Paths

In [7]:
# Find some example files
example_data_types = ['pairs', 'task_events', 'voxel_coordinates']
for data_type in example_data_types:
    print(finder.find(data_type=data_type))
/mnt/rhino/protocols/r1/subjects/R1389J/localizations/0/montages/0/neuroradiology/current_processed/pairs.json
/mnt/rhino/protocols/r1/subjects/R1389J/experiments/catFR5/sessions/1/behavioral/current_processed/task_events.json
/mnt/rhino/data10/RAM/subjects/R1389J/tal/VOX_coords_mother.txt

Identifying Available Sessions

CMLReaders contains a utility function for loading the json-formatted index files located in the protocols/ directory on RHINO as a dataframe. Once loaded, the standard pandas selection idioms can be used to answer questions such as:

  1. What subjects completed FR1?
  2. What experiments did subject R1111M complete?
  3. How many sessions have been colleted of PAL1?

For many analyses, this will be the first step in determining the sample of subjects to be used.

In [8]:
from cmlreaders import get_data_index
In [9]:
r1_data = get_data_index(kind='r1', rootdir=rhino_root)
r1_data.head()
Out[9]:
Recognition all_events contacts experiment import_type localization math_events montage original_experiment original_session pairs ps4_events session subject subject_alias system_version task_events
0 NaN protocols/r1/subjects/R1001P/experiments/FR1/s... protocols/r1/subjects/R1001P/localizations/0/m... FR1 build 0 protocols/r1/subjects/R1001P/experiments/FR1/s... 0 NaN 0 protocols/r1/subjects/R1001P/localizations/0/m... NaN 0 R1001P R1001P NaN protocols/r1/subjects/R1001P/experiments/FR1/s...
1 NaN protocols/r1/subjects/R1001P/experiments/FR1/s... protocols/r1/subjects/R1001P/localizations/0/m... FR1 build 0 protocols/r1/subjects/R1001P/experiments/FR1/s... 0 NaN 1 protocols/r1/subjects/R1001P/localizations/0/m... NaN 1 R1001P R1001P NaN protocols/r1/subjects/R1001P/experiments/FR1/s...
2 NaN protocols/r1/subjects/R1001P/experiments/FR2/s... protocols/r1/subjects/R1001P/localizations/0/m... FR2 build 0 protocols/r1/subjects/R1001P/experiments/FR2/s... 0 NaN 0 protocols/r1/subjects/R1001P/localizations/0/m... NaN 0 R1001P R1001P NaN protocols/r1/subjects/R1001P/experiments/FR2/s...
3 NaN protocols/r1/subjects/R1001P/experiments/FR2/s... protocols/r1/subjects/R1001P/localizations/0/m... FR2 build 0 protocols/r1/subjects/R1001P/experiments/FR2/s... 0 NaN 1 protocols/r1/subjects/R1001P/localizations/0/m... NaN 1 R1001P R1001P NaN protocols/r1/subjects/R1001P/experiments/FR2/s...
4 NaN protocols/r1/subjects/R1001P/experiments/PAL1/... protocols/r1/subjects/R1001P/localizations/0/m... PAL1 build 0 protocols/r1/subjects/R1001P/experiments/PAL1/... 0 NaN 0 protocols/r1/subjects/R1001P/localizations/0/m... NaN 0 R1001P R1001P NaN protocols/r1/subjects/R1001P/experiments/PAL1/...
In [10]:
# What subjects completed FR1?
fr1_subjects = r1_data[r1_data['experiment'] == 'FR1']['subject'].unique()
fr1_subjects
Out[10]:
array(['R1001P', 'R1002P', 'R1003P', 'R1006P', 'R1010J', 'R1015J',
       'R1018P', 'R1020J', 'R1022J', 'R1023J', 'R1026D', 'R1027J',
       'R1030J', 'R1031M', 'R1032D', 'R1033D', 'R1034D', 'R1035M',
       'R1036M', 'R1039M', 'R1042M', 'R1044J', 'R1045E', 'R1048E',
       'R1049J', 'R1050M', 'R1051J', 'R1052E', 'R1053M', 'R1054J',
       'R1056M', 'R1057E', 'R1059J', 'R1060M', 'R1061T', 'R1062J',
       'R1063C', 'R1065J', 'R1066P', 'R1067P', 'R1068J', 'R1069M',
       'R1070T', 'R1074M', 'R1075J', 'R1076D', 'R1077T', 'R1080E',
       'R1081J', 'R1083J', 'R1084T', 'R1086M', 'R1089P', 'R1092J',
       'R1093J', 'R1094T', 'R1096E', 'R1098D', 'R1100D', 'R1101T',
       'R1102P', 'R1104D', 'R1105E', 'R1106M', 'R1108J', 'R1111M',
       'R1112M', 'R1113T', 'R1114C', 'R1115T', 'R1118N', 'R1120E',
       'R1121M', 'R1122E', 'R1123C', 'R1124J', 'R1125T', 'R1127P',
       'R1128E', 'R1129D', 'R1130M', 'R1131M', 'R1134T', 'R1135E',
       'R1136N', 'R1137E', 'R1138T', 'R1142N', 'R1145J', 'R1146E',
       'R1147P', 'R1148P', 'R1149N', 'R1150J', 'R1151E', 'R1153T',
       'R1154D', 'R1155D', 'R1156D', 'R1158T', 'R1159P', 'R1161E',
       'R1162N', 'R1163T', 'R1164E', 'R1166D', 'R1167M', 'R1168T',
       'R1169P', 'R1170J', 'R1171M', 'R1172E', 'R1173J', 'R1174T',
       'R1175N', 'R1176M', 'R1177M', 'R1178P', 'R1184M', 'R1185N',
       'R1186P', 'R1187P', 'R1189M', 'R1191J', 'R1193T', 'R1195E',
       'R1196N', 'R1198M', 'R1200T', 'R1201P', 'R1202M', 'R1203T',
       'R1204T', 'R1207J', 'R1212P', 'R1214M', 'R1215M', 'R1216E',
       'R1217T', 'R1221P', 'R1222M', 'R1223E', 'R1226D', 'R1228M',
       'R1229M', 'R1230J', 'R1231M', 'R1232N', 'R1234D', 'R1236J',
       'R1240T', 'R1241J', 'R1243T', 'R1247P', 'R1250N', 'R1251M',
       'R1260D', 'R1264P', 'R1268T', 'R1274T', 'R1275D', 'R1277J',
       'R1281E', 'R1283T', 'R1286J', 'R1288P', 'R1290M', 'R1291M',
       'R1292E', 'R1293P', 'R1297T', 'R1298E', 'R1299T', 'R1302M',
       'R1304N', 'R1306E', 'R1307N', 'R1308T', 'R1309M', 'R1310J',
       'R1311T', 'R1313J', 'R1315T', 'R1316T', 'R1317D', 'R1318N',
       'R1320D', 'R1321M', 'R1323T', 'R1324M', 'R1325C', 'R1328E',
       'R1329T', 'R1330D', 'R1331T', 'R1332M', 'R1334T', 'R1336T',
       'R1337E', 'R1338T', 'R1339D', 'R1341T', 'R1342M', 'R1345D',
       'R1346T', 'R1347D', 'R1349T', 'R1350D', 'R1351M', 'R1354E',
       'R1355T', 'R1358T', 'R1361C', 'R1363T', 'R1364C', 'R1367D',
       'R1368T', 'R1373T', 'R1374T', 'R1375C', 'R1376D', 'R1377M',
       'R1378T', 'R1379E', 'R1380D', 'R1381T', 'R1383J', 'R1384J',
       'R1385E', 'R1386T', 'R1387E', 'R1390M', 'R1391T', 'R1393T',
       'R1394E', 'R1395M', 'R1396T', 'R1397D', 'R1398J', 'R1401J',
       'R1402E', 'R1404E', 'R1405E', 'R1406M', 'R1409D', 'R1412M',
       'R1414E', 'R1415T', 'R1416T', 'R1420T', 'R1421M', 'R1422T',
       'R1423E', 'R1425D', 'R1427T', 'R1431J', 'R1438M'], dtype=object)
In [11]:
# What experiments did R1111M complete?
r1111m_experiments = r1_data[r1_data['subject'] == 'R1111M']['experiment'].unique()
r1111m_experiments
Out[11]:
array(['FR1', 'FR2', 'PAL1', 'PAL2', 'PS2', 'catFR1'], dtype=object)
In [12]:
# How many sessions of PAL1 have been collected?
pal_sessions = r1_data[r1_data['experiment'] == 'PAL1']
len(pal_sessions)
Out[12]:
151

Loading Data

In most cases, the end goal is to load the data into memory rather than just locating a file or determing what data has been collected. In this case, CML Readers provides a handy class to unify the API for loading data. By default, the location will be determined automatically based on the file type using the PathFinder class demonstrated earlier. However, a custom path can be given by using the file_path keyword. This can be useful if you have some data stored locally that is in the same format as one of the data types supported by CMLReaders that you would like to load and use. See the “Loading from a Custom Location” section below for an example.

Each data type has a default representation that is returned when you call the .load() method. Most users will want to use this default representation. However, if you would like to get the data in a different format, you have two options:

  1. Get the reader for the data type and load the data using a different supported method using one of the as_x methods
  2. Load the data as the default type and convert it manually
In [13]:
reader = cml.CMLReader(subject="R1389J", experiment="catFR5", session=1,
                       localization=0, montage=0, rootdir=rhino_root)

Using the Default Representation

In [14]:
# Pandas dataframe
events_df = reader.load('task_events')
events_df.head()
Out[14]:
eegoffset category category_num eegfile exp_version experiment intrusion is_stim item_name item_num ... recog_rt recognized rectime rejected serialpos session stim_list stim_params subject type
0 -1 X -999 catFR5 -999 False X -999 ... -999 -999 -999 -999 -999 1 False [] R1389J STIM_ARTIFACT_DETECTION_START
1 5831 X -999 R1389J_catFR5_1_28Feb18_1552.h5 catFR5 -999 False -1 ... -999 -999 -999 -999 -999 1 False [{'amplitude': 500.0, 'anode_label': 'STG6', '... R1389J STIM_ON
2 7790 X -999 R1389J_catFR5_1_28Feb18_1552.h5 catFR5 -999 False -1 ... -999 -999 -999 -999 -999 1 False [{'amplitude': 500.0, 'anode_label': 'STG6', '... R1389J STIM_ON
3 9786 X -999 R1389J_catFR5_1_28Feb18_1552.h5 catFR5 -999 False -1 ... -999 -999 -999 -999 -999 1 False [{'amplitude': 500.0, 'anode_label': 'STG6', '... R1389J STIM_ON
4 11782 X -999 R1389J_catFR5_1_28Feb18_1552.h5 catFR5 -999 False -1 ... -999 -999 -999 -999 -999 1 False [{'amplitude': 500.0, 'anode_label': 'STG6', '... R1389J STIM_ON

5 rows × 28 columns

In [15]:
# Python dictionary
electrode_categories_dict = reader.load('electrode_categories')
electrode_categories_dict
Out[15]:
{'interictal': ['ONEMC 9',
  'ONEMC10',
  'ONEMC8',
  'SMC3',
  'SMC4',
  'STG 7 ',
  'STG8',
  'TWOSTG 4'],
 'brain_lesion': ['FOURSC', 'ONESC', 'ONNEMC', 'THREESC', 'TWOMC', 'TWOSC'],
 'bad_channel': ['NONE'],
 'soz': []}

Using the Underlying Reader

In [16]:
# Ask CMLReader to give back the reader instead of the data
event_reader = reader.get_reader('task_events')
type(event_reader)
Out[16]:
cmlreaders.readers.readers.EventReader
In [17]:
# Load the task events as a dictionary instead of the default representation
event_dict = event_reader.as_dict()
event_dict[:1]
Out[17]:
[{'eegoffset': -1,
  'category': 'X',
  'category_num': -999,
  'eegfile': '',
  'exp_version': '',
  'experiment': 'catFR5',
  'intrusion': -999,
  'is_stim': False,
  'item_name': 'X',
  'item_num': -999,
  'list': -999,
  'montage': 0,
  'msoffset': -1,
  'mstime': -1,
  'phase': '',
  'protocol': 'r1',
  'recalled': False,
  'recog_resp': -999,
  'recog_rt': -999,
  'recognized': -999,
  'rectime': -999,
  'rejected': -999,
  'serialpos': -999,
  'session': 1,
  'stim_list': False,
  'stim_params': [],
  'subject': 'R1389J',
  'type': 'STIM_ARTIFACT_DETECTION_START'}]
In [18]:
# Load the task event as a recarray (not recommended)
event_recarray = event_reader.as_recarray()
event_recarray[0]
Out[18]:
(0, -1, 'X', -999, '', '', 'catFR5', -999, False, 'X', -999, -999, 0, -1, -1, '', 'r1', False, -999, -999, -999, -999, -999, -999, 1, False, [], 'R1389J', 'STIM_ARTIFACT_DETECTION_START')

Examples

Loading Multiple Sessions

CMLReaders is currently designed for loading events one session at a time. This may change in the future, but in the interim, it is straightforward to load multiple sessions-worth of events.

In [19]:
# Find all sessions of FR6 that subject R1409D completed
sessions_completed = r1_data[(r1_data['subject'] == 'R1409D') &
                             (r1_data['experiment'] == 'FR6')]['session'].unique()
sessions_completed
Out[19]:
array([0, 2, 3, 9])
In [20]:
# Verbose method
all_events = []
for session in sessions_completed:
    sess_events = cml.CMLReader(subject="R1409D", experiment="FR6", session=session,
                                localization=0, montage=0, rootdir=rhino_root).load('task_events')
    all_events.append(sess_events)

all_sessions_df = pd.concat(all_events)
len(all_sessions_df)
Out[20]:
4527
In [21]:
# Same operation, but with list comprehension
all_session_df = pd.concat([cml.CMLReader(subject='R1409D', experiment='FR6', session=session, rootdir=rhino_root).load('events')
                            for session in sessions_completed])
len(all_sessions_df)
Out[21]:
4527

We can also use the special load_events classmethod to load events from multiple subjects and/or experiments.

In [22]:
aggregate_events = cml.CMLReader.load_events(subjects=["R1111M", "R1409D"],
                                             experiments=["FR1"],
                                             rootdir=rhino_root)
aggregate_events.subject.unique()
Out[22]:
array(['R1111M', 'R1409D'], dtype=object)

Loading EEG

Loading EEG data is a bit more complicated than loading other data types. For this reason, rather than using the general load method, we instead use load_eeg which takes special keyword arguments. As always, start by instantiating a CMLReader:

In [23]:
reader = cml.CMLReader(subject='R1111M', experiment='FR1', session=0, rootdir=rhino_root)

Reading a full session

If we give no parameters to reader.load_eeg, then by default all data for an entire session will be loaded as the data were recorded:

In [24]:
full_session_eeg = reader.load_eeg()
full_session_eeg.shape
/Users/depalati/src/cmlreaders/cmlreaders/path_finder.py:225: MultiplePathsFoundWarning: Multiple files found. Returning the first file found
  'file found', MultiplePathsFoundWarning)
Out[24]:
(1, 100, 1623160)

Reading from events

To select EEG epochs based on events, first load the events and use standard pandas idioms to select events of interest:

In [25]:
events = reader.load('events')
word_events = events[events['type'] == 'WORD']

In addition to passing the events, we also need to specify relative start and stop times in milliseconds. Below, we will load data for each word onset starting at the time the word appeared and ending 100 ms later:

In [26]:
word_event_eeg = reader.load_eeg(events=word_events, rel_start=0, rel_stop=100)
word_event_eeg.shape
/Users/depalati/src/cmlreaders/cmlreaders/path_finder.py:225: MultiplePathsFoundWarning: Multiple files found. Returning the first file found
  'file found', MultiplePathsFoundWarning)
Out[26]:
(288, 100, 50)

Reading from multiple sessions

Building upon aggregate event loading, we can also load EEG data from multiple sessions. The caveat here is that we are restricted to a single subject:

In [27]:
all_fr1_events = reader.load_events(subjects=["R1111M"], experiments=["FR1"], rootdir=rhino_root)
fr1_words = all_fr1_events[all_fr1_events["type"] == "WORD"]
fr1_eeg = reader.load_eeg(events=fr1_words, rel_start=-100, rel_stop=100)
fr1_eeg.shape
/Users/depalati/src/cmlreaders/cmlreaders/path_finder.py:225: MultiplePathsFoundWarning: Multiple files found. Returning the first file found
  'file found', MultiplePathsFoundWarning)
Out[27]:
(1020, 100, 100)

Converting EEG Representation

By default, CMLReaders uses a simple container for time series data. However, we provide two utility methods for converting this representation to a format that is understood by PTSA and MNE, since these are common libraries for interacting with EEG data.

In [28]:
ptsa_eeg = word_event_eeg.to_ptsa()
type(ptsa_eeg)
Out[28]:
ptsa.data.timeseries.TimeSeries
In [29]:
mne_eeg = word_event_eeg.to_mne()
type(mne_eeg)
Out[29]:
mne.epochs.EpochsArray