Getting Started with CML Readers¶

In [1]:

import json
import pandas as pd
import cmlreaders as cml

Finding Files on Rhino¶

The PathFinder helper class can be used to locate files on RHINO. It’s sole responsibility is to locate and return the file path of the file. In many cases, a file could be located in more than one location. In these situations, PathFinder will search over the list of possible locations and return the path where the file is first found. Implicitly, this assumes that the order of the file locations is prioritized such that the preferred location comes before a fall-back location.

In [2]:

# If not working on RHINO, specify the mount point.
# Alternatively, set the CML_ROOT environment variable and never
# have to explicitly pass the rootdir keyword argument.
rhino_root = "/mnt/rhino/"

# Instantiate the finder object
finder = cml.PathFinder(subject="R1389J", experiment="catFR5", session=1,
                        localization=0, montage=0, rootdir=rhino_root)

What can you request?¶

The PathFinder has a few built-in properties to help you understand what data types are currently supported. Different file types require that the finder be instantiated with different fields. For example, if you are planning to request localization files, there is no need to specify an experiment, session, or montage. However, it is not a problem to specify too many fields, as any extraneous ones will simply be ignored if the data type does not require that it be given. The following properties are defined:

requestable_files: All supported data types
localization_files: Files related to localization
montage_files: Files associated with a specific montage
session_files: Files that are specific to a session. This files could be processed events, Ramulator files, etc.

For high-level information about each of these data types, see the Data Guide section of the documentation.

In [3]:

finder.requestable_files

Out[3]:

['r1_index',
 'ltp_index',
 'pyfr_index',
 'pyfr_root',
 'localization',
 'voxel_coordinates',
 'prior_stim_results',
 'electrode_coordinates',
 'jacksheet',
 'area',
 'electrode_categories',
 'good_leads',
 'leads',
 'classifier_excluded_leads',
 'matlab_bipolar_talstruct',
 'matlab_monopolar_talstruct',
 'pairs',
 'contacts',
 'session_summary',
 'classifier_summary',
 'math_summary',
 'target_selection_table',
 'baseline_classifier',
 'all_events',
 'task_events',
 'math_events',
 'ps4_events',
 'sources',
 'processed_eeg',
 'experiment_log',
 'session_log',
 'ramulator_session_folder',
 'event_log',
 'experiment_config',
 'raw_eeg',
 'odin_config',
 'used_classifier',
 'excluded_pairs',
 'all_pairs']

In [4]:

finder.localization_files

Out[4]:

('localization',)

In [5]:

finder.montage_files

Out[5]:

('pairs',
 'contacts',
 'voxel_coordinates',
 'prior_stim_results',
 'electrode_coordinates',
 'jacksheet',
 'good_leads',
 'leads',
 'area',
 'classifier_excluded_leads',
 'electrode_categories',
 'target_selection_file',
 'baseline_classifier')

In [6]:

finder.session_files

Out[6]:

('session_summary',
 'classifier_summary',
 'math_summary',
 'used_classifier',
 'excluded_pairs',
 'all_pairs',
 'experiment_log',
 'session_log',
 'event_log',
 'experiment_config',
 'raw_eeg',
 'odin_config',
 'all_events',
 'task_events',
 'math_events',
 'ps4_events')

Finding File Paths¶

In [7]:

# Find some example files
example_data_types = ['pairs', 'task_events', 'voxel_coordinates']
for data_type in example_data_types:
    print(finder.find(data_type=data_type))

/mnt/rhino/protocols/r1/subjects/R1389J/localizations/0/montages/0/neuroradiology/current_processed/pairs.json
/mnt/rhino/protocols/r1/subjects/R1389J/experiments/catFR5/sessions/1/behavioral/current_processed/task_events.json
/mnt/rhino/data10/RAM/subjects/R1389J/tal/VOX_coords_mother.txt

Identifying Available Sessions¶

CMLReaders contains a utility function for loading the json-formatted index files located in the protocols/ directory on RHINO as a dataframe. Once loaded, the standard pandas selection idioms can be used to answer questions such as:

What subjects completed FR1?
What experiments did subject R1111M complete?
How many sessions have been colleted of PAL1?

For many analyses, this will be the first step in determining the sample of subjects to be used.

In [8]:

from cmlreaders import get_data_index

In [9]:

r1_data = get_data_index(kind='r1', rootdir=rhino_root)
r1_data.head()

Out[9]:

	Recognition	all_events	contacts	experiment	import_type	math_events	original_experiment	original_session	pairs	ps4_events	session	subject	subject_alias	system_version	task_events
0	NaN	protocols/r1/subjects/R1001P/experiments/FR1/s...	protocols/r1/subjects/R1001P/localizations/0/m...	FR1	build	protocols/r1/subjects/R1001P/experiments/FR1/s...	NaN	0	protocols/r1/subjects/R1001P/localizations/0/m...	NaN	0	R1001P	R1001P	NaN	protocols/r1/subjects/R1001P/experiments/FR1/s...
1	NaN	protocols/r1/subjects/R1001P/experiments/FR1/s...	protocols/r1/subjects/R1001P/localizations/0/m...	FR1	build	protocols/r1/subjects/R1001P/experiments/FR1/s...	NaN	1	protocols/r1/subjects/R1001P/localizations/0/m...	NaN	1	R1001P	R1001P	NaN	protocols/r1/subjects/R1001P/experiments/FR1/s...
2	NaN	protocols/r1/subjects/R1001P/experiments/FR2/s...	protocols/r1/subjects/R1001P/localizations/0/m...	FR2	build	protocols/r1/subjects/R1001P/experiments/FR2/s...	NaN	0	protocols/r1/subjects/R1001P/localizations/0/m...	NaN	0	R1001P	R1001P	NaN	protocols/r1/subjects/R1001P/experiments/FR2/s...
3	NaN	protocols/r1/subjects/R1001P/experiments/FR2/s...	protocols/r1/subjects/R1001P/localizations/0/m...	FR2	build	protocols/r1/subjects/R1001P/experiments/FR2/s...	NaN	1	protocols/r1/subjects/R1001P/localizations/0/m...	NaN	1	R1001P	R1001P	NaN	protocols/r1/subjects/R1001P/experiments/FR2/s...
4	NaN	protocols/r1/subjects/R1001P/experiments/PAL1/...	protocols/r1/subjects/R1001P/localizations/0/m...	PAL1	build	protocols/r1/subjects/R1001P/experiments/PAL1/...	NaN	0	protocols/r1/subjects/R1001P/localizations/0/m...	NaN	0	R1001P	R1001P	NaN	protocols/r1/subjects/R1001P/experiments/PAL1/...

In [10]:

# What subjects completed FR1?
fr1_subjects = r1_data[r1_data['experiment'] == 'FR1']['subject'].unique()
fr1_subjects

Out[10]:

array(['R1001P', 'R1002P', 'R1003P', 'R1006P', 'R1010J', 'R1015J',
       'R1018P', 'R1020J', 'R1022J', 'R1023J', 'R1026D', 'R1027J',
       'R1030J', 'R1031M', 'R1032D', 'R1033D', 'R1034D', 'R1035M',
       'R1036M', 'R1039M', 'R1042M', 'R1044J', 'R1045E', 'R1048E',
       'R1049J', 'R1050M', 'R1051J', 'R1052E', 'R1053M', 'R1054J',
       'R1056M', 'R1057E', 'R1059J', 'R1060M', 'R1061T', 'R1062J',
       'R1063C', 'R1065J', 'R1066P', 'R1067P', 'R1068J', 'R1069M',
       'R1070T', 'R1074M', 'R1075J', 'R1076D', 'R1077T', 'R1080E',
       'R1081J', 'R1083J', 'R1084T', 'R1086M', 'R1089P', 'R1092J',
       'R1093J', 'R1094T', 'R1096E', 'R1098D', 'R1100D', 'R1101T',
       'R1102P', 'R1104D', 'R1105E', 'R1106M', 'R1108J', 'R1111M',
       'R1112M', 'R1113T', 'R1114C', 'R1115T', 'R1118N', 'R1120E',
       'R1121M', 'R1122E', 'R1123C', 'R1124J', 'R1125T', 'R1127P',
       'R1128E', 'R1129D', 'R1130M', 'R1131M', 'R1134T', 'R1135E',
       'R1136N', 'R1137E', 'R1138T', 'R1142N', 'R1145J', 'R1146E',
       'R1147P', 'R1148P', 'R1149N', 'R1150J', 'R1151E', 'R1153T',
       'R1154D', 'R1155D', 'R1156D', 'R1158T', 'R1159P', 'R1161E',
       'R1162N', 'R1163T', 'R1164E', 'R1166D', 'R1167M', 'R1168T',
       'R1169P', 'R1170J', 'R1171M', 'R1172E', 'R1173J', 'R1174T',
       'R1175N', 'R1176M', 'R1177M', 'R1178P', 'R1184M', 'R1185N',
       'R1186P', 'R1187P', 'R1189M', 'R1191J', 'R1193T', 'R1195E',
       'R1196N', 'R1198M', 'R1200T', 'R1201P', 'R1202M', 'R1203T',
       'R1204T', 'R1207J', 'R1212P', 'R1214M', 'R1215M', 'R1216E',
       'R1217T', 'R1221P', 'R1222M', 'R1223E', 'R1226D', 'R1228M',
       'R1229M', 'R1230J', 'R1231M', 'R1232N', 'R1234D', 'R1236J',
       'R1240T', 'R1241J', 'R1243T', 'R1247P', 'R1250N', 'R1251M',
       'R1260D', 'R1264P', 'R1268T', 'R1274T', 'R1275D', 'R1277J',
       'R1281E', 'R1283T', 'R1286J', 'R1288P', 'R1290M', 'R1291M',
       'R1292E', 'R1293P', 'R1297T', 'R1298E', 'R1299T', 'R1302M',
       'R1304N', 'R1306E', 'R1307N', 'R1308T', 'R1309M', 'R1310J',
       'R1311T', 'R1313J', 'R1315T', 'R1316T', 'R1317D', 'R1318N',
       'R1320D', 'R1321M', 'R1323T', 'R1324M', 'R1325C', 'R1328E',
       'R1329T', 'R1330D', 'R1331T', 'R1332M', 'R1334T', 'R1336T',
       'R1337E', 'R1338T', 'R1339D', 'R1341T', 'R1342M', 'R1345D',
       'R1346T', 'R1347D', 'R1349T', 'R1350D', 'R1351M', 'R1354E',
       'R1355T', 'R1358T', 'R1361C', 'R1363T', 'R1364C', 'R1367D',
       'R1368T', 'R1373T', 'R1374T', 'R1375C', 'R1376D', 'R1377M',
       'R1378T', 'R1379E', 'R1380D', 'R1381T', 'R1383J', 'R1384J',
       'R1385E', 'R1386T', 'R1387E', 'R1390M', 'R1391T', 'R1393T',
       'R1394E', 'R1395M', 'R1396T', 'R1397D', 'R1398J', 'R1401J',
       'R1402E', 'R1404E', 'R1405E', 'R1406M', 'R1409D', 'R1412M',
       'R1414E', 'R1415T', 'R1416T', 'R1420T', 'R1421M', 'R1422T',
       'R1423E', 'R1425D', 'R1427T', 'R1431J', 'R1438M'], dtype=object)

In [11]:

# What experiments did R1111M complete?
r1111m_experiments = r1_data[r1_data['subject'] == 'R1111M']['experiment'].unique()
r1111m_experiments

Out[11]:

array(['FR1', 'FR2', 'PAL1', 'PAL2', 'PS2', 'catFR1'], dtype=object)

In [12]:

# How many sessions of PAL1 have been collected?
pal_sessions = r1_data[r1_data['experiment'] == 'PAL1']
len(pal_sessions)

Out[12]:

Loading Data¶

In most cases, the end goal is to load the data into memory rather than just locating a file or determing what data has been collected. In this case, CML Readers provides a handy class to unify the API for loading data. By default, the location will be determined automatically based on the file type using the PathFinder class demonstrated earlier. However, a custom path can be given by using the file_path keyword. This can be useful if you have some data stored locally that is in the same format as one of the data types supported by CMLReaders that you would like to load and use. See the “Loading from a Custom Location” section below for an example.

Each data type has a default representation that is returned when you call the .load() method. Most users will want to use this default representation. However, if you would like to get the data in a different format, you have two options:

Get the reader for the data type and load the data using a different supported method using one of the as_x methods
Load the data as the default type and convert it manually

In [13]:

reader = cml.CMLReader(subject="R1389J", experiment="catFR5", session=1,
                       localization=0, montage=0, rootdir=rhino_root)

Using the Default Representation¶

In [14]:

# Pandas dataframe
events_df = reader.load('task_events')
events_df.head()

Out[14]:

	eegoffset	category	category_num	eegfile	experiment	intrusion	is_stim	item_name	item_num	...	recog_rt	recognized	rectime	rejected	serialpos	session	stim_list	stim_params	subject	type
0	-1	X	-999		catFR5	-999	False	X	-999	...	-999	-999	-999	-999	-999	1	False	[]	R1389J	STIM_ARTIFACT_DETECTION_START
1	5831	X	-999	R1389J_catFR5_1_28Feb18_1552.h5	catFR5	-999	False		-1	...	-999	-999	-999	-999	-999	1	False	[{'amplitude': 500.0, 'anode_label': 'STG6', '...	R1389J	STIM_ON
2	7790	X	-999	R1389J_catFR5_1_28Feb18_1552.h5	catFR5	-999	False		-1	...	-999	-999	-999	-999	-999	1	False	[{'amplitude': 500.0, 'anode_label': 'STG6', '...	R1389J	STIM_ON
3	9786	X	-999	R1389J_catFR5_1_28Feb18_1552.h5	catFR5	-999	False		-1	...	-999	-999	-999	-999	-999	1	False	[{'amplitude': 500.0, 'anode_label': 'STG6', '...	R1389J	STIM_ON
4	11782	X	-999	R1389J_catFR5_1_28Feb18_1552.h5	catFR5	-999	False		-1	...	-999	-999	-999	-999	-999	1	False	[{'amplitude': 500.0, 'anode_label': 'STG6', '...	R1389J	STIM_ON

5 rows × 28 columns

In [15]:

# Python dictionary
electrode_categories_dict = reader.load('electrode_categories')
electrode_categories_dict

Out[15]:

{'interictal': ['ONEMC 9',
  'ONEMC10',
  'ONEMC8',
  'SMC3',
  'SMC4',
  'STG 7 ',
  'STG8',
  'TWOSTG 4'],
 'brain_lesion': ['FOURSC', 'ONESC', 'ONNEMC', 'THREESC', 'TWOMC', 'TWOSC'],
 'bad_channel': ['NONE'],
 'soz': []}

Using the Underlying Reader¶

In [16]:

# Ask CMLReader to give back the reader instead of the data
event_reader = reader.get_reader('task_events')
type(event_reader)

Out[16]:

cmlreaders.readers.readers.EventReader

In [17]:

# Load the task events as a dictionary instead of the default representation
event_dict = event_reader.as_dict()
event_dict[:1]

Out[17]:

[{'eegoffset': -1,
  'category': 'X',
  'category_num': -999,
  'eegfile': '',
  'exp_version': '',
  'experiment': 'catFR5',
  'intrusion': -999,
  'is_stim': False,
  'item_name': 'X',
  'item_num': -999,
  'list': -999,
  'montage': 0,
  'msoffset': -1,
  'mstime': -1,
  'phase': '',
  'protocol': 'r1',
  'recalled': False,
  'recog_resp': -999,
  'recog_rt': -999,
  'recognized': -999,
  'rectime': -999,
  'rejected': -999,
  'serialpos': -999,
  'session': 1,
  'stim_list': False,
  'stim_params': [],
  'subject': 'R1389J',
  'type': 'STIM_ARTIFACT_DETECTION_START'}]

In [18]:

# Load the task event as a recarray (not recommended)
event_recarray = event_reader.as_recarray()
event_recarray[0]

Out[18]:

(0, -1, 'X', -999, '', '', 'catFR5', -999, False, 'X', -999, -999, 0, -1, -1, '', 'r1', False, -999, -999, -999, -999, -999, -999, 1, False, [], 'R1389J', 'STIM_ARTIFACT_DETECTION_START')

Examples¶

Loading Multiple Sessions¶

CMLReaders is currently designed for loading events one session at a time. This may change in the future, but in the interim, it is straightforward to load multiple sessions-worth of events.

In [19]:

# Find all sessions of FR6 that subject R1409D completed
sessions_completed = r1_data[(r1_data['subject'] == 'R1409D') &
                             (r1_data['experiment'] == 'FR6')]['session'].unique()
sessions_completed

Out[19]:

array([0, 2, 3, 9])

In [20]:

# Verbose method
all_events = []
for session in sessions_completed:
    sess_events = cml.CMLReader(subject="R1409D", experiment="FR6", session=session,
                                localization=0, montage=0, rootdir=rhino_root).load('task_events')
    all_events.append(sess_events)

all_sessions_df = pd.concat(all_events)
len(all_sessions_df)

Out[20]:

In [21]:

# Same operation, but with list comprehension
all_session_df = pd.concat([cml.CMLReader(subject='R1409D', experiment='FR6', session=session, rootdir=rhino_root).load('events')
                            for session in sessions_completed])
len(all_sessions_df)

Out[21]:

We can also use the special load_events classmethod to load events from multiple subjects and/or experiments.

In [22]:

aggregate_events = cml.CMLReader.load_events(subjects=["R1111M", "R1409D"],
                                             experiments=["FR1"],
                                             rootdir=rhino_root)
aggregate_events.subject.unique()

Out[22]:

array(['R1111M', 'R1409D'], dtype=object)

Loading EEG¶

Loading EEG data is a bit more complicated than loading other data types. For this reason, rather than using the general load method, we instead use load_eeg which takes special keyword arguments. As always, start by instantiating a CMLReader:

In [23]:

reader = cml.CMLReader(subject='R1111M', experiment='FR1', session=0, rootdir=rhino_root)

Reading a full session¶

If we give no parameters to reader.load_eeg, then by default all data for an entire session will be loaded as the data were recorded:

In [24]:

full_session_eeg = reader.load_eeg()
full_session_eeg.shape

/Users/depalati/src/cmlreaders/cmlreaders/path_finder.py:225: MultiplePathsFoundWarning: Multiple files found. Returning the first file found
  'file found', MultiplePathsFoundWarning)

Out[24]:

(1, 100, 1623160)

Reading from events¶

To select EEG epochs based on events, first load the events and use standard pandas idioms to select events of interest:

In [25]:

events = reader.load('events')
word_events = events[events['type'] == 'WORD']

In addition to passing the events, we also need to specify relative start and stop times in milliseconds. Below, we will load data for each word onset starting at the time the word appeared and ending 100 ms later:

In [26]:

word_event_eeg = reader.load_eeg(events=word_events, rel_start=0, rel_stop=100)
word_event_eeg.shape

/Users/depalati/src/cmlreaders/cmlreaders/path_finder.py:225: MultiplePathsFoundWarning: Multiple files found. Returning the first file found
  'file found', MultiplePathsFoundWarning)

Out[26]:

(288, 100, 50)

Reading from multiple sessions¶

Building upon aggregate event loading, we can also load EEG data from multiple sessions. The caveat here is that we are restricted to a single subject:

In [27]:

all_fr1_events = reader.load_events(subjects=["R1111M"], experiments=["FR1"], rootdir=rhino_root)
fr1_words = all_fr1_events[all_fr1_events["type"] == "WORD"]
fr1_eeg = reader.load_eeg(events=fr1_words, rel_start=-100, rel_stop=100)
fr1_eeg.shape

/Users/depalati/src/cmlreaders/cmlreaders/path_finder.py:225: MultiplePathsFoundWarning: Multiple files found. Returning the first file found
  'file found', MultiplePathsFoundWarning)

Out[27]:

(1020, 100, 100)

Converting EEG Representation¶

By default, CMLReaders uses a simple container for time series data. However, we provide two utility methods for converting this representation to a format that is understood by PTSA and MNE, since these are common libraries for interacting with EEG data.

In [28]:

ptsa_eeg = word_event_eeg.to_ptsa()
type(ptsa_eeg)

Out[28]:

ptsa.data.timeseries.TimeSeries

In [29]:

mne_eeg = word_event_eeg.to_mne()
type(mne_eeg)

Out[29]:

mne.epochs.EpochsArray