The focus of the project is to predict drug sensitivity (IC50) based on genomic features of cancer cell lines. It will involve regression tasks to predict the IC50 values or classify cell lines as sensitive or resistant to specific drugs. Dataset also allows for identification of genomic markers that correlate with drug response.

Primary targetable variable is LN_IC50 (Natural log of the half-maximal inhibitory concentration): which is the concentration of drug that inhibits cell viability by 50% on a logarithmic scale. Lower values mean higher drug sensitivity.

In [141]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import openpyxl
In [142]:
# Load data
gdsc_data = pd.read_csv('data/GDSC2-dataset.csv')
compound_annotation = pd.read_csv('data/Compounds-annotation.csv')
cellline_details = pd.read_excel('data/Cell_Lines_Details.xlsx', sheet_name = 'Cell line details')
/Users/yongjudan/MLDL_projects/GDSC/gdsc_venv/lib/python3.10/site-packages/openpyxl/worksheet/_reader.py:329: UserWarning: Unknown extension is not supported and will be removed
  warn(msg)
In [143]:
# Overview data and choose columns for use for gdsc_column
gdsc_column = ['COSMIC_ID', 'CELL_LINE_NAME','TCGA_DESC', 'DRUG_ID','DRUG_NAME', 'PUTATIVE_TARGET',
               'PATHWAY_NAME','MIN_CONC','MAX_CONC', 'LN_IC50', 'AUC', 'RMSE', 'Z_SCORE']

gdsc_data =gdsc_data[gdsc_column]
print(gdsc_data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 242036 entries, 0 to 242035
Data columns (total 13 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   COSMIC_ID        242036 non-null  int64  
 1   CELL_LINE_NAME   242036 non-null  object 
 2   TCGA_DESC        240969 non-null  object 
 3   DRUG_ID          242036 non-null  int64  
 4   DRUG_NAME        242036 non-null  object 
 5   PUTATIVE_TARGET  214881 non-null  object 
 6   PATHWAY_NAME     242036 non-null  object 
 7   MIN_CONC         242036 non-null  float64
 8   MAX_CONC         242036 non-null  float64
 9   LN_IC50          242036 non-null  float64
 10  AUC              242036 non-null  float64
 11  RMSE             242036 non-null  float64
 12  Z_SCORE          242036 non-null  float64
dtypes: float64(6), int64(2), object(5)
memory usage: 24.0+ MB
None
In [144]:
# Overview data and choose columns for use for compound_annotation
annotate_column = ['DRUG_ID', 'DRUG_NAME', 'TARGET', 'TARGET_PATHWAY']

compound_annotation = compound_annotation[annotate_column]
print(compound_annotation.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 621 entries, 0 to 620
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   DRUG_ID         621 non-null    int64 
 1   DRUG_NAME       621 non-null    object
 2   TARGET          579 non-null    object
 3   TARGET_PATHWAY  621 non-null    object
dtypes: int64(1), object(3)
memory usage: 19.5+ KB
None
In [145]:
# Overview data and choose columns for use for cell linedetails
cellline_column = ['COSMIC identifier', 'Sample Name', 'Whole Exome Sequencing (WES)', 'Gene Expression', 'Methylation', 'Drug\nResponse',
                   'GDSC\nTissue descriptor 1', 'GDSC\nTissue\ndescriptor 2', 'Cancer Type\n(matching TCGA label)', 'Microsatellite \ninstability Status (MSI)',
                   'Growth Properties']
cellline_details = cellline_details[cellline_column]

# Rename columns
cellline_details = cellline_details.rename(columns = {'COSMIC identifier' : 'COSMIC_ID', 'Sample Name' : 'CELL_LINE_NAME', 'GDSC\nTissue descriptor 1' : 'GDSC Tissue descriptor 1',
                                                      'GDSC\nTissue\ndescriptor 2' : 'GDSC Tissue descriptor 2', 'Cancer Type\n(matching TCGA label)': 'CANCER_TYPE',
                                                      'Microsatellite \ninstability Status (MSI)':'MSI', 'Drug\nResponse': 'Drug Response'})
print(cellline_details.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1002 entries, 0 to 1001
Data columns (total 11 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   COSMIC_ID                     1001 non-null   float64
 1   CELL_LINE_NAME                1002 non-null   object 
 2   Whole Exome Sequencing (WES)  1002 non-null   object 
 3   Gene Expression               1002 non-null   object 
 4   Methylation                   1002 non-null   object 
 5   Drug Response                 1002 non-null   object 
 6   GDSC Tissue descriptor 1      1001 non-null   object 
 7   GDSC Tissue descriptor 2      1001 non-null   object 
 8   CANCER_TYPE                   826 non-null    object 
 9   MSI                           986 non-null    object 
 10  Growth Properties             999 non-null    object 
dtypes: float64(1), object(10)
memory usage: 86.2+ KB
None
In [146]:
# Merge Tables
merged = pd.merge(gdsc_data, cellline_details, on = ['COSMIC_ID', 'CELL_LINE_NAME'], how = 'left')
final_merge = pd.merge(merged, compound_annotation, on = ['DRUG_ID','DRUG_NAME'], how = 'left')

print(final_merge.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 242036 entries, 0 to 242035
Data columns (total 24 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   COSMIC_ID                     242036 non-null  int64  
 1   CELL_LINE_NAME                242036 non-null  object 
 2   TCGA_DESC                     240969 non-null  object 
 3   DRUG_ID                       242036 non-null  int64  
 4   DRUG_NAME                     242036 non-null  object 
 5   PUTATIVE_TARGET               214881 non-null  object 
 6   PATHWAY_NAME                  242036 non-null  object 
 7   MIN_CONC                      242036 non-null  float64
 8   MAX_CONC                      242036 non-null  float64
 9   LN_IC50                       242036 non-null  float64
 10  AUC                           242036 non-null  float64
 11  RMSE                          242036 non-null  float64
 12  Z_SCORE                       242036 non-null  float64
 13  Whole Exome Sequencing (WES)  232670 non-null  object 
 14  Gene Expression               232670 non-null  object 
 15  Methylation                   232670 non-null  object 
 16  Drug Response                 232670 non-null  object 
 17  GDSC Tissue descriptor 1      232670 non-null  object 
 18  GDSC Tissue descriptor 2      232670 non-null  object 
 19  CANCER_TYPE                   190589 non-null  object 
 20  MSI                           229683 non-null  object 
 21  Growth Properties             232670 non-null  object 
 22  TARGET                        214881 non-null  object 
 23  TARGET_PATHWAY                242036 non-null  object 
dtypes: float64(6), int64(2), object(16)
memory usage: 44.3+ MB
None
In [147]:
# Check for duplicates
print(sum(final_merge.duplicated()))
0
In [148]:
# Check for redundant numeric features
plt.figure(figsize=(15, 15))
sns.heatmap(final_merge.corr(numeric_only=True), annot=True, fmt='.2f')
plt.xticks(rotation=45, ha='right')  # Rotate ticker labels for readability
plt.yticks(rotation=0)
plt.tight_layout()
plt.title(f'Correlation Matrix')
plt.show()
# not much of redundantt numeric features
No description has been provided for this image
In [149]:
# Overview of data set
print(final_merge.head(2).T)
                                            0                 1
COSMIC_ID                              683667            684052
CELL_LINE_NAME                         PFSK-1              A673
TCGA_DESC                                  MB      UNCLASSIFIED
DRUG_ID                                  1003              1003
DRUG_NAME                        Camptothecin      Camptothecin
PUTATIVE_TARGET                          TOP1              TOP1
PATHWAY_NAME                  DNA replication   DNA replication
MIN_CONC                               0.0001            0.0001
MAX_CONC                                  0.1               0.1
LN_IC50                             -1.463887         -4.869455
AUC                                   0.93022           0.61497
RMSE                                 0.089052          0.111351
Z_SCORE                              0.433123           -1.4211
Whole Exome Sequencing (WES)                Y                 Y
Gene Expression                             Y                 Y
Methylation                                 Y                 Y
Drug Response                               Y                 Y
GDSC Tissue descriptor 1       nervous_system       soft_tissue
GDSC Tissue descriptor 2      medulloblastoma  rhabdomyosarcoma
CANCER_TYPE                                MB               NaN
MSI                                 MSS/MSI-L         MSS/MSI-L
Growth Properties                    Adherent          Adherent
TARGET                                   TOP1              TOP1
TARGET_PATHWAY                DNA replication   DNA replication
In [150]:
# Drop cateogorical columns with redundancy
drop_columns = ['TARGET_PATHWAY','PUTATIVE_TARGET']
final_merge.drop(columns=drop_columns, inplace = True)
In [151]:
# Data info
print(final_merge.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 242036 entries, 0 to 242035
Data columns (total 22 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   COSMIC_ID                     242036 non-null  int64  
 1   CELL_LINE_NAME                242036 non-null  object 
 2   TCGA_DESC                     240969 non-null  object 
 3   DRUG_ID                       242036 non-null  int64  
 4   DRUG_NAME                     242036 non-null  object 
 5   PATHWAY_NAME                  242036 non-null  object 
 6   MIN_CONC                      242036 non-null  float64
 7   MAX_CONC                      242036 non-null  float64
 8   LN_IC50                       242036 non-null  float64
 9   AUC                           242036 non-null  float64
 10  RMSE                          242036 non-null  float64
 11  Z_SCORE                       242036 non-null  float64
 12  Whole Exome Sequencing (WES)  232670 non-null  object 
 13  Gene Expression               232670 non-null  object 
 14  Methylation                   232670 non-null  object 
 15  Drug Response                 232670 non-null  object 
 16  GDSC Tissue descriptor 1      232670 non-null  object 
 17  GDSC Tissue descriptor 2      232670 non-null  object 
 18  CANCER_TYPE                   190589 non-null  object 
 19  MSI                           229683 non-null  object 
 20  Growth Properties             232670 non-null  object 
 21  TARGET                        214881 non-null  object 
dtypes: float64(6), int64(2), object(14)
memory usage: 40.6+ MB
None
In [152]:
# Save as CSV
final_merge.to_csv('gdsc_dataset.csv', index=False)