Chemistry 125-225: Machine Learning in Chemistry
Fall Quarter, 2024
Homework Assignment #3 - Due: December 5, 2024. Turn in a writeup with your responses to a ll questions below, codes, outputs (e.g. graphs, etc.). Attach all your Python files as well, so we can run them.
Problem 1: K-means Clustering on a Chemistry Dataset
In this problem, you will explore k-means clustering on a chemical dataset using the scikit-learn library. The dataset contains molecular descriptors and water solubility values for various compounds. Your tasks include applying clustering, interpreting results, and evaluating the performance of your clustering model.
Dataset: You will use the delaney solubility with descriptors dataset available through the DeepChem library.
1. Loading and Exploring the Data
(a) Download and load the dataset using DeepChem. Convert it into a Pandas DataFrame with feature columns and a target column for solubility values. The following code snippet will help you get started:
import deepchem as dc
import pandas as pd
# Load the dataset
tasks , datasets , transformers = dc . molnet . load_delaney ( featurizer =’ECFP ’,
splitter = None )
dataset = datasets [0]
# Convert to Pandas DataFrame.
X = dataset .X
y = dataset .y
columns = [f’feature_ {i}’ for i in range (X. shape [1]) ]
df = pd . DataFrame. (X , columns = columns )
df [’solubility ’] = y
# Display basic information about the dataset
print ( df . head () )
print ( df . describe () )
(b) Describe the data briefly. How many samples and features are there? Are there any missing values? If so, how would you handle them?
2. Data Preprocessing
(a) Standardize the features so that they have zero mean and unit variance using the StandardScaler class from sklearn. Explain why standardization is important for k-means clustering.
from sklearn . preprocessing import StandardScaler
# Standardize the features
scaler = StandardScaler ()
X_scaled = scaler . fit_transform. ( df . drop (’solubility ’, axis =1) )
3. Applying K-means Clustering
(a) Apply k-means clustering to the standardized data with k = 3 clusters. Use the KMeans class from sklearn to fit the model and predict the cluster assignments.
from sklearn . cluster import KMeans
# Apply K- means clustering
k = 3
kmeans = KMeans ( n_clusters =k , random_state =42)
clusters = kmeans . fit_predict ( X_scaled )
df [’cluster ’] = clusters
(b) Reduce the dimensionality of the data to 2 components using Principal Component Analysis (PCA) and plot the clusters. Use the PCA class from sklearn. Hint: Use matplotlib for plotting.
4. Evaluating Clustering Performance
(a) Compute the silhouette score for the clustering using the silhouette score function from sklearn. What does the silhouette score indicate about the quality of the clustering?
from sklearn . metrics import silhouette_score
# Compute silhouette score
silhouette_avg = silhouette_score ( X_scaled , clusters )
print (f" Silhouette Score for k={k}: { silhouette_avg :.3f}")
(b) Experiment with different values of k (e.g., k = 2, 4, 5) and compute the silhouette score for each value. Plot the silhouette scores as a function of k to determine the optimal number of clusters.
5. Exploring Clustering Interpretability
(a) Examine the cluster centroids. What patterns, if any, do you observe? Are there any features that clearly distinguish one cluster from another?
(b) Select a few representative samples from each cluster (if there are at least three points per cluster) and compare their solubility values. What can you infer about the relationship between cluster membership and solubility? Use the following code snippet to sample data:
# Sampling representative data points from each cluster
for cluster_label in range (k) :
cluster_data = df [ df [’cluster ’] == cluster_label ]
num_samples = min (3 , len ( cluster_data ))
# Ensure we do not sample
more than available data points
samples = cluster_data . sample ( num_samples , random_state =42)
print (f" Cluster { cluster_label } samples :")
print ( samples [[ ’solubility ’, ’cluster ’]])
6. Advanced Analysis and Interpretations
(a) Compare the performance of k-means clustering with another clustering algorithm, such as Ag- glomerative Hierarchical Clustering or DBSCAN, using the same dataset. Which method performs better based on silhouette scores, and why might this be the case? Provide code for your chosen alternative clustering method and a brief analysis of your results.
(b) K-means assumes that clusters are spherical and equally sized. Discuss whether this assumption is reasonable for the dataset you are using. Based on your understanding of chemical descriptors, do you expect clusters to have such shapes? If not, suggest preprocessing or alternative methods that might improve clustering performance.
(c) Use elbow method analysis to determine the optimal number of clusters for k-means. Plot the sum of squared distances (inertia) for a range of cluster values (e.g., k = 1 to k = 10) and identify the ”elbow point.” Explain your choice of the optimal k based on this analysis.
# Elbow method plot
inertia = []
k_range = range (1 , 11)
for k in k_range :
kmeans = KMeans ( n_clusters =k , random_state =42)
kmeans . fit ( X_scaled )
inertia . append ( kmeans . inertia_ )
import matplotlib . pyplot as plt
plt . figure ( figsize =(8 , 5) )
plt . plot ( k_range , inertia , marker =’o’)
plt . title (’Elbow Method for Optimal k’)
plt . xlabel (’Number of Clusters (k)’)
plt . ylabel (’Sum of Squared Distances ( Inertia )’)
plt . show ()
(d) Investigate the impact of feature selection on clustering performance. Remove features that have near-zero variance or high correlation with other features, and re-run the k-means clustering. How does feature selection impact the silhouette score and cluster interpretability? Provide a brief explanation supported by code and results.
(e) Use a dimensionality reduction technique other than PCA (e.g., t-SNE) to visualize the clusters in a lower-dimensional space. Compare the visualization and interpretability of clusters using PCA versus t-SNE. Discuss any differences in the separation and distribution of clusters.
Problem 2: Decision Trees and their Application in Chemistry
In this problem, you will learn about decision trees, a type of supervised learning algorithm used for clas- sification and regression tasks. Decision trees model data by splitting it based on feature values, creating a tree-like structure of decisions. You will explore their application using a chemistry dataset.
Dataset: You will use the delaney solubility with descriptors dataset available through the DeepChem library, which contains molecular descriptors and water solubility values.
1. Introduction to Decision Trees
(a) Read about decision trees: Decision trees split data into branches based on feature values, creating a structure resembling a tree. At each node, a decision is made to split the data further based on some feature until a final prediction is made at a leaf node. Research and briefly describe:
• How decision trees make predictions (e.g., how data is split at each node).
• The advantages and disadvantages of decision trees compared to other algorithms.
2. Loading and Exploring the Data
(a) Download and load the delaney solubility with descriptors dataset using the code below.
Create a Pandas DataFrame with feature columns and a target column for solubility values.
import deepchem as dc
import pandas as pd
# Load the dataset
tasks , datasets , transformers = dc . molnet . load_delaney ( featurizer =’ECFP ’,
splitter =’random ’)
train_dataset , valid_dataset , test_dataset = datasets
# Convert the training set to a Pandas DataFrame.
X_train = train_dataset .X
y_train = train_dataset .y
columns = [f’feature_ {i}’ for i in range ( X_train . shape [1]) ]
df_train = pd . DataFrame. ( X_train , columns = columns )
df_train [’solubility ’] = y_train
(b) Describe the dataset: How many features and samples are present? What is the target variable? Print the first few rows to inspect the data.
3. Training a Decision Tree Regressor
(a) Use sklearn’s DecisionTreeRegressor to fit a decision tree model to the training data. Split the features and target variable as follows:
from sklearn . tree import DecisionTreeRegressor
# Define features and target
X = df_train . drop (’solubility ’, axis =1)
y = df_train [’solubility ’]
# Fit the decision tree regressor
model = DecisionTreeRegressor ( random_state =42)
model . fit (X , y)
(b) Examine the structure of the tree using the plot tree function from sklearn. Does the tree have a large number of splits? What does this imply about the model’s complexity?
from sklearn import tree
import matplotlib . pyplot as plt
plt . figure ( figsize =(12 , 8) )
tree . plot_tree ( model , max_depth =3 , filled = True , feature_names =X . columns )
plt . show ()
4. Evaluating the Model
(a) Use the trained model to make predictions on the validation data. Compute the Mean Absolute Error (MAE) and Mean Squared Error (MSE) as performance metrics.
from sklearn . metrics import mean_absolute_error , mean_squared_error
# Load validation data
X_valid = valid_dataset .X
y_valid = valid_dataset .y
# Make predictions
y_pred = model . predict ( X_valid )
# Calculate performance metrics
mae = mean_absolute_error ( y_valid , y_pred )
mse = mean_squared_error ( y_valid , y_pred )
print (f" Mean Absolute Error ( MAE): { mae :.3 f}")
print (f" Mean Squared Error ( MSE ): { mse :.3 f}")
(b) Interpret the performance metrics. Is the model accurate in predicting solubility? What do the values of MAE and MSE suggest?
5. Improving the Decision Tree
(a) Decision trees can easily overfit the training data. Try limiting the maximum depth of the tree (e.g., max depth=3) and re-evaluate the model using MAE and MSE. How does limiting the depth impact performance on the validation data?
(b) Experiment with other hyperparameters, such as min samples split and min samples leaf. How do these parameters affect the model’s performance and complexity? Provide a brief analysis supported by code and results.
6. Advanced Analysis (Challenging Question)
(a) Feature importance: Use the feature importances attribute of the trained model to identify the most important features for predicting solubility. Plot the feature importances and interpret the results. Do these features make sense in a chemical context?
import matplotlib . pyplot as plt
import numpy as np
# Plot feature importances
importance = model . feature_importances_
feature_names = X. columns
indices = np . argsort ( importance ) [:: -1]
plt . figure ( figsize =(10 , 6) )
plt . bar ( range ( len ( importance ) ) , importance [ indices ])
plt . xticks ( range ( len ( importance )) , feature_names [ indices ], rotation =90)
plt . title (’Feature Importances ’)
plt . show ()
(b) Discuss any limitations you observe when using decision trees for this dataset. Suggest potential approaches to overcome these limitations (e.g., using ensemble methods such as Random Forests).
Problem 3: Exploring t-SNE for Dimensionality Reduction
In this problem, you will learn about t-SNE (t-Distributed Stochastic Neighbor Embedding), a widely used dimensionality reduction algorithm, and apply it to visualize chemical datasets. Answer all subproblems below to demonstrate your understanding of t-SNE and its applications in machine learning for chemistry.
(a) Introduction to t-SNE
1. What is t-SNE? Write a detailed explanation of t-SNE, covering: (a) Its purpose as a dimensionality reduction technique.
(b) The key concepts of pairwise similarity, high-dimensional probability distributions, and low- dimensional embeddings.
(c) How t-SNE minimizes the KL divergence between high-dimensional and low-dimensional distri- butions.
(d) Common applicationsoft-SNE in chemistry (e.g., clustering molecular features, visualizing datasets).
2. Explain the following t-SNE parameters and their effects:
(a) perplexity: How does it control the size of the neighborhood in high-dimensional space? (b) learning rate: What happens if it is set too high or too low?
(c) n iter: Why is it important to use enough iterations?
(d) metric: Which distance metrics can be used, and why might certain metrics be preferred in chemistry?
(b) Loading and Preprocessing Data
Choose a chemical dataset (e.g., ChEMBL, ESOL, or MOSES) and write Python code to:
1. Load the dataset into a Pandas DataFrame.
2. Standardize the features using StandardScaler.
3. Display the first few rows of the dataset.
import pandas as pd
from sklearn . preprocessing import StandardScaler
# Load dataset
chem_data = pd . read_csv (’ chemistry_data . csv ’)
print ( chem_data . head () )
# Select numerical features and standardize
features = [’ molecular_weight ’, ’alogp ’, ’hba ’, ’hbd ’, ’psa ’]
X = chem_data [ features ]
scaler = StandardScaler ()
X_scaled = scaler . fit_transform. (X)
(c) Applying t-SNE
1. Use t-SNE to reduce the dataset to two dimensions using sklearn.manifold.TSNE.
2. Use the following parameter values: perplexity=30, learning rate=200, n iter=1000.
3. Write Python code to generate a scatter plot of the t-SNE embedding using matplotlib or seaborn, with points colored by a categorical property (e.g., bioactivity).
from sklearn . manifold import TSNE
import matplotlib . pyplot as plt
import seaborn as sns
# Apply t-SNE
tsne = TSNE ( n_components =2 , perplexity =30 , learning_rate =200 , n_iter =1000 , random_state =42)
embedding = tsne . fit_transform. ( X_scaled )
# Plot t- SNE embedding
plt . figure ( figsize =(8 , 6) )
sns . scatterplot (x= embedding [: , 0] , y= embedding [: , 1] , hue = chem_data [’ bioactivity ’], palette =
’viridis ’)
plt . title (’t- SNE Embedding of Chemical Dataset ’)
plt . xlabel (’t- SNE 1’)
plt . ylabel (’t- SNE 2’)
plt . show ()
(d) Experimenting with Parameters
Repeat the t-SNE projection with the following parameter combinations:
1. perplexity=10, learning rate=50
2. perplexity=50, learning rate=500
Plot all three embeddings side by side. Discuss how changes in perplexity and learning rate affect the embedding.
(e) Comparing t-SNE with PCA
1. Apply PCA to reduce the dataset to two dimensions.
2. Generate a scatter plot of the PCA embedding.
3. Write a paragraph comparing t-SNE and PCA in terms of their ability to preserve data structure. How does t-SNE’s focus on local structure differ from PCA’s emphasis on global variance?
from sklearn . decomposition import PCA
# Apply PCA
pca = PCA ( n_components =2)
pca_embedding = pca . fit_transform. ( X_scaled )
# Plot PCA embedding
plt . figure ( figsize =(8 , 6) )
sns . scatterplot (x= pca_embedding [: , 0] , y = pca_embedding [: , 1] , hue = chem_data [’bioactivity ’],
palette =’viridis ’)
plt . title (’PCA Embedding of Chemical Dataset ’)
plt . xlabel (’PCA 1’)
plt . ylabel (’PCA 2’)
plt . show ()
(f) Reflection and Analysis
Answer the following questions:
1. Do you observe distinct clusters in the t-SNE embedding? What might these clusters represent in the context of molecular properties?
2. Compute the trustworthiness score of the t-SNE embedding. How does this metric quantify the quality of the embedding? Use the following code to calculate the score:
from sklearn . manifold import trustworthiness
score = trustworthiness ( X_scaled , embedding , n_neighbors =5)
print (f" Trustworthiness score : { score }")
3. Discuss the challenges and best practices for tuning t-SNE parameters like perplexity and learning rate.
4. Why might t-SNE be particularly useful in chemistry applications? Provide examples, such as cluster- ing compounds or analyzing molecular properties.
Submission Instructions
Submit a report containing:
• Python code for all parts of the problem.
• Plots and visualizations.
• Written answers to all questions and reflections on t-SNE’s performance.
Hints:
• Install scikit-learn if needed:
pip install scikit-learn
• Trustworthiness provides a quantitative measure of embedding quality.
Problem 4: Understanding and Implementing UMAP
UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that is particularly effective for visualizing high-dimensional data in low-dimensional spaces. This problem will guide you through understanding, implementing, and analyzing UMAP using Python.
(a) Introduction to UMAP
1. What is UMAP? Research and provide a brief explanation of what UMAP does and how it works. Include a discussion of the following:
• The mathematical foundation of UMAP (manifold learning, topology, etc.).
• The main parameters of UMAP (e.g., n neighbors, min dist) and their effects on the embedding.
• A comparison of UMAP with other dimensionality reduction methods such as PCA and t-SNE.
2. Why is UMAP particularly well-suited for visualizing high-dimensional datasets?
(b) Dataset Preparation
Download a high-dimensional dataset of your choice for analysis. For example:
• MNIST: Handwritten digit images (available in sklearn.datasets).
• ESOL, ChEMBL, or QM9: Chemical datasets containing molecular features.
• MOSES: Molecular datasets with SMILES strings.
Use the following Python code snippet to load the MNIST dataset as an example:
from sklearn . datasets import fetch_openml
import pandas as pd
# Load MNIST dataset
mnist = fetch_openml (’mnist_784 ’, version =1)
X = mnist . data
y = mnist . target
print (f" Shape of data : {X. shape }, Shape of labels : {y. shape }")
Answer the following questions:
1. What is the dimensionality of the dataset?
2. How would you preprocess this dataset for UMAP? Perform any necessary preprocessing steps, such as scaling or normalization, and provide the Python code.
(c) Implementing UMAP
Use the umap-learn Python library to reduce the dimensionality of your dataset to 2 dimensions for visu- alization. Here is a code snippet to get started:
import umap . umap_ as umap
import matplotlib . pyplot as plt
# Initialize and fit UMAP
reducer = umap . UMAP ( n_neighbors =15 , min_dist =0.1 , random_state =42)
X_embedded = reducer . fit_transform. ( X)
# Plot the embedding
plt . figure ( figsize =(8 , 6) )
plt . scatter ( X_embedded [: , 0] , X_embedded [: , 1] , c=y , cmap =’Spectral ’, s =5)
plt . colorbar ( label =" Digit Label ")
plt . title (" UMAP Projection of MNIST Dataset ")
plt . xlabel (" UMAP Dimension 1")
plt . ylabel (" UMAP Dimension 2")
plt . show ()
Questions:
1. What do the parameters n neighbors and min dist control in the UMAP algorithm? Experiment with different values for these parameters and describe their effects on the embedding.
2. How does the UMAP embedding compare with the original high-dimensional data?
(d) Analyzing UMAP Results
After generating the 2D embedding, analyze the results:
1. Identify any clusters in the 2D projection. Do these clusters correspond to meaningful patterns in the original data (e.g., digit classes in MNIST or chemical properties in molecular datasets)?
2. Compute the pairwise distances between points in the original high-dimensional space and compare them with distances in the 2D embedding. What can you infer about UMAP’s ability to preserve local versus global structures?
3. For chemical datasets, relate the UMAP clusters to specific molecular properties such as polarity or molecular weight. Are there clear separations between different types of molecules?
(e) Comparison with PCA and t-SNE
Perform dimensionality reduction on the same dataset using PCA and t-SNE for comparison. Use the following code snippets:
from sklearn . decomposition import PCA
from sklearn . manifold import TSNE
# PCA
pca = PCA ( n_components =2)
X_pca = pca . fit_transform. (X)
# t- SNE
tsne = TSNE ( n_components =2 , random_state =42)
X_tsne = tsne . fit_transform. ( X)
# Plot PCA
plt . figure ( figsize =(8 , 6) )
plt . scatter ( X_pca [: , 0] , X_pca [: , 1] , c=y , cmap =’Spectral ’, s =5)
plt . colorbar ( label =" Digit Label ")
plt . title ("PCA Projection of MNIST Dataset ")
plt . xlabel (" PCA Dimension 1")
plt . ylabel (" PCA Dimension 2")
plt . show ()
# Plot t- SNE
plt . figure ( figsize =(8 , 6) )
plt . scatter ( X_tsne [: , 0] , X_tsne [: , 1] , c=y , cmap =’Spectral ’, s =5)
plt . colorbar ( label =" Digit Label ")
plt . title ("t- SNE Projection of MNIST Dataset ")
plt . xlabel ("t- SNE Dimension 1")
plt . ylabel ("t- SNE Dimension 2")
plt . show ()
Questions:
1. How do the embeddings generated by PCA, t-SNE, and UMAP differ in terms of cluster separation and overall structure?
2. Which method appears to work best for this dataset, and why? Consider factors such as local and global structure preservation, computational efficiency, and interpretability.
3. Reflect on the strengths and limitations of UMAP compared to PCA and t-SNE.
(f) Applications of UMAP in Chemistry
Provide examples of how UMAP can be applied to chemical datasets. Possible applications include:
1. Visualizing chemical space to identify clusters of similar molecules.
2. Analyzing high-dimensional molecular features for drug discovery.
3. Reducing the dimensionality of quantum chemical datasets for machine learning models.
Discuss how UMAP’s ability to preserve local structure can be beneficial in each of these scenarios. Submission: Submit a report containing:
• Python code for each part of the problem.
• Visualizations of the UMAP, PCA, and t-SNE embeddings.
• Written answers to all questions and interpretations of the results.
• An analysis of UMAP’s applications in chemistry.