Big Biomedical Data Analysis

2024-07-12

字数统计: 9.1k字 | 阅读时长≈ 40分

Big Biomedical Data Analysis

DAY 1

Environment

Download Anaconda Distribution | Anaconda

下载安装后，搜索Anaconda Navigator，打开后运行JupyterLab，在里面敲代码就行

JupyterLab快捷键+调试技巧

Translate Your Name Into DNA Code

The corresponding dictionary:

letterToDna = {
    'A': 'GCT', 'B': 'GCA', 'C': 'TGC', 'D': 'GAT',
    'E': 'GAG', 'F': 'TTT', 'G': 'GGG', 'H': 'CAT',
    'I': 'ATA', 'J': 'ATC', 'K': 'AAG', 'L': 'CTC',
    'M': 'ATG', 'N': 'GAC', 'O': 'GAT', 'P': 'CCC',
    'Q': 'GAG', 'R': 'CGT', 'S': 'TCA', 'T': 'ACT',
    'U': 'ACG', 'V': 'GTC', 'W': 'TGG', 'X': 'GTA',
    'Y': 'TAC', 'Z': 'TAT'
}

encode name

#Replace with dictionary
encodedName = [letterToDna[item] for item in "NICK"]
TName="".join(encodedName)
TName
//'GACATATGCAAG'

Base Paring

（碱基配对）

The four bases in DNA are adenine (A), cytosine (C), guanine (G), and thymine (T).

These bases form specific pairs

A with T
G with C

def reverseComplement(seq):
    dic = {"A":"T","T":"A","C":"G","G":"C"}
    resq = [dic[item] for item in seq]
    resq.reverse()
    resq = "".join(resq)
    return resq

1
2
3

RTName = "".join(reverseComplement(TName))
RTName
//'CTTGCATATGTC'

Find Most Frequent 5-mers

K-Mer : a sequence with length of K e.g. 5-mers

def findTop5Mers(sequence):
    k = 5  # k-mer length
    top_kmers = []
    kmer_counts = {}

    # Count occurrences of each k-mer
    for i in range(len(sequence) - k + 1):
        kmer = sequence[i:i+k]
        if kmer in kmer_counts:
            kmer_counts[kmer] += 1
        else:
            kmer_counts[kmer] = 1

    # Sort the k-mers based on their counts in descending order
    sorted_kmers = sorted(kmer_counts, key=kmer_counts.get, reverse=True)

    # Get the top 5 k-mers
    top_kmers = sorted_kmers[0]

    return top_kmers

DAY 2

DNA data analysis

1)Download a DNA sequence

Ensembl
NCBI
UCSC genome browser

2)Sequence alignment

Sequence 1: ATCGTACG Sequence 2: ATCGAAG

To align these sequences, we might introduce gaps ("-") to indicate insertions or deletions:

Sequence 1: A--TCGTACG Sequence 2: ATCG--AAG

Edit distance:

Why we want to align sequences?

Essentially, sequence alignment is used to find the distance between “sequences”

A lot of applications:

Assembly the genome

Quantify gene expression

Study the conservation between species

Understand the evolution

3)Find a DNA motif

DNA motif（DNA模体），也被称为DNA序列模式或DNA序列基元，在基因组中是指一种特定的短序列。DNA模体在基因组中以重复的形式存在，并且在不同的基因或物种中可能具有保守性。

transcription factor (TF) :转录因子

TF binding pattern-motif

Example: TATA BOX <-> TBP

TATABOX: TATAWAWA (W:[A/T])

How to identify DNA motifs?

Enrichment analysis

富集分析（Enrichment analysis）是一种用于识别基因组中具有显著富集的功能或关联的方法。

An example:

JUND: Transcription factor involved in cell growth, immune response, and tumor growth.
JUNB: Transcription factor regulating cell cycle, immune response, and tumor development.
FOS: Transcription factor family involved in cell processes, stress response, and tumor formation.
IRF1: Interferon regulatory factor playing a role in immune response and gene expression regulation.
IRF2: Interferon regulatory factor involved in development and gene transcription control.
ATF2: Transcription factor important for stress response, DNA repair, and cell regulation.

CHIP-Seq（Chromatin Immunoprecipitation Sequencing）

先使用抗体特异性地富集某个特定的蛋白质与染色质结合的DNA片段。这个蛋白质可以是转录因子、组蛋白修饰酶或其他与染色质相互作用的蛋白质。通过与特定抗体结合，这些蛋白质与其结合的染色质区域可以被选择性地富集。

然后将富集的DNA片段经过适当的处理和准备后进行高通量测序。通过测序，可以得到与染色质结合的DNA片段的序列信息。

Download reads (fastq)SRA Toolkit

Map the reads to the reference genome

tool：Bowtie 2: fast and sensitive read alignment (sourceforge.net)

Process mapped reads

(e.g., sort bam)

Peak calling method

tool：MACS2

Identify enriched motifs(TFs)

4)Identify a DNA mutation

DAY 3

RNA-seq data analysis

RNA与DNA区别：RNA只包含“excellent部分”，DNA包含内含子

Quantify gene expression

即将RNA样本对应到基因上，通过匹配量的比例量化基因的表达量

（例如一组六个RNA，有4个映射到了A，两个映射到B；另一组十个是3:7，那么第一组B的表达量认为高于第二组）。

Linear regression

建立自变量（特征）与因变量之间线性关系

$\hat{y} = \beta_0 x_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n$

How to search the parameters?

1）Brute-force

2）Gradient descent

残差平方和：

$ E = _{i=1}^{m} (y_i - _i)^2$

目标——最小化E

Logistic regression

将线性回归模型与sigmoid函数结合，以将线性输出映射到概率值。sigmoid函数将输入值压缩到0和1之间，表示样本属于某个类别的概率。

$P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n)}}$

Support-vector machine (SVM)

一种用于分类和回归的机器学习算法

原理：

数据准备：收集带有标签的训练数据，其中每个样本都有对应的特征向量和类别标签。
特征缩放：根据需要对特征进行缩放，以确保它们具有相似的尺度。
定义超平面：SVM 的目标是找到一个最优的超平面，使得两个不同类别的样本被最大化的间隔分开。间隔是指超平面与离它最近的支持向量之间的距离。
核函数（Kernel Function）：如果样本在原始特征空间中无法线性分隔，可以使用核函数将样本映射到高维特征空间，使其线性可分。
训练模型：通过求解优化问题，找到将样本正确分隔的最优超平面。优化问题的目标是最大化间隔，并且可以使用凸优化方法（例如二次规划）来解决。
模型评估：使用测试数据评估模型的性能，常见的评估指标包括准确率、精确率、召回率和 F1 分数等。
模型预测：使用训练好的模型对新样本进行分类预测。

Linear/non-linear classifier

通过Kernel Function使数据线性可分

Decision Tree

另一种分类方法

用树来表示决策规则，并根据特征的值进行逐步的条件分割

Random forest

根据多个Decision Tree，评判出最合适的Class （不容易过拟合）

random forest models, random forest classification, ml, dl, machine learning system

over-fitting

Regularization

通过在损失函数中添加正则化项来约束模型参数的大小，防止过拟合

L1-regularization

$Regularization Term= λ * ||w||$

L2-regularization

$Regularization Term= λ * ||w||^2$

An example data

Pima Indians Diabetes Database (kaggle.com)

Request : Build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not

steps:

read and process the raw dataset => extract X and Y

train a classifier model using scikit-learn

make prediction and calculate accuracy

take a screenshot and report your accuracy

(optional), try to incorperate techniques for anti over-fitting

(optional), if you implemented multiple methods, build a meta-learner (ensemble approach)

在JupyterLab创个目录，把下载下来的数据和代码都放到那个目录下面

直接在代码中运行

1 2	!pip3 install scikit-learn !pip install matplotlib

下载这个scikit-learn库，然后可以使用class sklearn.ensemble.RamForestClassifier来对数据进行分析，搭建预测模型

这个类的doc：RandomForestClassifier — scikit-learn 1.5.1 documentation

（也可以用这个包里的其他类，参照API Reference — scikit-learn 1.5.1 documentation，毕竟拟合的方法不止一种，但是我不会。。。）

后面那个matplotlib库是用来画图用的

import pdb, sys, os
import csv
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

# 打开数据文件
f = open("diabetes.csv", "r")
lf = f.readlines()
f.close()

# 解析数据文件
lf = [item.strip().split(",") for item in lf]
FNames = lf[0][0:-1]  # 特征名称列表
X = [item[0:-1] for item in lf[1:]]  # 获取特征列表
Y = [item[-1] for item in lf[1:]]  # 获取是否患有糖尿病列表

# 转换数据类型为浮点数
X = [[float(k) for k in item] for item in X]
Y = [float(item) for item in Y]

# 构建随机森林分类器
rf_jd = RandomForestClassifier(max_depth=2, random_state=0)
#max_depth：决策树的最大深度。控制每棵决策树的最大深度限制，即限制树的分支层数。设置较小的深度可以限制模型的复杂度，有助于防止过拟合。可以试着改改看效果
#random_state：随机数生成器的种子值（说是随机，其实是伪随机）。用于控制随机森林中的随机性源，这样可以保证每次运行代码时能获得相同效果


# 在训练数据上拟合随机森林模型
rf_jd.fit(X, Y)

# 使用拟合的模型对训练数据进行预测
Ypre = rf_jd.predict(X)

# 计算预测正确的样本数
correct_predictions = list(Ypre == Y).count(True)

# 打印模型在训练数据上的准确率
print("Accuracy:", rf_jd.score(X, Y))

# 计算特征重要性
FS = rf_jd.feature_importances_

# 绘制特征重要性条形图
plt.bar(FNames, FS)
plt.xticks(rotation=90)
plt.show()

Homework 1

description:

you need to build a classifier (e.g., random forest or svm) for the prediction of TB on HIV patients.
steps :

download the dataset from the NIH GEO database https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE162164&format=file

annotate the patients with its phenotype in other words, some patients are HIV only, the rest are HIV + TB

do some basic file reading and processing (convert float values)

train a model, could be tricky, the performance could be very bad

you need some tricks to minimize the number of features (some feature selection to reduce the feature space) , for example, if you find a gene that is not very different from HIV vs HIV+TB, then you know this feature won't be important

you train the model and calculate the accuracy, report it

write a report (jupyter notebook, detailing each of the steps and results) you need also to tell me what is the biomarker (the most critical feature for the TB+HIV disease)

select features

见后文的p-value

ROC

（Receiver Operating Characteristic）

ROC（Receiver Operating Characteristic）曲线是一种常用的评估分类模型性能的工具。它以分类模型在不同阈值下的真阳性率（True Positive Rate，TPR，也称为灵敏度或召回率）和假阳性率（False Positive Rate，FPR）之间的关系为基础。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score

# 模拟的分类预测结果和真实标签
y_true = np.array([0, 0, 1, 1, 1, 0, 1, 1, 0, 1])
y_scores = np.array([0.1, 0.3, 0.4, 0.6, 0.7, 0.2, 0.9, 0.8, 0.3, 0.5])

# 计算ROC曲线的FPR和TPR
fpr, tpr, thresholds = roc_curve(y_true, y_scores)

# 计算AUC（Area Under Curve）
auc = roc_auc_score(y_true, y_scores)

# 绘制ROC曲线
plt.plot(fpr, tpr, label='ROC curve (AUC = %0.2f)' % auc)
plt.plot([0, 1], [0, 1], 'k--')  # 绘制对角线
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

通过 plot 函数绘制了 ROC 曲线，其中 fpr 是 X 轴，tpr 是 Y 轴。对角线表示随机猜测的曲线，而我们的目标是使 ROC 曲线尽可能靠近左上角，即 TPR 尽可能大，FPR 尽可能小。

Cross validation

K-fold cross-validation

将数据集随机分成 K 个大小相似的子集（也称为折）。
选择其中一个子集作为测试集，将其余 K-1 个子集作为训练集。
在训练集上训练模型，并在测试集上评估模型的性能。
重复步骤 2 和 3，每次选择不同的子集作为测试集，直到每个子集都被用作测试集。
对 K 次评估结果进行平均，得到最终的性能指标。

因为作业提供的数据量较少（只有25个），可以使用Leave-One-Out，选择24个进行训练，一个进行测试，然后循环测试25次以计算准确度。

DAY 4

Visualize the data

PCA

Essentially, it’s linear transformation

可以理解为投影，用于将高维数据转换为低维表示，同时保留数据集的最大方差。

（降低维度但是尽可能不损失特征）

PCA通过线性变换将原始特征映射到新的正交特征空间，新特征称为主成分。

t-SNE (and UMAP)

也是降维技术，用于可视化高维数据。

但是是非线性的

Clustering methods

K-Means

聚类算法，用于将数据集划分为K个不同的簇（clusters）。每个簇由其内部的数据点组成，这些数据点在特征空间中彼此相似。K-Means算法的目标是最小化数据点与其所属簇中心的平方距离之和。

Hierarchical clustering

聚类算法，它通过逐步合并或分割数据点来构建聚类层次结构。

自顶向下的方法，开始时所有数据点都在一个簇中，然后通过递归，将簇分成更小的子簇来构建聚类层次。

Density estimator

密度估计，根据数据密度估计不同位置数据出现的概率，可以理解为根据数据的密度拟合一个函数

Leiden clustering ( )

Modularity is a measure of the structure of networks graphs which measures the strength of division of a network into modules (also called groups, clusters or communities).

用于在一个网络中实现聚类

Clustering evaluation metrics

Silhouette Score $= (b-a)/max(a,b)$

1表示样本与其自身的簇非常相似，0表示样本位于两个簇的边界上，-1表示样本更适合与其他簇

Practice

Also this data:

Pima Indians Diabetes Database (kaggle.com)

导包和数据初级处理

import pdb, sys, os
import csv
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import numpy as np
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE

# 打开数据文件
f = open("diabetes.csv", "r")
lf = f.readlines()
f.close()
# 解析数据文件
lf = [item.strip().split(",") for item in lf]
FNames = lf[0][0:-1]  # 特征名称列表
featureSet = [item[0:-1] for item in lf[1:]]  # 输入特征列表
outcomeSet = [int(item[-1]) for item in lf[1:]]  # 输出标签列表
#数据类型转换
featureSet=np.array(featureSet,dtype=float)

用PCA降为二维

pca = PCA(n_components=2)  # 指定要降到的维度，这里选择2维.方便绘图
reduced_data = pca.fit_transform(featureSet)

xr_x=[item[0] for item in reduced_data]
xr_y=[item[1] for item in reduced_data]

# 定义自定义颜色映射
colors = ['purple', 'yellow']  # 1对应紫色，0对应黄色
cmap = ListedColormap(colors)
plt.scatter(xr_x, xr_y, c=outcomeSet, cmap=cmap)

用t-SNE降维

#创建 t-SNE 模型并将降维后的数据拟合到模型中
tsne = TSNE(n_components=2)
tsne_data = tsne.fit_transform(featureSet)

tsne_x=[item[0] for item in tsne_data]
tsne_y=[item[1] for item in tsne_data]

plt.scatter(tsne_x, tsne_y, c=outcomeSet, cmap=cmap)

K-means Clustering

# 创建 K-means 模型并指定聚类簇数为2
kmeans = KMeans(n_clusters=2, random_state=0)

# 将数据拟合到 K-means 模型中并进行聚类
labels = kmeans.fit_predict(featureSet)

# 获取聚类标签
cpred = kmeans.labels_

可视化聚类成果

#PCA
plt.scatter(xr_x, xr_y, c=cpred, cmap=cmap)

#t-SNE
plt.scatter(tsne_x, tsne_y, c=cpred, cmap=cmap)

DAY 5

Graphical models

Graph

A graph consists of nodes and edges, where nodes represent variables and edges represent the relationships or dependencies between variables.

Bayesian network

Inference

p(x|e)=p(x,e)/p(e)

2Parameter learning

P(A|B)=N(mu,sigma) -->正态分布

通过观察数据来估计贝叶斯网络中的节点的条件概率分布的参数(估计正态分布的参数)

Structure learning

观察数据中这些变量的相互关系，并试图找出最合适的有拓扑结构来表示这些关系。

```python #将矩阵展平为一维数组，并将结果存储在original_flattened中 original_flattened = matrix.flatten() new_flattened = new_matrix.flatten() #pearsonr()函数计算展平后的original_flattened和new_flattened之间的皮尔逊相关系数和p值。 correlation_coefficient, p_value = pearsonr(original_flattened, new_flattened)


  

#### p-value

H0: coin is fair (50% chance for head/tail)

Observation: 10 tests, 9 heads

p-value: the probability of observing 9 heads (and more) by random

p=1-pbinom(9-1,10,0.5)=0.01074219 (1%)

<cutoff (often 5% or 1%), reject the H0

Conclusion (coin is unfair), the probability of wrong conclusion is around 1%

p=1-pbinom(7-1,10,0.5)=0.171=17%

 

> `pbinom(k, n, p)` 计算了二项分布的累积分布函数，其中：
>
> - `k` 是观察到的随机变量的值（或小于该值），
> - `n` 是试验的总次数，
> - `p` 是每次试验成功的概率。



使用 p-value能够进行特征选择

这里是假设x和y没有关系

```python
import numpy as np
from scipy import stats

# 假设有两个特征 X1 和 X2，以及目标变量 y
X1 = np.array([1, 2, 3, 4, 5])
X2 = np.array([6, 7, 8, 9, 10])
y = np.array([0, 1, 0, 1, 0])

# 使用 t-test 计算特征 X1 和目标变量 y 之间的 p-value
t_statistic, p_value = stats.ttest_ind(X1[y==0], X1[y==1])

# 打印 p-value
print("p-value for feature X1:", p_value)

根据所选择的阈值，选择具有低 p-value 的特征，将它们视为重要特征。

特征之间的相关性：p-value 方法通常只考虑单个特征与目标变量之间的关系。如果特征之间存在强相关性，选择一个特征可能会忽略其他与目标变量相关的特征。
多重比较校正：当对多个特征进行假设检验时，需要考虑多重比较校正。常见的方法包括 Bonferroni 校正、Benjamini-Hochberg 过程等，以控制多重比较带来的错误发现率。
预测性能：p-value 方法只考虑了特征与目标变量之间的统计关系，并未直接衡量特征对于预测性能的贡献。在某些情况下，即使 p-value 较高的特征在预测模型中也可能具有重要作用。

Mann-whitney U test

H0: the probability of X being greater than Y is equal to the probability of Y being greater than X.

(原假设：两组之间的分布没有差异)

若p值低于预定的显著性水平，则表明有足够的证据来拒绝原假设，并得出结论认为两组之间存在显著差异。相反，如果p值高于显著性水平，则没有足够的证据拒绝原假设，表明两组之间的差异在统计上不显著。

Markov chain

马尔可夫链

（略）

High-order Markov Chain

考虑核苷酸链：

ACGTACTTCGAGGTTTTTAAACTACTACT

观察二阶马尔可夫转移概率，转移概率表示基于观察到的序列，从一个状态（二聚体）转移到另一个状态的可能性。

导包

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import defaultdict
import random
from scipy.stats import pearsonr

从文件中读取DNA序列

1 2	with open('data/DNA.txt', 'r') as file: dna_sequence = file.read().strip()

统计转移计数

transition_counts = defaultdict(lambda: defaultdict(int))
#创建一个嵌套的默认字典(defaultdict)结构，用于存储转移计数。它的结构是transition_counts[pair][next_base]，初始值为0。

for i in range(len(dna_sequence) - 2):
    #遍历DNA序列每个位置。对于每个位置，提取当前位置及其后两个位置的碱基，将其作为一个碱基对(pair)。获取下一个位置的碱基作为下一个碱基(next_base)。
    pair = dna_sequence[i:i+2]
    next_base = dna_sequence[i+2]
    transition_counts[pair][next_base] += 1

计算转移概率

#保证每行概率之和为1
transition_probabilities = defaultdict(dict)

for pair, next_bases in transition_counts.items():
    total_transitions = sum(next_bases.values())
    for base, count in next_bases.items():
        transition_probabilities[pair][base] = count / total_transitions

定义碱基和碱基对

1 2	bases = ['A', 'C', 'G', 'T'] pairs = [a+b for a in bases for b in bases]

创建转移概率矩阵

matrix = np.zeros((16, 4))

for i, pair in enumerate(pairs):
    for j, base in enumerate(bases):
        matrix[i, j] = transition_probabilities[pair].get(base, 0)

绘制原始序列的转移概率热力图

plt.figure(figsize=(10, 8))
sns.heatmap(matrix, xticklabels=bases, yticklabels=pairs, annot=True, cmap='viridis')#使用Seaborn库绘制热力图
plt.title('Transition Probabilities Heatmap - Original Sequence')
plt.xlabel('Next Base')
plt.ylabel('Base Pair')
plt.savefig('original_transition_heatmap.png')#保存图片至本地

生成新的DNA序列

def generate_next_base(pair, transition_probabilities):
    # 从转移概率字典中提取碱基和对应的概率
    bases, probabilities = zip(*transition_probabilities[pair].items())
    # 根据概率随机选择下一个碱基
    return random.choices(bases, probabilities)[0]

new_sequence = list(dna_sequence[:2])

k = 5000
for _ in range(k):
    pair = ''.join(new_sequence[-2:])
    next_base = generate_next_base(pair, transition_probabilities)
    new_sequence.append(next_base)

new_sequence = ''.join(new_sequence)

绘制生成序列的转移概率热力图

代码跟前面一样

计算原始转移概率矩阵和生成转移概率矩阵之间的相关系数（p-value）

皮尔逊相关系数接近于1时，表示两个变量之间存在强正相关关系
当皮尔逊相关系数接近于-1时，表示两个变量之间存在强负相关关系
当皮尔逊相关系数接近于0时，表示两个变量之间不存在线性关系，或者存在非线性关系。
在皮尔逊相关系数的计算中，p值用于判断相关系数是否显著不为零，即是否存在真正的线性关系。
当p值小于设定的显著性水平（例如0.05）时，我们会拒绝零假设，即认为相关系数是显著不为零的，存在线性关系。而当p值大于显著性水平时，我们无法拒绝零假设，即不能得出相关系数显著不为零的结论。

#将矩阵展平为一维数组，并将结果存储在original_flattened中
original_flattened = matrix.flatten() 
new_flattened = new_matrix.flatten() 
#pearsonr()函数计算展平后的original_flattened和new_flattened之间的皮尔逊相关系数和p值。
correlation_coefficient, p_value = pearsonr(original_flattened, new_flattened)

Hidden Markov Model (HMM)

一个具有马尔可夫性质的系统，其状态是不可观测的（隐藏状态），但我们可以观测到与状态相关的一系列可见数据。并根据观测到的数据和系统状态的转移概率预测系统的状态。

（高中的方法做就行）

Initial probabilities
Transition probabilities

表示系统从一个状态转移到另一个状态的概率。这些概率可以用状态转移矩阵来表示，矩阵的元素包括从一个状态到另一个状态的转移概率。
Emission probabilities P(O|S)

表示在特定状态下观察到某个可见数据的概率。这些概率可以用观测矩阵来表示，矩阵的元素包括在每个状态下观察到每个可见数据的概率。

DAY 6

Single-cell genomics

C1 Most importantly, human eyes can’t see anything beyond 3D

Curse of dimensionality

Analyzing of the high dimensional data often suffers from the curse of dimensionality

The searching space increases exponentially.

Neighbors of each data point also increase exponentially.

Distances are on longer informative

Dimensionality reduction

Linear F

PCA

Non-linear F

Non-linearly separable data

t-SNE (U-MAP)

Auto-encoder

PCA

Find a linear transformation to project the data from HD to LD space that minimize the projection error.

t-SNE

1)Measuring the distance in higher dimensional space (Gaussian distribution)

2)Measuring the distance in lower dimensional space (long-tail student t distribution)

3)The locations of the points in the LD space (y) are determined by minimizing the (non-symmetric) Kullback–Leibler divergenceof the distribution P from the distribution Q. Then use the gradient descent to search the y(i) that minimize the KL divergence C.

Diffusion Map

是一种非线性降维技术

假设对一个马尔科夫链，他的转移概率矩阵为M，那他的后两次地转移概率矩阵就是$M^2$，这就叫Diffusion

Diffusion Map 利用了数据的扩散过程，通过计算数据点之间的相似性和扩散距离来构建低维表示。它可以在保留数据结构的同时，更好地捕捉非线性关系和聚类结构。

Autoencoder

也是一种非线性降维技术

由编码器（Encoder）和解码器（Decoder）组成

将输入数据压缩到低维编码表示，然后再将其解码为原始输入的重构。自编码器可以用于降维、特征学习、数据去噪等任务。

C2. Single-cell data is very noisy

MAGIC

修复单细胞中缺失值的方法

MAGIC (Markov Affinity-based Graph Imputation of Cells) is a computational method used in single-cell RNA sequencing (scRNA-seq) data analysis. It is designed to address the problem of missing gene expression values in scRNA-seq datasets.

Single-cell data denoising

1)Remove cells with low # of expressing genes

2)Remove cells with high % of mitochondrial reads

C3. Single-cell data is enormously heterogeneous

——Clustering

Supervised neural network for clustering

一种结合了监督学习和聚类的方法。传统的聚类方法（如k-means、DBSCAN等）通常是无监督的，它们根据数据的内在结构将样本分组成类别。相比之下，监督学习方法需要已知的标签或目标来进行训练和预测。

Variational autoencoder

Annotate cluster

A few existing solution:

Use marker genes
Use functional analysis (e.g., GO enrichment)
Compare with expression data with known cell types

C4. Reconstructing trajectories from Single-cell data

Infer the cell dynamics (the cellular state change over time) from single-cell data (often time-series)

Monocle

一个用于单细胞RNA测序（scRNA-seq）数据分析的软件包，用于推断和分析细胞发育轨迹。它提供了一系列的算法和工具，帮助研究者从单细胞数据中推断细胞类型、发育状态和细胞转录组的动态变化。

PAGA

基于细胞间的相似性计算构建细胞之间的连接关系，并根据这些连接关系构建一个图形结构。

这个图形结构可以被视为细胞社群的一种抽象，其中每个节点代表一个细胞，边代表细胞之间的关系。

PAGA 使用图论算法来计算和揭示细胞之间的关联和转换。

1)Graph partitioning and abstraction

2)Pseudo-time estimation

3)Preserving Graph topology across resolutions

C5. Reconstructing the regulatory networks underlying the trajectories

Infer the transcription factors and pathways that dictate the cellular dynamics

GENIE3

Gene Network Inference with Ensemble of trees

可以从基因表达数据中推断出基因之间的调控关系和网络拓扑结构

基于Decision Tree的集成方法，利用Random Forest的思想进行基因网络推断。

SCDIFF

Single-Cell Differential Expression Analysis

确定在不同细胞群体之间具有显著差异表达的基因。它通过统计学方法来鉴定差异表达基因，并提供一种量化的方式来衡量差异的显著性。

DAY 7

scanpy

.hg19

可以看成一个矩阵

列名是细胞名

行名是基因名

每个行列对应位置为表达量

AnnData object

10X数据读入和创建AnnData对象:

1	adata=sc.read_10x_mtx('data/filtered gene bc matrices/hg19')

普通矩阵数据读入和创建AnnData对象

matrix = pd.read_table('data/test.tsv', index_col=0)
matrix = pd.DataFrame(matrix.values.T, index=matrix.columns, columns=matrix.index)
matrix.to_csv('data/test.csv')

adata = sc.read_csv("data/test.csv")

.X：一个二维的NumPy数组，表示存储在AnnData对象中的数据矩阵。每一行代表一个细胞，每一列代表一个基因，数据是表达量
.obs：返回细胞注释的表格（列名），其中每一行代表一个观测，每一列代表一个注释信息。.obs可以包含有关每个观测的元数据，如细胞类型、样本来源、处理条件等。
.var：返回基因注释的表格（行名），其中每一行代表一个变量，每一列代表一个注释信息。.var可以包含有关每个变量的元数据，如基因名称、基因类型、基因功能等。
.uns：返回一个未命名结构（Unstructured）注释，是一个字典对象，用于存储与数据相关的其他注释信息。

QC

主要是两个函数进行行和列得过滤

# 过滤细胞（cells），保留至少具有200个基因的细胞
sc.pp.filter_cells(adata, min_genes=200)

# 过滤基因（genes），保留至少在3个细胞中表达的基因
sc.pp.filter_genes(adata, min_cells=3)

计算线粒体基因

1 2	adata.var['mt'] = adata.var_names.str.startswith('MT-') sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

sc.pl.violin(adata, keys=['n_genes_by_counts', 'total_counts', 'pct_counts_mt'], jitter=0.4, multi_panel=True)
plt.savefig("plot/gc_count.pdf")

adata = adata[adata.obs['n_genes_by_counts'] < 2500, :]
adata = adata[adata.obs['pct_counts_mt'] < 5, :]

'n_genes_by_counts'：表示每个细胞中基因计数的数量。它
'total_counts'：表示每个细胞中的总计数。
'pct_counts_mt'：表示线粒体基因计数在总计数中的比例。它可以用来评估细胞中线粒体基因表达的相对程度，从而间接反映细胞质量和代谢状态。

画出小提琴图来确认需要过滤的数值范围

Data Standardization

数据标准化

1 2	sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata)

Scanpy的默认标准化方法是lognormalize：

每个细胞的某一count先除以该细胞的总count，然后乘以scale因子-10000，再做个对数转换log

标准化目的：

在sCRNA-seg中，由于每个细胞的起始转录分子量有限，每个细胞中转录本的捕获以及扩增效率都会有技术差异，因此很难保证样本之间在文库制备上保持高度的一致性。这也造成了多个样本的测序数据中会在在由于文库测序覆盖率(seguencing coverage)不同而引入的系统差异。数据的标准化目的就是消除这些差异，使得我们得到的分析结果不受技术噪音的影响。

小结：消除技术差异的影响

Feature Selection

特征选择

寻找高变基因

即寻找只在某些细胞中表达的基因

sc.pp.highly_variable_genes(adata, n_top_genes=2000)
#绘制
sc.pl.highly_variable_genes(adata)
plt.savefig("plot/highly_variable_genes.pdf")

n_top_gene：表示用于举例的高变基因的数目

Data Normalization

数据归一化

adata.raw = adata	#将原始数据存储在 adata.raw 中
adata = adata[:, adata.var['highly_variable']]	#修改对 adata 的筛选方式
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
sc.pp.scale(adata, max_value=10)

归一化：每个标准化之后的值减去该基因在所有细胞中的均值再除以该基因的标准差value = (exp - mean) / sd

sc.pp.regress_out：回归掉指定的影响因素。通过线性回归分析，去除总计数（'total_counts'）和线粒体基因计数占比（'pct_counts_mt'）对基因表达的影响。这些因素可能导致不同细胞之间的亮度偏移或批次效应。

sc.pp.scale：对数据进行缩放，使数据的值范围在0到指定的最大值（这里是10）之间。

评估细胞周期的影响

1	sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes)

Data Dimensionality Reduction

数据降维

PCA

1
2
3

sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca_variance_ratio(adata, log=True)
plt.savefig("plot/pca_variance.pdf")

Clustering

聚类

sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)
sc.tl.tsne(adata, n_pcs=40)
sc.tl.leiden(adata)
sc.pl.umap(adata, color=['CD79A', 'MS4A1', 'IG', 'CD3D', 'FCER1A', 'FCGR3A', 'n_counts', 'bulk_labels'], s=50, frameon=False, ncols=4, vmax='p99')
plt.savefig("plot/umap_cluster_feature.pdf")
adata.write('data/test.h5ad')

sc.pp.neighbors：计算邻居关系，

n_neighbors=10: 用于指定每个样本要考虑的邻居数量。
n_pcs=40: 用于指定要在计算邻居关系之前保留的主成分数量。这里设置为40，表示在计算邻居关系之前，先对数据进行PCA降维，并保留40个主成分。

sc.tl.umap、sc.tl.tsne：降维便于可视化

sc.tl.leiden：聚类

sc.pl.umap：根据umap结果作图

color: 表示要在 UMAP 图上显示的颜色。它应该是一个包含要显示的基因或特征名称的列表。在这个例子中，使用了 ['CD79A', 'MS4A1', 'IG', 'CD3D', 'FCER1A', 'FCGR3A', 'n_counts', 'bulk_labels']，将这些基因或特征的表达情况映射到 UMAP 图上的颜色。bulk_labels 是表示样本或细胞的批次标签
s: 表示在 UMAP 图上显示的点的大小，默认值为 20。在这个例子中，设置为 50，将点的大小调整为 50。
frameon: 表示是否显示图像周围的边框，默认为 True，即显示边框。在这个例子中，设置为 False，将边框去掉。
ncols: 表示颜色图例的列数，默认为 1，即只有一列。在这个例子中，设置为 4，将颜色图例分成 4 列。
vmax: 表示颜色映射的上限，默认为 None，即自动确定上限。在这个例子中，设置为 'p99'，将颜色映射的上限设置为数据的 99th 百分位数。

Finding differentially expressed genes

寻找差异基因，并且可视化

1
2
3

sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
plt.savefig("plot/diff_feature.pdf")

sc.tl.rank_genes_groups：根据聚类结果计算差异基因

sc.pl.rank_genes_groups：绘制基因差异

n_genes=25: 用于指定要显示的前 n 个差异显著的基因数量。这里设置为 25，表示显示前 25 个基因。
sharey=False: 用于指定是否共享 y 轴刻度。这里设置为 False，表示每个基因的条形图使用独立的 y 轴刻度。

绘制差异基因的dotplot图

1	sc.pl.rank_genes_groups_dotplot(adata, n_genes=4)

n_genes=4：选择每个类中前四的差异基因

指定类进行差异基因计算

sc.tl.rank_genes_groups(adata, "leiden", groups=['0'], reference='1', method='wilcoxon')
sc.pl.rank_genes_groups(adata, groups=['0'], n_genes=20)
pic("plot/diff_feature_Ovs1.pdf")
# 通过小提琴图，展示差异基因在两者之间的表达量
sc.pl.rank_genes_groups_violin(adata, groups='0', n_genes=8)
pic("plot/diff_feature_n_genes.pdf")

sc.tl.rank_genes_groups：

groups=['0']: 指定要计算基因差异的分组。这里设置为 ['0']，表示计算分组 '0' 相对于参考分组 '1' 的基因差异。
reference='1': 指定用作参考的分组。这里设置为 '1'，表示将分组 '1' 作为参考组。
method='wilcoxon': 指定用于计算差异的统计方法。这里使用了 Wilcoxon 秩和检验。

Labeling cell subsets

给细胞亚群添加文字标注

marker_genes_dict = {
    'B-cell': ['CD79A', 'MS4A1'],
    'Dendritic': ['FCER1A', 'CST3'],
    'Monocytes': ['FCGR3A'],
    'NK': ['GNLY', 'NKG7'],
    'Other': ['IGLL1'],
    'Plasma': ['IGJ'],
    'T-cell': ['CD3D'] 
}# 定义标记基因字典

sc.pl.dotplot(adata, marker_genes_dict, 'leiden', dendrogram=True)
sc.pl.savefig('plot/gene.pdf')

dendrogram=True: 表示是否绘制基于基因表达模式的层次聚类树状图。在这个例子中，设置为 True，表示在基因表达图上绘制层次聚类树状图。

将标签表示到UMAP图

# 定义群集到细胞类型的映射字典
cluster2annotation = {
    '0': 'Monocytes',
    '1': 'Dendritic',
    '2': 'T-cell',
    '3': 'NK',
    '4': 'B-cell',
    '5': 'Dendritic',
    '6': 'Plasma',
    '7': 'other',
    '8': 'Dendritic'
}

# 将 'clusters' 列映射为细胞类型，并将其设置为 'cell type' 列
adata.obs['cell type'] = adata.obs['leiden'].map(cluster2annotation).astype('category')

# 绘制基于细胞类型的基因表达图
sc.pl.dotplot(adata, marker_genes_dict, 'cell type', dendrogram=True)
sc.pl.savefig('plot/gene.pdf')

# 绘制 UMAP 并根据细胞类型着色
sc.pl.umap(adata, color='cell type', legend_loc='on data', frameon=False, legend_fontsize=10, legend_fontoutline=2)

Gene expression display

基因表达展示

画出单个基因在不同细胞内表达量的小提琴图

1 2	sc.pl.violin(adata, ['CD79A', 'MS4A1'], groupby='leiden') sc.pl.savefig('plot/gene.pdf')

批量表示的话用堆叠小提琴图

1	ax = sc.pl.stacked_violin(adata, marker_genes_dict, groupby='leiden', swap_axes=False)

绘制热图，类似于dotplot，但是把表达量缩放到0到1了

1 2	sc.pl.matrixplot(adata, marker_genes_dict, 'leiden', dendrogram=True, cmap='Blues',standard_scale='var', colorbar_title='column scaled\nexpression') sc.pl.savefig('plot/gene.pdf')

绘制矩阵图，与前者类似，以z-score进行缩放

adata.layers['scaled'] = sc.pp.scale(adata, copy=True).X

sc.pl.matrixplot(adata, marker_genes_dict, 'leiden', dendrogram=True, colorbar_title='mean z-score',layer='scaled', vmin=-2, vmax=2, cmap='RdBu_r')
sc.pl.savefig('plot/gene.pdf')

绘制每一类中每一个细胞的指定基因的表达情况

sc.pl.heatmap(adata, marker_genes_dict, groupby='leiden', cmap='viridis', dendrogram=True)
sc.pl.savefig('plot/gene.pdf')

#进行缩放再绘制
ax = sc.pl.heatmap(adata, marker_genes_dict, groupby='leiden', layer='scaled', vmin=-2, vmax=2,cmap='RdBu_r', dendrogram=True, swap_axes=True, figsize=(11, 4))
sc.pl.savefig('plot/gene.pdf')

#用tracksplot形式绘制
ax = sc.pl.tracksplot(adata, marker_genes_dict, groupby='leiden', dendrogram=True)
pic('plot/gene.pdf')

聚类树展示

sc.tl.dendrogram(adata, 'bulk labels')

ax = sc.pl.dendrogram(adata, 'bulk labels')
sc.pl.savefig('plot/gene.pdf')

相关性展示

和聚类树一样展现类与类之间的关系

1 2	ax = sc.pl.correlation_matrix(adata, 'bulk labels', figsize=(5, 3.5)) sc.pl.savefig('plot/gene.pdf')

Homework 2

Single-Cell RNA Sequencing Analysis Using Scanpy

Objective:

Learn to process and analyze single-cell RNA sequencing data of peripheral blood mononuclear cells using Scanpy.

Note:

For the introduction and use of Scanpy, you can refer to the official website (https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering.html)(https://scanpy.readthedocs.io/en/ stable/tutorials/basics/clustering-2017.html) or the reference code "sc.ipynb" provided by your teacher.

Steps:

Download Data: o Acquire dataset either from Scanpy's official tutorial（https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering-2017.html） or use the "processed.h5ad" file provided.

Preprocessing: o Follow instructions to filter, normalize, and log-transform the data as per the Scanpy tutorials.

Dimensionality Reduction: o Apply PCA to reduce dataset dimensions and visualize with UMAP or t-SNE.

Clustering: o Clustering: Execute clustering on the dataset using the Leiden algorithm or UMAP to detect cell groups.

Visualization: o Generate plots to display clustering and gene expression patterns.

Report: o Document the analysis process, results, and interpretations in a Jupyter Notebook. Deliverables: Submit the Jupyter Notebook with detailed code, plots, and explanations.

DAY 8

本文作者： NICK
本文链接： https://nicccce.github.io/CourseNote/Big-Biomedical-Data-Analysis/
版权声明： 本博客所有文章除特别声明外，均采用 MIT 许可协议。转载请注明出处！

Big Biomedical Data Analysis

DAY 1

Environment

Translate Your Name Into DNA Code

Base Paring

Find Most Frequent 5-mers

DAY 2

DNA data analysis

DAY 3

RNA-seq data analysis

Linear regression

An example data

Homework 1

select features

ROC

Cross validation

DAY 4

Visualize the data

PCA

t-SNE (and UMAP)

Clustering methods

K-Means

Hierarchical clustering

Density estimator

Leiden clustering ( )

Clustering evaluation metrics

Practice

DAY 5

Graphical models

Graph

Bayesian network

Mann-whitney U test

Markov chain

High-order Markov Chain

Hidden Markov Model (HMM)

DAY 6

Single-cell genomics

C1 Most importantly, human eyes can’t see anything beyond 3D

Curse of dimensionality

Dimensionality reduction

Autoencoder

C2. Single-cell data is very noisy

MAGIC

Single-cell data denoising

C3. Single-cell data is enormously heterogeneous

Supervised neural network for clustering

Variational autoencoder

Annotate cluster

C4. Reconstructing trajectories from Single-cell data

Monocle

PAGA

C5. Reconstructing the regulatory networks underlying the trajectories

GENIE3

SCDIFF

DAY 7

scanpy

.hg19

AnnData object

QC

Data Standardization

Feature Selection

Data Normalization

Data Dimensionality Reduction

Clustering

Finding differentially expressed genes

Labeling cell subsets

Gene expression display

Homework 2

DAY 8