# Count occurrences of each k-mer for i inrange(len(sequence) - k + 1): kmer = sequence[i:i+k] if kmer in kmer_counts: kmer_counts[kmer] += 1 else: kmer_counts[kmer] = 1
# Sort the k-mers based on their counts in descending order sorted_kmers = sorted(kmer_counts, key=kmer_counts.get, reverse=True)
# Get the top 5 k-mers top_kmers = sorted_kmers[0]
import pdb, sys, os import csv import matplotlib.pyplot as plt from sklearn.ensemble import RandomForestClassifier from sklearn.inspection import permutation_importance
# 打开数据文件 f = open("diabetes.csv", "r") lf = f.readlines() f.close()
# 解析数据文件 lf = [item.strip().split(",") for item in lf] FNames = lf[0][0:-1] # 特征名称列表 X = [item[0:-1] for item in lf[1:]] # 获取特征列表 Y = [item[-1] for item in lf[1:]] # 获取是否患有糖尿病列表
# 转换数据类型为浮点数 X = [[float(k) for k in item] for item in X] Y = [float(item) for item in Y]
you need to build a classifier (e.g., random forest or svm) for the
prediction of TB on HIV patients.
steps :
download the dataset from the NIH GEO database
https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE162164&format=file
annotate the patients with its phenotype in other words, some
patients are HIV only, the rest are HIV + TB
do some basic file reading and processing (convert float
values)
train a model, could be tricky, the performance could be very
bad
you need some tricks to minimize the number of features (some
feature selection to reduce the feature space) , for example, if you
find a gene that is not very different from HIV vs HIV+TB, then you know
this feature won't be important
you train the model and calculate the accuracy, report
it
write a report (jupyter notebook, detailing each of the steps and
results) you need also to tell me what is the biomarker (the most
critical feature for the TB+HIV disease)
Modularity is a measure of the structure of networks graphs which
measures the strength of division of a network into modules (also called
groups, clusters or communities).
import pdb, sys, os import csv import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap import numpy as np from sklearn.decomposition import PCA from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.manifold import TSNE
# 打开数据文件 f = open("diabetes.csv", "r") lf = f.readlines() f.close() # 解析数据文件 lf = [item.strip().split(",") for item in lf] FNames = lf[0][0:-1] # 特征名称列表 featureSet = [item[0:-1] for item in lf[1:]] # 输入特征列表 outcomeSet = [int(item[-1]) for item in lf[1:]] # 输出标签列表 #数据类型转换 featureSet=np.array(featureSet,dtype=float)
import numpy as np import seaborn as sns import matplotlib.pyplot as plt from collections import defaultdict import random from scipy.stats import pearsonr
从文件中读取DNA序列
1 2
withopen('data/DNA.txt', 'r') as file: dna_sequence = file.read().strip()
for pair, next_bases in transition_counts.items(): total_transitions = sum(next_bases.values()) for base, count in next_bases.items(): transition_probabilities[pair][base] = count / total_transitions
定义碱基和碱基对
1 2
bases = ['A', 'C', 'G', 'T'] pairs = [a+b for a in bases for b in bases]
创建转移概率矩阵
1 2 3 4 5
matrix = np.zeros((16, 4))
for i, pair inenumerate(pairs): for j, base inenumerate(bases): matrix[i, j] = transition_probabilities[pair].get(base, 0)
C1
Most importantly, human eyes can’t see anything beyond
3D
Curse of
dimensionality
Analyzing of the high dimensional data often suffers from the curse
of dimensionality
The searching space increases exponentially.
Neighbors of each data point also increase exponentially.
Distances are on longer informative
Dimensionality
reduction
Linear F
PCA
Non-linear F
Non-linearly separable data
image-20240717170318627
t-SNE (U-MAP)
Auto-encoder
PCA
Find a linear transformation to project the data from HD to LD space
that minimize the projection error.
t-SNE
1)Measuring the distance in higher dimensional space (Gaussian
distribution)
2)Measuring the distance in lower dimensional space (long-tail
student t distribution)
3)The locations of the points in the LD space (y) are determined by
minimizing the (non-symmetric) Kullback–Leibler divergenceof the
distribution P from the distribution Q. Then use the gradient descent to
search the y(i) that minimize the KL divergence C.
MAGIC (Markov Affinity-based Graph Imputation of Cells) is a
computational method used in single-cell RNA sequencing (scRNA-seq) data
analysis. It is designed to address the problem of missing gene
expression values in scRNA-seq datasets.
Learn to process and analyze single-cell RNA sequencing data of
peripheral blood mononuclear cells using Scanpy.
Note:
For the introduction and use of Scanpy, you can refer to the official
website
(https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering.html)(https://scanpy.readthedocs.io/en/
stable/tutorials/basics/clustering-2017.html) or the reference code
"sc.ipynb" provided by your teacher.
Steps:
Download Data: o Acquire dataset either from Scanpy's official
tutorial(https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering-2017.html)
or use the "processed.h5ad" file provided.
Preprocessing: o Follow instructions to filter, normalize, and
log-transform the data as per the Scanpy tutorials.
Dimensionality Reduction: o Apply PCA to reduce dataset dimensions
and visualize with UMAP or t-SNE.
Clustering: o Clustering: Execute clustering on the dataset using
the Leiden algorithm or UMAP to detect cell groups.
Visualization: o Generate plots to display clustering and gene
expression patterns.
Report: o Document the analysis process, results, and
interpretations in a Jupyter Notebook. Deliverables: Submit the Jupyter
Notebook with detailed code, plots, and explanations.