Single-Cell RNA-seq Analysis using K-dense open source Claude skill.

K-Dense is multi-agent AI research system which can coordinate specialized agents that plan experiments, review literature, design analyses, execute code in secure sandboxes, and generate publication-ready reports. Achieved 29.2% accuracy on BixBench (the bioinformatics benchmark), outperforming GPT-5 (22.9%), GPT-4o (18%), and Claude 3.5 Sonnet (18%), there is no benchmark of 4.5 Sonnet I can find.

The use cases are: Drug Discovery Research

Screen compound libraries from PubChem and ZINC Analyze bioactivity data from ChEMBL Predict molecular properties with RDKit and DeepChem Perform molecular docking with DiffDock Bioinformatics Analysis

Process genomic sequences with BioPython Analyze single-cell RNA-seq data with Scanpy Query gene information from Ensembl and NCBI Gene Identify protein-protein interactions via STRING Materials Science

Analyze crystal structures with Pymatgen Predict material properties Design novel compounds and materials Clinical Research

Search clinical trials on ClinicalTrials.gov Analyze genetic variants in ClinVar Review pharmacogenomic data from ClinPGx Access cancer mutations from COSMIC Academic Research

Literature searches via PubMed Patent landscape analysis using USPTO Data visualization for publications Statistical analysis and hypothesis testing

📋 Executive Summary on Analyze single-cell RNA-seq data with Scanpy

I only need to press a tons of yes to create a whole single-cell RNA-seq analysis pipeline. Which includes 10 python scripts, one bash master script, one python dependency file and several documentation files.

It seems provides all the methods used and even the citations of the methods, besides, list the Thresholds that used and which can be customized as needed. The only thing needs to be done is run the script and trouble shooting when the pipelines fails. This is definitely making the start point for analysis much faster if the direction is correct. Will updates with real test results next week.

Analysis Scripts (10 Python Files)

Script	Purpose	Key Outputs
01_load_and_qc.py	Quality control and filtering	Filtered dataset, QC plots
02_doublet_removal.py	Doublet detection (Scrublet)	Cleaned dataset, doublet plots
03_preprocessing_and_clustering.py	Normalization, PCA, UMAP, Leiden clustering	Processed dataset, UMAP plots
04_cell_type_annotation.py	Cell type identification	Annotated dataset, marker plots
05_cellxgene_integration.py	Public data integration	Integrated dataset, batch correction
06_differential_expression.py	DE analysis (Wilcoxon)	DE results, volcano plots, heatmaps
07_grn_inference.py	Gene regulatory networks (GRNBoost2)	TF-target networks, network graphs
08_pathway_enrichment.py	GO/KEGG/Reactome enrichment	Pathway results, enrichment plots
09_opentargets_analysis.py	Therapeutic target identification	Druggable targets, priority lists
10_generate_report.py	Report generation	README.md, PDF report

Total Lines of Code: ~1,500+ lines of well-documented Python code

Execution Scripts

run_complete_analysis.sh - Master bash script to run all 10 steps sequentially
requirements.txt - Complete list of Python dependencies

Documentation Files

File	Description	Pages/Size
QUICKSTART.md	Quick start guide with installation and usage	7.5 KB
PIPELINE_OVERVIEW.md	Visual pipeline architecture and overview	~500 lines
README.md	Auto-generated analysis report	Generated at runtime

Quality Control Thresholds

min_genes_per_cell = 200
max_genes_per_cell = 6000
max_mitochondrial_pct = 15%
min_cells_per_gene = 3
doublet_rate = 6% (expected)

Preprocessing Settings

normalization = "size_factor" (target_sum=10,000)
transformation = "log1p"
highly_variable_genes = 2000
pca_components = 50
umap_neighbors = 15
leiden_resolutions = [0.5, 0.8, 1.0]

Statistical Thresholds

de_test = "wilcoxon"
significance_threshold = 0.05 (adjusted p-value)
fold_change_threshold = 0.5 (log2FC)
pathway_fdr = 0.05


**Estimated time:** 40-110 minutes (depending on dataset size)

🎓 Scientific Methods

Cell Type Markers Used

The pipeline identifies the following PBMC cell types:

Cell Type	Canonical Markers
CD4+ T cells	IL7R, CD4, CD3D, CD3E
CD8+ T cells	CD8A, CD8B, CD3D, CD3E
B cells	MS4A1 (CD20), CD79A, CD79B, CD19
NK cells	GNLY, NKG7, NCAM1 (CD56), KLRD1, KLRF1
CD14+ Monocytes	CD14, LYZ, S100A8, S100A9
FCGR3A+ Monocytes	FCGR3A (CD16), MS4A7, LYZ
Dendritic cells	FCER1A, CST3, CLEC10A
Platelets	PPBP, PF4, GNG11

Statistical Methods

Quality Control: MAD-based filtering on QC metrics
Doublet Detection: Scrublet (simulated doublets comparison)
Normalization: Size-factor normalization (library size)
Feature Selection: Highly variable genes (mean-variance relationship)
Dimensionality Reduction: PCA → UMAP
Clustering: Leiden algorithm (graph-based community detection)
Differential Expression: Wilcoxon rank-sum test (non-parametric)
GRN Inference: GRNBoost2 (gradient boosting regression)
Pathway Enrichment: Hypergeometric test with FDR correction
Batch Correction: Harmony (linear correction) or BBKNN (graph-based)

Customization

📝 Adjust QC thresholds in 01_load_and_qc.py
📝 Modify clustering resolution in 03_preprocessing_and_clustering.py
📝 Change DE thresholds in 06_differential_expression.py
📝 Update marker genes in 04_cell_type_annotation.py

📚 References & Citations

Primary Methods

Scanpy Framework
- Wolf, F.A., et al. (2018). “SCANPY: large-scale single-cell gene expression data analysis.” Genome Biology 19:15.
Doublet Detection
- Wolock, S.L., et al. (2019). “Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data.” Cell Systems 8(4):281-291.
Gene Regulatory Networks
- Moerman, T., et al. (2019). “GRNBoost2 and Arboreto: efficient and scalable inference of gene regulatory networks.” Bioinformatics 35(12):2159-2161.
Leiden Clustering
- Traag, V.A., et al. (2019). “From Louvain to Leiden: guaranteeing well-connected communities.” Scientific Reports 9:5233.
Open Targets Platform
- Ochoa, D., et al. (2021). “Open Targets Platform: supporting systematic drug-target identification and prioritisation.” Nucleic Acids Research 49(D1):D1302-D1310.
CellxGene Census
- Chan Zuckerberg Initiative. (2023). “CellxGene Census: A versioned container of single-cell data.”

Supporting Tools

AnnData: Virshup, I., et al. (2021). bioRxiv.
UMAP: McInnes, L., et al. (2018). arXiv:1802.03426.
Harmony: Korsunsky, I., et al. (2019). Nature Methods 16:1289-1296.
GSEApy: Zhu, F., et al. (2020). Bioinformatics 36(15):4390-4392.
KEGG: Kanehisa, M., et al. (2021). Nucleic Acids Research 49(D1):D545-D551.
Reactome: Jassal, B., et al. (2020). Nucleic Acids Research 48(D1):D498-D503.

Testing the Claude code through web.

Chinese Ai Company Ipo