Posts: 410
Threads: 13
Joined: Oct 2023
Gender: Undisclosed
03-16-2024, 05:37 PM
(This post was last modified: 03-16-2024, 05:51 PM by teepean.)
So I have had questions over the years about creating datasets and here I present a new program called aDNA to dataset (AKA make-myself-redundant)
There are both Linux and Windows versions available. The instructions are very simple so I would like to get feedback from anyone interested in creating their own datasets from BAMs.
Notice: Windows version includes all the files necessary to run except references. For Linux you need to have samtools and pileupCaller in your path.
pileupCaller can be downloaded from here and for samtools I assume you know how to use apt, pacman, yum etc.
https://github.com/stschiff/sequenceTools
Main page:
https://github.com/teepean/adna_to_dataset
Download:
https://github.com/teepean/adna_to_datas.../v.0.2.zip
PileupCaller uses default settings and if you want to modify them you have to edit the .bat or .sh.
EDIT: the program supports only hs37d5 and hg19 as references as those are the most commonly used in aDNA papers. hg38/T2T support can be added if AADR starts supporting those references.
Posts: 154
Threads: 5
Joined: Sep 2023
Gender: Undisclosed
Ethnicity: Levantine
Y-DNA (P): E-CTS6667
03-16-2024, 05:44 PM
(This post was last modified: 03-16-2024, 05:54 PM by Qrts.)
Brilliant work. Thank you for your contributions teepean.
Posts: 410
Threads: 13
Joined: Oct 2023
Gender: Undisclosed
Posts: 301
Threads: 23
Joined: Oct 2023
Gender: Undisclosed
(03-16-2024, 05:37 PM)teepean Wrote: So I have had questions over the years about creating datasets and here I present a new program called aDNA to dataset (AKA make-myself-redundant)
There are both Linux and Windows versions available. The instructions are very simple so I would like to get feedback from anyone interested in creating their own datasets from BAMs.
Notice: Windows version includes all the files necessary to run except references. For Linux you need to have samtools and pileupCaller in your path.
EDIT: the program supports only hs37d5 and hg19 as references as those are the most commonly used in aDNA papers. hg38/T2T support can be added if AADR starts supporting those references.
Sorry for n00b question but what is the main purpose of creating own datasets from BAMs?
Regarding references download etc. may I ask if this could be installed together with WGSExtract so that big (reference) files/paths can be shared?
Posts: 226
Threads: 4
Joined: Feb 2024
(03-16-2024, 05:37 PM)teepean Wrote: So I have had questions over the years about creating datasets and here I present a new program called aDNA to dataset (AKA make-myself-redundant)
There are both Linux and Windows versions available. The instructions are very simple so I would like to get feedback from anyone interested in creating their own datasets from BAMs.
Notice: Windows version includes all the files necessary to run except references. For Linux you need to have samtools and pileupCaller in your path.
pileupCaller can be downloaded from here and for samtools I assume you know how to use apt, pacman, yum etc.
https://github.com/stschiff/sequenceTools
Main page:
https://github.com/teepean/adna_to_dataset
Download:
https://github.com/teepean/adna_to_datas.../v.0.2.zip
PileupCaller uses default settings and if you want to modify them you have to edit the .bat or .sh.
EDIT: the program supports only hs37d5 and hg19 as references as those are the most commonly used in aDNA papers. hg38/T2T support can be added if AADR starts supporting those references.
Do you think it could be possible a T2T CHM13 liftover of AADR data? Plink 2 can work perfectly with CHM13 reference and many low coverage ancient DNA raw data detects more SNPs with CHM13 reference alignment
Posts: 410
Threads: 13
Joined: Oct 2023
Gender: Undisclosed
(03-17-2024, 12:57 PM)ChrisR Wrote: Sorry for n00b question but what is the main purpose of creating own datasets from BAMs?
Regarding references download etc. may I ask if this could be installed together with WGSExtract so that big (reference) files/paths can be shared?
The idea is that more people could create datasets. As for the references it is possible to edit the code to point to a different location.
..\winbin\samtools mpileup -B -q 30 -Q 30 -l ../positions/v42.4.1240K.pos -f ../reference/hs37d5.fa
Posts: 771
Threads: 4
Joined: Sep 2023
Excellent! This sounds very useful. Quick question, about how much RAM does this program utilize on Windows, and is it compatible with 32-bit OS?
Posts: 16
Threads: 1
Joined: Sep 2023
ok
Technical question from a noob :
I am running the program from Windows. The download for the references starts without any issues. Then, I encounter the question "enter population name". Which population does this refer to?
Posts: 424
Threads: 14
Joined: Sep 2023
Gender: Male
Ethnicity: North Europe
Nationality: Normand French
Y-DNA (P): R-BY3604
Y-DNA (M): I-M253
mtDNA (M): H5a1
mtDNA (P): K1c1c
(03-17-2024, 05:05 PM)Fabrice E Wrote: ok
Technical question from a noob :
I am running the program from Windows. The download for the references starts without any issues. Then, I encounter the question "enter population name". Which population does this refer to?
The outcome is a PLINK packedped consisting of 3 files (.bed, .bim, .fam) that will be created in the target subdirectory. The .fam indicates for each individual a Population Name and an Individual Name. These are what the program asks you to choose.
MyHeritage:
North and West European 55.8%
English 28.5%
Baltic 11.5%
Finnish 4.2%
GENETIC GROUPS Scotland (Aberdeen and Aberdeenshire)
Papertrail (4 generations): Normandy, Orkney, Bergum, Emden, Oulu
Posts: 963
Threads: 24
Joined: Oct 2023
Can someone convert these 2 files to plink please?
GRC13292545.chrY.bam
GRC13292546.chrY.bam
https://evolbio.ut.ee/chrY/
These are Y chromosomes of A00 used by Karmin et al. (2015) .
Posts: 963
Threads: 24
Joined: Oct 2023
In addition , these BAM files are hg18 ( not hg19) .
##reference=file:///cvmfs/data.galaxyproject.org/byhand/hg18/sam_index/hg18.fa
The snp positions POS ID are not the same as hg19.
Posts: 410
Threads: 13
Joined: Oct 2023
Gender: Undisclosed
(03-18-2024, 01:57 PM)TanTin Wrote: Can someone convert these 2 files to plink please?
GRC13292545.chrY.bam
GRC13292546.chrY.bam
https://evolbio.ut.ee/chrY/
These are Y chromosomes of A00 used by Karmin et al. (2015) .
This program should not be used for Y-DNA.
Posts: 410
Threads: 13
Joined: Oct 2023
Gender: Undisclosed
Posts: 410
Threads: 13
Joined: Oct 2023
Gender: Undisclosed
(06-06-2024, 11:34 AM)Gabru77 Wrote: Issue with output file of .bam to Plink conversion
Plink output of MT23 that I attempted to convert from .bam:-
https://send.zcyph.cc/download/aa860f445...hQRtZYpLmA
Can someone tell me the problem with output file?
Can you share the bam you tried to create the dataset from?
Posts: 410
Threads: 13
Joined: Oct 2023
Gender: Undisclosed
(06-06-2024, 02:25 PM)Gabru77 Wrote: (06-06-2024, 02:13 PM)teepean Wrote: (06-06-2024, 11:34 AM)Gabru77 Wrote: Issue with output file of .bam to Plink conversion
Plink output of MT23 that I attempted to convert from .bam:-
https://send.zcyph.cc/download/aa860f445...hQRtZYpLmA
Can someone tell me the problem with output file?
Can you share the bam you tried to create the dataset from?
MT23 right here... .bam and .bai
https://www.ebi.ac.uk/ena/browser/view/PRJEB54894
Did you choose this one? The other bams for MT23 are unaligned.
MT23.fastq.combined.fq.prefixed.mapped.mappedonly.sorted.qF.sorted.cleaned.merged_rmdup.clean.RG_sort.rescaled.bam