Posts: 64
Threads: 0
Joined: Jun 2024
I'm having an issue with file sizes right now, I recently tried to add the Zlaty Kun sample to the 1240k dataset. The geno file of the Zlaty Kun was around 2.4MB and the 1240k geno file is roughly 5GB. Can someone explain why after using Eigenstrat to merge the two files did the resulting new file balloon by nearly 4x as much at nearly 20GB? I wouldn't mind as much if it didn't also affect performance, now running admixtools methods takes a bit longer to complete and the program constantly has to read the HDD with each and every run
Posts: 953
Threads: 22
Joined: Oct 2023
(06-06-2024, 01:48 PM)ModusOperandi Wrote: I'm having an issue with file sizes right now, I recently tried to add the Zlaty Kun sample to the 1240k dataset. The geno file of the Zlaty Kun was around 2.4MB and the 1240k geno file is roughly 5GB. Can someone explain why after using Eigenstrat to merge the two files did the resulting new file balloon by nearly 4x as much at nearly 20GB? I wouldn't mind as much if it didn't also affect performance, now running admixtools methods takes a bit longer to complete and the program constantly has to read the HDD with each and every run
You will need to "clean" the new data. To make sure the new snips are matching the snips in the dataset that you merge.
1. First - you create the file with the list of all snips in your current dataset: ( v52.2_1240K_public )
plink --allow-no-sex --bfile v52.2_1240K_public --write-snplist --out v52.2_1240K_clean
2. Before to merge: you filter the file that should be added to have only the snips matching v52.2_1240K_clean list:
plink --bfile ZlatyKun_example --extract v52.2_1240K_clean.snplist --make-bed --out ZlatyKun_example_clean
Next you do the merge using the clean file ( ZlatyKun_example_clean )
This way it should not increase the size significantly.
Posts: 953
Threads: 22
Joined: Oct 2023
06-06-2024, 06:29 PM
(This post was last modified: 06-06-2024, 06:34 PM by TanTin.)
(06-06-2024, 03:31 PM)Gabru77 Wrote: Error while merging two Plink files:-
Code: 4016 MB RAM detected; reserving 2008 MB for main workspace.
44 people loaded from Pathak_v1.0.fam.
1 person to be merged from SDLG7.fam.
Of these, 1 is new, while 0 are present in the base dataset.
Warning: Multiple chromosomes seen for variant 'rs748651'.
Warning: Multiple chromosomes seen for variant 'rs3112911'.
Warning: Multiple chromosomes seen for variant 'rs2247450'.
Warning: Multiple chromosomes seen for variant 'rs1072841'.
Warning: Multiple chromosomes seen for variant 'rs34472859'.
Warning: Multiple chromosomes seen for variant 'rs1987475'.
Warning: Multiple chromosomes seen for variant 'rs11103281'.
Warning: Multiple chromosomes seen for variant 'rs1344098'.
Warning: Multiple chromosomes seen for variant 'rs2097266'.
Warning: Multiple positions seen for variant 'rs8051412'.
Warning: Multiple positions seen for variant 'rs35726347'.
Warning: Multiple chromosomes seen for variant 'rs814740'.
Warning: Multiple chromosomes seen for variant 'rs601338'.
Warning: Multiple chromosomes seen for variant 'rs672163'.
Warning: Multiple chromosomes seen for variant 'rs4129148'.
Warning: Multiple chromosomes seen for variant 'rs11480'.
Warning: Multiple chromosomes seen for variant 'rs14115'.
Warning: Multiple chromosomes seen for variant 'rs306875'.
Warning: Multiple chromosomes seen for variant 'rs707689'.
Warning: Multiple chromosomes seen for variant 'rs1736462'.
Warning: Multiple chromosomes seen for variant 'rs6642242'.
Warning: Multiple chromosomes seen for variant 'rs4893102'.
Warning: Multiple chromosomes seen for variant 'rs909439'.
Warning: Multiple chromosomes seen for variant 'rs5983831'.
Warning: Multiple chromosomes seen for variant 'rs5940653'.
Warning: Multiple chromosomes seen for variant 'rs5983854'.
Warning: Multiple chromosomes seen for variant 'rs3093457'.
Warning: Multiple chromosomes seen for variant 'rs731477'.
Warning: Multiple chromosomes seen for variant 'rs731478'.
716503 markers loaded from Pathak_v1.0.bim.
1233013 markers to be merged from SDLG7.bim.
Of these, 822412 are new, while 410601 are present in the base dataset.
Error: 73425 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
result.missnp.
(Warning: if the subsequent merge seems to work, strand errors involving SNPs
with A/T or C/G alleles probably remain in your data. If LD between nearby
SNPs is high, --flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
that subset of the data to VCF (via e.g. '--recode vcf'), merging with
another tool/script, and then importing the result; PLINK is not yet suited
to handling them.
See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.
I've seen such error when merging old datasets. The main reason for such error is because the 2 files are in different format.
Despite both are plink files: they are mapped by different specifications:
example: hg19 and hg18.
Some older files were mapped in the past by using hg18 specification.
snip names could be still the same , but the snip position is different in hg19 and hg18.
For this reason the reported error is: Multiple chromosomes seen for variant 'rs748651' - that means the same snip was reported for 2 different positions.
One solution would be to convert the older file from hg18 to hg19.
This is called: Lift Genome Annotations
https://genome.ucsc.edu/cgi-bin/hgLiftOver
Posts: 953
Threads: 22
Joined: Oct 2023
06-06-2024, 06:39 PM
(This post was last modified: 06-06-2024, 06:47 PM by TanTin.)
For the error:
Error: 73425 variants with 3+ alleles present.
-> if the number of 3+ alleles is small - I just ignore them.
Same as the previous command above: plink --bfile ZlatyKun_example --extract v52.2_1240K_clean.snplist --make-bed --out ZlatyKun_example_clean
but instead of " --extract " use " --exclude " option.
(--exclude result.missnp <------- to exclude the conflicting 3+ alleles snips only !)
By excluding 3+ alleles present you will get lower coverage, but no errors about these conflicting snips.
Posts: 64
Threads: 0
Joined: Jun 2024
(06-06-2024, 06:15 PM)TanTin Wrote: (06-06-2024, 01:48 PM)ModusOperandi Wrote: I'm having an issue with file sizes right now, I recently tried to add the Zlaty Kun sample to the 1240k dataset. The geno file of the Zlaty Kun was around 2.4MB and the 1240k geno file is roughly 5GB. Can someone explain why after using Eigenstrat to merge the two files did the resulting new file balloon by nearly 4x as much at nearly 20GB? I wouldn't mind as much if it didn't also affect performance, now running admixtools methods takes a bit longer to complete and the program constantly has to read the HDD with each and every run
You will need to "clean" the new data. To make sure the new snips are matching the snips in the dataset that you merge.
1. First - you create the file with the list of all snips in your current dataset: ( v52.2_1240K_public )
plink --allow-no-sex --bfile v52.2_1240K_public --write-snplist --out v52.2_1240K_clean
2. Before to merge: you filter the file that should be added to have only the snips matching v52.2_1240K_clean list:
plink --bfile ZlatyKun_example --extract v52.2_1240K_clean.snplist --make-bed --out ZlatyKun_example_clean
Next you do the merge using the clean file ( ZlatyKun_example_clean )
This way it should not increase the size significantly.
I did all the above steps and the file size is still roughly the same
Posts: 953
Threads: 22
Joined: Oct 2023
(06-11-2024, 04:48 PM)ModusOperandi Wrote: I did all the above steps and the file size is still roughly the same
Show Content
Spoiler
1 rs3094315 0 752566 G A
1 rs12124819 0 776546 0 A
1 rs28765502 0 832918 T C
1 rs7419119 0 842013 0 T
1 rs950122 0 846864 0 G
1 rs113171913 0 869303 0 C
1 rs13302957 0 891021 G A
1 rs6696609 0 903426 0 C
There are 2 bim files for the 2 detasets that you are merging.
See the list of the snips in your bim files. Snip name and snip position should be matching for the 2 files.
Example: snip rs3094315 is on position 752566 .
If your datasets are different versions: the snip name may be the same, but the snip position could be different for both files.
Posts: 241
Threads: 6
Joined: Oct 2023
The problem could be that you likely want packedancestrymap format, not eigenstrat or ancestrymap.
Posts: 503
Threads: 72
Joined: Nov 2023
Gender: Male
Ethnicity: Arab
(05-05-2024, 06:12 PM)AimSmall Wrote: You have strand inconsistencies or flipped SNPs... I've never tried to merge Eigenstrat files, just PLINK. Your mileage may vary.
Here is how, who I call "Obi Wan", taught me..... (because he was the master) aka Nganasankan aka Henjin
Note this was using a Windows system at the time. Same technique works elsewhere, just fix your file pathing accordingly.
Basically a four step process....
Code: D:\DataAnalysis\DataSets\v52>plink --allow-no-sex --bfile v52.2_HO_public --bmerge AimSmall_genome --make-bed --out v52_HO_AimSmall
Need to flip for strand inconsistency
henjin — Today at 4:37 PM
This is the procedure to deal with strand inconsistency errors:
alias pli='plink --allow-no-sex'
pli --bfile $a --bmerge $b --make-bed --out merge
pli --bfile $b --flip merge-merge.missnp --make-bed --out $b-flip
pli --bfile $a --bmerge $b-flip --make-bed --out merge
pli --bfile $b-flip --exclude merge-merge.missnp --make-bed --out $b-filtered
pli --bfile $a --bmerge $b-filtered --make-bed --out merged
You can use a function for it too:
plib()(plink --allow-no-sex --bfile "$@")
merg()(plib "$1" --bmerge "$2" --make-bed --out merge;plib "$2" --flip merge-merge.missnp --make-bed --out "$2-flip";plib "$1" --bmerge "$2-flip" --make-bed --out merge;plib "$2-flip" --exclude merge-merge.missnp --make-bed --out "$2-filtered";plib "$1" --bmerge "$2-filtered" --make-bed --out "${3-merged}")
merg v52.2_HO_public AimSmall_genome
Manual method example
D:\DataAnalysis\DataSets\v52>plink --allow-no-sex --bfile AimSmall_genome --flip v52_HO_AimSmall-merge.missnp --make-bed --out AimSmall_genome_flip
D:\DataAnalysis\DataSets\v52>plink --allow-no-sex --bfile v52.2_HO_public --bmerge AimSmall_genome_flip --make-bed --out v52_HO_merged
D:\DataAnalysis\DataSets\v52>plink --allow-no-sex --bfile AimSmall_genome_flip --exclude v52_HO_merged-merge.missnp --make-bed --out AimSmall_genome_flip_filtered
D:\DataAnalysis\DataSets\v52>plink --allow-no-sex --bfile v52.2_HO_public --bmerge AimSmall_genome_flip_filtered --make-bed --out v52_HO_AimSmall2
I followed your steps but when I do this
plink --allow-no-sex --bfile AimSmall_genome --flip v52_HO_AimSmall-merge.missnp --make-bed --out AimSmall_genome_flip
Invalid.bed file size error expected 2077088
But it still flips the snps
Posts: 503
Threads: 72
Joined: Nov 2023
Gender: Male
Ethnicity: Arab
Either im the most unluckiest person in the world or i just have bad luck
This is how i solved the error size validation
This did the trick truncate -s 2077088 Oujda.bed
Posts: 503
Threads: 72
Joined: Nov 2023
Gender: Male
Ethnicity: Arab
Okay not it says 166204 her. Haploid genotypes do I need to fix this or is it fine to leave it as it is
Posts: 503
Threads: 72
Joined: Nov 2023
Gender: Male
Ethnicity: Arab
After merging the datasets in plink i was going to convert to eigenstrat but then it made another silly error of the invidual name is too long despite it not having an issue when i was converting the dataset into plink, so I decided to stay with plink I ran this code on the eigenstrat ind file
awk '{print $1}' v62.0_1240k_public.ind > Sample_Names.txt
this extracted the population individuals I needed, I then used this to replace the first column in the plink .fam file
awk 'NR==FNR {new_id[NR]=$1; next} { $1=new_id[FNR]; print }' Family_Samples.txt v64_0_Berkane2.fam > v64_0_Berkane2_updated.fam
This should give you all the sample names in the original file in your plink file, make sure to add yourself in the last row in the Sample_Names.txt
my first run on qpadm
Code: eft=c('CanaryIslands_Guanche.SG', 'Congo_Kindoki_Protohistoric.AG', 'Syria_TellMasaikh_Medieval_oAfrica.SG', 'Syria_TellMasaikh_Medieval.SG')
>
> right=c('Mbuti.DG', 'Papuan.DG', 'Chukchi.DG', 'Yoruba.DG', 'Russia_UstIshim_IUP.DG', 'Russia_Kostenki14_UP.SG', 'Jordan_PPNB.AG', 'Israel_Natufian.AG', 'Russia_MA1_UP.SG', 'Morocco_Iberomaurusian.AG')
> target='Son'
> qp=qpadm("C:/Users/fromh/v64_0_Berkane2",left,right,target,allsnps=TRUE)
ℹ Reading metadata...
Warning: 1 parsing failure.
row col expected actual file
17630 -- 6 columns 5 columns 'C:\Users\fromh\v64_0_Berkane2.fam'
ℹ Computing block lengths for 2454306 SNPs...
ℹ Computing 36 f4-statistics for block 713 out of 713...
ℹ "allsnps = TRUE" uses different SNPs for each f4-statistic
Number of SNPs used for each f4-statistic:
pop1 pop2 pop3 pop4 n
1 Son CanaryIslands_Guanche.SG Mbuti.DG Chukchi.DG 720627
2 Son CanaryIslands_Guanche.SG Mbuti.DG Israel_Natufian.AG 759331
3 Son CanaryIslands_Guanche.SG Mbuti.DG Jordan_PPNB.AG 742759
4 Son CanaryIslands_Guanche.SG Mbuti.DG Morocco_Iberomaurusian.AG 806598
5 Son CanaryIslands_Guanche.SG Mbuti.DG Papuan.DG 818557
6 Son CanaryIslands_Guanche.SG Mbuti.DG Russia_Kostenki14_UP.SG 767299
7 Son CanaryIslands_Guanche.SG Mbuti.DG Russia_MA1_UP.SG 671142
8 Son CanaryIslands_Guanche.SG Mbuti.DG Russia_UstIshim_IUP.DG 127242
9 Son CanaryIslands_Guanche.SG Mbuti.DG Yoruba.DG 818567
10 Son Congo_Kindoki_Protohistoric.AG Mbuti.DG Chukchi.DG 680495
11 Son Congo_Kindoki_Protohistoric.AG Mbuti.DG Israel_Natufian.AG 777567
12 Son Congo_Kindoki_Protohistoric.AG Mbuti.DG Jordan_PPNB.AG 766951
13 Son Congo_Kindoki_Protohistoric.AG Mbuti.DG Morocco_Iberomaurusian.AG 802266
14 Son Congo_Kindoki_Protohistoric.AG Mbuti.DG Papuan.DG 806325
15 Son Congo_Kindoki_Protohistoric.AG Mbuti.DG Russia_Kostenki14_UP.SG 759921
16 Son Congo_Kindoki_Protohistoric.AG Mbuti.DG Russia_MA1_UP.SG 638072
17 Son Congo_Kindoki_Protohistoric.AG Mbuti.DG Russia_UstIshim_IUP.DG 82878
18 Son Congo_Kindoki_Protohistoric.AG Mbuti.DG Yoruba.DG 806328
19 Son Syria_TellMasaikh_Medieval.SG Mbuti.DG Chukchi.DG 661861
20 Son Syria_TellMasaikh_Medieval.SG Mbuti.DG Israel_Natufian.AG 722093
21 Son Syria_TellMasaikh_Medieval.SG Mbuti.DG Jordan_PPNB.AG 712487
22 Son Syria_TellMasaikh_Medieval.SG Mbuti.DG Morocco_Iberomaurusian.AG 747523
23 Son Syria_TellMasaikh_Medieval.SG Mbuti.DG Papuan.DG 754106
24 Son Syria_TellMasaikh_Medieval.SG Mbuti.DG Russia_Kostenki14_UP.SG 710799
25 Son Syria_TellMasaikh_Medieval.SG Mbuti.DG Russia_MA1_UP.SG 629569
26 Son Syria_TellMasaikh_Medieval.SG Mbuti.DG Russia_UstIshim_IUP.DG 73147
27 Son Syria_TellMasaikh_Medieval.SG Mbuti.DG Yoruba.DG 754116
28 Son Syria_TellMasaikh_Medieval_oAfrica.SG Mbuti.DG Chukchi.DG 675450
29 Son Syria_TellMasaikh_Medieval_oAfrica.SG Mbuti.DG Israel_Natufian.AG 730207
30 Son Syria_TellMasaikh_Medieval_oAfrica.SG Mbuti.DG Jordan_PPNB.AG 719100
31 Son Syria_TellMasaikh_Medieval_oAfrica.SG Mbuti.DG Morocco_Iberomaurusian.AG 760710
32 Son Syria_TellMasaikh_Medieval_oAfrica.SG Mbuti.DG Papuan.DG 767999
33 Son Syria_TellMasaikh_Medieval_oAfrica.SG Mbuti.DG Russia_Kostenki14_UP.SG 723302
34 Son Syria_TellMasaikh_Medieval_oAfrica.SG Mbuti.DG Russia_MA1_UP.SG 639313
35 Son Syria_TellMasaikh_Medieval_oAfrica.SG Mbuti.DG Russia_UstIshim_IUP.DG 85829
36 Son Syria_TellMasaikh_Medieval_oAfrica.SG Mbuti.DG Yoruba.DG 768009
ℹ Computing admixture weights...
ℹ Computing standard errors...
ℹ Computing number of admixture waves...
warning: solve(): system is singular (rcond: 3.52528e-19); attempting approx solution
warning: solve(): system is singular (rcond: 5.98313e-18); attempting approx solution
Warning message:
In f4blockdat_to_f4blocks(.) :
Discarding 443 block(s) due to missing values!
>
> qp$weights
# A tibble: 4 × 5
target left weight se z
<chr> <chr> <dbl> <dbl> <dbl>
1 Son CanaryIslands_Guanche.SG -1.82 3.68 -0.494
2 Son Congo_Kindoki_Protohistoric.AG 0.0601 0.488 0.123
3 Son Syria_TellMasaikh_Medieval_oAfrica.SG 19.6 17.0 1.15
4 Son Syria_TellMasaikh_Medieval.SG -16.8 15.1 -1.12
> qp$popdrop
# A tibble: 15 × 15
pat wt dof chisq p f4rank CanaryIslands_Guanche.SG Congo_Kindoki_Protohisto…¹ Syria_TellMasaikh_Me…² Syria_TellMasaikh_Me…³ feasible best dofdiff
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <dbl>
1 0000 0 6 12.0 6.26e- 2 3 -1.82 0.0601 19.6 -16.8 FALSE NA NA
2 0001 1 7 57.7 4.33e- 10 2 -14.7 1.14 14.5 NA FALSE TRUE 0
3 0010 1 7 65.5 1.19e- 11 2 -14.0 1.44 NA 13.5 FALSE TRUE 0
4 0100 1 7 12.0 9.93e- 2 2 -1.62 NA 20.3 -17.7 FALSE TRUE 0
5 1000 1 7 12.3 9.09e- 2 2 NA -0.0854 19.2 -18.2 FALSE TRUE NA
6 0011 2 8 1116. 1.07e-235 1 0.711 0.289 NA NA TRUE NA NA
7 0101 2 8 76.5 2.50e- 13 1 27.7 NA -26.7 NA FALSE NA NA
8 0110 2 8 101. 2.51e- 18 1 13.8 NA NA -12.8 FALSE NA NA
9 1001 2 8 1158. 1.08e-244 1 NA 0.334 0.666 NA TRUE NA NA
10 1010 2 8 1128. 4.15e-238 1 NA 0.291 NA 0.709 TRUE NA NA
11 1100 2 8 12.9 1.16e- 1 1 NA NA 16.7 -15.7 FALSE NA NA
12 0111 3 9 2567. 0 0 1 NA NA NA TRUE NA NA
13 1011 3 9 2542. 0 0 NA 1 NA NA TRUE NA NA
14 1101 3 9 1500. 1.53e-317 0 NA NA 1 NA TRUE NA NA
15 1110 3 9 1429. 4.34e-302 0 NA NA NA 1 TRUE NA NA
# ℹ abbreviated names: ¹Congo_Kindoki_Protohistoric.AG, ²Syria_TellMasaikh_Medieval_oAfrica.SG, ³Syria_TellMasaikh_Medieval.SG
# ℹ 2 more variables: chisqdiff <dbl>, p_nested <dbl>
>
> `|`=`%>%`
> p=""
> o=""
> rm(t)
> #t=qp$popdrop|dplyr::filter(f4rank!=0&feasible)|arrange(desc(p),chisq)
> t=qp$popdrop|dplyr::filter(f4rank!=0&feasible)|arrange(desc(p),chisq)
>
> p=t|select(7:last_col(5))|apply(1,\(x)na.omit(100*x)|sort(T)|sprintf("%.0f %s",.,names(.))|paste(collapse=" "))
> o=sub("^0","",sprintf(ifelse(t$p<.001,"%.0g","%.3f"),t$p))|paste0((ifelse(t$p > 0.05,"SUCCESS ","FAILED ")),"p=",.," ",p,collapse="\n")
> paste0("Target: ",target,"\nLeft: ",paste(sort(left),collapse=", "),"\nRight: ",paste(right,collapse=", "),"\nFeasible Results:","\n",o)|writeLines
Target: Son
Left: CanaryIslands_Guanche.SG, Congo_Kindoki_Protohistoric.AG, Syria_TellMasaikh_Medieval.SG, Syria_TellMasaikh_Medieval_oAfrica.SG
Right: Mbuti.DG, Papuan.DG, Chukchi.DG, Yoruba.DG, Russia_UstIshim_IUP.DG, Russia_Kostenki14_UP.SG, Jordan_PPNB.AG, Israel_Natufian.AG, Russia_MA1_UP.SG, Morocco_Iberomaurusian.AG
Feasible Results:
FAILED p=1e-235 71 CanaryIslands_Guanche.SG 29 Congo_Kindoki_Protohistoric.AG
FAILED p=4e-238 71 Syria_TellMasaikh_Medieval.SG 29 Congo_Kindoki_Protohistoric.AG
FAILED p=1e-244 67 Syria_TellMasaikh_Medieval_oAfrica.SG 33 Congo_Kindoki_Protohistoric.AG
I am still getting some errors so Ideally I will want to convert back to eigenstrat but I don't have the time right now so I'll try to do this when I get some free time because of work
Posts: 503
Threads: 72
Joined: Nov 2023
Gender: Male
Ethnicity: Arab
Warning: 1 parsing failure.
row col expected actual file
17630 -- 6 columns 5 columns
I fixed this because I was missing a column I had to add an extra column with what’s meant to be the ID which I did now this error doesn’t pop up anymore
Posts: 503
Threads: 72
Joined: Nov 2023
Gender: Male
Ethnicity: Arab
@ AimSmall
I managed to convert it back to eigenstrat but the results look weird not sure what I did wrong I followed all steps to flip snps. The only thing I can think is that I adjusted the size of my .bed file because of the file validation error
Posts: 503
Threads: 72
Joined: Nov 2023
Gender: Male
Ethnicity: Arab
@ AimSmall
Do you think me using my combined kit is causing this issue since it’s 2000k snps as to the 1240k snps causing the bed file too big
Posts: 863
Threads: 48
Joined: Aug 2023
Gender: Male
Ethnicity: Colonial American
Nationality: American
Y-DNA (P): R1b-U152 >R-FTA96415
Y-DNA (M): I2-P37 > I-BY77146
mtDNA (M): J1b1a1a
mtDNA (P): H66a
(10-21-2024, 12:20 PM)Genetics189291 Wrote: @AimSmall
Do you think me using my combined kit is causing this issue since it’s 2000k snps as to the 1240k snps causing the bed file too big
When merging your kit, you should have kept only the SNPs that exist in the 1240K. It's pointless to add any that don't exist as it forces all the other samples to add NO CALLs and swells the resulting dataset.
plink --allow-no-sex --bfile v54p1_1240K_public --write-snplist --out v54p1_1240K_clean
plink --23file AimSmall.txt --extract v54p1_1240K_clean.snplist --make-bed --out AimSmall_v54p1_genome
Then merge AimSmall_v54p1_genome with v54p1_1240K_public or whatever dataset.
|