Dataset merging help
#16
I'm having an issue with file sizes right now, I recently tried to add the Zlaty Kun sample to the 1240k dataset. The geno file of the Zlaty Kun was around 2.4MB and the 1240k geno file is roughly 5GB. Can someone explain why after using Eigenstrat to merge the two files did the resulting new file balloon by nearly 4x as much at nearly 20GB? I wouldn't mind as much if it didn't also affect performance, now running admixtools methods takes a bit longer to complete and the program constantly has to read the HDD with each and every run
Reply
#17
(06-06-2024, 01:48 PM)ModusOperandi Wrote: I'm having an issue with file sizes right now, I recently tried to add the Zlaty Kun sample to the 1240k dataset. The geno file of the Zlaty Kun was around 2.4MB and the 1240k geno file is roughly 5GB. Can someone explain why after using Eigenstrat to merge the two files did the resulting new file balloon by nearly 4x as much at nearly 20GB? I wouldn't mind as much if it didn't also affect performance, now running admixtools methods takes a bit longer to complete and the program constantly has to read the HDD with each and every run

You will need to "clean" the new data. To make sure the new snips are matching the snips in the dataset that you merge.

1. First - you create the file with the list of all snips in your current dataset:  ( v52.2_1240K_public  )
plink --allow-no-sex --bfile v52.2_1240K_public --write-snplist --out v52.2_1240K_clean

2. Before to merge:  you filter the file that should be added to have only the snips matching  v52.2_1240K_clean  list:
plink --bfile   ZlatyKun_example  --extract v52.2_1240K_clean.snplist --make-bed --out ZlatyKun_example_clean

Next you do the merge using the clean file ( ZlatyKun_example_clean )

This way it should not increase the size significantly.
Reply
#18
(06-06-2024, 03:31 PM)Gabru77 Wrote: Error while merging two Plink files:-

Code:
4016 MB RAM detected; reserving 2008 MB for main workspace.
44 people loaded from Pathak_v1.0.fam.
1 person to be merged from SDLG7.fam.
Of these, 1 is new, while 0 are present in the base dataset.
Warning: Multiple chromosomes seen for variant 'rs748651'.
Warning: Multiple chromosomes seen for variant 'rs3112911'.
Warning: Multiple chromosomes seen for variant 'rs2247450'.
Warning: Multiple chromosomes seen for variant 'rs1072841'.
Warning: Multiple chromosomes seen for variant 'rs34472859'.
Warning: Multiple chromosomes seen for variant 'rs1987475'.
Warning: Multiple chromosomes seen for variant 'rs11103281'.
Warning: Multiple chromosomes seen for variant 'rs1344098'.
Warning: Multiple chromosomes seen for variant 'rs2097266'.
Warning: Multiple positions seen for variant 'rs8051412'.
Warning: Multiple positions seen for variant 'rs35726347'.
Warning: Multiple chromosomes seen for variant 'rs814740'.
Warning: Multiple chromosomes seen for variant 'rs601338'.
Warning: Multiple chromosomes seen for variant 'rs672163'.
Warning: Multiple chromosomes seen for variant 'rs4129148'.
Warning: Multiple chromosomes seen for variant 'rs11480'.
Warning: Multiple chromosomes seen for variant 'rs14115'.
Warning: Multiple chromosomes seen for variant 'rs306875'.
Warning: Multiple chromosomes seen for variant 'rs707689'.
Warning: Multiple chromosomes seen for variant 'rs1736462'.
Warning: Multiple chromosomes seen for variant 'rs6642242'.
Warning: Multiple chromosomes seen for variant 'rs4893102'.
Warning: Multiple chromosomes seen for variant 'rs909439'.
Warning: Multiple chromosomes seen for variant 'rs5983831'.
Warning: Multiple chromosomes seen for variant 'rs5940653'.
Warning: Multiple chromosomes seen for variant 'rs5983854'.
Warning: Multiple chromosomes seen for variant 'rs3093457'.
Warning: Multiple chromosomes seen for variant 'rs731477'.
Warning: Multiple chromosomes seen for variant 'rs731478'.
716503 markers loaded from Pathak_v1.0.bim.
1233013 markers to be merged from SDLG7.bim.
Of these, 822412 are new, while 410601 are present in the base dataset.
Error: 73425 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
  result.missnp.
  (Warning: if the subsequent merge seems to work, strand errors involving SNPs
  with A/T or C/G alleles probably remain in your data.  If LD between nearby
  SNPs is high, --flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
  that subset of the data to VCF (via e.g. '--recode vcf'), merging with
  another tool/script, and then importing the result; PLINK is not yet suited
  to handling them.
See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.

I've seen such error when merging old datasets.  The main reason for such error is because the 2 files are in different format.
Despite both are plink files: they are mapped by different specifications:
example: hg19 and hg18.
Some older files were mapped in the past by using hg18 specification.
snip names could be still the same , but the snip position is different in  hg19 and hg18.

For this reason the reported error is:  Multiple chromosomes seen for variant 'rs748651'    - that means the same snip was reported for 2 different positions. 
One solution would be  to convert the older file from hg18 to hg19.

This is called: Lift Genome Annotations
https://genome.ucsc.edu/cgi-bin/hgLiftOver
Reply
#19
For the error:
Error: 73425 variants with 3+ alleles present.

-> if the number of 3+ alleles is small - I just ignore them.

Same as the previous command above: plink --bfile ZlatyKun_example --extract v52.2_1240K_clean.snplist --make-bed --out ZlatyKun_example_clean
but instead of " --extract " use " --exclude " option.
(--exclude result.missnp <------- to exclude the conflicting 3+ alleles snips only !)

By excluding 3+ alleles present you will get lower coverage, but no errors about these conflicting snips.
Reply
#20
(06-06-2024, 06:15 PM)TanTin Wrote:
(06-06-2024, 01:48 PM)ModusOperandi Wrote: I'm having an issue with file sizes right now, I recently tried to add the Zlaty Kun sample to the 1240k dataset. The geno file of the Zlaty Kun was around 2.4MB and the 1240k geno file is roughly 5GB. Can someone explain why after using Eigenstrat to merge the two files did the resulting new file balloon by nearly 4x as much at nearly 20GB? I wouldn't mind as much if it didn't also affect performance, now running admixtools methods takes a bit longer to complete and the program constantly has to read the HDD with each and every run

You will need to "clean" the new data. To make sure the new snips are matching the snips in the dataset that you merge.

1. First - you create the file with the list of all snips in your current dataset:  ( v52.2_1240K_public  )
plink --allow-no-sex --bfile v52.2_1240K_public --write-snplist --out v52.2_1240K_clean

2. Before to merge:  you filter the file that should be added to have only the snips matching  v52.2_1240K_clean  list:
plink --bfile   ZlatyKun_example  --extract v52.2_1240K_clean.snplist --make-bed --out ZlatyKun_example_clean

Next you do the merge using the clean file ( ZlatyKun_example_clean )

This way it should not increase the size significantly.

I did all the above steps and the file size is still roughly the same
Reply
#21
(06-11-2024, 04:48 PM)ModusOperandi Wrote: I did all the above steps and the file size is still roughly the same

Show Content

There are 2 bim files for the 2 detasets that you are merging.
See the list of the snips in your bim files. Snip name and snip position should be matching for the 2 files.
Example: snip rs3094315 is on position 752566 .

If your datasets are different versions: the snip name may be the same, but the snip position could be different for both files.
Reply
#22
The problem could be that you likely want packedancestrymap format, not eigenstrat or ancestrymap.
Reply
#23
(05-05-2024, 06:12 PM)AimSmall Wrote: You have strand inconsistencies or flipped SNPs...  I've never tried to merge Eigenstrat files, just PLINK.  Your mileage may vary.

Here is how, who I call "Obi Wan",  taught me..... (because he was the master) aka Nganasankan aka Henjin 

Note this was using a Windows system at the time.  Same technique works elsewhere, just fix your file pathing accordingly.


Basically a four step process....
Code:
D:\DataAnalysis\DataSets\v52>plink --allow-no-sex --bfile v52.2_HO_public --bmerge AimSmall_genome --make-bed --out v52_HO_AimSmall

Need to flip for strand inconsistency

henjin — Today at 4:37 PM
This is the procedure to deal with strand inconsistency errors:
alias pli='plink --allow-no-sex'
pli --bfile $a --bmerge $b --make-bed --out merge
pli --bfile $b --flip merge-merge.missnp --make-bed --out $b-flip
pli --bfile $a --bmerge $b-flip --make-bed --out merge
pli --bfile $b-flip --exclude merge-merge.missnp --make-bed --out $b-filtered
pli --bfile $a --bmerge $b-filtered --make-bed --out merged


You can use a function for it too:
plib()(plink --allow-no-sex --bfile "$@")
merg()(plib "$1" --bmerge "$2" --make-bed --out merge;plib "$2" --flip merge-merge.missnp --make-bed --out "$2-flip";plib "$1" --bmerge "$2-flip" --make-bed --out merge;plib "$2-flip" --exclude merge-merge.missnp --make-bed --out "$2-filtered";plib "$1" --bmerge "$2-filtered" --make-bed --out "${3-merged}")
merg v52.2_HO_public AimSmall_genome


Manual method example

D:\DataAnalysis\DataSets\v52>plink --allow-no-sex --bfile AimSmall_genome --flip v52_HO_AimSmall-merge.missnp --make-bed --out AimSmall_genome_flip

D:\DataAnalysis\DataSets\v52>plink --allow-no-sex --bfile v52.2_HO_public --bmerge AimSmall_genome_flip --make-bed --out v52_HO_merged

D:\DataAnalysis\DataSets\v52>plink --allow-no-sex --bfile AimSmall_genome_flip --exclude v52_HO_merged-merge.missnp --make-bed --out AimSmall_genome_flip_filtered

D:\DataAnalysis\DataSets\v52>plink --allow-no-sex --bfile v52.2_HO_public --bmerge AimSmall_genome_flip_filtered --make-bed --out v52_HO_AimSmall2

I followed your steps but when I do this 


plink --allow-no-sex --bfile AimSmall_genome --flip v52_HO_AimSmall-merge.missnp --make-bed --out AimSmall_genome_flip

Invalid.bed file size error expected 2077088

But it still flips the snps
Reply
#24
Either im the most unluckiest person in the world or i just have bad luck

This is how i solved the error size validation

This did the trick truncate -s 2077088 Oujda.bed
Reply
#25
Okay not it says 166204 her. Haploid genotypes do I need to fix this or is it fine to leave it as it is
Reply
#26
After merging the datasets in plink i was going to convert to eigenstrat but then it made another silly error of the invidual name is too long despite it not having an issue when i was converting the dataset into plink, so I decided to stay with plink I ran this code on the eigenstrat ind file

awk '{print $1}' v62.0_1240k_public.ind > Sample_Names.txt

this extracted the population individuals I needed, I then used this to replace the first column in the plink .fam file


awk 'NR==FNR {new_id[NR]=$1; next} { $1=new_id[FNR]; print }' Family_Samples.txt v64_0_Berkane2.fam > v64_0_Berkane2_updated.fam

This should give you all the sample names in the original file in your plink file, make sure to add yourself in the last row in the Sample_Names.txt

my first run on qpadm

Code:
eft=c('CanaryIslands_Guanche.SG', 'Congo_Kindoki_Protohistoric.AG', 'Syria_TellMasaikh_Medieval_oAfrica.SG', 'Syria_TellMasaikh_Medieval.SG')
>
> right=c('Mbuti.DG', 'Papuan.DG', 'Chukchi.DG', 'Yoruba.DG', 'Russia_UstIshim_IUP.DG', 'Russia_Kostenki14_UP.SG', 'Jordan_PPNB.AG', 'Israel_Natufian.AG', 'Russia_MA1_UP.SG', 'Morocco_Iberomaurusian.AG')
> target='Son'
> qp=qpadm("C:/Users/fromh/v64_0_Berkane2",left,right,target,allsnps=TRUE)
ℹ Reading metadata...
Warning: 1 parsing failure.
  row col  expected    actual                                   file
17630  -- 6 columns 5 columns 'C:\Users\fromh\v64_0_Berkane2.fam'

ℹ Computing block lengths for 2454306 SNPs...
ℹ Computing 36 f4-statistics for block 713 out of 713...
ℹ "allsnps = TRUE" uses different SNPs for each f4-statistic
  Number of SNPs used for each f4-statistic:
   pop1                                  pop2     pop3                      pop4      n
1   Son              CanaryIslands_Guanche.SG Mbuti.DG                Chukchi.DG 720627
2   Son              CanaryIslands_Guanche.SG Mbuti.DG        Israel_Natufian.AG 759331
3   Son              CanaryIslands_Guanche.SG Mbuti.DG            Jordan_PPNB.AG 742759
4   Son              CanaryIslands_Guanche.SG Mbuti.DG Morocco_Iberomaurusian.AG 806598
5   Son              CanaryIslands_Guanche.SG Mbuti.DG                 Papuan.DG 818557
6   Son              CanaryIslands_Guanche.SG Mbuti.DG   Russia_Kostenki14_UP.SG 767299
7   Son              CanaryIslands_Guanche.SG Mbuti.DG          Russia_MA1_UP.SG 671142
8   Son              CanaryIslands_Guanche.SG Mbuti.DG    Russia_UstIshim_IUP.DG 127242
9   Son              CanaryIslands_Guanche.SG Mbuti.DG                 Yoruba.DG 818567
10  Son        Congo_Kindoki_Protohistoric.AG Mbuti.DG                Chukchi.DG 680495
11  Son        Congo_Kindoki_Protohistoric.AG Mbuti.DG        Israel_Natufian.AG 777567
12  Son        Congo_Kindoki_Protohistoric.AG Mbuti.DG            Jordan_PPNB.AG 766951
13  Son        Congo_Kindoki_Protohistoric.AG Mbuti.DG Morocco_Iberomaurusian.AG 802266
14  Son        Congo_Kindoki_Protohistoric.AG Mbuti.DG                 Papuan.DG 806325
15  Son        Congo_Kindoki_Protohistoric.AG Mbuti.DG   Russia_Kostenki14_UP.SG 759921
16  Son        Congo_Kindoki_Protohistoric.AG Mbuti.DG          Russia_MA1_UP.SG 638072
17  Son        Congo_Kindoki_Protohistoric.AG Mbuti.DG    Russia_UstIshim_IUP.DG  82878
18  Son        Congo_Kindoki_Protohistoric.AG Mbuti.DG                 Yoruba.DG 806328
19  Son         Syria_TellMasaikh_Medieval.SG Mbuti.DG                Chukchi.DG 661861
20  Son         Syria_TellMasaikh_Medieval.SG Mbuti.DG        Israel_Natufian.AG 722093
21  Son         Syria_TellMasaikh_Medieval.SG Mbuti.DG            Jordan_PPNB.AG 712487
22  Son         Syria_TellMasaikh_Medieval.SG Mbuti.DG Morocco_Iberomaurusian.AG 747523
23  Son         Syria_TellMasaikh_Medieval.SG Mbuti.DG                 Papuan.DG 754106
24  Son         Syria_TellMasaikh_Medieval.SG Mbuti.DG   Russia_Kostenki14_UP.SG 710799
25  Son         Syria_TellMasaikh_Medieval.SG Mbuti.DG          Russia_MA1_UP.SG 629569
26  Son         Syria_TellMasaikh_Medieval.SG Mbuti.DG    Russia_UstIshim_IUP.DG  73147
27  Son         Syria_TellMasaikh_Medieval.SG Mbuti.DG                 Yoruba.DG 754116
28  Son Syria_TellMasaikh_Medieval_oAfrica.SG Mbuti.DG                Chukchi.DG 675450
29  Son Syria_TellMasaikh_Medieval_oAfrica.SG Mbuti.DG        Israel_Natufian.AG 730207
30  Son Syria_TellMasaikh_Medieval_oAfrica.SG Mbuti.DG            Jordan_PPNB.AG 719100
31  Son Syria_TellMasaikh_Medieval_oAfrica.SG Mbuti.DG Morocco_Iberomaurusian.AG 760710
32  Son Syria_TellMasaikh_Medieval_oAfrica.SG Mbuti.DG                 Papuan.DG 767999
33  Son Syria_TellMasaikh_Medieval_oAfrica.SG Mbuti.DG   Russia_Kostenki14_UP.SG 723302
34  Son Syria_TellMasaikh_Medieval_oAfrica.SG Mbuti.DG          Russia_MA1_UP.SG 639313
35  Son Syria_TellMasaikh_Medieval_oAfrica.SG Mbuti.DG    Russia_UstIshim_IUP.DG  85829
36  Son Syria_TellMasaikh_Medieval_oAfrica.SG Mbuti.DG                 Yoruba.DG 768009
ℹ Computing admixture weights...
ℹ Computing standard errors...
ℹ Computing number of admixture waves...

warning: solve(): system is singular (rcond: 3.52528e-19); attempting approx solution

warning: solve(): system is singular (rcond: 5.98313e-18); attempting approx solution
Warning message:
In f4blockdat_to_f4blocks(.) :
  Discarding 443 block(s) due to missing values!
>
> qp$weights
# A tibble: 4 × 5
  target left                                    weight     se      z
  <chr>  <chr>                                    <dbl>  <dbl>  <dbl>
1 Son    CanaryIslands_Guanche.SG               -1.82    3.68  -0.494
2 Son    Congo_Kindoki_Protohistoric.AG          0.0601  0.488  0.123
3 Son    Syria_TellMasaikh_Medieval_oAfrica.SG  19.6    17.0    1.15
4 Son    Syria_TellMasaikh_Medieval.SG         -16.8    15.1   -1.12
> qp$popdrop
# A tibble: 15 × 15
   pat      wt   dof  chisq         p f4rank CanaryIslands_Guanche.SG Congo_Kindoki_Protohisto…¹ Syria_TellMasaikh_Me…² Syria_TellMasaikh_Me…³ feasible best  dofdiff
   <chr> <dbl> <dbl>  <dbl>     <dbl>  <dbl>                    <dbl>                      <dbl>                  <dbl>                  <dbl> <lgl>    <lgl>   <dbl>
1 0000      0     6   12.0 6.26e-  2      3                   -1.82                      0.0601                 19.6                  -16.8   FALSE    NA         NA
2 0001      1     7   57.7 4.33e- 10      2                  -14.7                       1.14                   14.5                   NA     FALSE    TRUE        0
3 0010      1     7   65.5 1.19e- 11      2                  -14.0                       1.44                   NA                     13.5   FALSE    TRUE        0
4 0100      1     7   12.0 9.93e-  2      2                   -1.62                     NA                      20.3                  -17.7   FALSE    TRUE        0
5 1000      1     7   12.3 9.09e-  2      2                   NA                        -0.0854                 19.2                  -18.2   FALSE    TRUE       NA
6 0011      2     8 1116.  1.07e-235      1                    0.711                     0.289                  NA                     NA     TRUE     NA         NA
7 0101      2     8   76.5 2.50e- 13      1                   27.7                      NA                     -26.7                   NA     FALSE    NA         NA
8 0110      2     8  101.  2.51e- 18      1                   13.8                      NA                      NA                    -12.8   FALSE    NA         NA
9 1001      2     8 1158.  1.08e-244      1                   NA                         0.334                   0.666                 NA     TRUE     NA         NA
10 1010      2     8 1128.  4.15e-238      1                   NA                         0.291                  NA                      0.709 TRUE     NA         NA
11 1100      2     8   12.9 1.16e-  1      1                   NA                        NA                      16.7                  -15.7   FALSE    NA         NA
12 0111      3     9 2567.  0              0                    1                        NA                      NA                     NA     TRUE     NA         NA
13 1011      3     9 2542.  0              0                   NA                         1                      NA                     NA     TRUE     NA         NA
14 1101      3     9 1500.  1.53e-317      0                   NA                        NA                       1                     NA     TRUE     NA         NA
15 1110      3     9 1429.  4.34e-302      0                   NA                        NA                      NA                      1     TRUE     NA         NA
# ℹ abbreviated names: ¹​Congo_Kindoki_Protohistoric.AG, ²​Syria_TellMasaikh_Medieval_oAfrica.SG, ³​Syria_TellMasaikh_Medieval.SG
# ℹ 2 more variables: chisqdiff <dbl>, p_nested <dbl>
>
> `|`=`%>%`
> p=""
> o=""
> rm(t)
> #t=qp$popdrop|dplyr::filter(f4rank!=0&feasible)|arrange(desc(p),chisq)
> t=qp$popdrop|dplyr::filter(f4rank!=0&feasible)|arrange(desc(p),chisq)
>
> p=t|select(7:last_col(5))|apply(1,\(x)na.omit(100*x)|sort(T)|sprintf("%.0f %s",.,names(.))|paste(collapse=" "))
> o=sub("^0","",sprintf(ifelse(t$p<.001,"%.0g","%.3f"),t$p))|paste0((ifelse(t$p > 0.05,"SUCCESS ","FAILED  ")),"p=",.," ",p,collapse="\n")
> paste0("Target: ",target,"\nLeft: ",paste(sort(left),collapse=", "),"\nRight: ",paste(right,collapse=", "),"\nFeasible Results:","\n",o)|writeLines
Target: Son
Left: CanaryIslands_Guanche.SG, Congo_Kindoki_Protohistoric.AG, Syria_TellMasaikh_Medieval.SG, Syria_TellMasaikh_Medieval_oAfrica.SG
Right: Mbuti.DG, Papuan.DG, Chukchi.DG, Yoruba.DG, Russia_UstIshim_IUP.DG, Russia_Kostenki14_UP.SG, Jordan_PPNB.AG, Israel_Natufian.AG, Russia_MA1_UP.SG, Morocco_Iberomaurusian.AG
Feasible Results:
FAILED  p=1e-235 71 CanaryIslands_Guanche.SG 29 Congo_Kindoki_Protohistoric.AG
FAILED  p=4e-238 71 Syria_TellMasaikh_Medieval.SG 29 Congo_Kindoki_Protohistoric.AG
FAILED  p=1e-244 67 Syria_TellMasaikh_Medieval_oAfrica.SG 33 Congo_Kindoki_Protohistoric.AG

I am still getting some errors so Ideally I will want to convert back to eigenstrat but I don't have the time right now so I'll try to do this when I get some free time because of work
Reply
#27
Warning: 1 parsing failure.
row col expected actual file
17630 -- 6 columns 5 columns

I fixed this because I was missing a column I had to add an extra column with what’s meant to be the ID which I did now this error doesn’t pop up anymore
Reply
#28
@AimSmall

I managed to convert it back to eigenstrat but the results look weird not sure what I did wrong I followed all steps to flip snps. The only thing I can think is that I adjusted the size of my .bed file because of the file validation error
Reply
#29
@AimSmall

Do you think me using my combined kit is causing this issue since it’s 2000k snps as to the 1240k snps causing the bed file too big
Reply
#30
(10-21-2024, 12:20 PM)Genetics189291 Wrote: @AimSmall

Do you think me using my combined kit is causing this issue since it’s 2000k snps as to the 1240k snps causing the bed file too big

When merging your kit, you should have kept only the SNPs that exist in the 1240K.  It's pointless to add any that don't exist as it forces all the other samples to add NO CALLs and swells the resulting dataset.

plink --allow-no-sex --bfile v54p1_1240K_public --write-snplist --out v54p1_1240K_clean
plink --23file AimSmall.txt --extract v54p1_1240K_clean.snplist --make-bed --out AimSmall_v54p1_genome

Then merge  AimSmall_v54p1_genome with  v54p1_1240K_public or whatever dataset.
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)