Selective backup from cluster

#notes #Rsync #Backup

Background

Suppose you have a large amount of data (from your numerical simulation) on a remote cluster, says about 10T. In most cases, you may not need these data any more, but it will be good to just save some of these which can be used to restart computation (.ser in my current case). This note shows a way to realize this purpose.

Generate a list to be backed up

Here is the Python script to save all .ser files you want to backup to a text file.

    #!/usr/bin/env python
    import os,sys 
    from datetime import datetime
    
    saveFreq=10
    if len(sys.argv) < 2:
        print("Need one argument: the ser saved frequency")
        print("The dafault value 10 is used")
    else:
        saveFreq = int(sys.argv[1])
    
    now = datetime.now()
    date_time = now.strftime("%Y%m%d-%H%M%S")
    of = open('bkupSerList_'+date_time+'.txt', 'a')
    for root,dirs,files in os.walk(os.getcwd()): 
        for unin in ['in', 'wall', 'out']:
            if unin in dirs:
                dirs.remove(unin)
        fl = []
        for f in files: 
            if f.endswith(".ser"):
                fl.append(f)
        
        if len(fl) > 1:
            fl.sort()
            print(os.path.join(root,fl[-1]), file=of)
            for idx in range(0, len(fl), saveFreq):
                print(os.path.join(root,fl[idx]), file=of)
    
            of.flush()
    
    of.close()

The only argument is an int value n for on which frequency you want to save your .ser files, i.e., you want save one file every n files, the default value is 10.

Run this script on /scratch/jlv/test, here is some of them I get

    /scratch/jlv/test/RBC/CG/muS5d3/RBCVesPCG_20180609-235417_kappa_1.0_Shear_gamma_0.0_pN_40.0/013.9000000.ser
    /scratch/jlv/test/RBC/CG/muS5d3/RBCVesPCG_20180609-235417_kappa_1.0_Shear_gamma_0.0_pN_40.0/000.0000000.ser
    /scratch/jlv/test/RBC/CG/muS5d3/RBCVesPCG_20180609-235417_kappa_1.0_Shear_gamma_0.0_pN_40.0/000.5200000.ser
    /scratch/jlv/test/RBC/CG/muS5d3/RBCVesPCG_20180609-235417_kappa_1.0_Shear_gamma_0.0_pN_40.0/001.0200000.ser
    /scratch/jlv/test/RBC/CG/muS5d3/RBCVesPCG_20180609-235417_kappa_1.0_Shear_gamma_0.0_pN_40.0/001.5200000.ser
    /scratch/jlv/test/RBC/CG/muS5d3/RBCVesPCG_20180609-235417_kappa_1.0_Shear_gamma_0.0_pN_40.0/002.0300000.ser
    /scratch/jlv/test/RBC/CG/muS5d3/RBCVesPCG_20180609-235417_kappa_1.0_Shear_gamma_0.0_pN_40.0/002.6300000.ser
    /scratch/jlv/test/RBC/CG/muS5d3/RBCVesPCG_20180609-235417_kappa_1.0_Shear_gamma_0.0_pN_40.0/003.2300000.ser
    /scratch/jlv/test/RBC/CG/muS5d3/RBCVesPCG_20180609-235417_kappa_1.0_Shear_gamma_0.0_pN_40.0/003.8300000.ser
    /scratch/jlv/test/RBC/CG/muS5d3/RBCVesPCG_20180609-235417_kappa_1.0_Shear_gamma_0.0_pN_40.0/004.4300000.ser

Copy to a temporary directory

Suppose the destination directory is /scratch/jlv/tmp_for_bkup_serfiles, before copying, the leading string /scratch/jlv/test/ must be trimmed. This can be easily done with vim. The local copy can then de done by

    rsync -uvhR `cat bkupSerList_20210623-152003.txt` /scratch/jlv/tmp_for_bkup_serfiles/

Backup to your own disk

Suppose the local directory where you want to save your data is /media/jlv/WD1/meso_Marseille/scratch/test, the backup can be simply realized via another rsync command, run under the destination directory

    rsync -ravh -e ssh jlv@login.mesocentre.univ-amu.fr:/scratch/jlv/tmp_for_bkup_serfiles/ .

And finally, don’t forget to delete the data on /scratch/jlv/tmp_for_bkup_serfiles.

Notes

When copying data with rsync to an external driver formatted as exFAT, for example, one need to add the option –modify-window=1 to avoid copying all files every time.

    # https://unix.stackexchange.com/questions/552349/rsync-over-ssh-copies-all-files-every-time
    rsync --modify-window=1 -ravh -e ssh jlv@login.mesocentre.univ-amu.fr:/scratch/jlv/tmp_for_bkup_serfiles/ .