python - 「インデックス」を使用したバッチ（ベース名）ファイル/フォルダの名前変更

Question

バッチでのファイルとフォルダの名前の変更はよく聞かれる質問ですが、いくつかの検索の後、私のものに似ているものはないと思います。

背景：いくつかの生物学的サンプルをサービスプロバイダーに送信します。サービスプロバイダーは、一意の名前のファイルと、ファイル名とその元のサンプルを含むテキスト形式のテーブルを返します。

head samples.txt
fq_file Sample_ID   Sample_name Library_ID  FC_Number   Track_Lanes_Pos
L2369_Track-3885_R1.fastq.gz    S1746_B_7_t B 7 t   L2369_B_7_t 163 6
L2349_Track-3865_R1.fastq.gz    S1726_A_3_t A 3 t   L2349_A_3_t 163 5
L2354_Track-3870_R1.fastq.gz    S1731_A_GFP_c   A GFP c L2354_A_GFP_c   163 5
L2377_Track-3893_R1.fastq.gz    S1754_B_7_c B 7 c   L2377_B_7_c 163 7
L2362_Track-3878_R1.fastq.gz    S1739_B_GFP_t   B GFP t L2362_B_GFP_t   163 6

ディレクトリ構造（34ディレクトリの場合）：

L2369_Track-3885_
   accepted_hits.bam      
   deletions.bed   
   junctions.bed         
   logs
   accepted_hits.bam.bai  
   insertions.bed  
   left_kept_reads.info
L2349_Track-3865_
   accepted_hits.bam      
   deletions.bed   
   junctions.bed         
   logs
   accepted_hits.bam.bai  
   insertions.bed  
   left_kept_reads.info

目標：ファイル名は無意味で解釈が難しいため、.bamで終わるファイル（接尾辞を保持）と対応するサンプル名のフォルダーの名前を、より適切な方法で並べ替えたいと思います。結果は次のようになります。

7_t_B
   7_t_B..bam      
   deletions.bed   
   junctions.bed         
   logs
   7_t_B.bam.bai  
   insertions.bed  
   left_kept_reads.info
3_t_A
   3_t_A.bam      
   deletions.bed   
   junctions.bed         
   logs
   accepted_hits.bam.bai  
   insertions.bed  
   left_kept_reads.info

私はbashとpython（初心者）を使ってソリューションをハックしましたが、それは過剰に設計されているように感じます。問題は、私が見逃していた、よりシンプルでエレガントな方法があるかどうかです。ソリューションはpython、bashである可能性があり、Rも、私がそれを学ぼうとしているのでawkである可能性があります。比較的初心者であることは、物事を複雑にします。

これが私の解決策です：

ラッパーはそれをすべて配置し、ワークフローのアイデアを提供します。

#! /bin/bash

# select columns of interest and write them to a file - basenames
tail -n +2 samples.txt |  cut -d$'\t' -f1,3 >> BAMfilames.txt 

# call my little python script that creates a new .sh with the renaming commmands
./renameBamFiles.py

# finally do the renaming
./renameBam.sh

# and the folders to
./renameBamFolder.sh

renameBamFiles.py：

#! /usr/bin/env python
import re

# Read in the data sample file and create a bash file that will remane the tophat output 
# the reanaming will be as follows:
# mv L2377_Track-3893_R1_ L2377_Track-3893_R1_SRSF7_cyto_B
# 

# Set the input file name
# (The program must be run from within the directory 
#  that contains this data file)
InFileName = 'BAMfilames.txt'


### Rename BAM files

# Open the input file for reading
InFile = open(InFileName, 'r')


# Open the output file for writing
OutFileName= 'renameBam.sh'

OutFile=open(OutFileName,'a') # You can append instead with 'a'

OutFile.write("#! /bin/bash"+"\n")
OutFile.write(" "+"\n")


# Loop through each line in the file
for Line in InFile:
    ## Remove the line ending characters
    Line=Line.strip('\n')

    ## Separate the line into a list of its tab-delimited components
    ElementList=Line.split('\t')

    # separate the folder string from the experimental name
    fileroot=ElementList[1]
    fileroot=fileroot.split()

    # create variable names using regex
    folderName=re.sub(r'^(.*)(\_)(\w+).*', r'\1\2\3\2', ElementList[0])
    folderName=folderName.strip('\n')
    fileName = "%s_%s_%s" % (fileroot[1], fileroot[2], fileroot[0])

    command= "for file in %s/accepted_hits.*; do mv $file ${file/accepted_hits/%s}; done" % (folderName, fileName)

    print command
    OutFile.write(command+"\n")  


# After the loop is completed, close the files
InFile.close()
OutFile.close()


### Rename folders

# Open the input file for reading
InFile = open(InFileName, 'r')


# Open the output file for writing
OutFileName= 'renameBamFolder.sh'

OutFile=open(OutFileName,'w') 

OutFile.write("#! /bin/bash"+"\n")
OutFile.write(" "+"\n")


# Loop through each line in the file
for Line in InFile:
    ## Remove the line ending characters
    Line=Line.strip('\n')

    ## Separate the line into a list of its tab-delimited components
    ElementList=Line.split('\t')

    # separate the folder string from the experimental name
    fileroot=ElementList[1]
    fileroot=fileroot.split()

    # create variable names using regex
    folderName=re.sub(r'^(.*)(\_)(\w+).*', r'\1\2\3\2', ElementList[0])
    folderName=folderName.strip('\n')
    fileName = "%s_%s_%s" % (fileroot[1], fileroot[2], fileroot[0])

    command= "mv %s %s" % (folderName, fileName)

    print command

    OutFile.write(command+"\n")  


# After the loop is completed, close the files
InFile.close()
OutFile.close()

RenameBam.sh-前のPythonスクリプトによって作成されました：

#! /bin/bash

for file in L2369_Track-3885_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/7_t_B}; done
for file in L2349_Track-3865_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/3_t_A}; done
for file in L2354_Track-3870_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/GFP_c_A}; done
(..)

名前の変更renameBamFolder.shは非常に似ています：

mv L2369_Track-3885_R1_ 7_t_B
mv L2349_Track-3865_R1_ 3_t_A
mv L2354_Track-3870_R1_ GFP_c_A
mv L2377_Track-3893_R1_ 7_c_B

私は学んでいるので、これを行うさまざまな方法のいくつかの例と、それを行う方法を考えることは非常に役立つと思います。

score 2 · Accepted Answer

bashの簡単な方法の1つ：

find . -type d -print |
while IFS= read -r oldPath; do

   parent=$(dirname "$oldPath")
   old=$(basename "$oldPath")
   new=$(awk -v old="$old" '$1~"^"old{print $4"_"$5"_"$3}' samples.txt)

   if [ -n "$new" ]; then
      newPath="${parent}/${new}"
      echo mv "$oldPath" "$newPath"
      echo mv "${newPath}/accepted_hits.bam" "${newPath}/${new}.bam"
   fi
done

最初のテスト後に「エコー」を削除して、実際に「mv」を実行できるようにします。

@tripleeの答えが示すように、すべてのターゲットディレクトリが1つのレベルにある場合は、さらに簡単です。親ディレクトリにcdして、次の手順を実行します。

awk 'NR>1{sub(/[^_]+$/,"",$1); print $1" "$4"_"$5"_"$3}' samples.txt |
while read -r old new; do
   echo mv "$old" "$new"
   echo mv "${new}/accepted_hits.bam" "${new}/${new}.bam"
done

期待される出力の1つで、「。bai」ファイルの名前を変更しました。もう1つでは、名前を変更せず、名前を変更するかどうかを指定しませんでした。名前を変更したい場合は、追加するだけです

echo mv "${new}/accepted_hits.bam.bai" "${new}/${new}.bam.bai"

あなたが好む上記の解決策に。

score 0 · Accepted Answer

それはあなたが探しているものではありませんが（箱の外で考えるだけです）：ファイルシステムの代替の「ビュー」を検討するかもしれません-データベースビューがテーブルにあるように「ビュー」という用語を使用します。これは、「ユーザースペースのファイルシステム」であるFUSEを介して行うことができます。多くの既存のユーティリティでこれを行うことができますが、特に名前の変更/再編成のために、ファイルのセットで一般的に機能するユーティリティはわかりません。しかし、それをどのように使用できるかの具体的な例として、pytagsfsは仮想（ヒューズ）ファイルシステムを作成します定義したルールに基づいて、ファイルのディレクトリ構造を好きなように表示します。（これもうまくいくかもしれませんが、pytagsfsは実際にはメディアファイルを対象としています。）そして、通常そのデータにアクセスするプログラムを使用して、その（仮想）ファイルシステムを操作するだけです。または、仮想ディレクトリ構造を永続的にするには（pytagsfsにこれを行うオプションがまだない場合）、仮想ファイルシステムを別のディレクトリ（仮想ファイルシステムの外部）にコピーするだけです。

score 0 · Accepted Answer

もちろん、Pythonでのみ実行でき、そのための小さな読み取り可能なスクリプトを生成できます。

まず、sampels.txt filを読み取り、既存のファイルプレフィックスから目的のマッピングプレフィックスへのマップを作成します。最後のデータ列内で列区切り文字が使用されるため、ファイルはPythonCSVリーダーモジュールを使用するようにフォーマットされていません。

mapping = {}
with open("samples.txt") as samples:
   # throw away headers
   samples.readline()
   for line in samples():
       # separate the columns spliting the first  whitespace ocurrences:
       # (either space sequences or tabs)
       fields = line.split()
       # skipp blank, malformed lines:
       if len(fields) < 6: 
           continue
       fq_file, sample_id, Sample_name, Library_ID,  FC_Number,  track_lanes_pos, *other = fields
       # the [:-2] part is to trhow awauy the "R1"  sufix as for the example above
       file_prefix = fq_file.split(".")[0][:-2]
       target_id = "_".join((Library_ID, FC_number. Sample_name))
       mapping[file_prefix] = target_id

次に、ディレクトリ名を確認し、各ディレクトリ内の「.bam」ファイルを再マッピングします。

import os
for entry in os.listdir("."):
     if entry in mapping:
         dir_prefix = "./" + entry + "/")
         for file_entry in os.listdir(dir_prefix):
              if ".bam" in file_entry:
                   parts = file_entry.split(".bam")
                   parts[0] = mapping[entry]
                   new_name = ".bam".join(parts)

                   os.rename(dir_prefix + file_entry, dir_prefix + new_name)
         os.rename(entry, mapping[entry])

score 0 · Accepted Answer

while単純なループでインデックスファイルから必要なフィールドを簡単に読み取ることができるようです。ファイルがどのように構造化されているかは明らかではないため、ファイルは空白で区切られ、Sample_Id実際には4つのフィールド（複雑なsample_id、名前の3つのコンポーネント）であると想定しています。Sample_Idフィールドに内部スペースがあるタブ区切りファイルがあるかもしれませんか？とにかく、私の仮定が間違っていれば、これは簡単に適応できるはずです。

# Skip the annoying field names
tail +1 samples.txt |
while read fq _ c a b chaff; do
    dir=${fq%R1.fastq.gz}
    new="${a}_${b}_$c"
    echo mv "$dir"/accepted_hits.bam "$dir/$new".bam
    echo mv "$dir"/accepted_hits.bam.bai "$dir/$new".bam.bai
    echo mv "$dir" "$new"
done

出力が希望どおりの場合は、を取り出しechoます。

score 0 · Accepted Answer

シェルスクリプトを使用する1つの方法は次のとおりです。次のように実行します：

script.sh /path/to/samples.txt /path/to/data

内容script.sh：

# add directory names to an array
while IFS= read -r -d '' dir; do

    dirs+=("$dir")

done < <(find $2/* -type d -print0)


# process the sample list
while IFS=$'\t' read -r -a list; do

    for i in "${dirs[@]}"; do

        # if the directory is in the sample list
        if [ "${i##*/}" == "${list[0]%R1.fastq.gz}" ]; then

            tag="${list[3]}_${list[4]}_${list[2]}"
            new="${i%/*}/$tag"
            bam="$new/accepted_hits.bam"

            # only change name if there's a bam file
            if [ -n $bam ]; then

                mv "$i" "$new"
                mv "$bam" "$new/$tag.bam"
            fi
        fi
    done

done < <(tail -n +2 $1)

python - 「インデックス」を使用したバッチ（ベース名）ファイル/フォルダの名前変更

5 に答える 5

Related

Reference