“genome”の関連問題_Stack Overflow日本語サイト

0 投票する

2 に答える

2225 参照

python - 座標によってヒトゲノム配列を取得する高速な方法

大量のヒトゲノム断片 (5 億以上) をランダムに取得したい。

これは、プロセス全体の部分的な作業です。ボウタイからの .sam 結果ファイルがあり、1,000 万のヒトゲノム読み取りアライメントが含まれています。各クエリの読み取りを、sam ファイルからの「整列先の参照シーケンス」と比較したいと考えています。私が使用した参照配列は、UCSC の hg19.fa です。そのため、sam ファイル内の場所を使用して、hg19.fa (または染色体ファイル) から配列を取得できる必要があります。

たとえば、chr4:35654-35695 を指定すると、42bp シーケンスを取得できます。

gtcttccagggtttttatattttttgggtttacacttaagt

これまでのところ、2 つの解決策がありました。1. UCSC DAS サーバーからシーケンスを取得するための python スクリプト: http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr4:35654,35695

Pythonスクリプトを使用して「samtools faidx」コマンドを呼び出し、投稿からcommnad出力を返します： http://seqanswers.com/forums/showthread.php?t=23606&highlight=fetch+genome+coordinate

しかし、彼らは遅いです。samtools faidx は、DAS サーバーから取得するよりも少し高速ですが、それでも遅いです。

それで、これを行うためのFAST方法はありますか？私は別の染色体 fasta ファイルと hg19.fa ファイルを持っています。

2014-04-15T16:25:55.787

0 投票する

2 に答える

1749 参照

multidimensional-array - AWK: ファイル 1 の列が他のファイルの 2 つの列で宣言された範囲内にある場合に行を抽出する

現在、まだ解決できていない AWK の問題に取り組んでいます。位置を含むリスト (列 1 と 2 で宣言) を保持するゲノムデータと、いくつかの範囲を保持する 2 つ目のリスト (列 3、4 および 5 で宣言) を含む 1 つの巨大なファイル (30GB) があります。秒ファイルで宣言された範囲内にある最初のファイルのすべての行を抽出したい。位置は特定の染色体 (chr) 内でのみ一意であるため、最初に chr が同一であるかどうかをテストする必要があります (つまり、ファイル 1 の col1 がファイル 2 の col3 と一致する)。

ファイル 1

ファイル 2

期待される出力

私がやろうとしていることの要約（半分はコード化されています）：

file1 を配列に入れ、位置をインデックスとして使用することでこの問題を解決する方法を理解している場合は親切ですが、まだ chr に問題があり、さらに file1 が大きすぎて配列に入れることができません (ただし、128GB の羊）。多次元配列でいくつかのことを試しましたが、それを行う方法も実際にはわかりませんでした。

ご協力ありがとうございました。

更新 2014 年 8 月 5 日同じクロムの別の範囲を含むファイル 2 に 3 行目を追加しました。2 行目のように。この行は、以下のスクリプトではスキップされます。

multidimensional-array awk bioinformatics genome

2014-05-07T19:02:02.203

0 投票する

1 に答える

421 参照

r - Complex Superimposed Horizontal R Barplots with Multiple Values on each Bar

I have been trying for months to figure out how to do this, so hopefully somebody can give me some clarity. I have created an R script that displays all of the values in my database's Genes table. So it gives the length (in nucleotides) of each gene, and I lay it out horizontally.

The main idea was to take values from another table called QGRS, which contains the lengths of each QGRS. The issue I am having is that there are many QGRS's on a single gene, so I can't figure out how to use R to show this. There may be a better way, but my idea was to have the horizontal gene lengths bars be one color, and have the QGRS lengths appear right over those bars as a different color to highlight the QGRS's location on the gene. And this is for all of the genes. I don't understand how to get multiple values over a single bar, and then how to superimpose the two graphs properly.

I hope this makes sense. Here is what I have:

And here is what it outputs [long picture!]: enter image description here

** Note, the numbers on the left are cut off a bit, I have no idea why... but they are the gene IDs straight from NCBI, just a reference to label them as.

Let me know if more information is needed. Please, any help I would greatly appreciate. I really tried to search for the answers for months (this entire past semester), but I don't think I'm very competent at this. It's too complex for me.

Now I know that I could make another graph for the QGRS but if it was this same way, they would each come out on different lines! So that's not helpful.

Also, my Genes table works like this. I have 5 genes per chromosome, for all the chromosomes in the human genome (24 if you count the X and Y separately). So if needed, the genes graph too could be combined to have only 24 lines and where each line consists of the 5 genes, but I doubt this helps.

--------EDIT------------

Here is sample data from Genes table, the 5 genes for chromosomes 1 and two:

And here is sample data from QGRS table [just a few lines for Gene '8682' [first line in above sample data]

r graph bar-chart genome

2014-06-01T22:52:17.400

0 投票する

2 に答える

190 参照

python - 配列を持つ相同体に基づいてゲノムから配列を抽出するにはどうすればよいですか?

私は、いくつかの種の相同体を持つ配列と、これらの相同体のスコアを持っています。

これは、gff ファイルのレコードの例です。

==>4592637 => シーケンスの NAPP(Nucleic Acid Phylogenetic Profiling database) ID (genbank ID ではありません)

==>Beutenbergia_cavernae_DSM_12333 => シーケンスの種名

==>TILL => シーケンスのタイプ

==>70731 .. 70780 => シーケンスの開始と終了

==>clst_id=429 => は、このシーケンスのクラスターの ID です

==>SubjectOrganism => シーケンスが相同性を持つ種の名前

==>SubjectScore => この種のシーケンスのホモログのスコア ( Blastn スコア )

SubjectOrganismシーケンス（4592637）が類似している場所からシーケンスを抽出したい。

Python を使用して、シーケンスにホモログがあるゲノムからシーケンスを抽出するにはどうすればよいですか?

python extract sequence biopython genome

2014-07-09T10:27:34.177

0 投票する

1 に答える

2725 参照

r - 2 つの GRanges オブジェクトを互いに減算する

私は周りを見回しましたが、これに関して以前に投稿された質問はないようです。いくつかの座標を持つ 2 つの GRanges オブジェクトがあり、一方の間隔を他方から減算したいと考えています。これは、findOverlaps() または intersect() を使用してオーバーラップを見つけることとは異なります。

例えば：

そして私が欲しい：

以下は機能しますが、かなり不器用で、2 つのオブジェクト間で染色体ごとの間隔の数が一致しないため、染色体ごとに実行する必要があります。

2 つの GRanges オブジェクトをまとめて使用する、より迅速で効果的な方法はありますか? ありがとうございました！

r bioinformatics range bioconductor genome

2014-07-10T13:18:24.103

0 投票する

1 に答える

632 参照

biopython - biopython でゲノムを爆破する

シーケンスは実際にはゲノムのシーケンスですが、これでは出力が得られないため、結果を取得する必要があります。エラーはどこですか？クエリは正しいですか？

biopython blast genome

2014-07-11T09:54:26.490

0 投票する

5 に答える

443 参照

perl - Perl ソートのゲノム位置

染色体:開始-終了という形式のゲノム位置のリストがあります

例えば

これを取得するために、これを染色体番号と数値の開始位置で並べ替えたい:

perlでこれを行うための効果的で効率的な方法は何ですか?

perl bioinformatics genome

2014-07-18T14:26:48.080

0 投票する

1 に答える

84 参照

r - 複数の線形モデル

現在、2 つのデータテーブルがあり、そのうちの 1 つは列に独立変数と制御変数を含み、もう 1 つは従属変数の行を含みます。

従属値テーブルの行ごとに繰り返される 2 つのテーブルから線形モデルを実行する方法の作成を手伝ってくれる人はいますか?

r statistics lm genome

2014-09-01T20:45:25.860

問題タブ [genome]

Reference