0

私のコードには、次のステートメントがあります。

df.loc[i] = [df.iloc[0][0], i, np.nan]

whereは、このステートメントが存在iするループで使用した反復変数であり、インポートした numpy モジュールであり、次のような DataFrame です。fornpdf

   build_number   name  cycles
0           390  adpcm   21598
1           390    aes    5441
2           390  dfadd     463
3           390  dfdiv    1323
4           390  dfmul     167
5           390  dfsin   39589
6           390    gsm    6417
7           390   mips    4205
8           390  mpeg2    1993
9           390    sha  348417

ご覧のとおり、コード内のステートメントは、新しい行を DataFrame に挿入し、df(新しく挿入された行内の) 最後の列に値を入力cyclesNaNます。

ただし、そうすると、次の警告メッセージが表示されます。

/usr/local/bin/ipython:28: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

ドキュメントを見ても、ここで発生している問題やリスクが何であるかをまだ理解していません。私はすでに推奨に従って使用locしていると思いましたか?iloc

ありがとうございました。

ここで編集 @EdChumのリクエストで、上記のステートメントを使用する関数を以下に追加しました。

def patch_missing_benchmarks(refined_dataframe):
'''
Patches up a given DataFrame, ensuring that all build_numbers have the complete
set of benchmark names, inserting NaN values at the column where the data is
supposed to be residing in.

Accepts:
--------
* refined_dataframe
DataFrame that was returned from the remove_early_retries() function and that 
contains no duplicates of benchmarks within a given build number and also has been
sorted nicely to ensure that build numbers are in alphabetical order.
However, this function can also accept the DataFrame that has not been sorted, so
long as it has no repitition of benchmark names within a given build number.

Returns:
-------
* patched_benchmark_df
DataFrame with all Build numbers filled with the complete set of benchmark data,
with those previously missing benchmarks now having NaN values for their data.
'''
patched_df_list = []
benchmark_list = ['adpcm', 'aes', 'blowfish', 'dfadd', 'dfdiv', 'dfmul', 
                'dfsin', 'gsm', 'jpeg', 'mips', 'mpeg2', 'sha']
benchmark_series = pd.Series(data = benchmark_list)

for number in refined_dataframe['build_number'].drop_duplicates().values:
  # df must be a DataFrame whose data has been sorted according to build_number
  # followed by benchmark name
  df = refined_dataframe.query('build_number == %d' % number)

  # Now we compare the benchmark names present in our section of the DataFrame
  # with the Series containing the complete collection of Benchmark names and 
  # get back a boolean DataFrame telling us precisely what benchmark names 
  # are missing
  boolean_bench = benchmark_series.isin(df['name'])
  list_names = []
  for i in range(0, len(boolean_bench)):
    if boolean_bench[i] == False:
      name_to_insert = benchmark_series[i]
      list_names.append(name_to_insert)
    else:
      continue
  print 'These are the missing benchmarks for build number',number,':'
  print list_names

  for i in list_names:
    # create a new row with index that is benchmark name itself to avoid overwriting 
    # any existing data, then insert the right values into that row, filling in the 
    # space name with the right benchmark name, and missing data with NaN
    df.loc[i] = [df.iloc[0][0], i, np.nan]  

    patched_for_benchmarks_df = df.sort_index(by=['build_number',
                                          'name']).reset_index(drop = True)

    patched_df_list.append(patched_for_benchmarks_df)

  # we make sure we call a dropna method at threshold 2 to drop those rows whose benchmark
  # names as well as cycles names are NaN, leaving behind the newly inserted rows with
  # benchmark names but that now have the data as NaN values
  patched_benchmark_df = pd.concat(objs = patched_df_list, ignore_index = 
                               True).sort_index(by= ['build_number',
                              'name']).dropna(thresh = 2).reset_index(drop = True)

  return patched_benchmark_df
4

1 に答える 1