r - Rでダミーのウェブショップデータを生成する: トランザクションをランダムに生成する際のパラメータを組み込む

Question

私が現在受講しているコースでは、ダミーのトランザクション、顧客、製品のデータセットを構築して、ウェブショップ環境と財務ダッシュボードでの機械学習のユースケースを紹介しようとしています。残念ながら、ダミーデータは提供されていません。これは R の知識を向上させる良い方法だと思いましたが、それを実現するのに深刻な問題が発生しています。

アイデアは、いくつかのパラメーター/ルールを指定することです (任意/架空ですが、特定のクラスタリングアルゴリズムのデモンストレーションに適用できます)。私は基本的にパターンを非表示にして、機械学習を利用してこのパターンを再検索しようとしています(この質問の一部ではありません)。私が隠しているパターンは、製品導入のライフサイクルに基づいており、ターゲットを絞ったマーケティングの目的で、さまざまな顧客タイプを特定する方法を示すことを試みています。

私が探しているものを示します。なるべくリアルに描きたいです。顧客ごとのトランザクション数やその他の特性を正規分布に割り当てることで、そうしようとしました。私はこれを行うための潜在的な他の方法を完全に受け入れていますか?

以下は私がどこまで来たかです。最初に顧客のテーブルを作成します。

# Define Customer Types & Respective probabilities
CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers")
PropCustTypes <- c(.10, .45, .30, .15)   # Probability of being in each group.

set.seed(1)   # Set seed to make reproducible
Customers <- data.frame(ID=(1:10000), 
  CustomerType = sample(CustomerTypes, size=10000,
                                  replace=TRUE, prob=PropCustTypes),
  NumBought = rnorm(10000,3,2)   # Number of Transactions to Generate, open to alternative solutions?
)
Customers[Customers$Numbought<0]$NumBought <- 0   # Cap NumBought at 0

次に、選択する製品のテーブルを生成します。

Products <- data.frame(
  ID=(1:50),
  DateReleased = rep(as.Date("2012-12-12"),50)+rnorm(50,0,8000),
  SuggestedPrice = rnorm(50, 50, 30))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10   # Cap ProductPrice at 10$
Products[Products$DateReleased<as.Date("2013-04-10"),]$DateReleased <- as.Date("2013-04-10")   # Cap Releasedate to 1 year ago

ここで、現在関連している各変数の次のパラメーターに基づいて、n 個のトランザクションを生成したいと思います (番号は上記の顧客テーブルにあります)。

Parameters <- data.frame(
  CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
  BySearchEngine   = c(0.10, .40, 0.50, 0.6), # Probability of coming through channel X
  ByDirectCustomer = c(0.60, .30, 0.15, 0.05),
  ByPartnerBlog    = c(0.30, .30,  0.35, 0.35),
  Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
  Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
    stringsAsFactors=FALSE)

Parameters
   CustomerType BySearchEngine ByDirectCustomer ByPartnerBlog Timeliness Discount
1  EarlyAdopter            0.1             0.60          0.30          1     0.00
2   Pragmatists            0.4             0.30          0.30          6     0.00
3 Conservatives            0.5             0.15          0.35         12     0.05
4    Dealseeker            0.6             0.05          0.35         12     0.10

「EarlyAdopters」は、ラベル「BySearchEngine」、60%「ByDirectCustomer」、および 30%「ByPartnerBlog」を持つトランザクションの 10% (平均、正規分布) を持つという考えです。これらの値は互いに除外する必要があります。最終的なデータセットでは、PartnerBlog と検索エンジンの両方を介して 1 つを取得することはできません。オプションは次のとおりです。

ObtainedBy <- c("SearchEngine","DirectCustomer","PartnerBlog")

さらに、上記の手段を利用して正規分布する割引変数を生成したいと考えています。簡単にするために、標準偏差は平均/5 とすることができます。

次に、最もトリッキーな部分ですが、いくつかのルールに従ってこれらのトランザクションを生成したいと思います。

数日にわたってある程度均等に分布し、週末にはわずかに増加する可能性があります。
2006 年から 2014 年の間に広がった。
長年にわたる顧客のトランザクション数の分散;
発売前の商品はご購入いただけません。

その他のパラメータ:

YearlyMax <- 1 # ? How would I specify this, a growing number would be even nicer?
DailyMax <-  1 # Same question? Likely dependent on YearlyMax

CustomerID 2 の結果は次のようになります。

Transactions <- data.frame(
    ID        = c(1,2),
    CustomerID = c(2,2), # The customer that bought the item.
    ProductID = c(51,100), # Products chosen to approach customer type's Timeliness average
    DateOfPurchase = c("2013-01-02", "2012-12-03"), # Date chosen to mimic timeliness average
    ReferredBy = c("DirectCustomer", "SearchEngine"), # See above, follows proportions previously identified.
    GrossPrice = c(50,52.99), # based on Product Price, no real restrictions other than using it for my financial dashboard.
    Discount = c(0.02, 0.0)) # Chosen to mimic customer type's discount behavior.    

Transactions
  ID CustomerID ProductID DateOfPurchase     ReferredBy GrossPrice Discount
1  1          2        51     2013-01-02 DirectCustomer      50.00     0.02
2  2          2       100     2012-12-03   SearchEngine      52.99     0.00

R コードを書くことにますます自信を持っていますが、グローバルパラメーター (毎日のトランザクションの分布、顧客ごとの年間最大 # トランザクション) とさまざまなリンケージを維持するためのコードを書くのに苦労しています。

適時性: リリース後の購入の速さ
ReferredBy: この顧客がどのようにして私の Web サイトにたどり着いたか?
顧客がどれだけの割引を受けているか (顧客が割引にどれほど敏感かを示すため)

これにより、顧客テーブルに for ループを記述して顧客ごとにトランザクションを生成する必要があるのか、それとも別のルートを取るべきなのかがわかりません。どんな貢献も大歓迎です。R を使用してこの問題を解決したいと考えていますが、代替のダミーデータセットも歓迎します。この投稿は、進行に応じて更新されます。

私の現在の擬似コード:

sample() を使用して顧客を顧客タイプに割り当てます
Customers$NumBought トランザクションを生成する
... まだ考えています？

編集: トランザクションテーブルを生成します。次は、正しいデータを入力する必要があります。

Tr <- data.frame(
  ID = 1:sum(Customers$NumBought),
  CustomerID = NA,
  DateOfPurchase = NA,
  ReferredBy = NA,
  GrossPrice=NA,
  Discount=NA)

score 2 · Accepted Answer

非常に大まかに、日数のデータベースとその日の訪問数を設定します。

days<- data.frame(day=1:8000, customerRate = 8000/XtotalNumberOfVisits)
# you could change the customerRate to reflect promotions, time since launch, ...
days$nVisits <- rpois(8000, days$customerRate)

次に、訪問をカタログ化します

    visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits)
    visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights)
    visits$nPurchases <- rpois(nrow(vists), XpurchaseRate[visits$customerType])

それらの前にある変数はすべて、Xプロセスのパラメーターです。同様に、他の列に従って、利用可能なオブジェクト間の相対的な可能性をパラメータ化することにより、トランザクションデータベースを生成します。または、その日に利用可能な各製品のキーを含む訪問データベースを生成できます。

   productRelease <- data.frame(id=X, releaseDay=sort(X)) # ie df is sorted by releaseDay
   visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits)
   visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights)
   day$productsAvailable = rep(1:nrow(productRelease), times=diff(c(productRelease$releaseDay, nrow(days)+1)))
   visits <- visits[(1:nrow(visits))[day$productsAvailable],]
   visits$prodID <- with(visits, ave(rep(id==id, id, cumsum))

次に、各行について、顧客がその商品を購入する確率 (日、顧客、製品に基づく) を提供する関数を決定できます。そして、'visits$didTheyPurchase <- runif(nrow(visits)) < XmyProbability.

申し訳ありませんが、私はまっすぐに入力したので、おそらくこれ全体にタイプミスが散らばっていますが、うまくいけば、これがあなたにアイデアを与えることを願っています.

score 0 · Accepted Answer

Gavin に続いて、次のコードで問題を解決しました。

最初に CustomerTypes をインスタンス化します。

require(lubridate)
CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers")
PropCustTypes <- c(.10, .45, .30, .15)   # Probability for being in each group.

顧客タイプのパラメーターを設定する

set.seed(1)   # Set seed to make reproducible
Parameters <- data.frame(
  CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
  BySearchEngine   = c(0.10, .40, 0.50, 0.6), # Probability of choosing channel X
  ByDirectCustomer = c(0.60, .30, 0.15, 0.05),
  ByPartnerBlog    = c(0.30, .30,  0.35, 0.35),
  Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
  Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
  stringsAsFactors=FALSE)

来場者数を記載

TotalVisits <- 20000
NumDays <- 100
StartDate <- as.Date("2009-01-04")
NumProducts <- 100
StartProductRelease <- as.Date("2007-01-04") # As products will be selected based on     this, make sure
                                             # we include a few years prior as people will buy products older than 2 years?
AnnualGrowth <- 0.15

ここで、提案されているように、日のデータセットを構築します。DaysSinceStart を追加して、時間をかけてビジネスを成長させるために使用しました。

days <- data.frame(
  day            = StartDate+1:NumDays, 
  DaysSinceStart = StartDate+1:NumDays - StartDate,
  CustomerRate = TotalVisits/NumDays)

days$nPurchases <- rpois(NumDays, days$CustomerRate)
days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)] <- # Increase sales in weekends
  as.integer(days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)]*1.5)

今、これらの日からトランザクションを構築します。

Transactions <- data.frame(
  ID           = 1:sum(days$nPurchases),
  Date         = rep(days$day, times=days$nPurchases),
  CustomerType = sample(CustomerTypes, sum(days$nPurchases), replace=TRUE, prob=PropCustTypes),
  NewCustomer  = sample(c(0,1), sum(days$nPurchases),replace=TRUE, prob=c(.8,.2)),
  CustomerID   = NA,
  ProductID = NA,
  ReferredBy = NA)
Transactions$CustomerType <- as.character(Transactions$CustomerType)

Transactions <- merge(Transactions,Parameters, by="CustomerType") # Append probabilities to table for use in 'sample', haven't found a better way to vlookup?

新しくないときに選択できるいくつかの顧客を開始します。

Customers <- data.frame(ID=(1:100), 
                        CustomerType = sample(CustomerTypes, size=100,
                                              replace=TRUE, prob=PropCustTypes)
); Customers$CustomerType <- as.character(Customers$CustomerType)
# Now make a new customer if transaction is with new customer, otherwise choose one with the right type.

均等に分割されたリリース日で、選択できる製品の数を増やします

ReleaseRange <- StartProductRelease + c(1:(StartDate+NumDays-StartProductRelease))
Upper <- max(ReleaseRange)
Lower <- min(ReleaseRange)
Products <- data.frame(
  ID = 1:NumProducts,
  DateReleased = as.Date(StartProductRelease+c(seq(as.numeric(Upper-Lower)/NumProducts,
                                         as.numeric(Upper-Lower),
                                         as.numeric(Upper-Lower)/NumProducts))),
  SuggestedPrice = rnorm(NumProducts, 50, 30))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10   # Cap ProductPrice at 10$

ReferredByOptions <- c("BySearchEngine", "Direct Customer", "Partner Blog")

次に、新しく作成されたトランザクション data.frame をループして、利用可能な製品 (購入日で測定 - 平均適時性 (月単位) * 30 日 +/- 15 日) から選択します。また、新しい顧客を新しい CustomerID に割り当て、既存のものから選択します。他のフィールドは、上記のパラメータによって決定されます。

Start.time <- Sys.time()
for (i in 1:length(Transactions$ID)){

  if (Transactions[i,]$NewCustomer==1){
    NewCustomerID <- max(Customers$ID, na.rm=T)+1
    Customers[NewCustomerID,]$ID = NewCustomerID
    Transactions[i,]$CustomerID <- NewCustomerID
    Customers[NewCustomerID,]$CustomerType <- Transactions[i,]$CustomerType
  }
  if (Transactions[i,]$NewCustomer==0){
    Transactions[i,]$CustomerID <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID,
                                          1,replace=FALSE)
  }
  Transactions[i,]$Discount <- rnorm(1,Transactions[i,]$Discount,Transactions[i,]$Discount/20)
  Transactions[i,]$Timeliness <- rnorm(1,Transactions[i,]$Timeliness, Transactions[i,]$Timeliness/6)
  Transactions[i,]$ReferredBy <- sample(ReferredByOptions,1,replace=FALSE,
                               prob=Current[,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")])

  CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30)
  ProductReleaseRange <- as.Date(CenteredAround+c(-15:15))
  Transactions[i,]$ProductID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE)
}
Elapsed <- Sys.time()-Start.time
length(Transactions$ID)

そして、それは完了です！残念ながら、100 日間で 20,000 個の製品が販売されたデータセットでは、22 分ほどかかります。必ずしも問題ではありませんが、潜在的な改善に非常に興味があります。

r - Rでダミーのウェブショップデータを生成する: トランザクションをランダムに生成する際のパラメータを組み込む

2 に答える 2

Related

Reference