UNIX 系統上的 SQLite WAL 並發寫入效能

Question

在 Ubuntu 18.04 上確認，尚未在 Windows 上測試。

我簡化了您的範例並添加了檢測程式碼。第一個圖顯示了為每個子程序寫入的 blob 數量。在第一個圖中，平穩狀態顯示所有核心上的不活動狀態持續約 0.2 秒，而急劇上升是所有核心上的突發寫入。第二個圖顯示了原始數據，對於plotly 最有用，但在StackOverflow 答案中不起作用。

啟用後，gc()運行時間會更長，但負載分佈會更均勻，如下圖所示。

我不知道發生了什麼事。您可以使用此設定進行複製和進一步實驗嗎？如果您能在這裡或在 RSQLite 問題追蹤器中獲得回饋，我將不勝感激。

基本運行，無`gc()`

make.con <- function() {
  options(digits.secs = 6)

  con <<- DBI::dbConnect(RSQLite::SQLite(), dbname = "db.sqlite")
  DBI::dbExecute(con, "PRAGMA journal_mode = WAL;")
  DBI::dbExecute(con, "PRAGMA busy_timeout = 60000;")
  DBI::dbExecute(con, "PRAGMA synchronous = OFF;")
  DBI::dbExecute(con, "
    CREATE TABLE IF NOT EXISTS tmp (
      id INTEGER NOT NULL,
      blob BLOB NOT NULL,
      PRIMARY KEY (id)
  )")
}
make.con()
#> [1] 0

blob <- serialize(list(rand = runif(1000)), connection = NULL, xdr = FALSE)

fn <- function(x) {
  time0 <- Sys.time()
  rs <- DBI::dbSendQuery(con, "INSERT INTO tmp (blob) VALUES (:blob);")
  time1 <- Sys.time()
  DBI::dbBind(rs, params = list("blob" = list(blob)))
  time2 <- Sys.time()
  DBI::dbClearResult(rs)
  time3 <- Sys.time()
  # gc()
  time4 <- Sys.time()
  list(pid = unix::getpid(), time0 = time0, time1 = time1, time2 = time2, time3 = time3, time4 = time4)
}

n <- 1000L

parallel::setDefaultCluster(parallel::makeCluster(8L))
parallel::clusterExport(varlist = c("make.con", "blob"))
invisible(parallel::clusterEvalQ(expr = {
  make.con()
}))

data <- parallel::parLapply(X = 1:n, fun = fn, chunk.size = 50L)

parallel::stopCluster(cl = parallel::getDefaultCluster())

library(tidyverse)

tbl <-
  data %>%
  transpose() %>%
  map(unlist, recursive = FALSE) %>%
  as_tibble() %>%
  rowid_to_column() %>%
  pivot_longer(-c(rowid, pid), names_to = "step", values_to = "time") %>%
  mutate(time = as.POSIXct(time, origin = "1970-01-01")) %>%
  mutate(pid = factor(pid)) %>%
  arrange(time)

tbl %>%
  group_by(pid) %>%
  mutate(cum = row_number()) %>%
  ungroup() %>%
  ggplot(aes(x = time, y = cum, color = pid)) +
  geom_line()

p <-
  tbl %>%
  ggplot(aes(x = time, y = factor(pid), group = 1)) +
  geom_path() +
  geom_point(aes(color = step))

p

plotly::ggplotly(p)

（plotly 不適用於 StackOverflow）

^{創建於 2020-01-30 由代表包(v0.3.0)}

結果與`gc()`

Answer 1

在 Ubuntu 18.04 上確認，尚未在 Windows 上測試。

我簡化了您的範例並添加了檢測程式碼。第一個圖顯示了為每個子程序寫入的 blob 數量。在第一個圖中，平穩狀態顯示所有核心上的不活動狀態持續約 0.2 秒，而急劇上升是所有核心上的突發寫入。第二個圖顯示了原始數據，對於plotly 最有用，但在StackOverflow 答案中不起作用。

啟用後，gc()運行時間會更長，但負載分佈會更均勻，如下圖所示。

我不知道發生了什麼事。您可以使用此設定進行複製和進一步實驗嗎？如果您能在這裡或在 RSQLite 問題追蹤器中獲得回饋，我將不勝感激。

基本運行，無`gc()`

make.con <- function() {
  options(digits.secs = 6)

  con <<- DBI::dbConnect(RSQLite::SQLite(), dbname = "db.sqlite")
  DBI::dbExecute(con, "PRAGMA journal_mode = WAL;")
  DBI::dbExecute(con, "PRAGMA busy_timeout = 60000;")
  DBI::dbExecute(con, "PRAGMA synchronous = OFF;")
  DBI::dbExecute(con, "
    CREATE TABLE IF NOT EXISTS tmp (
      id INTEGER NOT NULL,
      blob BLOB NOT NULL,
      PRIMARY KEY (id)
  )")
}
make.con()
#> [1] 0

blob <- serialize(list(rand = runif(1000)), connection = NULL, xdr = FALSE)

fn <- function(x) {
  time0 <- Sys.time()
  rs <- DBI::dbSendQuery(con, "INSERT INTO tmp (blob) VALUES (:blob);")
  time1 <- Sys.time()
  DBI::dbBind(rs, params = list("blob" = list(blob)))
  time2 <- Sys.time()
  DBI::dbClearResult(rs)
  time3 <- Sys.time()
  # gc()
  time4 <- Sys.time()
  list(pid = unix::getpid(), time0 = time0, time1 = time1, time2 = time2, time3 = time3, time4 = time4)
}

n <- 1000L

parallel::setDefaultCluster(parallel::makeCluster(8L))
parallel::clusterExport(varlist = c("make.con", "blob"))
invisible(parallel::clusterEvalQ(expr = {
  make.con()
}))

data <- parallel::parLapply(X = 1:n, fun = fn, chunk.size = 50L)

parallel::stopCluster(cl = parallel::getDefaultCluster())

library(tidyverse)

tbl <-
  data %>%
  transpose() %>%
  map(unlist, recursive = FALSE) %>%
  as_tibble() %>%
  rowid_to_column() %>%
  pivot_longer(-c(rowid, pid), names_to = "step", values_to = "time") %>%
  mutate(time = as.POSIXct(time, origin = "1970-01-01")) %>%
  mutate(pid = factor(pid)) %>%
  arrange(time)

tbl %>%
  group_by(pid) %>%
  mutate(cum = row_number()) %>%
  ungroup() %>%
  ggplot(aes(x = time, y = cum, color = pid)) +
  geom_line()

p <-
  tbl %>%
  ggplot(aes(x = time, y = factor(pid), group = 1)) +
  geom_path() +
  geom_point(aes(color = step))

p

plotly::ggplotly(p)

（plotly 不適用於 StackOverflow）

^{創建於 2020-01-30 由代表包(v0.3.0)}

UNIX 系統上的 SQLite WAL 並發寫入效能

標竿管理

答案1

基本運行，無`gc()`

結果與`gc()`

相關內容

標竿管理

答案1

基本運行，無gc()

結果與gc()

相關內容

基本運行，無`gc()`

結果與`gc()`