This document is focused on using data.table
as a
dependency in other R packages. If you are interested in using
data.table
C code from a non-R application, or in calling
its C functions directly, jump to the last
section of this vignette.
Importing data.table
is no different from importing
other R packages. This vignette is meant to answer the most common
questions arising around that subject; the lessons presented here can be
applied to other R packages.
data.table
One of the biggest features of data.table
is its concise
syntax which makes exploratory analysis faster and easier to write and
perceive; this convenience can drive package authors to use
data.table
. Another, perhaps more important reason is high
performance. When outsourcing heavy computing tasks from your package to
data.table
, you usually get top performance without needing
to re-invent any of these numerical optimization tricks on your own.
data.table
is easyIt is very easy to use data.table
as a dependency due to
the fact that data.table
does not have any of its own
dependencies. This applies both to operating system and to R
dependencies. It means that if you have R installed on your machine, it
already has everything needed to install data.table
. It
also means that adding data.table
as a dependency of your
package will not result in a chain of other recursive dependencies to
install, making it very convenient for offline installation.
DESCRIPTION
fileThe first place to define a dependency in a package is the
DESCRIPTION
file. Most commonly, you will need to add
data.table
under the Imports:
field. Doing so
will necessitate an installation of data.table
before your
package can compile/install. As mentioned above, no other packages will
be installed because data.table
does not have any
dependencies of its own. You can also specify the minimal required
version of a dependency; for example, if your package is using the
fwrite
function, which was introduced in
data.table
in version 1.9.8, you should incorporate this as
Imports: data.table (>= 1.9.8)
. This way you can ensure
that the version of data.table
installed is 1.9.8 or later
before your users will be able to install your package. Besides the
Imports:
field, you can also use
Depends: data.table
but we strongly discourage this
approach (and may disallow it in future) because this loads
data.table
into your user’s workspace; i.e. it enables
data.table
functionality in your user’s scripts without
them requesting that. Imports:
is the proper way to use
data.table
within your package without inflicting
data.table
on your user. In fact, we hope the
Depends:
field is eventually deprecated in R since this is
true for all packages.
NAMESPACE
fileThe next thing is to define what content of data.table
your package is using. This needs to be done in the
NAMESPACE
file. Most commonly, package authors will want to
use import(data.table)
which will import all exported
(i.e., listed in data.table
’s own NAMESPACE
file) functions from data.table
.
You may also want to use just a subset of data.table
functions; for example, some packages may simply make use of
data.table
’s high-performance CSV reader and writer, for
which you can add importFrom(data.table, fread, fwrite)
in
your NAMESPACE
file. It is also possible to import all
functions from a package excluding particular ones using
import(data.table, except=c(fread, fwrite))
.
Be sure to read also the note about non-standard evaluation in
data.table
in the section on “undefined
globals”
As an example we will define two functions in a.pkg
package that uses data.table
. One function,
gen
, will generate a simple data.table
;
another, aggr
, will do a simple aggregation of it.
Be sure to include tests in your package. Before each major release
of data.table
, we check reverse dependencies. This means
that if any changes in data.table
would break your code, we
will be able to spot breaking changes and inform you before releasing
the new version. This of course assumes you will publish your package to
CRAN or Bioconductor. The most basic test can be a plaintext R script in
your package directory tests/test.R
:
When testing your package, you may want to use
R CMD check --no-stop-on-test-error
, which will continue
after an error and run all your tests (as opposed to stopping on the
first line of script that failed) NB this requires R 3.4.0 or
greater.
testthat
It is very common to use the testthat
package for
purpose of tests. Testing a package that imports data.table
is no different from testing other packages. An example test script
tests/testthat/test-pkg.R
:
context("pkg tests")
test_that("generate dt", { expect_true(nrow(gen()) == 100) })
test_that("aggregate dt", { expect_true(nrow(aggr(gen())) < 100) })
If data.table
is in Suggests (but not Imports) then you
need to declare .datatable.aware=TRUE
in one of the R/*
files to avoid “object not found” errors when testing via
testthat::test_package
or
testthat::test_check
.
data.table
’s use of R’s deferred evaluation (especially
on the left-hand side of :=
) is not well-recognised by
R CMD check
. This results in NOTE
s like the
following during package check:
* checking R code for possible problems ... NOTE
aggr: no visible binding for global variable 'grp'
gen: no visible binding for global variable 'grp'
gen: no visible binding for global variable 'id'
Undefined global functions or variables:
grp id
The easiest way to deal with this is to pre-define those variables
within your package and set them to NULL
, optionally adding
a comment (as is done in the refined version of gen
below).
When possible, you could also use a character vector instead of symbols
(as in aggr
below):
gen = function (n = 100L) {
id = grp = NULL # due to NSE notes in R CMD check
dt = as.data.table(list(id = seq_len(n)))
dt[, grp := ((id - 1) %% 26) + 1
][, grp := letters[grp]
][]
}
aggr = function (x) {
stopifnot(
is.data.table(x),
"grp" %in% names(x)
)
x[, .N, by = "grp"]
}
The case for data.table
’s special symbols
(e.g. .SD
and .N
) and assignment operator
(:=
) is slightly different (see ?.N
for more,
including a complete listing of such symbols). You should import
whichever of these values you use from data.table
’s
namespace to protect against any issues arising from the unlikely
scenario that we change the exported value of these in the future,
e.g. if you want to use .N
, .I
, and
:=
, a minimal NAMESPACE
would have:
Much simpler is to just use import(data.table)
which
will greedily allow usage in your package’s code of any object exported
from data.table
.
If you don’t mind having id
and grp
registered as variables globally in your package namespace you can use
?globalVariables
. Be aware that these notes do not have any
impact on the code or its functionality; if you are not going to publish
your package, you may simply choose to ignore them.
Common practice by R packages is to provide customization options set
by options(name=val)
and fetched using
getOption("name", default)
. Function arguments often
specify a call to getOption()
so that the user knows (from
?fun
or args(fun)
) the name of the option
controlling the default for that parameter;
e.g. fun(..., verbose=getOption("datatable.verbose", FALSE))
.
All data.table
options start with datatable.
so as to not conflict with options in other packages. A user simply
calls options(datatable.verbose=TRUE)
to turn on verbosity.
This affects all data.table function calls unless
verbose=FALSE
is provided explicitly;
e.g. fun(..., verbose=FALSE)
.
The option mechanism in R is global. Meaning that if a user
sets a data.table
option for their own use, that setting
also affects code inside any package that is using
data.table
too. For an option like
datatable.verbose
, this is exactly the desired behavior
since the desire is to trace and log all data.table
operations from wherever they originate; turning on verbosity does not
affect the results. Another unique-to-R and excellent-for-production
option is R’s options(warn=2)
which turns all warnings into
errors. Again, the desire is to affect any warning in any package so as
to not miss any warnings in production. There are 6
datatable.print.*
options and 3 optimization options which
do not affect the result of operations. However, there is one
data.table
option that does and is now a concern:
datatable.nomatch
. This option changes the default join
from outer to inner. [Aside, the default join is outer because outer is
safer; it doesn’t drop missing data silently; moreover it is consistent
to base R way of matching by names and indices.] Some users prefer inner
join to be the default and we provided this option for them. However, a
user setting this option can unintentionally change the behavior of
joins inside packages that use data.table
. Accordingly, in
v1.12.4 (Oct 2019) a message was printed when the
datatable.nomatch
option was used, and from v1.14.2 it is
now ignored with warning. It was the only data.table
option
with this concern.
If you face any problems in creating a package that uses data.table,
please confirm that the problem is reproducible in a clean R session
using the R console: R CMD check package.name
.
Some of the most common issues developers are facing are usually
related to helper tools that are meant to automate some package
development tasks, for example, using roxygen
to generate
your NAMESPACE
file from metadata in the R code files.
Others are related to helpers that build and check the package.
Unfortunately, these helpers sometimes have unintended/hidden side
effects which can obscure the source of your troubles. As such, be sure
to double check using R console (run R on the command line) and ensure
the import is defined in the DESCRIPTION
and
NAMESPACE
files following the instructions above.
If you are not able to reproduce problems you have using the plain R
console build and check, you may try to get some support based on past
issues we’ve encountered with data.table
interacting with
helper tools: devtools#192 or
devtools#1472.
Since version 1.10.5 data.table
is licensed as Mozilla
Public License (MPL). The reasons for the change from GPL should be read
in full here and
you can read more about MPL on Wikipedia here and
here.
data.table
: SuggestsIf you want to use data.table
conditionally, i.e., only
when it is installed, you should use Suggests: data.table
in your DESCRIPTION
file instead of using
Imports: data.table
. By default this definition will not
force installation of data.table
when installing your
package. This also requires you to conditionally use
data.table
in your package code which should be done using
the ?requireNamespace
function. The below example
demonstrates conditional use of data.table
’s fast CSV
writer ?fwrite
. If the data.table
package is
not installed, the much-slower base R ?write.table
function
is used instead.
my.write = function (x) {
if(requireNamespace("data.table", quietly=TRUE)) {
data.table::fwrite(x, "data.csv")
} else {
write.table(x, "data.csv")
}
}
A slightly more extended version of this would also ensure that the
installed version of data.table
is recent enough to have
the fwrite
function available:
my.write = function (x) {
if(requireNamespace("data.table", quietly=TRUE) &&
utils::packageVersion("data.table") >= "1.9.8") {
data.table::fwrite(x, "data.csv")
} else {
write.table(x, "data.csv")
}
}
When using a package as a suggested dependency, you should not
import
it in the NAMESPACE
file. Just mention
it in the DESCRIPTION
file. When using
data.table
functions in package code (R/* files) you need
to use the data.table::
prefix because none of them are
imported. When using data.table
in package tests
(e.g. tests/testthat/test* files), you need to declare
.datatable.aware=TRUE
in one of the R/* files.
data.table
in Imports
but nothing
importedSome users (e.g.)
may prefer to eschew using importFrom
or
import
in their NAMESPACE
file and instead use
data.table::
qualification on all internal code (of course
keeping data.table
under their Imports:
in
DESCRIPTION
).
In this case, the un-exported function [.data.table
will
revert to calling [.data.frame
as a safeguard since
data.table
has no way of knowing that the parent package is
aware it’s attempting to make calls against the syntax of
data.table
’s query API (which could lead to unexpected
behavior as the structure of calls to [.data.frame
and
[.data.table
fundamentally differ, e.g. the latter has many
more arguments).
If this is anyway your preferred approach to package development,
please define .datatable.aware = TRUE
anywhere in your R
source code (no need to export). This tells data.table
that
you as a package developer have designed your code to intentionally rely
on data.table
functionality even though it may not be
obvious from inspecting your NAMESPACE
file.
data.table
determines on the fly whether the calling
function is aware it’s tapping into data.table
with the
internal cedta
function (Calling
Environment is Data
Table Aware), which, beyond checking
the ?getNamespaceImports
for your package, also checks the
existence of this variable (among other things).
For more canonical documentation of defining packages dependency check the official manual: Writing R Extensions.
Some of internally used C routines are now exported on C level thus
can be used in R packages directly from their C code. See ?cdt
for details and Writing
R Extensions Linking to native routines in other packages
section for usage.
Some tiny parts of data.table
C code were isolated from
the R C API and can now be used from non-R applications by linking to
.so / .dll files. More concrete details about this will be provided
later; for now you can study the C code that was isolated from the R C
API in src/fread.c
and src/fwrite.c.
To convert a Depends
dependency on
data.table
to an Imports
dependency in your
package, follow these steps:
Before:
Depends:
R (>= 3.5.0),
data.table
Imports:
After:
Depends:
R (>= 3.5.0)
Imports:
data.table
R CMD check
Run R CMD check
to identify any missing imports or
symbols. This step helps:
data.table
that are not explicitly imported..N
, .SD
,
and :=
.Note: Not all such usages are caught by R CMD check
. In
particular, R CMD check
skips some symbols/functions in
formulas and will completely miss parsed expressions like
parse(text = "data.table(a = 1)")
. Packages will need good
test coverage to detect these edge cases.
Based on the R CMD check
results, ensure all used
functions, special symbols, S3 generics, and S4 classes from
data.table
are imported.
That means adding importFrom(data.table, ...)
directives
for symbols, functions, and S3 generics, and/or
importClassesFrom(data.table, ...)
directives for S4
classes as appropriate. See ‘Writing R Extensions’ for full details on
how to do so properly.
Alternatively, you can import all functions from
data.table
at once, though this is generally not
recommended:
Justification for Avoiding Blanket Imports: 1.
Documentation: The NAMESPACE file can serve as good
documentation of how you depend on certain packages. 2. Avoiding
Conflicts: Blanket imports leave you open to subtle breakage.
For example, if you import(pkgA)
and
import(pkgB)
, but later pkgB exports a function also
exported by pkgA, this will break your package due to conflicts in your
namespace, which is disallowed by R CMD check
and CRAN.
When you move a package from Depends
to
Imports
, it will no longer be automatically attached when
your package is loaded. This can be important for examples, tests,
vignettes, and demos, where Imports
packages need to be
attached explicitly.
Before (with Depends
):
# data.table functions are directly available
library(MyPkgDependsDataTable)
dt <- data.table(x = 1:10, y = letters[1:10])
setDT(dt)
result <- merge(dt, other_dt, by = "x")
After (with Imports
):
Imports
Depends
alters your
users’ search()
path, possibly without their wanting to do
so.Depends
can lead to conflicts and compatibility issues over
time.data.table
data.table
is easyDESCRIPTION
fileNAMESPACE
filetestthat
data.table
: Suggestsdata.table
in Imports
but nothing imported