Those who work with R are well aware that the language was originally developed as a tool for interactive work. Naturally, methods that are convenient for a console step-by-step application by a person who is deep in the subject, are of little use for creating an application for the end user. The ability to get a detailed diagnosis immediately after the error, look through all the variables and traces, execute the code elements manually (perhaps, partially changing the variables) - all this will not be available when the R applications work offline in the enterprise environment. (we say R, we mean, basically, Shiny web applications).
However, not everything is so bad. The R environment (packages and approaches) has evolved so much that a number of very simple tricks allows you to elegantly solve the problem of ensuring the stability and reliability of user applications. A number of them will be described below.
It is a continuation of previous publications .
What is the complexity of the problem?
The main range of tasks for which R is often used is a variety of data processing. And even a fully debugged algorithm, lined with tests from all sides and fully documented, can easily break down and give nonsense, if the curve data are slipped into it at the input.
Data can come in from other information systems as well as from users. And, if in the first case it is possible to demand compliance with the API and impose very strict restrictions on the stability of the information flow, then in the second case, there is no escape from surprises. A person can make a mistake and slip the wrong file, write the wrong thing in it. 99% of users use Excel in their work and prefer to slip it to the system, it is multi page, with clever formatting. In this case, the task is even more complicated. Even visually, a valid document may look like a complete nonsense from the point of view of the car. Dates are leaving (a very famous story "Excel's designer thought 1900 was a leap year, but it was not" ). Numeric values are stored as text and set. Invisible cells and hidden formulas ... And much more. In principle, it will not work out all possible rakes - there is not enough fantasy. That costs only duplication of records in various join-ah with curved sources.
As additional considerations, we take the following:
An excellent document “An introduction to data cleaning with R” , describing the process of preliminary data preparation. For the next steps, we will single out the presence of two validation phases: technical and logical.
- Technical validation consists in checking the correctness of the data source. Structure, types, quantitative indicators.
- Logical validation can be multi-stage, carried out in the course of the calculations, and is to verify the compliance of certain data elements or their combinations with different logical requirements.
- One of the basic rules in the development of user interfaces - the formation of the most complete diagnosis in case of user errors. That is, if the user has already uploaded the file, then it is necessary to check it for correctness as much as possible and give a full summary with all errors (it is also advisable to explain what is wrong), and not to fall at the very first problem with the message “Incorrect input value @ line 528493, pos 17 "and require downloading a new file with this bug fixed. Such an approach makes it possible to repeatedly reduce the number of iterations to form the correct source and improve the quality of the final result.
Validation technologies and methods
Come with the end. For logical validation, there are a number of packages. In our practice, we stopped at the following approaches.
- Already a classic
dplyr
. In simple cases, it is convenient to simply draw a pipe with a series of checks and analysis of the final result. - The
validate
package for checking technically correct objects for compliance with specified rules.
For technical validation, we chose the following approaches:
checkmate
package with a wide range of quick functions for various technical checks.- Explicit handling of exceptions: "Advanced R. Debugging, condition handling, and defensive programming" , "Advanced R. Beyond Exception Handling: Conditions and Restarts", both for carrying out the full scope of validation in one step, and for ensuring the stability of the application.
- Use
purr
wrappers for exceptions. Very useful when used inside a pipe.
In the code, broken into functions, an important element of “defensive programming” is the verification of the input and output parameters of the functions. In the case of languages with dynamic typing, type checking has to be done independently. The checkmate package is ideal for basic types, especially its qtest
\ qassert
. To test the data.frame
stopped at about the following construction (checking names and types). The trick with the merging of the name and type reduces the number of lines in the check.
ff <- function(dataframe1, dataframe2){ # calledFun <- deparse(as.list(sys.call())[[1]]) tic("Calculating XYZ") # (class, typeof, Date ) list(dataframe1=c("name :: character", "val :: numeric", "ship_date :: Date"), dataframe2=c("out :: character", "label :: character")) %>% purrr::iwalk(~{ flog.info(glue::glue("Function {calledFun}: checking '{.y}' parameter with expected structure '{collapse(.x, sep=', ')}'")) rlang::eval_bare(rlang::sym(.y)) %>% assertDataFrame(min.rows=1, min.cols=length(.x)) %>% {assertSetEqual(.x, stri_join(names(.), map_chr(., class), sep=" :: "), .var.name=.y)} # {assertSubset(.x, stri_join(names(.), map_chr(., typeof), sep=" :: "))} }) … }
In terms of the type checking function, you can choose a method to your taste, in accordance with the expected data. class
was chosen because it gives the date as a Date
, and not as a number (internal representation). In detail, the question of determining data types is dealt with in the dialogue "mode 'and' class' and 'typeof' are insufficient . "
assertSetEqual
or assertSubset
are selected for reasons of a clear match of the columns or as minimal as assertSubset
.
For practical purposes, such a small set completely covers most of the needs.
Previous publication - R as a lifeline for the system administrator .