Smile for the Camera: a new cybercrime short story ebook.

Gallery: datatool performance

Image of datatooltk

Documents that use the datatool package can take a long time to build, but there are various ways to speed things up a bit. The test files described here can be found in the src/tests/performance subdirectory of datatooltk’s GitHub repository.

The tests described here were compiled with datatool.sty v2.29 and datatooltk v1.8. Older versions may cause an undefined control sequence error or unknown option error.

1. Load Speed

Simply loading data can take a while, especially for large amounts of data. The tests in this section load data using various methods, and then simply iterate over the data, displaying the first three columns:

\DTLforeach*{testdata}
 {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3}
 {\Fieldi, \Fieldii, \Fieldiii.\par}
or (where \testdata has been defined to the database label):
\DTLforeach*{\testdata}
 {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3}
 {\Fieldi, \Fieldii, \Fieldiii.\par}

Each test document has an associated bash script to perform the complete document build. The build time is then calculated using the GNU time command:

/usr/bin/time -f "$filename:\t%e\t%E --append --output=$logfile $filename"

where $filename is the name of the document build bash script ($logfile is just the name of the file containing the results of all the tests). The results are summarised in Table 1.

Table 1: Load Summary
TestBuild Time (minutes:​seconds)
1.1 \DTLloaddb (1000x3) 0:12.71
1.2 \DTLloadrawdb (1000x3) 0:12.00
1.3 \DTLnewdb (1000x3) 0:09.58
1.4 datatooltk --nomap-tex-specials --csv (1000x3) 0:01.03
1.5 datatooltk --map-tex-specials --csv (1000x3) 0:01.14
1.6 datatooltk --in with interpreted \DTLnewdb (1000x3) 0:01.68
1.7 datatooltk skipped (1000x3) 0:00.71
1.8 \DTLloaddb (1000x6) 0:39.00
1.9 \DTLloadrawdb (1000x6) 0:42.30
1.10 \DTLnewdb (1000x6) 0:39.62
1.11 datatooltk --nomap-tex-specials --csv (1000x6) 0:01.52
1.12 datatooltk --map-tex-specials --csv (1000x6) 0:01.65
1.13 datatooltk --in with interpreted \DTLnewdb (1000x6) 0:02.34
1.14 fp.sty with datatooltk skipped (1000x6) 0:01.11
1.15 pgfmath.sty with datatooltk skipped (1000x6) 0:01.11
1.16 datatooltk --nomap-tex-specials --remove-columns (1000x6) 0:01.44
1.17 datatooltk --remove-columns skipped (1000x6) 0:01.10

1.1 \DTLloaddb (1000x3)

This test uses \DTLloaddb to load the CSV file sample-1000x3.csv, which has a header row:

Field1,Field2,Field3
and 1000 rows and 3 columns containing randomly generated words. This data includes \& in one of the entries:
A\&M,amaurosis,world-plaguing

The document (test-csv-doc1.tex) simply loads this data using:

\DTLloaddb{testdata}{sample-1000x3.csv}

The build process is simply:

pdflatex test-csv-doc1.tex

1.2 \DTLloadrawdb

This test is much like the previous one, but it uses a slightly modified CSV file sample-1000x3-spchar.csv. This only differs by one character from sample-1000x3.csv as it doesn’t escape the & special character:

A&M,amaurosis,world-plaguing

The test document (test-csv-doc3.tex) loads this data using:

\DTLloadrawdb{testdata}{sample-1000x3-spchar.csv}
which converts & to \&. This method is intended for CSV files that have been generated by a non-TeX aware tool, so the file shouldn’t contain any commands. Neither \DTLloaddb nor \DTLloadrawdb can load CSV files that contain multiline cells.

The build process is simply

pdflatex test-csv-doc3.tex

1.3 \DTLnewdb (1000x3)

For this test, the CSV file from Test 1.1 (sample-1000x3.csv) was converted to sample-1000x3-newdb.tex, which contains datatool’s user commands provided to create a database. The database and first few rows are created using (line breaks added for clarity):

\newcommand{\testdata}{testdata}
\DTLnewdb{\testdata}
\DTLnewrow{\testdata}
\DTLnewdbentry{\testdata}{Field1}{graced}
\DTLnewdbentry{\testdata}{Field2}{premankind}
\DTLnewdbentry{\testdata}{Field3}{bachelor}
\DTLnewrow{\testdata}
\DTLnewdbentry{\testdata}{Field1}{toad-housing}
\DTLnewdbentry{\testdata}{Field2}{undersighted}
\DTLnewdbentry{\testdata}{Field3}{levels}
\DTLnewrow{\testdata}
\DTLnewdbentry{\testdata}{Field1}{A\&M}
\DTLnewdbentry{\testdata}{Field2}{amaurosis}
\DTLnewdbentry{\testdata}{Field3}{world-plaguing}

The test document (test-newdb-doc.tex) loads this data using:

\input{sample-1000x3-newdb.tex}

The build process is simply:

pdflatex test-newdb-doc.tex

1.4 datatooltk --nomap-tex-specials --csv (1000x3)

For this test, the CSV file from Test 1.1 (sample-1000x3.csv) is first converted to a .dbtex file by datatooltk. Since the CSV file contains \& the settings --nomap-tex-specials and --nocsv-escape are required.

The resulting file is then input in the document (test-dbtex-doc.tex) using:

\DTLloaddbtex{\testdata}{data.dbtex}

The build process is now:

datatooltk --nomap-tex-specials --nocsv-escape --csv sample-1000x3.csv -o data.dbtex
pdflatex -jobname test-dbtex-doc1 test-dbtex-doc

1.5 datatooltk --map-tex-specials --csv (1000x3)

This is like the previous example, except that it uses the CSV file with the unescaped & special character (sample-1000x3-spchar.csv). This means that the --map-tex-specials switch is required. The CSV file doesn’t contain any backslash characters so no escape character is required, but I’ve set it anyway for consistency. The document is exactly the same as for the previous test. This build process is now:

datatooltk --map-tex-specials --csv-escape '\' --csv sample-1000x3-spchar.csv -o data.dbtex
pdflatex -jobname test-dbtex-doc4 test-dbtex-doc

1.6 datatooltk --in with interpreted \DTLnewdb (1000x3)

As from version 1.7, datatooltk will first assume that an input file (optionally identified by --in; not an imported file, such as a CSV file) is in the format created by \DTLsaverawdb. If the contents of the file don’t match this exact format, then datatooltk will use the TeX Parser Library to try to interpret the contents. This is slower, but it means that datatooltk can now read files like sample-1000x3-newdb.tex that contain the database construction commands \DTLnewdb, \DTLnewrow and \DTLnewdbentry.

This test still uses the same document test-dbtex-doc.tex from the previous two tests. The build process is now:

datatooltk --in sample-1000x3-newdb.tex -o data.dbtex
pdflatex -jobname test-dbtex-doc5 test-dbtex-doc

1.7 datatooltk skipped (1000x3)

The advantage of using datatooltk is that this step can be skipped in the build process if the data hasn’t changed. This is useful if you can apply conditionals in an automated build sequence. For example, with arara version 4.0 you can add the following comment lines:

% arara: datatooltk: {input: sample-1000x3-newdb.tex, output: data.dbtex} if changed(toFile("sample-1000x3-newdb.tex")) || missing(toFile("data.dbtex"))
% arara: pdflatex

This will only run datatooltk if the database file (data.dbtex) needs updating. Compare this with the first three tests that use \DTLloaddb, \DTLloadrawdb and \DTLnewdb, where on every LaTeX run the data must be parsed and converted into datatool’s internal format.

This test was performed immediately after the previous test, so the file data.dbtex is present and contains 1000 rows and 3 columns. This mimics a subsequent build where the original data hasn’t been modified and therefore the datatooltk invocation can be skipped, so the build process is simply:

pdflatex -jobname test-dbtex-doc6 test-dbtex-doc

1.8 \DTLloaddb (1000x6)

This test is much like Test 1.1, except that it loads a different CSV file (sample-1000x6.csv) that has 1000 rows and 6 columns of randomly generated data. In this case there are no special characters to worry about. The test document (test-csv-1000x6-doc1.tex) still only accesses the first three columns, but all six columns must be parsed when the data is loaded. The build process is simply:

pdflatex test-csv-1000x6-doc1

1.9 \DTLloadrawdb (1000x6)

This is almost identical to the previous test but uses \DTLloadrawdb instead:

\DTLloadrawdb{testdata}{sample-1000x6.csv}
The same CSV file is used. The build process is simply
pdflatex test-csv-1000x6-doc2

1.10 \DTLnewdb (1000x6)

For this test, the CSV file from the previous test (sample-1000x6.csv) was converted to sample-1000x6-newdb.tex, which contains datatool’s user commands provided to create a database. Again, the data contains six columns but the document (test-newdb-1000x6-doc.tex) only uses the first three. The build process is simply

pdflatex test-newdb-1000x6-doc

1.11 datatooltk --nomap-tex-specials --csv (1000x6)

For this test, the CSV file sample-1000x6.csv is first converted to a .dbtex file by datatooltk. The test document is test-dbtex-doc.tex (the same as for the other tests using datatooltk). There are no special characters to worry about so no mapping is required. The build process is:

datatooltk --nomap-tex-specials --nocsv-escape --csv sample-1000x6.csv -o data.dbtex
pdflatex -jobname test-dbtex-1000x6-doc1 test-dbtex-doc

1.12 datatooltk --map-tex-specials --csv (1000x6)

Although sample-1000x6.csv doesn’t contain any special characters, this test uses --map-tex-specials for comparison. The build process is:

datatooltk --map-tex-specials --csv-escape '\' --csv sample-1000x6.csv -o data.dbtex
pdflatex -jobname test-dbtex-1000x6-doc2 test-dbtex-doc

1.13 datatooltk --in with interpreted \DTLnewdb (1000x6)

This uses datatooltk to process the file sample-1000x6-newdb.tex (from Test 1.10) with the TeX Parser Library. The build process is:

datatooltk --in sample-1000x6-newdb.tex -o data.dbtex
pdflatex -jobname test-dbtex-1000x6-doc3 test-dbtex-doc

1.14 datatooltk skipped (1000x6)

Again it’s useful to consider a conditional build where the datatooltk step can be omitted if the original data hasn’t changed. So this test just builds the document (test-dbtex-doc.tex) using the data.dbtex file created from the previous test. The build process is simply:

pdflatex -jobname test-dbtex-1000x6-doc4 test-dbtex-doc

1.15 pgfmath with datatooltk skipped (1000x6)

The datatool package will either load the fp or pgfmath package to perform floating point arithmetic. The default is to use fp. This test is the same as the previous test except that the test document (test-dbtex-pgfmath-doc.tex) loads datatool with the math=pgfmath option:

\usepackage[math=pgfmath]{datatool}

Again this test skips the datatooltk call in the build process as the datat.dbtex file doesn’t need updating.

pdflatex -jobname test-dbtex-1000x6-doc5 test-dbtex-pgfmath-doc
Comparing the build times between this test and the previous one shows the difference in package loading time, which turns out to be insignificant. (Although no numerical computations or comparisons are made using either or those packages in these tests, so there may still be some difference in performance if those commands are required.)

1.16 datatooltk --nomap-tex-specials --remove-columns (1000x6)

The document only requires the first three columns, so this test strips the unrequired columns before saving to the .dbtex file. The column removal is always performed last by datatooltk, which allows a column to be used for sorting, merging or filtering purposes even if that column isn’t actually required in the document. Although it takes time to remove the unwanted columns, time is saved by not writing those extra columns to the .dbtex file, which in turn reduces the document code to that of the smaller 1000 x 3 database from the earlier tests.

datatooltk --nomap-tex-specials --nocsv-escape --csv sample-1000x6.csv -o data.dbtex --remove-columns "4-"
pdflatex -jobname test-dbtex-1000x6-doc8 test-dbtex-doc

1.17 datatooltk --remove-columns skipped (1000x6)

For comparison with Test 1.14, the data.dbtex from the previous test was used without recreating it. The build process is simply:

pdflatex -jobname test-dbtex-1000x6-doc9 test-dbtex-doc

This shows the reduction in build time when the unrequired columns are absent (which reduces the database size to that of the one in Test 1.6).

2. Sort Speed

Sorting is typically the most time-consuming operation performed on a database. The datatool package provides \dtlsort that takes a comparison handler macro as the final argument. There are four pre-defined handler commands, but they all use a fairly primitive character code comparison, so if you are sorting according to a language that contains extended Latin or non-Latin characters, then you simply can’t sort the data in a locale-sensitive manner within TeX unless you provide a sort handler macro that can compare characters according to your alphabet. This would really require a Unicode-aware engine (XeLaTeX or LuaLaTeX) as TeX considers UTF-8 characters as two tokens. So not only is it faster to sort using an external tool, it’s also the only practical way of sorting according to a particular locale.

Since this section is comparing \dtlsort against datatooltk --sort the tests using sample-1000x3-newdb.tex and sample-1000x6-newdb.tex are omitted.

As in the previous section, the test documents simply iterate over the data, displaying the first three columns. The build is again performed using bash scripts in order to time each test. The results are summarised in Table 2.

Note there is a significant saving in the document build time if datatooltk is used to sort the data and (see Tests 1.6 and 1.13 in Table 1) even more so if a conditional is used in the document build to skip datatooltk if the data hasn’t been modified.

Table 2: Sort Summary
TestBuild Time (minutes:​seconds)
2.1 \DTLloaddb and \dtlsort (1000x3) 15:31.54
2.2 datatooltk --sort-case-insensitive --sort-locale none (1000x3) 0:01.04
2.3 datatooltk --sort-locale en-GB (1000x3) 0:01.05
2.4 UTF-8 \DTLloaddb and \dtlsort (1000x3) 15:51.16
2.5 UTF-8 \DTLloadrawdb and \dtlsort (1000x3) 15:25.92
2.6 UTF-8 datatooltk --nomap-tex-specials --sort en-GB (1000x3) 0:01.03
2.7 UTF-8 datatooltk --map-tex-specials --sort en-GB (1000x3) 0:01.23
2.8 \DTLloaddb and \dtlsort (1000x6) 30:07.43
2.9 datatooltk --sort en-GB (1000x6) 0:01.63
2.10 datatooltk --sort en-GB --remove-columns (1000x6) 0:01.57

2.1 \DTLloaddb and \dltsort (1000x3)

This test is much like Test 1.1. The same CSV file sample-1000x3.csv is loaded using

\DTLloaddb{testdata}{sample-1000x3.csv}

but in this case the data is additionally sorted in the document using the case-insensitive handler \dtlicompare:

\dtlsort{Field1}{testdata}{\dtlicompare}

The document build is simply:

pdflatex test-csv-doc2

2.2 datatooltk --sort-case-insensitive --sort-locale none (1000x3)

This test is much like Test 1.4 but additionally sorts the data using datatooltk’s case-insensitive character code comparison. The build process is:

datatooltk --nomap-tex-specials --nocsv-escape --sort Field1 --sort-locale none --sort-case-insensitive --csv sample-1000x3.csv -o data.dbtex
pdflatex -jobname test-dbtex-doc2 test-dbtex-doc

The compile time (shown in Table 2 shows this method is significantly faster and, when compared with the result for Test 1.4 in Table 1, the added sort action has an insignificant effect on the overall document build time.

2.3 datatooltk --sort-locale en-GB (1000x3)

The previous test uses the case-insensitive character code sort method to provide an analogous comparison to Test 2.1 (which uses the \dtlicompare handler). This test, uses a locale-sensitive sort instead. Since the data consists of randomly generated words within the ASCII set, the resulting order is much the same (punctuation may be treated differently).

The build process is now:

datatooltk --nomap-tex-specials --nocsv-escape --sort Field1 --sort-locale en-GB --csv sample-1000x3.csv -o data.dbtex
pdflatex -jobname test-dbtex-doc3 test-dbtex-doc

This is marginally slower than the previous test, but the increase is merely in fractions of a second.

2.4 UTF-8 \DTLloaddb and \dltsort (1000x3)

The 1000x3 CSV file from earlier was copied and some of the entries were replaced in the first column:

Field1,Field2,Field3
Øresund,premankind,bachelor
Þríhyrningsvatn,undersighted,levels
A\&M,amaurosis,world-plaguing
A\textasciitilde M,childminder,whitestraits
Ängelholm,mortalist,improducible
ætherial,profitmongering,femorofibular
œsophagus,Meigs,fluxes
Łobez,overindustrialization,acanthi
Ćmielów,tetracosane,semipacifistic
Żelechów,thermistors,polysaprobic
Ångström,emotionalise,Ulyanovsk
lackluster,Longport,zinked

The new CSV file was saved as sample-1000x3-utf8.csv and is UTF-8 encoded. Note that in addition to \& there is also a similar entry with \textasciitilde. This is again loaded in the document using \DTLloaddb but the document (test-csv-utf8-doc1.tex) requires UTF-8 support:

\documentclass{article}

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{datatool}

\DTLloaddb{testdata}{sample-1000x3-utf8.csv}
\dtlsort{Field1}{testdata}{\dtlicompare}

\begin{document}
\DTLforeach*{testdata}
 {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3}
 {\Fieldi, \Fieldii, \Fieldiii.\par}

\end{document}

The build process is simply:

pdflatex test-csv-utf8-doc1

2.5 UTF-8 \DTLloadrawdb and \dltsort (1000x3)

This test won’t work with old versions of datatool. The problem is that \DTLloadrawdb uses expansion to perform the mapping, which is why it’s not supposed to be used with CSV files containing TeX code. Unfortunately, the UTF-8 support provided by inputenc works by turning the first octet of each UTF-8 character into an active character. (This is a symptom of TeX’s pre-Unicode origins, and is not a problem with modern Unicode-aware engines.) These active octets caused expansion problems. This is fixed now that \DTLloadrawdb has switched to using protected expansion, but it’s still important to remember what’s happening behind the scenes. When comparing, say, “AMS” with “Ångström”, the handler isn’t comparing the strings “AMS” and “Ångström”, but is actually comparing “AMS” with “\IeC {\r A}ngstr\IeC {\"o}m”.

The document (test-csv-utf8-doc2.tex) is now:

\documentclass{article}

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{datatool}

\DTLloadrawdb{testdata}{sample-1000x3-utf8-spchar.csv}
\dtlsort{Field1}{testdata}{\dtlicompare}

\begin{document}
\DTLforeach*{testdata}
 {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3}
 {\Fieldi, \Fieldii, \Fieldiii.\par}

\end{document}

The build process again just involves a single LaTeX run:

pdflatex test-csv-utf8-doc2

2.6 UTF-8 datatooltk --nomap-tex-specials --sort-locale en-GB (1000x3)

This test uses the sample-1000x3-utf8.csv file containing UTF-8 characters and LaTeX commands (\& and \textasciicircum). Again the document (test-dbtex-utf8-doc.tex) needs to include UTF-8 support:

\documentclass{article}

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{datatool}

\DTLloaddbtex{\testdata}{data.dbtex}

\begin{document}

\DTLforeach*{\testdata}
 {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3}
 {\Fieldi, \Fieldii, \Fieldiii.\par}

\end{document}

The build process is now:

datatooltk --nomap-tex-specials --nocsv-escape --sort Field1 --sort-locale en-GB --csv sample-1000x3-utf8.csv -o data.dbtex
pdflatex -jobname test-dbtex-utf8-doc1 test-dbtex-utf8-doc

Note that datatooltk (unlike bib2gls) doesn’t try interpreting the database contents while it’s sorting. (The TeX Parser Library, which is also used by bib2gls, is only used when datatooltk’s quick load fails or when importing probsoln files.) This means that \& is seen as two characters (\ and &) by datatooltk and \textasciitilde is seen as 15 characters.

2.7 UTF-8 datatooltk --map-tex-specials --sort-locale en-GB (1000x3)

This is like the previous test except that it uses sample-1000x3-utf8-spchar.csv, which contains the special characters & and ~. Note, however, that --map-tex-specials is performed while the data is being imported. This means that by the time the sorting occurs, the special characters will already have been replaced by their corresponding map values (\& and \textasciitilde by default), so the ordering is the same as the previous test.

The document build process is:

datatooltk --map-tex-specials --csv-escape '\' --sort Field1 --sort-locale en-GB --csv sample-1000x3-utf8-spchar.csv -o data.dbtex
pdflatex -jobname test-dbtex-utf8-doc2 test-dbtex-utf8-doc

2.8 \DTLloaddb and \dtlsort (1000x6)

To compare how well the methods scale, this test is like Test 2.1 but the document (test-csv-1000x6-doc3.tex) uses the larger sample-1000x6.csv file.

The document build is again just a single LaTeX run:

pdflatex test-csv-1000x6-doc3

2.9 datatooltk --sort-locale en-GB (1000x6)

This test is like Test 2.3 but uses the larger sample-1000x6.csv file.

The document build is:

datatooltk --nomap-tex-specials --nocsv-escape --sort Field1 --sort-locale en-GB --csv sample-1000x6.csv -o data.dbtex
pdflatex -jobname test-dbtex-1000x6-doc6 test-dbtex-doc

2.10 datatooltk --sort-locale en-GB --remove-columns (1000x6)

This test is like the previous one but uses --remove-columns to remove the columns that aren’t used in the document. As with Test 1.17, the column removal reduces the build time.

The document build is:

datatooltk --nomap-tex-specials --nocsv-escape --sort Field1 --sort-locale en-GB --csv sample-1000x6.csv -o data.dbtex --remove-columns "4-"
pdflatex -jobname test-dbtex-1000x6-doc7 test-dbtex-doc

3. Filter Speed

Filtering is quite often performed in conjunction with sorting, but for the purposes of comparison, no sorting is performed in these tests. If sorting is also required, see the previous section. As with the removal of unrequired columns, removing unwanted rows also reduces the overall build time.

Table 3: Filter Summary
TestBuild Time (minutes:​seconds)
3.1 UTF-8 \DTLloaddb and \DTLifstringlt* (1000x3) 0:12.49
3.2 UTF-8 datatooltk --filter (1000x3) 0:00.43

3.1 UTF-8 \DTLloaddb and \DTLifstringlt* (1000x3)

This test is much like Test 1.1 except that it uses the UTF-8 CSV file sample-1000x3-utf8.csv, which is loaded using

\DTLloaddb{testdata}{sample-1000x3-utf8.csv}

This means that UTF-8 support must be enabled:

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}

The loop is now changed to filter the entries so that only those rows where the text in the first column is less than “B” are displayed in the document. This uses the starred version of \DTLifstringlt for a case-insensitive comparison:

\DTLforeach*{testdata}
 {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3}
 {%
   \DTLifstringlt*{\Fieldi}{B}%
   {\Fieldi, \Fieldii, \Fieldiii.\DTLpar}%
   {}%
 }
This internally uses \dtlicompare, which is the comparison handler used in the earlier sort tests with \dtlsort. Note the \DTLifstringlt performs a protected expansion on its arguments before passing them to the handler, so the UTF-8 characters will be expanded into their \IeC form. As a result, all the rows that start with an extended character are included because the sort handler considers \IeC less than “A”.

Since \DTLifstringlt is a short command, \par can’t be used, so the paragraph break has been changed to \DTLpar.

The document build again is just a single LaTeX call:

pdflatex test-csv-filter-doc1

Note that the entire CSV file has to be parsed even though only a small portion of it is actually required in the document.

3.2 datatooltk --filter (1000x3)

This method uses datatooltk to filter the same data (sample-1000x3-utf8.csv). The --filter option uses the same comparison as the --sort option, so even though datatooltk isn’t being used to sort the data, the --sort-locale switch is still used.

The document (test-dbtex-utf8-doc.tex) just requires UTF-8 support. There’s no change to the loop as the filtering is performed by datatooltk:

\documentclass{article}

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{datatool}

\DTLloaddbtex{\testdata}{data.dbtex}

\begin{document}

\DTLforeach*{\testdata}
 {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3}
 {\Fieldi, \Fieldii, \Fieldiii.\par}

\end{document}

The document build is

datatooltk --nomap-tex-specials --nocsv-escape --filter Field1 lt B --filter-include --sort-locale en-GB --csv sample-1000x3-utf8.csv -o data.dbtex
pdflatex -jobname test-dbtex-filter-doc1 test-dbtex-utf8-doc

The resulting document includes the rows containing Ängelholm, ætherial and Ångström, but not those containing any of the other extended characters.

Different locales produce different results. For example, if --sort-locale en-GB is changed to --sort-locale is (Icelandic) then “ætherial” is no longer included in the document. Whereas --sort-locale sv (Swedish) omits Ängelholm, ætherial and Ångström.

4. Iteration Speed

The tests so far have compared using TeX vs datatooltk to process the data. This section now looks at the various ways of iterating over the data within the document. Since the load time isn’t being considered here, all the test documents in this section use the data.dbtex file created in Test 1.13 using:

datatooltk --in sample-1000x6-newdb.tex -o data.dbtex

This system call isn’t included in the times listed in Table 4. For easier comparison, the results from Test 1.14 and Test 1.17 are included. The build process is a single LaTeX call.

Table 4: Iteration Summary
TestBuild Time (minutes:​seconds)
1.14 \DTLforeach* 0:01.11
1.17 \DTLforeach* (1000x6 reduced to 1000x3) 0:01.10
4.1 Unstarred \DTLforeach 0:07.78
4.2 \loop and \dtlgetrow 0:05.09
4.3 \loop and \DTLassign 0:05.14
4.4 \loop and \DTLgetvalue 0:07.05

4.1 Unstarred \DTLforeach

This is a slightly modified version of Test 1.13 that uses the unstarred version of \DTLforeach:

\DTLforeach{\testdata}
 {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3}
 {\Fieldi, \Fieldii, \Fieldiii.\par}

Other than that it’s the same as Test 1.13, but since the datatooltk call is omitted from the build times in this section, it needs to be compared with the result of Test 1.14 instead. This takes longer than the starred version, as the unstarred version permits editing on the current row, which means that the current row needs to be updated at the end of the iteration.

4.2 \loop and \dtlgetrow

This test uses \loop to iterate over all the rows in the database. In each iteration, \dtlgetrow is used to access the current row and \dtlgetentryfromcurrentrow is used to access the given column of the current row:

\newcount\rowctr
\loop
  \advance\rowctr by 1\relax
  \dtlgetrow{\testdata}{\rowctr}%
  \dtlgetentryfromcurrentrow{\Fieldi}{1}%
  \dtlgetentryfromcurrentrow{\Fieldii}{2}%
  \dtlgetentryfromcurrentrow{\Fieldiii}{3}%
  \Fieldi, \Fieldii, \Fieldiii.\par
\ifnum\rowctr<\DTLrowcount{\testdata}
\repeat

This takes longer than using \DTLforeach* but not as long as the unstarred version.

4.3 \loop and \DTLassign

This method also uses \loop but now the data is accessed using \DTLassign:

\newcount\rowctr
\loop
  \advance\rowctr by 1\relax
  \DTLassign{\testdata}{\rowctr}{\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3}%
  \Fieldi, \Fieldii, \Fieldiii.\par
\ifnum\rowctr<\DTLrowcount{\testdata}
\repeat

The difference in build time between this and the previous test is insignificant.

4.4 \loop and \DTLgetvalue

This method also uses \loop but now the data is accessed using \DTLgetvalue:

\newcount\rowctr
\loop
  \advance\rowctr by 1\relax
  \DTLgetvalue{\Fieldi}{\testdata}{\rowctr}{1}%
  \DTLgetvalue{\Fieldii}{\testdata}{\rowctr}{2}%
  \DTLgetvalue{\Fieldiii}{\testdata}{\rowctr}{3}%
  \Fieldi, \Fieldii, \Fieldiii.\par
\ifnum\rowctr<\DTLrowcount{\testdata}
\repeat

This test is slower than the previous two tests, but is slightly faster than the unstarred \DTLforeach.

5. Single Lookup

Accessing an individual row of data is a special case of filtering. The datatool package provides a number of commands for looking up a particular row or cell. If you know both the column index and row index you can use \DTLgetvalue{cs}{db}{row idx}{column idx}, which fetches the value and stores it in the control sequence cs. The indexes start from 1 for the first row/column of data. If you want to use the column label \dtlcolumnindex{db}{column label} expands to the column index.

If you want to access an entire row, you can use commands like \DTLassign{db}{row idx}{assign list} or \DTLassignfirstmatch{db}{col key}{value}{assign list}.

The tests here all lookup the row from the sample-1000x3.csv data that contains the word “grammatically” in the first column (identified by the label Field1). This row has the index 327 (of 1000 rows of data, which corresponds to line 328 of the CSV file) and is just under a third of the way through the data. If the data happened to contain multiple rows that match, all the tests here, except for Test 5.1, would only match the first instance. A real-world set of data would typically have a column containing a unique identifier, such as a custom number or product code, which would be used to identify the required row.

(In general, don’t assume that the row indexes will remain constant if it’s likely that the data source may be edited. For example, data representing students may have John Smith’s data on row 25 in one semester, but if a student higher up in the data drops out and is deleted, the next semester Smith’s data may be on row 24 instead.)

The results for this section are summarised in Table 5. Again, the fastest method is to use datatooltk to preprocess the data.

Table 5: Single Lookup Summary
TestBuild Time (minutes:​seconds)
5.1 \DTLloaddb and filtered loop 0:12.04
5.2 \DTLloaddb and filtered loop with break 0:11.56
5.3 \DTLloaddb and \DTLassignfirstmatch 0:11.33
5.4 datatooltk and \DTLassign 0:00.41
5.5 datatooltk and \DTLforeach* with break 0:00.42

5.1 \DTLloaddb and filtered loop

This is just a variation of the filtered example in Test 3.1. As before, the test data sample-1000x3.csv is loaded using:

\DTLloaddb{testdata}{sample-1000x3.csv}

The data is then iterated over as with the filtered example. The string comparison command \ifdefstring is provided by etoolbox, which is automatically loaded by datatool.

\DTLforeach*{testdata}
 {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3}
 {%
   \ifdefstring{\Fieldi}{grammatically}%
   {\Fieldi, \Fieldii, \Fieldiii.\par}%
   {}%
 }

This is a rather inefficient method as it iterates over the entire set of data. The document build is simply:

pdflatex test-csv-filter-doc1

5.2 \DTLloaddb and filtered loop with break

This is an improvement on the previous example as it uses \dtlbreak to break out of the loop once the required row has been found:

\DTLforeach*{testdata}
 {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3}
 {%
   \ifdefstring{\Fieldi}{grammatically}{\dtlbreak}{}%
 }

\Fieldi, \Fieldii, \Fieldiii.

Knowing that only one row is required means that the code doesn’t have to iterate over all 1000 rows. (In this case, the loop is broken when it’s only about 32% through.) This also means that the \DTLforeach block of code can be moved to the preamble and the assigned commands \Fieldi, \Fieldii and \Fieldii can be used throughout the document.

The document build is simply:

pdflatex test-csv-filter-doc1

The drawback with this method is that if the row isn’t found, \Fieldi, \Fieldii and \Fieldiii will have the value of the last row. You can add a test using the internal conditional \if@endfor provided by xfor (which is used by datatool for loop management).

5.3 \DTLloaddb and \DTLassignfirstmatch

This test uses \DTLassignfirstmatch, which uses \dtlgetrowindex to find the index of the first row to match the given value. Once the row index is found, the row information is accessed using the same internal code as \DTLassign. This means that no loop is required.

\DTLassignfirstmatch{testdata}{Field1}{grammatically}
 {\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3}

\Fieldi, \Fieldii, \Fieldiii.

As with the previous test, the assignment command may be moved to the preamble and the values will be available for the whole document. An error occurs (“Undefined control sequence” for \Fieldi, \Fieldii and \Fieldiii and “No match found for \DTLassignfirstmatch”) if the match is unsuccessful.

The document build is simply:

pdflatex test-csv-filter-doc1

5.4 datatooltk and \DTLassign

This test uses datatooltk to fetch only the required row. Since the document is now loading a database containing only one row, I know the row index will be 1 (regardless of the row number of the original data), so the data can simply be fetched with \DTLassign:

\DTLassign{\testdata}{1}{\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3}

\Fieldi, \Fieldii, \Fieldiii.

Again the assignment command may be moved to the preamble. The document build is:

datatooltk --nomap-tex-specials --nocsv-escape --csv sample-1000x3.csv --filter Field1 eq grammatically -o data.dbtex
pdflatex -jobname test-dbtex-single-doc1 test-dbtex-single-doc

If there’s no match, the database will be empty and the document will generate the error:

! Argument of \@dtl@getrow has an extra }.
<inserted text> 
                \par 
l.9 ...di=Field1,\Fieldii=Field2,\Fieldiii=Field3}

So you may want to consider checking the row count first. For example:

\ifcase\DTLrowcount{\testdata}
  \newcommand{\Fieldi}{??}
  \newcommand{\Fieldii}{??}
  \newcommand{\Fieldiii}{??}
\or
  \DTLassign{\testdata}{1}{\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3}
\else
  \GenericWarning{}{Multiple matches found}
  \DTLassign{\testdata}{1}{\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3}
\fi

5.4 datatooltk and \DTLforeach with break

This is like the previous test but uses \DTLforeach* instead of \DTLassign:

\DTLforeach*{\testdata}{\Fieldi=Field1,\Fieldii=Field2,\Fieldiii=Field3}
{\dtlbreak}

\Fieldi, \Fieldii, \Fieldiii.

Remember that unlike Test 5.1 and Test 5.2, the database loaded by the document only has a single row, so even without \dtlbreak, there’s only a single iteration. This is fractionally longer than Test 5.4 as a result of the loop initialisation. The document build process is:

datatooltk --nomap-tex-specials --nocsv-escape --csv sample-1000x3.csv --filter Field1 eq grammatically -o data.dbtex
pdflatex -jobname test-dbtex-single-doc2 test-dbtex-single-doc2

In this case, if there’s no match the data base will be empty, so there are no rows to iterate over. This means that \Fieldi, \Fieldii and \Fieldiii will be undefined.

Summary

To improve the document build process:

  1. Use datatooltk to parse the data and convert to the more efficient .dbtex file, which can be input in the document using \DTLloaddbtex.
  2. Remove unnecessary data by filtering, truncating or column removal.
  3. If the data needs sorting or column/row removal, combine this in the first step.
  4. Adjust the document build process so that datatooltk is only run when the data is changed (or if the sorting or filtering conditions change).
  5. If you have to iterate over a large database in your document, use the starred form \DTLforeach* (where possible) rather than the slower unstarred form \DTLforeach.