Many `parseInt` call-sites already provide the `radix` argument, and this rule helps improve consistency in the code-base; see https://eslint.org/docs/latest/rules/radix
*Please note:* The rule is disabled in `src/scripting_api/util.js` for now, since it's not obvious at a glance (at least to me) what the correct `radix` argument should be there.
This method is usually used with loops, and it should be a tiny bit more efficient to use an iterator directly rather than first iterating through ` Map`-values to create a temporary `Array` that we finally iterate through at the call-site.
Note that the `getRawValues` method is old code, and originally the `Dict` class stored its data in a regular `Object`, hence why the old code was written that way.
This method is usually used with loops, and it should be a tiny bit more efficient to use an iterator directly rather than first iterating through ` Map`-keys to create a temporary `Array` that we finally iterate through at the call-site.
Note that the `getKeys` method is old code, and originally the `Dict` class stored its data in a regular `Object`, hence why the old code was written that way.
This changes a number of loops currently using `Dict.prototype.{getKeys, getRaw}`, since it should be a tiny bit more efficient to use an iterator directly rather than first iterating through `Map`-keys to create a temporary `Array` that we finally iterate through at the call-site.
Note that the `getKeys` method is much older than `getRawEntries`, and originally the `Dict` class stored its data in a regular `Object`, hence why the old code was written that way.
There's a number of classes where the constructors can be removed completely by instead using class fields, which help to slightly shorten the code.
It seems that `unicorn/prefer-class-fields` ESLint plugin, see PR 20657, unfortunately isn't able to detect all of these cases.
For now it's just possible to create a single pdf in selecting some pages in different pdf sources.
The merge is for now pretty basic (it's why it's still a WIP) none of these data are merged for now:
- the struct trees
- the page labels
- the outlines
- named destinations
For there are 2 new ref tests where some new pdfs are created: one with some extracted pages and an other
one (encrypted) which is just rewritten.
The ref images are generated from the original pdfs in selecting the page we want and the new images are
taken from the generated pdfs.
When there is no tree, the tags for the new annotions are just put under the root element.
When there is a tree, we insert the new tags at the right place in using the value
of structTreeParentId (added in PR #16916).
After PR 12563 we're now free to use optional chaining in the worker-thread as well. (This patch also fixes one previously "missed" case in the `web/` folder.)
For the MOZCENTRAL build-target this patch reduces the total bundle-size by `1.6` kilobytes.
This *special* build-target is very old, and was introduced with the first pre-processor that only uses comments to enable/disable code.
When the new pre-processor was added `PRODUCTION` effectively became redundant, at least in JavaScript code, since `typeof PDFJSDev === "undefined"` checks now do the same thing.
This patch proposes that we remove `PRODUCTION` from the JavaScript code, since that simplifies the conditions and thus improves readability in many cases.
*Please note:* There's not, nor has there ever been, any gulp-task that set `PRODUCTION = false` during building.
This patch removes the existing `forEach` methods, in favor of making the classes properly iterable instead. Given that the classes are using a `Set` respectively a `Map` internally, implementing this is very easy/efficient and allows us to simplify some existing code.
Trying to use a non-string argument in either a `Cmd` or a `Name` is not intended, and would basically be an implementation error. Hence we can add a non-PRODUCTION check to enforce this, similar to the existing one used e.g. in the `Dict.set` method.
Trying to use a non-string `key` in a `Dict` is not intended, and would basically be an implementation error. Hence we can add a non-PRODUCTION check to enforce this, complementing the existing `value` check added in PR 11672.
This helper function is not really needed, since it's just a wrapper around a simple `instanceof` check, and it only adds unnecessary indirection in the code.
At this point all the various Stream-classes extends an abstract base-class, hence this helper function is no longer necessary and only adds unnecessary indirection in the code.
*Please note:* While this patch on its own is sufficient to prevent the worker-thread from hanging, however in combination with PR 14311 these PDF documents will both load *and* render correctly.
Rather than focusing on the particular structure of these PDF documents, it seemed (at least to me) to make sense to try and prevent all circular references when fetching/looking-up data using the XRef table.
To avoid a solution that required tracking the references manually everywhere, the implementation settled on here instead handles that internally in the `XRef.fetch`-method. This should work, since that method *and* the `Parser`/`Lexer`-implementations are completely synchronous.
Note also that the existing `XRef`-caching, used for all data-types *except* Streams, should hopefully help to lessen the performance impact of these changes.
One *potential* problem with these changes could be certain *browser* exceptions, since those are generally not catchable in JavaScript code, however those would most likely "stop" worker-thread parsing anyway (at least I hope so).
Finally, note that I settled on returning dummy-data rather than throwing an exception. This was done to allow parsing, for the rest of the document, to continue such that *one* bad reference doesn't prevent an entire document from loading.
Fixes two of the issues listed in issue 14303, namely the `poppler-91414-0.zip-2.gz-53.pdf` and `poppler-91414-0.zip-2.gz-54.pdf` documents.
By not adding any additional non-`Dict` entries to the list of candidates for merging of sub-dictionaries, we can very slightly reduce the amount of parsing required by not having to *again* iterate through unmergeable data.
Given how trivial the `isEOF` function is, we can simply inline the check at the various call-sites and remove the function (which ought to be ever so slightly more efficient as well).
Furthermore, this patch also changes the `EOF` primitive itself to a `Symbol` instead of an Object since that has the nice benefit of making it unclonable (thus preventing *accidentally* trying to send `EOF` from the worker-thread).
Currently the `!mergeSubDicts` code-path is essentially just duplicated code, which we can easily avoid by simply moving that check. (This may lead to ever so slightly more parsing for this case, but the difference ought to be negligible in practice.)
When working on PR 13612, I mostly prioritized a simple solution that didn't require touching a lot of code. However, while working on PR 13735 I started to realize that the static `Name.empty` construction really wasn't a good idea.
In particular, having a special `Name`-instance where the `name`-property isn't actually a String is confusing (to put it mildly) and can easily lead to issues elsewhere. The only reason for not simply allowing the `name`-property to be an *empty* string, in PR 13612, was to avoid having to touch a lot of existing code. However, it turns out that this is only limited to a few methods in the `PartialEvaluator` and a few of the `BaseLocalCache`-implementations, all of which can be easily re-factored to handle *empty* `Name`-instances.
All-in-all, I think that this patch is even an *overall* improvement since we're now validating (what should always be) `Name`-data better in the `PartialEvaluator`.
This is what I ought to have done from the start, sorry about the code churn here!
Apparently some really bad PDF software can create documents with *empty* `Name`-entries, which we thus need to somehow deal with.
While I don't know if this patch is necessarily the best solution, it should at least ensure that the *empty* `Name`-instance cannot accidentally match a proper `Name`-instance (and it doesn't require changes to a lot of existing code).[1]
---
[1] I briefly considered using a `Symbol` rather than an Object, but quickly decided against that since the former one [is not clonable](https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Structured_clone_algorithm#supported_types) and `Name`-instances may be sent to the API.
Similar to the `get`/`getAsync` methods, this should be a *tiny* bit more efficient which cannot hurt considering that `getArray` is now used a lot more than when initially added.
This patch was tested using the PDF file from issue 2618, i.e. https://bug570667.bugzilla-attachments.gnome.org/attachment.cgi?id=226471, with the following manifest file:
```
[
{ "id": "issue2618",
"file": "../web/pdfs/issue2618.pdf",
"md5": "",
"rounds": 50,
"type": "eq"
}
]
```
which gave the following results when comparing this patch against the `master` branch:
```
-- Grouped By browser, stat --
browser | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05)
------- | ------------ | ----- | ------------ | ----------- | --- | ---- | -------------
firefox | Overall | 50 | 3417 | 3426 | 9 | 0.27 |
firefox | Page Request | 50 | 1 | 1 | 0 | 5.41 |
firefox | Rendering | 50 | 3416 | 3426 | 9 | 0.27 |
```
Based on these results, there's no significant performance regression from using standard classes and this patch should thus be OK.
By having an abstract base-class, it becomes a lot clearer exactly which methods/getters are expected to exist on all Stream instances.
Furthermore, since a number of the methods are *identical* for all Stream implementations, this reduces unnecessary code duplication in the `Stream`, `DecodeStream`, and `ChunkedStream` classes.
For e.g. `gulp mozcentral`, the *built* `pdf.worker.js` files decreases from `1 619 329` to `1 616 115` bytes with this patch-series.
There's built-in ESLint rule, see `sort-imports`, to ensure that all `import`-statements are sorted alphabetically, since that often helps with readability.
Unfortunately there's no corresponding rule to sort `export`-statements alphabetically, however there's an ESLint plugin which does this; please see https://www.npmjs.com/package/eslint-plugin-sort-exports
The only downside here is that it's not automatically fixable, but the re-ordering is a one-time "cost" and the plugin will help maintain a *consistent* ordering of `export`-statements in the future.
*Note:* To reduce the possibility of introducing any errors here, the re-ordering was done by simply selecting the relevant lines and then using the built-in sort-functionality of my editor.
The `PartialEvaluator.hasBlendModes` method is necessary to determine if there's any blend modes on a page, which unfortunately requires *synchronous* parsing of the /Resources of each page before its rendering can start (see the "StartRenderPage"-message).
In practice it's not uncommon for certain /Resources-entries to be found on more than one page (referenced via the XRef-table), which thus leads to unnecessary re-fetching/re-parsing of data in `PartialEvaluator.hasBlendModes`.
To improve performance, especially in pathological cases, we can cache /Resources-entries when it's absolutely clear that they do not contain *any* blend modes at all[1]. This way, subsequent `PartialEvaluator.hasBlendModes` calls can be made significantly more efficient.
This patch was tested using the PDF file from issue 6961, i.e. https://github.com/mozilla/pdf.js/files/121712/test.pdf:
```
[
{ "id": "issue6961",
"file": "../web/pdfs/issue6961.pdf",
"md5": "a80e4357a8fda758d96c2c76f2980b03",
"rounds": 100,
"type": "eq"
}
]
```
which gave the following results when comparing this patch against the `master` branch:
```
-- Grouped By browser, page, stat --
browser | page | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05)
------- | ---- | ------------ | ----- | ------------ | ----------- | ---- | ------ | -------------
firefox | 0 | Overall | 100 | 1034 | 555 | -480 | -46.39 | faster
firefox | 0 | Page Request | 100 | 489 | 7 | -482 | -98.67 | faster
firefox | 0 | Rendering | 100 | 545 | 548 | 2 | 0.45 |
firefox | 1 | Overall | 100 | 912 | 428 | -484 | -53.06 | faster
firefox | 1 | Page Request | 100 | 487 | 1 | -486 | -99.77 | faster
firefox | 1 | Rendering | 100 | 425 | 427 | 2 | 0.51 |
```
---
[1] In the case where blend modes *are* found, it becomes a lot more difficult to know if it's generally safe to skip /Resources-entries. Hence we don't cache anything in that case, however note that most document/pages do not utilize blend modes anyway.
This simplifies/consolidates the ESLint configuration slightly in the `src/` folder, and prevents the addition of any new files where `var` is being used.[1]
Hence we no longer need to manually add `/* eslint no-var: error */` in files, which is easy to forget, and can instead disable the rule in the `src/core/` files where `var` is still in use.
---
[1] Obviously the `no-var` rule can, in the same way as every other rule, be disabled on a case-by-case basis where actually necessary.
Currently there's nothing that prevents modification of the `Dict.empty` primitive, which obviously needs to be *truly* empty to prevent any future (hard to find) bugs.
This allows for merging of dictionaries one level deeper than previously. This could be useful e.g. for /Resources dictionaries, where you want to e.g. merge their respective /Font dictionaries (and other) together rather than picking just the first one.
When the old `Dict.getAll()` method was removed, it was replaced with a `Dict.getKeys()` call and `Dict.get(...)` calls (in a loop).
While this pattern obviously makes a lot of sense in many cases, there's some instances where we actually want the *raw* `Dict` values (i.e. `Ref`s where applicable). In those cases, `Dict.getRaw(...)` calls are instead used within the loop. However, by introducing a new `Dict.getRawValues()` method we can reduce the number of (strictly unnecessary) function calls by simply getting the *raw* `Dict` values directly.
Using a `Map` instead of an `Object` provides some advantages such as
cheaper ways to get the size of the cache, to find out if an entry is
contained in the cache and to iterate over the cache. Moreover, we can
clear and re-use the same `Map` object now instead of creating a new
one.