RED-8825: general improvements
* classify rulings as underline/striketrough * improve performance of CleanRulings.lineBetween * use lineBetween where possible * wip, still todo: - Header/Footer by Ruling for all rotations - actually the ticket, optimizing layoutparsing for documine
This commit is contained in:
parent
1916e626df
commit
4761d2e1a2
@ -24,7 +24,7 @@ public class RulingIntersectionFinder {
|
||||
|
||||
|
||||
/**
|
||||
* Implementation to find line intersection in O(P + n log n), where n is the number of lines and P the numer of intersections
|
||||
* Implementation to find line intersection in O(P + n log n), where n is the number of lines and P the numer of intersections.
|
||||
* based on <a href="http://people.csail.mit.edu/indyk/6.838-old/handouts/lec2.pdf">Segment Intersection by Piotr Indyk</a>
|
||||
* The algorithm assumes there are only horizontal and vertical lines which are unique in their coordinates. (E.g. no overlapping horizontal lines exist)
|
||||
* As a high level overview, the algorithm uses a sweep line advancing from left to right.
|
||||
@ -32,13 +32,10 @@ public class RulingIntersectionFinder {
|
||||
* When the sweep line hits a vertical line, it then checks for all intersections with the currently intersected horizontal rulings.
|
||||
* THe trick of the algorithm is using a binary search tree to store the currently intersected horizontal rulings. This way the lookup should be in O(log n).
|
||||
* This way the initial sorting step has the highest complexity class (O(n log n) and thus determines the complexity class of the entire algorithm
|
||||
*
|
||||
* Unfortunately, the implementation here takes a few liberties compared to the original algorithm. The binary search tree is replaced by an ordered Set which is simply looped over.
|
||||
* Therefore, this implementation's worst case, where all horizontal lines span the entire sweep, you are essentially performing the naive approach with a bunch of overhead.
|
||||
* Since we are using this implementation to find table cells, one can expect this worst case to always be the case.
|
||||
*
|
||||
* A simple runtime comparison for a single page with the most lines we can expect (SinglePages/AbsolutelyEnormousTable.pdf with 30 horizontals and 144 verticals) shows this implementation takes roughly 14 ms, whereas the naive approach takes 7 ms. Both are negligible, but the naive approach is two times as fast.
|
||||
*
|
||||
* If we would like to make this faster, we would need a better data structure for 'TreeMap<Ruling, Void> horizontalRulingsInCurrentSweep', where we can query the TreeMap for all horizontal rulings in a given interval in O(log n).
|
||||
*
|
||||
* @param horizontals a list of non-overlapping horizontal rulings
|
||||
@ -162,7 +159,7 @@ public class RulingIntersectionFinder {
|
||||
}
|
||||
|
||||
|
||||
public SweepStep(Type type, float y_position, Ruling ruling) {
|
||||
SweepStep(Type type, float y_position, Ruling ruling) {
|
||||
|
||||
this.type = type;
|
||||
this.y_position = y_position;
|
||||
|
||||
@ -72,7 +72,7 @@ public class LayoutparsingVisualizations {
|
||||
new Color(121, 85, 72));
|
||||
|
||||
@Setter
|
||||
boolean active = false;
|
||||
boolean active;
|
||||
|
||||
final Visualizations words = Visualizations.builder().layer(ContentStreams.WORDS).build();
|
||||
final Visualizations lines = Visualizations.builder().layer(ContentStreams.LINES).build();
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user