Sweeper Index Page
To run sweeper, use paste_sweep.php.
Here are some sample documents:
In very general terms, preparing a document for publishing is a 4 step process:
- Conversion
- it looks right and consistent throughout and validates
- this step is required for a number of reasons:
- inaccuracies in the automated Word to HTML conversion process
- mistakes in authoring (eg using a style instead of tagging to make a header "look" the way they want it)
- missing information that will be requiored for quality HTML (eg alt text for images)
- Coding
- it has the proper styles and classes according to the client specifications
- it has high quality code for accessibility (tables, figure/images and abbreviations)
- Templating
- Delivery
All of these steps can be done manually or perhaps with the assistance of some commercially available tools. However, some tools may help to make the work either or both more accurate or more efficient (faster).
Procedure:
- Do a first review of the source document and determine the number of pages, headings, figures and images - both decorative and content, tables, equations (rare), footnotes and endnotes. For the sample file we get the following:
- pages=29
- headers=21 from counting Table of Contents
- Images. Use find and then down arrow "Find Graphics" =7. Have a look at them by using the down arrow "Next Search Result". Looks like 4 decorative images on page 1, on org chart on pg 4 that looks like SmartArt (that might help with alt text, long description!), a call-out box on pg 13, and some kind of image in the header or foot that will go away with conversion. Notes: Watch for what happens to the call-out box in conversion!
- Tables. Use Find Tables in Word Search. Again scroll through them. Seems to be 10 tables. There is a weird one that is a legend after Table 5. Ok, so tables 1-8 are labelled Table 1, to Table 7 (+ the weird one), and two more: one in Appendix B, and the final one in Appendix C. Notes: Watch for what happens to the legend table in conversion!
- Equations. Use Find. None.
- Footnotes, Endnotes, Use Find. 2 found. Have a look if they are footnotes or end notes. You can right click on the text and tell it is a footnote since it offers the choice "convert to endnote", and it appears at the bottom of a page like a footnote. Same with the second one. NOTE!!! The footnote numbers will be lost when converting from Word to Dreamweaver. This is a known problem. You can add them back in manually until we have a better soloution!
- Page Headers and Footers. Note, we don't worry about those at all. They just disappear in conversion.
- You need MS Word and Dreamweaver. Assumes Word 2010 and Dreamweaver CS 5.5
- make sure your preferences in DW are correct. Look at copy / paste and make sure you have the third bullet "text with structure plus basic formatting"
- You need to make sure your work validates or is close and gives "expected results" at every step of the way using: https://validator.w3.org/
- Open Dreamweaver, create a new blank page, doctype HTML5
Use the follow at the top of the page
<!DOCTYPE HTML>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Untitled Document</title>
</head>
</body>
</html>
- Open your source doc in MS Word. Select all (CTRL-A), Copy, Paste into Dreamweaver in Design View. If your Word doc is very large, DW may not like it. In this case you need to adjust DW by changing its config using the procedure outlined at: https://www.jdhodges.com/blog/dreamweaver-document-too-large-copy-past/
- Do initial clean-up
- check on the stuff we were worried about, the legend table and the call-out
- the legend table came through as a table. We can review it along with the tables later
- the call-out box came through as an image with alt-text of the call-out. We can also fix that later, I suppose.
- fix headers
- copy paste all the code, everything inside the <body> but excluding the body tags to paste_sweep. Use shift headers down. Paste back. Done!
- oops, not quite done. The main section headings have numbers in the ToC, but not in the content. Add them in. Done!
- The Appendixes headers had to be fixed manually. Mightr want to check this in conversion.
- tried validating - forget it. Over 1000 errors!
- let's try "basic" and see if that helps. Make sure to grab the code again since we have adjusted it manually for the headings.
- did some things, mainly nbsp type stuff in content. Not sure what this all means, but I guess it helps to understand.
- [7.93] [<strong( [^<>]*){0,1}>\s*<\/strong>] 5
[8.26] combine_inline 7
[8.31] footnotes 4
[8.35] phys_units 4
[9.45] non_breaking_date 55
[9.47] non_breaking_dollar_amounts 5
[9.52] non_breaking_year_range 5
[9.57] citation 3
[16.86] basiclanguage[ :] 1
Total changes in this file: 89
- still very impatient, so let's try clean_word_clf2 and see if that gets us closer to validatin ...
- this did a lot! took over a minute
- down to 30 erros in validating. And all on the open table tag which is an easy fix. 3 errors x 10 tables. So we are much closer to clean code. Let's carry on.
- interestingly, it didn't do table accessibility. This could be good or bad. Not sure.
- bad to be impatient. You must fix the footnotes before running clean_word. Just add in the numbers that get removed by conversion
- let's fix the Table of Contents
- let's give structure a try without any prep... nope didn't do anything. Maybe if we clean this up a bit in some way, let's experiement ... ok. Here is what you do. You turn the existing ToC into a list (ul), then indent the ones that are sub-sections of the H2, and then run structure, and it does its majic!
- this leaves a lot of empty lines / white space, so I figure maybe try basic and see if it cleans things up.
- [3.53] === DOM saved ===
[3.56] post_dom_stripme (better stripme) 2
[4.67] non_breaking_date 55
[4.68] non_breaking_dollar_amounts 5
[4.73] non_breaking_year_range 5
[4.77] post_dom_stripme (stripme) 5
Total changes in this file: 73
- seems to report doing some of the same things agian. No problem for me!
- let's check validation again, not that we have a clean-ed ToC
- still 30 errors, same ones on table, so we are making progress wtihout losing ground on validation. Good!
- Let's try table accessibility. As is. No joy. Hmmm. Maybe we can fix the captions, or maybe we need to mark the headers in DW, or maybe we can try clean_word_clf2, or maybe we can re-run dom_table_accessibility
- let's try a re-run ... Nope
- maybe try dom_table_accessibility (simple), yup that worked. I wonder if it helps ... Didn't do caption, but did
- let's try clean_word_clf2 first.
- stop! I notice it did do tbody, and also gave a message about not being able to find thead. So maybe that indicates it needs the header cells marked. Now maybe this is because these are all complex tables, and sweeper just needs a little help in these cases. Ok, let's mark the headers and see if that gets us further.
- yup, that worked!
- [3.35] === DOM saved ===
[3.44] post_dom_newtag 344
[3.45] dom_init = true
[3.45] xpath_init = true
Warning: DOMNode::cloneNode(): ID header16 already defined in /home/sweeper2/public_html/retidy.php on line 6555
[4.49] === DOM saved ===
[4.51] post_dom_stripme [ [\w]*="stripme"] 248
[4.54] post_dom_stripme (deleteme) 3
[4.55] post_dom_newtag 4
[4.55] post_dom_newtag 74
Total changes in this file: 673
- got a warning about id header 16, so maybe should check that out!
- let's check validation again. Opps, one new error.
- Element tbody not allowed as child of element table in this context. (Suppressing further errors from this subtree.) From line 196, column 9; to line 197, column 7
- I wonder if that is the legend table. Let's check. Maybe is coincide with header16 ...
- well, this was simply the order of tfoot and tbody. Now we don't really need tfoot, so I just moved the code around. And yes, it did have to do with header16 as well. After tweaking the code, re-validated, and back to the same 30 errors. Still good.
- I am not going to try to check if the header macro code is right. I will assume if you mark the headers correctly, dom_table_accessibility does its thing well. Let's face it, nobody actually checks this except in very rare circumstances, so doing a manual check of what the macro does is not effective.
- I notice the captions were not picked up, so I guess I have to do prep on that as well to make the code mpore easily recognized.
- Let's try again putting the table name in its own <p> tag alone above the table. I notice it actually did Table 5 and 6, so they didn't need adjustment.
- Yes, that worked. The captions are now in place. Note Appendix B and C have no caption. So sad.
- Let's finish the last two macros, and see if we end up with clean code that validates ... abbr (don't forget to set language and pick a Department) and update to WET4
- abbr - not working, suspect bug in code / deployment (need to fix!)
- update_to_WET4, changes the table tag
- VALIDTAES!
- notice footers have lost there numbers and also are not coded well. Still some work to do ...
- Fix images:
- fix broken paragraph / spacing
- fix lists
- fix header and make sure levels are good, use shift up / shift down if that helps
- add basic alt text to images
- make sure all links are working including Table of Contents
- don't worry about extra spaces between words, we will fix that with a macro later, or you can fix it in DW in Code View with Search " +" Replace using Regular Expressions.
- fix table headers on tables with headers in the middle of rows. We will need this later when we run the Table Accessibility macro.
- fix table captions / names, make sure they are alone on a line above the table, they will turn into captions later
- get rid of italics (check on this)
- it likely won't validate, but we will fix that soon
- Run "arbitrary_sweep"
- this cleans up a few things that word leaves or doesn't leave that we don't like (footnotes, endnotes, tables and lists
- Run cleanwordclf2"
- this does a lot and may even take a minute or to. Be patient. You will like it!
- it cleans up some artifacts of the automated word to DW conversion
- it removes things like ul type=disc and makes them just ul
- it changes the code for footnotes and endnotes
- it also changes styles to classes if you use "cleanwordclf2"
- watch out at the bottom of the file for an extra ">"
- Run "arbitrary_sweep" a second time
- this cleans up a few final things such as the table tag
- At this point, you should have a page that validates. Go ahead and check to be sure.
- [Run "basic"] Optional
- This does a number of things, mostly quality improvements in the code
- removes extra spaces
- removes empty tags
- inserts nbsp in good places like numbers and dates
- Mainly you won't "see" any difference, but the code is much improved.
- If you have run clean_word_clf2, and there were no errors, this is likely redundant
- [Run "table_accessibility"] Optional
- [Run "structure"] Optional
- [Run "find_abbr"] Optional
- Run "abbr" Make sure to select a Department and Language/li>
- [Run "update_to_WET4"] Optional