XML is not different from HTML

by **Yves** on Tue Jul 25, 2017 7:08 pm

Dear All,
i start with quad-xml (v16.0)

i try with the source of this page :
html ← Samples.HTTPGet 'http://monip.org/'

then
⎕xml html
generate a domain error : tag mismatch

as simple as possible, what can i do to explode html with ⎕xml ?
if it is possible, of course.

Thanks,
Yves

by **Richard|Dyalog** on Wed Jul 26, 2017 10:13 am

Hi, Yves - unfortunately, your assertion in the title is incorrect; although they superficially look similar and both describe themselves as "markup languages", HTML is very different from XML and consequently an XML parser cannot process it.

In general, XML is flexible on tag naming but strict on structure: opening tags must have matching closing tags so that nesting can be deduced; HTML, on the other hand, pre-defines the set of tags which are valid and is tolerant of missing end tags because it is able to infer them from its understanding of the tag meanings (for example, </p> at the end of paragraphs may be omitted because the HTML parser knows that paragraphs cannot contain paragraphs; an opening <p> therefore implicitly terminates the preceding paragraph). There are other, more subtle, differences such as preservation of space, case sensitivity and so on.

The source of the web page contains a <META> tag with no corresponding closing tag, within a <head> section. When </head> is encountered, XML rules specify that </META> should have been processed, which is why ⎕XML has signalled an error.

XHTML is a format which unifies HTML and XML - in particular, it is HTML that is syntactically valid XML, and this means ⎕XML is able to parse it. There are converters available which will convert HTML to XHTML (e.g. at http://www.it.uc3m.es/jaf/html2xhtml/), so you could use something like that as a preprocessor.

by **Brian|Dyalog** on Wed Jul 26, 2017 2:21 pm

Hi Yves,

I'd written some simple utilities to convert HTML to XHTML in order to make it possible to pass website contents to ⎕XML. I didn't take a comprehensive approach the problem though - I just "fixed" issues that cropped up as I encountered them.

I took at look at the HTML returned by the URL 'http://monip.org' and while it has very little HTML, the HTML that it does have is atrocious and beyond the scope of my simple utilities. My utility will fix the <meta> tags, but there are unbalanced tags and tags whose attributes are not quoted and my utility doesn't address those problems (yet).

I took a look at the converter that Richard suggested and it has a downloadable command line version. I downloaded it to my c:\tmp\ folder. So, in theory you could do something like this...

      ⍝ Use HttpCommand - it's better than Samples.HTTPGet
      ]load HttpCommand

      z←(HttpCommand.Get 'monip.org').Data  ⍝ grab the page content
      z ⎕NPUT 'c:\tmp\tmp.html' 1  ⍝ write the data to a file

      q←∊⎕SH 'c:\tmp\html2xhtml-1.3\html2xhtml c:\tmp\tmp.html' ⍝ convert it

      'Level' 'Element' 'Data'⍪(⎕XML q)[;1 2 3]  ⍝ TADA!
 Level  Element  Data                                       
     0  html                                                
     1  head                                                
     2  title    MonIP.org v1.0                             
     1  body                                                
     2  p                                                   
     3  font                                                
     4  br                                                  
     4           IP : 123.123.123.123                         
     4  br                                                  
     3  font                                                
     4  i        cpe-72-230-140-25.rochester.res.rr.com     
     4  br                                                  
     3  font                                                
     4  br                                                  
     4  br                                                  
     4           Pas de proxy dÃ©tectÃ© - No Proxy detected

N.B. I changed my actual IP address in the display above.

I hope this helps!

by **Yves** on Sat Jul 29, 2017 3:43 pm

Hello Richard, Brian, & all,
Great thanx for your time to help.

Good explanation about html & xhtml.
the convertor is a good idea in preprocessor.
Thank you Richard.

Brian, your example is perfect and works good.
personnal point : the line with Rochester contain a technical identifier at your provider. i think it is good habit to mask also this part.

for the line
Pas de proxy dÃ©tectÃ© - No Proxy detected
the chinese characters are only misencoding.

a suggestion (thank you Vince) :

q←'UTF-8'⎕UCS ⎕UCS ∊⎕SH 'c:\tmp\html2xhtml-1.3\html2xhtml c:\tmp\tmp.html'

Thank you Brian.

may be a future ⎕html will be born...
Yves

by **Yves** on Sun Jul 30, 2017 2:48 pm

Dear All,
Unfortunately, this converter misencode utf-8.

With this page :
http://www.sacred-texts.com/hin/mbs/mbs01001.htm

i have good encoding before converter ; lost encoding after converter.
but ⎕xml works good !

craft time :)

The tool of thought for

software solutions

XML is not different from HTML

XML is not different from HTML

Re: XML is not different from HTML

Re: XML is not different from HTML

Re: XML is not different from HTML

Re: XML is not different from HTML

Who is online

QUICK LINKS