XML is not different from HTML

General APL language issues

XML is not different from HTML

Postby Yves on Tue Jul 25, 2017 7:08 pm

Dear All,
i start with quad-xml (v16.0)

i try with the source of this page :
html ← Samples.HTTPGet 'http://monip.org/'

then
⎕xml html
generate a domain error : tag mismatch

as simple as possible, what can i do to explode html with ⎕xml ?
if it is possible, of course.

Thanks,
Yves
Yves
 
Posts: 39
Joined: Mon Nov 30, 2015 11:33 am

Re: XML is not different from HTML

Postby Richard|Dyalog on Wed Jul 26, 2017 10:13 am

Hi, Yves - unfortunately, your assertion in the title is incorrect; although they superficially look similar and both describe themselves as "markup languages", HTML is very different from XML and consequently an XML parser cannot process it.

In general, XML is flexible on tag naming but strict on structure: opening tags must have matching closing tags so that nesting can be deduced; HTML, on the other hand, pre-defines the set of tags which are valid and is tolerant of missing end tags because it is able to infer them from its understanding of the tag meanings (for example, </p> at the end of paragraphs may be omitted because the HTML parser knows that paragraphs cannot contain paragraphs; an opening <p> therefore implicitly terminates the preceding paragraph). There are other, more subtle, differences such as preservation of space, case sensitivity and so on.

The source of the web page contains a <META> tag with no corresponding closing tag, within a <head> section. When </head> is encountered, XML rules specify that </META> should have been processed, which is why ⎕XML has signalled an error.

XHTML is a format which unifies HTML and XML - in particular, it is HTML that is syntactically valid XML, and this means ⎕XML is able to parse it. There are converters available which will convert HTML to XHTML (e.g. at http://www.it.uc3m.es/jaf/html2xhtml/), so you could use something like that as a preprocessor.
User avatar
Richard|Dyalog
 
Posts: 44
Joined: Thu Oct 02, 2008 11:11 am

Re: XML is not different from HTML

Postby Brian|Dyalog on Wed Jul 26, 2017 2:21 pm

Hi Yves,

I'd written some simple utilities to convert HTML to XHTML in order to make it possible to pass website contents to ⎕XML. I didn't take a comprehensive approach the problem though - I just "fixed" issues that cropped up as I encountered them.

I took at look at the HTML returned by the URL 'http://monip.org' and while it has very little HTML, the HTML that it does have is atrocious and beyond the scope of my simple utilities. My utility will fix the <meta> tags, but there are unbalanced tags and tags whose attributes are not quoted and my utility doesn't address those problems (yet).

I took a look at the converter that Richard suggested and it has a downloadable command line version. I downloaded it to my c:\tmp\ folder. So, in theory you could do something like this...

      ⍝ Use HttpCommand - it's better than Samples.HTTPGet
]load HttpCommand

z←(HttpCommand.Get 'monip.org').Data ⍝ grab the page content
z ⎕NPUT 'c:\tmp\tmp.html' 1 ⍝ write the data to a file

q←∊⎕SH 'c:\tmp\html2xhtml-1.3\html2xhtml c:\tmp\tmp.html' ⍝ convert it

'Level' 'Element' 'Data'⍪(⎕XML q)[;1 2 3] ⍝ TADA!
Level Element Data
0 html
1 head
2 title MonIP.org v1.0
1 body
2 p
3 font
4 br
4 IP : 123.123.123.123
4 br
3 font
4 i cpe-72-230-140-25.rochester.res.rr.com
4 br
3 font
4 br
4 br
4 Pas de proxy détecté - No Proxy detected

N.B. I changed my actual IP address in the display above.

I hope this helps!
User avatar
Brian|Dyalog
 
Posts: 116
Joined: Thu Nov 26, 2009 4:02 pm
Location: West Henrietta, NY

Re: XML is not different from HTML

Postby Yves on Sat Jul 29, 2017 3:43 pm

Hello Richard, Brian, & all,
Great thanx for your time to help.

Good explanation about html & xhtml.
the convertor is a good idea in preprocessor.
Thank you Richard.

Brian, your example is perfect and works good.
personnal point : the line with Rochester contain a technical identifier at your provider. i think it is good habit to mask also this part.

for the line
Pas de proxy détecté - No Proxy detected
the chinese characters are only misencoding.

a suggestion (thank you Vince) :

q←'UTF-8'⎕UCS ⎕UCS ∊⎕SH 'c:\tmp\html2xhtml-1.3\html2xhtml c:\tmp\tmp.html'

Thank you Brian.

may be a future ⎕html will be born...
Yves
Yves
 
Posts: 39
Joined: Mon Nov 30, 2015 11:33 am

Re: XML is not different from HTML

Postby Yves on Sun Jul 30, 2017 2:48 pm

Dear All,
Unfortunately, this converter misencode utf-8.

With this page :
http://www.sacred-texts.com/hin/mbs/mbs01001.htm

i have good encoding before converter ; lost encoding after converter.
but ⎕xml works good !

craft time :)
Yves
 
Posts: 39
Joined: Mon Nov 30, 2015 11:33 am


Return to Language

Who is online

Users browsing this forum: No registered users and 1 guest