Functions and subselects in single FLWOR

I'm writing an XQuery to analyse large numbers of XML files that store queries similar to the example below. For these queries I'd like to calculate averages, sums and other information on various subelements. Additionally I'd like to generate subsections of the queries in the same document, for instance all queries that have no hits.

As I'll be manipulating hundreds of thousands of XML files, I'd like to make my xquery as efficient as possible. I've tried to use a single for iteration across the documents but I simply cannot figure out how to derive all the information I need.

Here's a sample XML:

<Query>
  <QueryString>Gigabyte Sapphire GTX-860</QueryString>
  <StatusCode>0</StatusCode>
  <QueryTime>0.04669069110297385</QueryTime>
  <Hits>8</Hits>
  <Date>2013-05-02</Date>
  <Time>12:07:07</Time>
  <LastModified>12:07:07</LastModified>
  <Pages resultsPerPage="10" clickCount="2">
    <Page resultCount="8" visited="true">
      <Result index="1" clickIndex="0" timeViewed="0" pid="85405" title="DDR3 1024 MB" />
      <Result index="2" clickIndex="1" timeViewed="178" pid="54065" title="ATK Excellium&#x9;" />
      <Result index="3" clickIndex="0" timeViewed="0" pid="74902" title="Intel E9650" />
      <Result index="4" clickIndex="0" timeViewed="0" pid="56468" title="ASUS Radeon HD 7980" />
      <Result index="5" clickIndex="0" timeViewed="0" pid="31072" title="Intel E7500" />
      <Result index="6" clickIndex="0" timeViewed="0" pid="26620" title="DDR3 2048 MB" />
      <Result index="7" clickIndex="2" timeViewed="92" pid="55625" title="Gigabyte Sapphire 7770" />
      <Result index="8" clickIndex="0" timeViewed="0" pid="67701" title="Intel E9650" />
    </Page>
  </Pages>
</Query>

Here's the XQuery:

let $doc := collection('file:///C:/REP/XML/input?select=*.xml')
for $y in (
    <Queries>
    {
        for $x in $doc
        let $hits := $x/Query/Hits
        return <Query hits="{$hits}" >{$x/Query/QueryString/string()}</Query>
    }
    </Queries>
)
let $avgHits := avg(data($y/Query/@hits))
let $numQueries := count($y/*)
return <Statistics avgHits="{$avgHits}" numQueries="{$numQueries}"/>

Which correctly returns <Statistics numQueries="10" avgHits="19.7"/> for a sample of 10 XML files. Is this the right approach? I seem to need the double for so I can group the Queries from disjoint files together as I can't seem to run functions on them otherwise.

I also need to repeat some queries inside the created <Statistics> element. Do I need to repeat a FLWOR statement? I can't bring summed or averaged values outside the for statement that calculates them yet I can't calculate them and perform a subselect since I'll have to include a where to filter them.

(Update)This is the query that I've come up with to include subsections of the queries, but as I mentioned I'm worried about the performance.

let $doc := collection('file:///C:/REP/XML/input?select=*.xml')
for $y in (
    <Queries>
    {
        for $x in $doc
        let $hits := $x/Query/Hits
        return <Query hits="{$hits}" >{$x/Query/QueryString/string()}</Query>
    }
    </Queries>
)
let $avgHits := avg(data($y/Query/@hits))
let $numQueries := count($y/*)
return <Statistics avgHits="{$avgHits}" numQueries="{$numQueries}">
    {
    for $x in $doc
    let $hits := $x/Query/Hits
    where $x/Query/Hits < 10
    return <Query hits="{$hits}" >{$x/Query/QueryString/string()}</Query>
    }   
</Statistics>

Will the XQuery processor optimise my for statements or will it access all XML files with every for that loops across them? Will the first let statement prevent this?

This is the kind of document I'm aiming to generate:

<DailyStats date="2013-04-15" >
    <DayStats>
        <QueryCount>24644</QueryCount>
        <Errors>0</Errors>
        <EmptySearches>643</EmptySearches>
        <AverageSearchTime>0.0213</AverageSearchTime>
        <AverageSearchesPerHour>236</AverageSearchesPerHour>
    </DayStats>
    <StoredQueries>
        <FailedSearches>
            <FailedSearch time="23:33:34" query="blurey" searchTime="0.0524" />
        </FailedSearches>
    </StoredQueries>
</DailyStats>

Answers


If you are worried about performance you should use an XML database (if you not already do so) as it will improve performance by indexing the data. Additionally, e.g. using BaseX and loading your XML files into a database you can access all nodes using ```db:open("your-db")```` avoiding the nested for loops. Additionally you could use some database-specific indexes which will speed up your query. If you have a simple XQuery precessor working on the fs it will certainly touch each xml file as it knows nothing about the data in each file.

Apart from that, your XQuery looks basically fine to me. Optimization, as I tried to point out, heavily depends on the processor/database you are using.

Yeah, you will have to run some test, it is nearly impossible to say anything about real-time runtime, because it heavily depends on the data and the query you have. However, it shouldn't be too hard to swicht to a database later on, so I wouldn't worry too much about it.


Need Your Help

Can not calculate the height of a scrollview when their is no internet connection

iphone ios objective-c core-data uiscrollview

I have a scrollview with a dynamic height. The height depends on the image and the text that I got back from my core database. I have the following.

What does it implies to disable syscall in Intel SGX

kernel intel syscall

I'm looking into programming with Intel Software Guard Extensions (SGX) facility recently. The idea of SGX is to create an enclave in which security-sensitive code is loaded and executed. Most

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.