Data characteristics

 

The clickstream data collection method used in Clipen is based on packet sniffing of tracked website HTTP(S) communication. Although processing this type of source data is quite demanding on a procedural level, the method provides the most complete clickstream data collection, so it pays for itself.

Feature Server-side Client-side Server-side
Data Collection Method web log files page tags packet sniffing
non-invasive data collection (easy deployment) + - +
client communication disconnects information1 - - +
POST method data available2 - +/- +
website content delivery information3 +/- - +
hit-level (atomic) granularity data available + +/- +
full spectrum of hits including non-pages + - +
server-side timing information - - +
Internet connection speed information +/- + +
missing or broken pages information + - +
client forwarded-for information4 - - +
server redirects information - - +
copes with proxy or browser caching - + -
robots and automated agents data available + - +
website content time-to-serve time5 +/- - +
no performance impact on clients + - +
independence on JavaScript-enabled clients + - +
copes with multiple servers of single website +/- + +

+/- means some applications may provide the particular feature but it is not generally available

The Internet's communication infrastructure is based on the TCP/IP protocol suite. The HTTP underlying protocols (especially IP protocol and TCP protocol) provide further useful information about real data transmission and timing over a network. Thanks to packet sniffing it is possible to analyze TCP communication streams that provide information about received and acknowledged Bytes transmitted from server to client and vice versa. The packet sniffing method is generally not able to provide data from SSL/TLS encrypted HTTPS communication, however Clipen supports this feature and is able to decrypt the communication, thus providing the full spectrum of clickstream data.

Because of the fact the technology is located on the tracked website server side, proxy and browser caching may sometimes affect the collection of the data. As a prevention against the unwanted caching of HTML pages (caching of pictures and similar media content may be desirable) it is highly important to apply certain "proxy busting" techniques in that case.

In addition, Clipen provides an indicator of so-called blind referrers that help to specify on a session level how much the session data completeness may be affected, so the Clipen user may decide to reject the data from the whole session if they compromise the set data quality level (this indicator may also indicate data noise or data corruption that is not caused by caching, so it may be an important component of quality assurance).

Output clickstream data within the Clipen database sit on hit and session levels of granularity, but always belong to a specific session and a specific tracked website. On the hit level there are hits as well as pages. Not all hits are pages, but a page is always a hit (pages are a subset of hits). The only hits/pages stored in the Clipen database are those whose HTTP request/response messages contain a complete heading including the empty line indicating the end of the present header fields. Each stored hit/page always includes both messages (parts) of the HTTP communication.

On the session level there are mostly aggregated data that were obtained during clickstream data processing over session hits and pages. Each session's data include qualitative and quantitative attributes that help to decide whether and how the session data should be passed over the ETL chain using the Clipen user's external ETL application.

Unlike the output clickstream data, the Clipen database also contains output definition data of a static nature. The data represent tracked website configuration settings. Although the data may vary from time to time when Clipen user modifies the settings using the ClipConf tool, the changes are fully driven by the Clipen user through the process of the configuration.