User Tracking on Academic Publisher Platforms

Cody Hanson
@codyh
codyhanson@umn.edu

Prepared for the Coalition for Networked Information Spring 2019 Member Meeting, April 8-9, 2019, St. Louis, Missouri. Slides | Video (YouTube)

Updated and expanded in a keynote at the EResources MN Conference, October 4, 2019. Slides | Video (YouTube)

I studied the page source from fifteen different publisher platform sites and found that publishers of library resources use technology on their platforms that actively undermine patron privacy. This advertising and marketing technology makes it impossible to ensure that the use of electronic library resources can be private.

Background

I was inspired to do this research by three talks I saw at the December 2018 Coalition for Networked Information (CNI) meeting in Washington, D.C. The first was “RA21: Resource Access for the 21st Century – Pilot Results and New Recommended Practices,” a briefing by Todd Carpenter of the National Information Standards Organization (NISO), Jean Shipman of Elsevier, and Ralph Youngen of the American Chemical Society. Several attendees asked pointed questions about the privacy implications of wiring our institutional identity management systems into publisher platforms. In response to these questions, Carpenter noted correctly that the mechanisms of RA21 don’t require that personally identifiable information (PII) be sent to publisher platforms for authentication. He added that publishers don’t need PII from RA21 to be able to identify library users. This last comment was intended to reassure attendees that the harvest of PII isn’t the intent of RA21, but I found the implication that publishers can and do identify library users concerning.

The second briefing that provoked me was by Kenning Arlitsch and Scott W. H. Young, both of Montana State University, presenting information from their paper with Patrick O’Brien and Karl Benedict, “Protecting privacy on the web: A study of HTTPS and Google Analytics implementation in academic library websites.” This project employed a programmatic analysis of the page source code from 279 academic library websites to determine if and how a handful of privacy-protecting mechanisms were implemented therein.

A similar, though more hands-on, approach was taken by Katie Zimmerman and Micah Altman of MIT, whose briefing “Evaluating and Closing Privacy Gaps for Online Library Services” detailed their exhaustive review of the privacy policies from a number of library platforms to identify the stated privacy protections. This presentation also touched briefly on web tracking mechanisms on publisher platforms.

Could an analysis of such web tracking mechanisms, I wondered, shed light on the user identification measures hinted at by Todd Carpenter?

Methodology

In January and February 2019, I set out to investigate tracking on publisher platforms by analyzing the code delivered to user browsers for article pages on major publishing platforms. To do so, with the assistance of my colleague Michael Berkowski, I gathered a list of the 100 Digital Object Identifiers (DOIs) most frequently accessed through the proxy server at the University of Minnesota Libraries during the previous two years. The articles these DOIs resolve to came from fifteen different publisher platforms.

I chose the most frequently-accessed article on the list from each of the fifteen platforms and accessed it myself through DOI.org . I used my on-campus workstation, which is part of our IP-authentication range with all fifteen publishers. I then downloaded a complete archive of the article page, including all HTML, JavaScript, CSS, and images. I used an unmodified install of the Chrome browser, with the addition only of the Ghostery plug-in. Ghostery is an ad blocker, but I set it to not block anything. Instead, I used it for its analysis of the third-party code being loaded on the page. For each article, I recorded all the third-party assets being loaded on the page and used Ghostery’s website to identify the origin of each asset.

I also reviewed, to the best of my ability, the source code of the article page and looked for relevant first-party code.

Findings

I found that, on average, each publisher site had eighteen third-party assets being loaded on their article pages. The median was ten. One publisher platform, the only one I will name here today, had zero: InformPubsOnline. One platform had over 100. In total, I found 139 different third-party asset sources across these fifteen articles.

Significance of third-party assets

The reason I was interested in third-party assets being loaded on these sites is that any JavaScript loaded on these pages has access to the entire DOM, or document object model, meaning it can read the address and contents of the page. It also has access to every user action that happens on that page, and can itself load additional scripts from additional sources. So when, for example, a publisher puts JavaScript from Google on its pages, Google can record any information from the page about the article being sought, or search terms from a library user in the publisher platform. Fourteen of the fifteen publisher platforms included Google code on the article page.

Further, should a user have active cookies in their browser set against the same domain from which third-party JavaScript is loaded, that JavaScript can read both the contents of the page and the contents of that cookie. This means, for instance, that if you’ve allowed Facebook to cache your user credentials in your browser (using “remember me on this computer” or a similar feature), and JavaScript from Facebook is loaded on a page you visit, Facebook not only is able to see the entire contents of the page, but they are able to attribute your visit to your Facebook account by reading those stored credentials. At least four of the fifteen publisher platforms included Facebook code on the article page.

Some of the third-party assets included on the publisher platforms are relatively benign, such as code from Pingdom or New Relic. These provide site monitoring and performance information. Likewise, HotJar operates with a robust user privacy model. However, every one of the third-party scripts has access to the DOM and the ability to record user behavior, as well as the ability to load additional scripts.

Browser fingerprinting

But say Facebook (or Google, or Twitter, or Yahoo! Ad Exchange) doesn’t have a cookie in the user’s browser. They are still very much interested in connecting a user’s activity to their past or future behavior. The more comprehensive the picture of the user’s behavior, the better the targeting of ads for that user. To build this picture, social networks, ad networks, and data brokers record all that could possibly be identifying. One common tool is called browser fingerprinting.

Browser fingerprinting is the practice of gathering all the metadata available to a web server about the user’s browser, including but not limited to:

User Agent - typically identifies browser type and version, operating system and version
Browser Plugins - the extensions the user has installed, such as ad blockers
Time Zone
Screen Size and Color Depth - identifies the monitor hardware
System Fonts - the list of fonts available to the browser for rendering text
Are Cookies Enabled?
Canvas fingerprint - identifies how your specific computer generates and displays images
Do Not Track Enabled?
Language

This list has been largely drawn from the Electronic Frontier Foundation’s Panopticlick tool.

Any one of these pieces of metadata on its own doesn’t provide meaningful information about who the user is. Taken together, however, the metadata about the user’s browser can be sufficient to identify a user. For example, Panopticlick estimates that the browser I most frequently use has a fingerprint matched by only one in 103433.5 browsers. So when Facebook or Google identifies my browser fingerprint on another page, or another day, or from another IP address, it can reasonably attribute those pageviews to a single user.

Audience tools

Techniques like browser fingerprinting are how massive ad networks like Google’s DoubleClick track users across devices, networks, and sessions. It’s how products you search for on your phone end up advertised to you on your laptop. (Thirteen of the fifteen publisher platforms I studied included DoubleClick code.) But tracking of user identity information is not strictly the province of sophisticated social networks and ad networks.

There is a class of software I’ll refer to as audience tools that makes these user identification techniques available to any site owner. I was interested in these because they are typically licensed by publishers from software companies at a significant cost, indicating deliberation and intention on the part of the publishers. However, because third parties can load their own JavaScript on publisher platforms, I do not know definitively whether these tools are used directly by publishers, or by their partners.

Adobe Audience Manager, Oracle Marketing Cloud, and Neustar are a few of the audience tools I found on the publisher platforms. Each of these tools allows site owners to identify their site visitors. Using IP addresses, browser fingerprinting, first-party data (such as user account information or email marketing lists from the publisher), and third-party data from data brokers, these tools bring together an individual user’s activity across sessions, devices, and sites. They build a comprehensive history of platform use by individually-identified users.

A marketing video on Neustar’s site describes their technology by saying, “Identity is not static. It is dynamic. Only Neustar’s OneID system has holistic identity resolution, corroborated as often as every fifteen minutes, with eleven billion daily updates from multiple sources.” To translate, when publishers include Neustar code on their platforms, any identifying information about library users is matched against Neustar’s massive identity database. Four of the fifteen publisher platforms I surveyed included Neustar code.

Adobe Audience Manager’s site touts its ability to “…turn fragmented data, from any channel or device, into meaningful audiences that you can act on right away.” The implication is that Adobe Audience Manager combines usage from multiple sites and devices into a single user profile. Elsewhere, the site describes Adobe Audience Manager’s ability to “deliver offers only to users when they are logged in, or based on previous log in activity.” This reveals that Adobe Audience Manager is able to identify users even when they are not logged in to the client platform. Adobe Audience Manager also offers a marketplace where clients can purchase access to data sets to enrich their user data from brokers like Acxiom, which boasts “comprehensive consumer data on approximately 250 million U.S. addressable consumers…” Six of the publisher platforms I surveyed included Adobe Audience Manager Code.

Oracle Marketing Cloud advertises its ability to “connect with an individual customer across all channels and devices” using what they call the Oracle ID Graph, which “ingests massive amounts of IDs across cookies, login, HH [household], email, and mobile ad IDs…The Oracle ID Graph can reach over 90% of people online in the US and in markets that matter internationally…” This implies that when Oracle Marketing Cloud code is loaded on publisher platforms, it is highly likely that the library user will be matched to a profile in Oracle’s ID Graph. At least four publisher platforms I surveyed included Oracle Marketing Cloud code.

And what are the sources of the billions of data points used by these companies? Eleven of the fifteen publisher platforms I looked at are among them, either by direct use of one of the three audience tools above, or through inclusion of AddThis, a social media widget that shares user activity data with 44 ad network and data broker partners, including Neustar, Adobe, Oracle, and Google.

The companies described above are a small sample of the dozens of companies whose code is loaded on publisher platforms, all of which are technically able to gather similar user data, and many of which do.

Aggregated identity

Recording of user activity attributed to browser fingerprints or other information is sometimes referred to as creating a “shadow profile.” If this sounds familiar, it’s because Facebook was recently asked some pointed questions about the practice by a congressional panel. Facebook has long been suspected of using their ad network and contact info uploaded by users and advertisers to build hidden profiles for users who don’t themselves have Facebook accounts. These profiles allow people with no Facebook profile to be targeted as effectively as Facebook users on sites in Facebook’s ad network. Similarly, no library user has created an account with Neustar, or the Oracle Marketing Cloud that would make them aware of the information these companies are collecting. These companies have assembled profiles for people out of bits and pieces of information gathered from many sources.

Identity aggregates around users in these systems. User activity initially associated only with a browser fingerprint can be combined with other activity on the basis of a shared cookie or IP address, and vice-versa. Eventually, should the user authenticate to Facebook, Google, or a publisher platform, their identity will be associated with all past or future activity from that IP address or browser, whether authenticated or not. Thus, activity that when it takes place has no accompanying information we’d typically classify as PII eventually becomes directly attributable to an identified individual.

Aggregated identity challenges libraries’ historical assumptions about privacy and anonymity, which presume that activity is personally identifiable if and only if that activity is directly and simultaneously associated with personally identifiable information. This notion of privacy, now enshrined in licenses, policies, and law, doesn’t conceive of our present world of storage and processing abundance, which allows social networks, ad networks, and data brokers to record indefinitely all user activity available to them along with all potentially identifying information available to them.

Privacy implications for libraries

Upon evaluation of these audience tools and others, I conclude that many publisher platforms seek to maximize, rather than minimize, the library user identity information that gets associated with users’ behavior. Further, whether intentionally or not, these platforms are sharing user behavior data with third parties that aggregate identity data. It is definitely the case that publishers don’t need RA21 or even what we typically think of as PII to identify library users.

What I found on these sites shocked me. I was well aware of the privacy-infringing tendencies of social networks and ad networks, but I did not expect to find this activity in library publisher platforms. I had assumed that efforts like the NISO Privacy Principles , drafted by a group combined of library and publisher representatives, indicated an understanding on the part of publishers of libraries’ privacy values. While publishers may not be collecting or sharing complete identity information alongside user behavior, they are certainly violating library privacy values by embracing technology designed for the express purpose of resolving, aggregating, and sharing user identity.

My colleagues and I have spent countless hours agonizing over how to responsibly manage user activity information: how to manage records, logs, policies, and access controls. I spoke proudly less than a year ago at a NISO event about how I felt the proxy server was a firewall protecting patron identity. All the while, every page loaded by our users on fourteen of these fifteen platforms was sending off to data brokers the kind of library usage information that twelve years ago Library Connection librarians fought the FBI to protect.

I’m pleased to see the emphasis on patron privacy in recent statements such as the addition of Article VII to the ALA Library Bill of Rights and the Statement on Patron Privacy and Database Access published by Stanford and signed by a number of other prominent academic libraries. However, my research shows that unless the code being served by publishers to other libraries differs significantly from that being delivered to UMN, these statements are aspirational at best. These statements do not accurately represent the current state of patron privacy in use of licensed resources.

Under current conditions, I don’t believe it’s possible for libraries to provide meaningful assurance of privacy or anonymity to users of licensed resources. Libraries ought to take care to delineate statements of values from statements of fact, lest patrons be misled about our ability to protect their privacy. I would encourage anyone who shares my concerns about publishers' privacy-infringing practices to spend time looking at article pages on publisher platforms to confirm with whom library patron usage data is being shared.

If libraries hope to curtail these privacy violations, the only tool available to us, absent significant regulation, is the license agreements we sign with publishers. At the December 2018 CNI meeting, Lisa Janicke Hinchliffe and Katie Zimmerman spoke about efforts underway to develop model license language that constrains privacy infringement on publisher platforms. I welcome these efforts. I intend to do my part to influence license language at my own library and to advocate for a redefinition of PII. I also intend to alter significantly the way I speak about libraries and privacy, until the reality of the systems and services we provide to our users more directly aligns with our professed values.

With gratitude to the friends and colleagues who helped me shape this piece through conversation, edits, or their own estimable work on similar topics, and whose mention here does not imply their endorsement of my work:

Michael Berkowski, Sunshine Carter, Scarlet Galvan, Jason Griffey, Eric Hellmann, Lisa Hinchliffe, David Lacy, Matthew Reidsma, Dorothea Salo, Franklin Sayre, Nancy Sims, Elizabeth Temple