Data Fusion
In this lesson, we’re going to explore how data fusion techniques can be used to combine information from different sources to make a more accurate picture of an individual. To explain the consequences of this technical process, I have created several fictitious example situations that are plausible with current data collection practices and evolving data fusion algorithms.
Overview
“Data fusion techniques combine data from multiple sensors, and related information from associated databases, to achieve improved accuracies and more specific inferences than could be achieved by the use of a single sensor alone.”1
The term data fusion ironically came into vogue in the calendar year 1984, when Lockheed Martin described tactical data fusion systems for military applications. These systems would combine information from battlefield sensors with other sources as a way to give the U.S. military a tactical advantage.2 Today, a variety of different techniques are employed, and various different kinds of classifications can be used to describe these techniques. However, they all have the same objective, which is to combine multiple, potentially unrelated pieces of information together in order to improve what is known or draw some kind of conclusion.3
Classical data fusion algorithms use complex mathematics and statistics to identify relationships between pieces of data.4 These algorithms have found uses outside the military in fields such as predictive law enforcement, in which government agencies use vast amounts of collected data to try to identify both potential crimes and potential suspects.5 Companies can use the same techniques to try to identify categories of potential new shoppers, such as pregnant women who will soon be spending on maternity items, diapers, and so forth.6
Artificial intelligence techniques, such as the use of machine learning, may improve the accuracy and efficiency of data fusion over time.7 As computational capabilities increase in low power devices, it might even be possible to perform AI-assisted fusion on the same devices that collect information.8 Such a development could increase both the quality and the invasiveness of information sent to data brokers.
Risks Involving Health Data
As we previously seen in the lessons in this section, quite a few different devices and systems can collect health and fitness data from individuals. Pharmacies can collect prescription information if a person gives consent, which often occurs when signing up for a discount or rewards program. A person who takes a DNA test creates shareable records of their genetic information. Health and fitness apps collect a considerable amount of sensitive information about overall activity, menstruation, sexual behavior, and mental health. Fitness trackers provide high-resolution information about heart rates, movements, and body temperature. Connected sex toys can collect a significant amount of extremely intimate data.
Each one of these categories of health data can be problematic by itself, both when used for marketing or targeting and when leaked during a data breach. However, much more powerful and intrusive inferences about a person can be drawn if these individual data points from different devices are all fused together into one comprehensive view. Given enough data points – and enough data fusion – an extremely detailed description of a person can be created. The results of this kind of fusion could be beneficial for the person if they could identify and stop some kind of cancer early, give some kind of insight that improves the person’s life, or connect the person with a lucrative and rewarding career. Unfortunately, data fusion can equally be used in a way that leads to darker outcomes. A person could be ridiculed, stalked, discriminated against, or even targeted based on an extremely accurate profile created from fused information.
Examples
In order to explain the consequences of data fusion without getting bogged down in the mathematics, I have created four fictitious examples that are not based on any real people or any real situation.
Example 1: Jack’s Dream Job
Jack is an ordinary college student. He attends class regularly, mostly does his work, and is active with his fraternity. Jack exercises regularly and wears a Fitbit 24/7. Jack also parties regularly with his fraternity brothers, consuming significant amounts of alcohol most weekends. Jack’s phone has numerous apps installed, some of which contain pervasive tracking functionality.
As Jack approaches graduation, he diligently prepares his resume and hopes for an interview with SuperHappy Corporation, which is a company that is well-known in his industry as a great place to work. A job at SuperHappy is Jack’s ultimate dream job. His excitement peaks when he’s called in for an interview, which seems to go quite well. The hiring manager at SuperHappy extends an excellent offer to Jack, and he eagerly accepts. Now he just needs to wait on the various Human Resources processes to be completed before he starts.
A week later, Jack receives a call from the hiring manager. His offer with SuperHappy Corp. has been rescinded, as he failed the background check portion of the hiring process. Jack has never been arrested. In fact, he’s never even had so much as a speeding ticket. The hiring manager notifies Jack that the decision has been made by an outside background investigation firm, and Jack is given a number to call to request a copy of the report. He calls the number and then has to wait six weeks for the report to be sent to him via snail mail.
Upon opening the report, Jack discovers that the firm has fused data from a variety of sources, including his phone apps and his Fitbit. Jack’s phone apps have revealed his location, which includes numerous visits to well-known party spots. Data from his Fitbit, showing his heart rate and body temperature at different times, suggest the consumption of alcohol. This inference has been fused with the location data, and Jack has been labeled a “high risk” person who makes “poor choices” by the background investigation firm. Jack’s chance at a dream job has been ruined by inferences a company should never have been able to make, using data that Jack didn’t even know was being collected.
Example 2: John’s Life Insurance
John is a serious student in college. He is careful about his health, since he knows that a certain kind of cancer runs in his family. Therefore, he doesn’t drink, he doesn’t smoke, and he exercises regularly. John doesn’t have too many apps on his phone, nor does he wear any kind of fitness tracker. However, he wonders about his future cancer risk and decides to order a commercial DNA test kit. The test results show what he already knows, which is that there is a history of this cancer in his family. However, it doesn’t give any conclusive estimate of John’s actual risk. He decides to use a nutrition tracking app to help him watch what he eats, but he’s otherwise unconcerned.
Ten years after he graduates from college, John is happily married and is expecting his first child. He decides to apply for whole life insurance in order to protect his young family in case he gets sick or gets into some kind of accident. He goes through the application process and has an insurance physical completed, which doesn’t show any concerning health conditions. Two weeks after applying, John receives a denial letter from the life insurance company explaining that he is “too risky” to insure. He calls the company and asks for an explanation, but they won’t provide any more details other than to say that he “failed underwriting.”
A year later, the insurance company has a data breach, and John is able to find his records on the Dark Web. The insurance company had purchased information from a data broker, and that information included John’s DNA test results, public records showing he’d had a few speeding tickets over the years, and information obtained by the nutrition app. The insurance company’s AI data fusion algorithm came to the conclusion that John had a high risk of an early death and determined that he shouldn’t be issued a life insurance policy. Of course, the AI algorithm proved to be wrong, John never got cancer, and he lived to 102. Nevertheless, he couldn’t get life insurance due to the various pieces of data that were collected about him and fused by a faulty algorithm.
Example 3: Jill’s Arrest
Jill is also an ordinary college student. She doesn’t drink heavily or use drugs, but she does party occasionally. Jill uses social media extensively to keep in touch with her friends, and she uses the social media company’s app on her phone. In fact, Jill loves phone apps. She uses an app to track her period, another app to watch what she eats, and a third app to interface with her fitness tracker, which she only wears when exercising. Jill likes to shop, and she has all the apps from her favorite stores, including her local grocery store and pharmacy.
Jill dates a bunch of different guys in college while she tries to decide what her “type” is, but she doesn’t stay in any particularly lengthy relationships. Her health is good until her senior year, and she usually only needs prescriptions for her birth control, some occasional antibiotics to deal with college respiratory crud, and some occasional medicine to deal with the yeast infections caused by the occasional antibiotics. Nothing is unusual about her health until the night she goes to this one party off campus. Something is strange about a drink she takes, and she falls down the stairs in front of the house. Fortunately, she went to the party with her good friends, who immediately take her home and care for her.
A week or two after the party, Jill is still a little sore from the fall, but she also has unusually heavy bleeding and cramping – much more than is usual for her period. She goes to the campus health clinic, which runs some tests and determines that Jill has contracted a treatable sexually transmitted infection. The clinic prescribes some antibiotics and sends her on her way. With graduation just around the corner, Jill decides to stop dating for a while so that she can focus on her upcoming career. She lands a job in another state, relocates to a new city, and excitedly begins her new adult life.
Several years after moving to the new city, the local police arrive with a warrant for Jill’s arrest. She is confused – there must be some mistake, and she is confident she hasn’t broken any laws. The warrant is actually for her extradition to the state where her college was located. She is being charged with having an illegal abortion, despite the fact that she has never even been pregnant!
Jill’s attorney immediately moves for discovery to determine why the authorities in her old state have filed this insane charge. It turns out that a grand jury has indicted her based on evidence obtained by fusing data from several sources. Her social media and shopping apps, as well as her fitness data, allowed a predictive policing algorithm to determine the dates that Jill dated her various boyfriends. Fused health data also revealed that the campus health clinic had not gotten Jill’s diagnosis fully correct. While she had contracted a mild STI, she had also miscarried as a result of the fall at the party.
The overzealous local prosecutor claims that Jill pretended to be drunk and staged the fall in order to induce an end to her pregnancy, despite having no evidence to show that Jill even knew she was pregnant. Fortunately, the judge is sensible and dismisses the charges with prejudice. Unfortunately, the damage is already done: Jill’s company fires her after her arrest, and her mugshot is all over TV and the Internet. Sensationalistic news media have obtained her private data from the prosecutor’s office, going so far as to publish charts showing her cycles through college and speculate about times when she might have forgotten to take her birth control. Even some members of her family accuse her of falling intentionally. This horrific ending is brought to you by the surveillance economy, coupled with data fusion algorithms, an unethical prosecutor, and absolutely sloppy police work.
Example 4: Jane’s Addiction
Our last story concerns Jane, who is (of course) also a college student. Jane is artistic and progressive in her views. She is sex-positive and openly bisexual – facts that she even puts on her public art portfolio website. Jane dates a number of people in college, several of whom give give her Internet-connected sex toys. Like most college students, Jane has a pretty extensive collection of apps on her phone. She uses a fitness tracker with her phone, regularly surfs the Internet using the phone’s browser, and even keeps a detailed diary using a cool diary app she found in the App Store (or maybe it was the Play Store). Jane is in a sorority and is really close with a few of her sorority sisters.
During her junior year, Jane notices that everyone seems to be talking about her. She discovers that some of her sorority sisters have been gossiping and comparing her to a middle school boy. She confronts one of the people she considered a friend and demands to know why she is being treated this way. The other person proceeds to explain that one of the fraternity brothers has purchased a complete set of records about Jane from a data broker and has used an AI algorithm to fuse the data from the records into a comprehensive profile. As it happens, the manufacturers of the various sex toys Jane has been given have been sharing her data – data that include times and durations of use, what settings are employed, and so forth. Data from her fitness tracker shows her elevated heart rate and changes in other vital signs that correlate with the sex toy data. Both the algorithm and the people who looked at her data noticed a lot of use, averaging 5 or 6 times per day.
Somehow, Jane’s situation grows even worse. It turns out that the cool diary app she has been using has been selling her diary entries – verbatim – to the same data broker that had collected her other data. Fusion tools have mined the diary data and used AI-assisted language processing to filter the entries to the ones that addressed her toy usage, including her self-expressed desire to stop and a feeling that she has lost control. Her addiction is further confirmed by data from the mental health app she used to connect with a sex therapist. While the therapist has properly maintained confidentiality (as required by HIPAA), the app has recorded everything and sold it to the same data broker. Thanks to the large quantity of information and data fusion, every Greek organization on campus now knows her secret.
Mitigation
You might be thinking that these examples are far-fetched, but they most definitely are not. All the data collection aspects of these examples are already occurring today. Data fusion isn’t quite at the level of mining a diary (as far as I know), but AI language processing is advancing rapidly. Remember that tomorrow’s inferences using tomorrow’s algorithms can be run on data that was collected yesterday and stored for later processing (not unlike the harvest now, decrypt later type of cyberattack). Recall from our past discussions on tracking that companies can use information from data brokers in their hiring decisions, so Jack’s story is entirely plausible. John’s story is equally plausible, as you should remember from recently reading that HIPAA only outlaws discrimination based on genetic data for health insurance, and not for life insurance.
Sadly, we have already seen real events that rival or beat Jill and Jane’s stories. In 2019, a woman named Chelsea Becker gave birth to a stillborn boy in a California hospital. She had a history of drug addiction, so the local prosecutor charged her with murdering a human fetus. A judge eventually dismissed the charge, but only after Becker had spent 16 months in jail.9 Becker’s case was not unique. A woman in Oklahoma was sentenced to 4 years in prison in 2021 after she miscarried. Her history of drug use was cited as a self-induced abortion, despite expert testimony that the baby had fetal abnormalities. Cases such as these date back long before data fusion was ever an issue. In 1989, South Carolina Ninth Circuit Solicitor Charlie Condon (who would later become state Attorney General) prosecuted some 30 women, alleging crack cocaine use during pregnancy. A nurse at the Medical University of South Carolina (MUSC – about two hours’ drive from Coastal Carolina University) conspired with Condon to test any pregnant woman who looked like she might be a drug addict for cocaine use. Since 29 of the 30 women were Black, and the one white woman had a Black boyfriend, racial profiling appears evident.10
Rampant data collection and data fusion make situations like the ones in both the fictitious and real stories more likely, since algorithms can be employed to look for connections in data to try to support some kind of agenda. These connections may prove to be false, as shown in John’s story, but the impacts can still be extremely damaging. It is for this reason that I keep hammering on the single most important point about digital privacy, which is to minimize the data “they” collect about you. In other words, practice anti-forensic techniques. Any time you see some new “connected” gadget or some cloud-based service that is supposed to improve your life in some way, stop and think about the possible consequences that could follow from the data it collects. Read and understand service terms, license agreements, and privacy policies before signing up for literally anything these days.
Finally, don’t assume that companies will be forthcoming about what they collected. In the fictitious stories in this lesson, I took some creative license and found a way to inform the subjects of the stories about the inferences drawn from the collected data, using an optimistically detailed report in Jack’s story, a data breach in John’s story, legal discovery in Jill’s story, and someone conveniently purchasing a data set from a broker in Jane’s story. In reality, most companies that collect and process data have every incentive to hide what they know from the data subject. Insurance companies are not necessarily going to be forthcoming about why underwriting failed. Data brokers aren’t eager to tell people just what they know about them. These companies have it in their interest to withhold as much information as possible, since people would be less likely to be willing to share their personal information if they knew just how utterly creepy the surveillance economy has become.
Notes and References
-
David L. Hall and James Llinas. “An Introduction to Multisensor Data Fusion.” Proceedings of the IEEE 85(1): 6-23. January 1997. ↩
-
Simson L. Garfinkel. “Data Fusion: The Ups and Downs of All-Encompassing Digital Profiles.” Scientific American. September 2008. ↩
-
Federico Castanedo. “A Review of Data Fusion Techniques.” The Scientific World Journal 2013: 704504. October 27, 2013. ↩
-
Marek Gagolewski. Data Fusion: Theory, Methods, and Applications. Warsaw, Poland: Institute of Computer Science, Polish Academy of Sciences. 2015. ↩
-
Eric Siegel. “The Real Reason the NSA Wants Your Data: Predictive Law Enforcement.” The European Business Review. July 24, 2016. ↩
-
Gregory Piatetsky. “Did Target Really Predict a Teen’s Pregnancy? The Inside Story.” KDnuggets. May 7, 2014. ↩
-
Tong Meng, Xuyang Jing, Zheng Yan, and Witold Pedrycz. “A survey on machine learning for data fusion.” Information Fusion 57: 115-129. May 2020. ↩
-
Arslan Munir, Erik Blasch, Jisu Kwon, Joonho Kong, and Alexander Aved. “Artificial Intelligence and Data Fusion at the Edge.” IEEE Aerospace and Electronic Systems Magazine 36(7): 62-78. July 7, 2021. ↩
-
Sam Levin. “She was jailed for losing a pregnancy. Her nightmare could become more common.” The Guardian. June 4, 2022. ↩
-
Ryan C. Hermens. “The Long, Scary History of Doctors Reporting Pregnant People to the Cops.” Mother Jones. April 15, 2022. ↩