Extract URL Contents With PHP and jQuery
How to extract url contents? This post will guide you how to extract url contents like many websites "Facebook, Twitter, Google" and retrieve the information about what any url title and description is about.
We will be creating following files:
- index.php, Contains html form that will allow us to submit a url for extraction.
- extract-contents.php, Will contain the code to fetch required data from submitted url.
- javascript.js, Will contain the code to send ajax request to extract-contents.php
- style.css, Contains all the style formatting for our html page and url data box.
index.php
<!DOCTYPE html>
<html>
<head>
<title>Extract URL Contents with PHP and jQuery - Demo</title>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
<script type="text/javascript" src="js/jquery-3.1.1.min.js"></script>
<script type="text/javascript" src="js/javascript.js"></script>
<link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css" rel="stylesheet"/>
<link rel="stylesheet" href="css/style.css" />
</head>
<body>
<div class="container">
<div class="extract-wrapper">
<label>Enter an absolute URL like https://www.codestacked.info</label>
<form class="url-extract-form">
<div class="input-group">
<input type="url" class="form-control url-input" value="" required="required" placeholder="Enter a URL to extract contents" />
<button type="submit" class="btn btn-green">Extract</button>
</div>
<div class="loader">
<i class="fa fa-spinner fa-spin"></i>
</div>
</form>
<div class="content-wrapper" id="content-wrapper"></div>
</div>
</div>
</body>
</html>
So in extract-contents.php we first create a regular expression to validate the submitted url, If url is valid we will fetch the contents of submitted url and open a new dom document and load this fetched content as html into our newly opened dom document. We initially set Title, Description and image as empty. First we prepare an array of images in case there is no open graph image added to document we will use the first image on submitted url page.
After that we look for all three values that we need. We will first look for open graph meta tags, If they exist we will be using them for Title, Description and Image. Otherwise we will fallback to document meta tags for Title and Description and for Image we will use the first image on submitted url page. The new domxpath() will be used for accessing elements in loaded dom document using xpath queries.
extract-contents.php
<?php
if($_POST){
$post = $_POST;
$url = strtolower($post['url']);
$url = str_starts_with($url, 'http') ? $url : 'https://'. $url;
// regular expression to validate url
$regex = '/^((https?|ftp):\/\/)(www\.)?[\w\-]+\.[a-z]{2,4}\/?[\w\/\-]*(\.[a-z]{2,4})?$/';
preg_match($regex, $url, $hostname);
// Check if url is a valid url
if(preg_match($regex, $url)){
// Get contents of url
$content =@file_get_contents($url);
// If failed to get contents show an error
if(!$content){
die('<div class="error">Error parsing the submitted URL.</div>');
}
$title = $description = "";
$images_arr = [];
// Open new dom document object
$dom = new domDocument('1.0', 'UTF-8');
// Load url content to dom document object
@$dom->loadHTML($content);
// Get images from dom document
$images = $dom->getElementsByTagName('img');
// Loop through images and push them to images array
foreach ($images as $image)
{
$src = parse_url($image->getAttribute('src'));
if($src['path'])
$images_arr[]=$image->getAttribute('src');
}
// Open xpath object for current dom document
$xPath = new domxpath($dom);
$og_title = $xPath -> query('//meta[@property="og:title"]');
$og_description = $xPath -> query('//meta[@property="og:description"]');
$og_image = $xPath -> query('//meta[@property="og:image"]');
$meta_description = @$xPath -> query('//meta[@name="description"]');
$meta_title = @$xPath -> query('//title');
// Prepare title of document
if($og_title->length){
$title = $og_title -> item(0)->getAttribute('content');
}elseif($meta_title->length){
$title = $meta_title -> item(0)->textContent;
}
// Prepare description of document
if($og_description->length){
$description = $og_description -> item(0)->getAttribute('content');
}elseif($meta_description->length){
$description = $meta_description -> item(0)->getAttribute('content');
}
// Prepare image of document
if($og_image->length){
$image = $og_image -> item(0)->getAttribute('content');
}elseif($meta_description->length){
$image = reset($images_arr);
}?>
<div class="url-info-box">
<?php
if(!empty($image)){
// Handling the https urls for images
$image = (preg_match('/^(https?)/',$image)) || (preg_match('/^(\/\/)/',$image))
? $image
: $hostname[0].$image;
list($width, $height) = getimagesize($image);
?>
<div class="image">
<img src="<?=$image;?>" class="img-responsive" width="<?=$width?>" height="<?=$height?>" alt=""/>
</div>
<?php } ?>
<div class="data">
<div class="title">
<?=$title;?>
</div>
<div class="description"><?=$description;?></div>
</div>
</div>
<?php
}else{
echo '<div class="error">Invalid URL submitted.</div>';
}
}
?>
javascript.js
$(document).ready(function(){
$(".url-extract-form").on("submit",function(e){
e.preventDefault();
var url = $(".url-input").val();
$(".content-wrapper").hide();
if(url != ''){
$(".loader").fadeIn();
$.ajax({
url: "extract-contents.php",
type: "POST",
data:{
url: url
},
success: function(data){
$(".content-wrapper").html(data).slideDown();
$(".loader").fadeOut();
}
});
}
});
});
style.css
*{
box-sizing: border-box;
}
html,body{
margin: 0;
padding: 0;
}
body{
background-color: #f6f6f6;
font-family: "Segoe UI", "Roboto", "Helvetica", sans-serif;
font-size: 15px;
font-weight: normal;
font-style: normal;
}
.container{
max-width: 1024px;
margin: 0 auto;
padding-left: 15px;
padding-right: 15px;
}
.url-extract-form{
position: relative;
margin-bottom: 1rem;
}
.extract-wrapper label{
display: inline-block;
margin-bottom: 0.25rem;
}
.input-group {
position: relative;
display: flex;
flex-wrap: wrap;
align-items: stretch;
width: 100%;
}
.form-control{
border: 1px solid #ddd;
padding: 10px;
position: relative;
font-size: inherit;
flex: 1 1 auto;
width: 1%;
min-width: 0;
}
.form-control:focus {
border-color: #00c0ef;
outline: 0;
}
.loader{
position: absolute;
inset: 0;
font-size: 1.75rem;
background: rgba(150,150,150,0.5);
z-index: 5;
padding: 0px 10px;
display: none;
color: #006699;
text-align: center;
}
.url-extract-form button{
display: inline-block;
padding: 5px 10px;
cursor: pointer;
font: inherit;
background: #00a65a;
border: 1px solid #009549;
color: #fff;
margin-left: -1px;
}
.content-wrapper .error{
padding: 10px;
background: #e95454;
color: #fff;
}
.url-info-box{
background: #fefefe;
border: 1px solid #fefefe;
overflow: hidden;
font-size: 13px;
max-width: 300px;
}
.img-responsive{
max-width: 100%;
height: auto;
display: block;
margin: 0 auto;
}
.url-info-box .data{
padding: 15px;
background: #efefef;
}
.url-info-box .title{
font-weight: bold;
max-height: 35px;
overflow: hidden;
color: #3778cd;
}